We're Still in the Batch Mode Era of Machine Learning

January 17, 2023

Computing used to work like this: You’d write your program on paper. You (or maybe a keypunch operator) would transcribe your program and data into a stack of punch cards. You’d wait in line to hand your program over to the scheduler down at the computing center in the basement. If everything went as planned, you’d pick up your results the next day. Or maybe there was a syntax error.

Training a machine learning model is much the same today. You describe your task. You (or maybe a team of human labelers) label a bunch of data. You enqueue your training job to run in a data center somewhere in a place called us-east-1. If everything went as planned, you collect your trained model the next day. Or maybe it was misconfigured.

Is that really the same?

Before we get further into this, let me deflate a few superficial differences between the two processes.

Yes, we no longer need to move physical punch cards to a physical data center. This does save a bit of time and frustration, but honestly, couldn’t we use the excuse to get a little exercise? More importantly, this doesn’t change the fact that in both cases you send the work off somewhere else, wait for it to complete, and then collect your results at the end.

Yes, when training a machine learning model today, unlike in the batch computing days of yore, we can monitor progress on dashboards like TensorBoard. This lets us abort early if things really go awry, and if we become skilled in loss curve augury, sometimes this will give a hint of insight into what’s happening to the model. It’s only a hint though, and most problems can’t be diagnosed from the loss curve.

Yes, many training scenarios today don’t involve much manual data labeling. That’s great (setting aside that now nobody at all knows what’s going on inside the dataset) but there’s usually analogous process of filtering and preprocessing the data, and the rest of the process is still very much the same.

The rest of computing has become interactive and graphical

Meanwhile, most of the rest of computing has moved on from its batch mode origins. Most of us spend most of our working life in very interactive modes of computing. Machine learning is behind in this sense.

Perhaps that is to be expected. It took a long and hard-fought campaign to make classical computing interactive and graphical. People like J. C. R. Licklider worked for years planting the intellectual seeds and funding the research projects that gradually blossomed into the interactive computing that we’ve become accustomed to.

Batch mode is the default. When software engineers without the specific intention to make something interactive sit down and write something like a machine learning training system, the most natural design is to have the computer do a bunch of work invisibly and then spit something out at the end.

Machine learning is a new paradigm of computing, where machines learn from data rather than follow human-written instructions. What interactive graphical interfaces to this kind of computing should look like is still largely a research problem. So it makes sense that we’re falling back to the batch default.

Interactive machine learning is possible

It took the work of many people over a generation to develop the ideas, research systems, and eventually products that brought interactive computing to the mainstream. Fortunately, there are people already working on doing the same for machine learning. To highlight three people whose work I admire:

Vincent D. Warmerdam (@fishnets88, koaning.io) – I often find myself thinking about his library human-learn, which allows users to design rule-based classifiers that are compatible with the sklearn ecosystem and that that can be used in place of and alongside ML models. You can even directly draw a decision boundary on top of a visualization of feature space.

He has tons of other cool projects too, like doubtlab, a library that encapsulates different ways of detecting bad labels, and bulk, a tool for labeling large groups of data points at once by taking advantage of 2D embeddings.

Ben Schmidt (@benmschmidt, benschmidt.org) – Ben recently left his position at NYU to work full time on Nomic, “the world’s first information cartography company”. You should follow the link to understand what that means.

I’ve been interested in this kind of interface for a long time, ever since I first encountered the Embedding Projector. What I think is so fascinating about this kind of interface is that it makes the internal, embedding representations of ML models directly visible to the end user. This makes the model’s understanding of the dataset into something the user can see and explore.

Been Kim (@_beenkim, beenkim.github.io) – Been is the first author of a paper titled Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), which introduces a way for humans to represent arbitrary concepts using the same kind of internal embedding representations that Ben Schmidt visualizes in his work. I’m using her as a stand in both for all of the other authors on that paper as well as the whole line of work the paper spawned – it’s now sitting at over 900 citations.

I find this idea, representing human-defined concepts using the internal representations of ML models, exciting because it lets us bridge between the ways humans understand the data and the ways models do. You could for instance define a concept like “playful” for your dataset of user comments. You could then jump from a given comment toward comments that are similar to the first comment, but more playful. You could find the most playful comments. You can check how important the concept of playfulness is to the decisions a model makes (this is the focus of the original paper). This allows us to build interfaces where the dials and levers can navigate, filter, and manipulate human-meaningful aspects of the dataset or model, not just the metadata where most interfaces are stuck today.

All of this work taken together begins to hint at a future where machine learning systems are developed in rich graphical environemnts, where the data is up front and easily inspectable, where models and data can be explained and re-expressed in terms of human-interpretable concepts, and where those same concepts act as an interface, allowing us to transform the dataset as well as the model itself in tangible, meaningful ways.

Compared to the batch-mode machine learning that predominates today, this future is one where there is a much tighter feedback loop and much higher bandwidth between the system and the human who is trying to shape the system to achieve their goals. It’s a future that that enables more non-linear workflows where there are more opportunities for creative problem solving in developing machine learning systems, making the whole process more interesting, more immersive and more fun.

That’s the future I want to build.

P.S. Conspicuously absent from my account of the state of machine learning here is the rise of large generative models that are conditioned on text or other human-manipulable inputs. These models do offer much tighter feedback loops, but at the expense of putting the control over the model more out of reach for most people. How these models fit into the broader future of interactive machine learning is something I’m still processing and will be the subject of future posts.