Knowing how to measure
Right know we don’t know what questions ML systems can answer, or what questions to ask them to find out.
One of my regular hobbyhorses is that computer science/machine learning as a field failed to learn the correct lessons from psychology/cognitive science. The relationship between the two fields is fraught. In the late 1950s it was much less so; the early AI researchers felt that they had much to gain from talking to cognitive scientists, and the very premise of cognitive science is that we can better understand the brain by thinking of it as an information processing system like a computer. The AI researchers went to the cognitive scientists and asked, in approximately so many words, how the brain worked, so that they could implement that in computers. The cognitive scientists gave a prompt and confident answer, based on their most sophisticated models of cognition, which the AI researchers promptly implemented. That answer was quite comprehensively wrong. It was so wrong that it led to something of a decade-long deep freeze where interaction between the disciplines of machine learning and psychology was minimal and, at best, deeply fraught.
From my perspective the relationship was soured by an inability to understand—from both sides—what Psychology was really good at. The model-building efforts of cognitive scientists are interesting, and have been useful in moving forward our understanding of the brain. But those models are, universally and generally quite profoundly, wrong. They do not describe what is actually happening in the brain in anything like as accurate or coherent a fashion as would be necessary to use that understanding as the spec to build a computer system. Models like Biederman's Geons or Simon's General Problem Solver are tremendously interesting ways to think about what the brain might be doing that fall apart at higher levels of specificity and offer no realistic path to practical systems that can see or reason about the world.
The AI researchers' response to this, quite reasonably, was to look elsewhere, and the whole lineage of (massively) data-driven, bitter lesson-informed statistical machine learning was the result. All along, though, the psychologists did know something vitally important that the AI researchers did not. They know something that only grows in importance as the sophistication of statistical learning-type approaches increases. They know how to measure behavior.
That's important, because measuring behavior is hard. The meat of the work that was done from the founding of experimental psychology in the early part of the 20th century until today was figuring out how to carefully and accurately measure the correlation between stimulus—input—and response—output—in such a way that you can say something meaningful about the processing happening between those two steps. The reason this is hard—much harder than measuring, say, the inputs and outputs of a plane's autopilot—is that human behavior is non-deterministic and decidely nonlinear. If you ask a person to make a judgment and respond when given a prompt, a priori you have no control over how they make that judgment. This will make more sense if I give an example. Consider an elementary multiplication test. The input, or prompts, for that test are a set of numbers to be multiplied and the output, the behavioral measurement, is those numbers, multiplied, written down. That test is measuring a child's ability to do multiplication. What it does not measure is how they do multiplication. The answers could be arrived at any number of ways. The child could memorize their multiplication tables. They could draw pictures of apples. They could haphazardly write down numbers and get lucky enough of the time. They could use some internal method for chunking and combining the numbers that no other child uses. In order to understand how multiplication is working in the brain—what sort of representations and manipulations are actually underlying the child's process to the correct answer—psychologists have had to design experiments that measure specific parts of that underlying system. There are tests of a child's intuitive number sense that reveal the precise developmental moment where that child can tell that one numeric quantity is larger than another, for instance. These kinds of measurement instruments are much more challenging to develop, because you want to know not only if the child can succeed, but how to make failure—or success—broadly informative.
The great successes in the history of psychology have all revolved around learning how to measure something. The field of psychometrics, whence the techniques behind much of modern statistical data analysis emerged, is about how to quantify behavioral measurements. The field of psychophysics is about grounding careful behavioral measurements in physical quantities. Neuropsychological testing is an entire subfield of medicine devoted to the question of how to empirically measure and quantify cognitive dysfunction. Behaviorism, which comprised the influential mainstream of experimental psychology for much of the 20th century, was about carefully, and quantifiably linking measurements of behavior to changes in behavior. The great "cogntive revolution" in psychology which began in the 1950s and 1960s was an attempt to broaden psychology to consider models of the architecture of cognition, breaking the field free of the ruthlessly single-minded focus on measurement that had defined its first century. The heart, though, of psychological expertise continues to lie in understanding the importance—and phenomenal difficulty—of behavioral measurement.
It is not a historical accident that early AI researchers engaged with psychology at precisely the moment that cognitive science was most eager to move the field past the seemingly limited hyperfocus on behavioral measurement. The impetus behind both the cognitive revolution in psychology and the establishment of AI as a field of study was the hypothesis that both the brain and digital computers could be described as information processing devices, using the methodological tools and concepts of cybernetics and information theory. To computer scientists, thinking of the brain as a modular information processing mechanism with specific architectural constraints and functions is natural. If you are going to design a computer system to solve a given problem, step one is very often to draw a block diagram laying out that architecture, those functions, and those constraints. The first step in replicating the functions of human cognition could naturally thought to be the same. It didn't work out, because the block diagrams that cognitive scientists were able to make to describe the architecture of cognition were—if useful as theoretical tools for understanding—fundamentally wrong at a level that made them worse than useless as ideas for computer architecture.
It's not just that careful measurement is what the field of psychology is best at. A lack of good tools for measurement is behind a great deal of the current disconcertment over AI. The problem with AI systems as such is that nobody can precisely define what they are able to do, where they fail and, except in limited cases, what questions they should be able to answer. This is true for LLMs—the question of whether they're intelligent and the question of whether they can ever be trusted to be accurate are both questions about what you can measure—but it's also true for image generators and it's true for "embodied" AI systems like those in self-driving cars.
People are starting to realize, and write about, the importance of measurement for AI. The lack of good tools for evaluating models is an issue that is increasingly being recognized, and is of increasing concern. Too many machine learning models have been evaluated on baseline datasets that do a poor job at best of revealing the actual performance of the model. These baselines give an inaccurate sense of model performance in itself, but also are very poor matches to the question of how to train models that are useful for real-world applications. The state of the art, though, remains far behind the state of the art of sixty years ago in experimental psychology.
There is also still no widespread appreciation that tools of measurement are essential on the model training side, as well. I've written many times over the years about the ways that everybody who is doing supervised learning in ML—that is, the kind of learning that relies on labeled data sets—is actually doing psychophysics experiments on the labelers. It's just that most of them don't realize they're doing that, and thus do a poor job.
This has been true for a while, but now more than ever the fields of ML and AI need to determine with some urgency how to use the skills and techniques that psychology has been developing for a century, and adapt them to the evaluation and training of computer systems. If this doesn't happen soon AI will have us catapulting into an uncertain future with no clarity on what the capabilities are—or could be—of the systems doing the catapulting. No regulatory response or industry agreement is likely to be useful or constructive if it is not issued in a context of understanding empirically what these systems are or are not, and right now ML and AI don't have the tools to answer that question.