The emptiness at the heart of emotion recognition
The strange science underlying one of ML's most problematic subfields
Every few months a new company flashes into the news doing emotion recognition with machine learning. This time it's Smart Eye. They already make driving (and pilot) monitoring software, capable of identifying whether you are looking at the road, a reasonably mature technology. Now, though, they’re looking into expanding into automatically sensing the mood of remote workers. People are justifiably skeptical of the uses to which this could be put. We looked into some of the application areas people try to address with emotion recognition when we were starting our company. HireVue, mentioned in the link above, does automated facial analysis for hiring. That’s something we investigated, although we never had any interest in looking at emotion recognition. The reason we were uninterested in emotion recognition is that the entire field of emotion recognition, from the foundational psychological work by Paul Ekman through to all of the ML based (however loosely) on that work, rests on a logical fallacy. It would probably not be going to far to say that it’s all, at a technical and scientific level, bullshit.
Paul Ekman broke into the public consciousness with his work on lying. He claimed to have developed techniques to use facial "microexpressions"—tiny, unconscious movements of facial muscles—to be able to detect lying (credulous write-up here). It doesn’t, as best science has been able to determine, actually work, but his approaches are catnip to police departments, which never met a biased pseudoscience they didn't want to introduce into evidence. He gives expensive seminars on his method. He inspired a hit TV show that took his assertions and made them that much more fanciful. His scientific reputation, though, as well as the intellectual infrastructure, was built on his work in identifying what he claimed as the universal components of facial expressions of emotion.
His scientific approach to determining these universal components was, at best, unusual. My adviser, a legendarily careful and productive researcher in the world of face perception, calls it "science by assertion". Ekman's starting place was a (relatively) fully formed theory about what muscle movements predicted what internal emotional state. In order to test this, he hired actors. He instructed those actors—skilled, presumably, in precisely controlling their facial muscles—to arrange their faces just so while he photographed them. Stretch this muscle, relax that muscle. He did this without telling the actors what emotion they were supposed to be producing, although one suspects that holding a rictus grin would clue them in pretty fast.
After collecting the pictures of the actors with their varying arrangements of facial muscles (which is to say, facial expressions), he showed those pictures to subjects. First American University students, later people in other cultures. He asked them to identify the emotion the pictured person was feeling. He found that there was a great deal of agreement about what sort of emotions the people in the photographs were feeling, and was able to sort them into six underlying dimensions of emotional states, the perception of which he claimed as universal.
There’s a problem with that approach. At no point in this exercise did any of the pictured subjects actually feel the emotions that were being perceived. He didn't take pictures of people who were actually happy, producing the unreflective facial expressions that correspond to that. He took pictures of people making facial expressions he would be perceived as happy. The experiment—intentionally, to be clear—begs the question of what facial expressions are naturally produced by people who are feeling a certain way. He did not show that humans are universally able to perceive each others' emotional state. He showed, to a first approximation, that you can reliably emulate a given emotional state, at least well enough to lead an observer who is looking at a still photo of their face to a specific conclusion. Those two things, it always seemed to me, were pretty different.
Remarkably, Ekman's highly idiosyncratic and not easily interpretable work became the basis of multiple flourishing subfields. His images became the gold standard representations of facial emotions for psychological studies even, though, again, not one of them verifiably depicted somebody experiencing the labeled emotional state. In machine learning, his feature-based approach became the basis for a large number of studies of the ability to train a model that can perceive facial emotions. Or rather—because again, this is coming from the same Ekman framework—imputed facial emotions. Because at no step along the path has any great deal of attention been given to the question of whether the people pictured in the training data for these ML models are actually feeling the emotions with which their images are labeled.
It's a practical question, tied to the nature of emotion: how do you guarantee that the person you are photographing is happy? There's a follow-up question: is that important? Ekman's hypothesis says that it is not, because the ingenuous and unconscious production of facial expressions matching one's mood activates specific muscles, so a face with those muscles activated is indistinguishable from one that is expressing genuine happiness. This makes the process of training a model much easier, as you can just assume that images labeled with "happy", say, or "disgusted", if captured and evaluated according to the proper Ekman methodology, contain exactly the information you need for your ML system to recognize real happiness, or disgust, or boredom.
What it doesn't capture—what none of these systems capture—is knowledge that different people express emotions differently, or that you might interpret the emotions of a stranger differently from those of a close friend, or that emotional expressions might vary in the presence of another person. Think about several different scenarios with the same person: in one, they're a retail employee acting friendly in a commercial transaction. In another, they've received a well-considered and moving gift from a close friend. In the third, they're in a good mood while sitting along using a computer. In a fourth, they've been dealt an excellent set of face down cards in a high-stakes game of poker. In all of these situations, the person—as a matter of their internal state—is happy. The claim of the ML models trained according to the state of the art in emotion recognition would be that their facial expression, in all of these situations, is in some sense the same. Your mileage might vary but I find it distinctly unconvincing, and I didn't even include the counterfactual of somebody pretending to be happy.
Still, companies forge on with these technologies, offering them up for use cases as disparate as gauging the sentiment of TV viewers and determining whether employees working from home are engaged while working. Vividly dystopian applications, even taken at face value, but understood in the context of the strange intellectual hollowness of the study of emotion recognition, they are at best tools for laundering bias, and at worst pure snake oil.
There's a sort of larger theme in ML/"AI" which ML for emotion recognition exemplifies: in many cases the most promising-seeming applications for ML (the kinds that people self-servingly call AI, in particular) involve the automation of human social relationships. But the management of social relationships, from perception through to ruminating self-consciousness through to action, is the very heart of what our brains do uniquely well. To carve off a well-defined slice of the universe of social cognition and recreate it in a machine learning model requires tremendous deftness and care, for it is an exquisitely complex and interdependent system to which we devote phenomenal amounts of neural real estate. Too often, though, which is to say to a first approximation almost always, the training efforts are simply conditioned on whether plausible-seeming labeled data exists, with the hard part of the job considered to be the model architecture and training regimes—the more [link]hardcore engineering, as it were, bits—leaving the hard questions unacknowledged and unexamined. Those hard questions, in this case and in many others, resolve to questions of whether your trained ML model is doing any of the things you claim it is doing at all. Too often the answer is no, and almost universally if the answer is no the market these models find is one that hopes to use them to launder unpleasant or antihumane decision-making through a wash of nominally objective and scientific technology. Emotion recognition might be the poster child for this process, but it is far from the only exemplar.