Autonomous cars are operating based on a set of information about the world completely different from the information on which human drivers rely. It’s a hard concept to develop an intuition around, but it’s central to understanding the ways these vehicles are never likely to behave precisely like human-driven vehicles. The feature vector—the information the car collects about the environment—isn’t the correct input for a system where the output is supposed to function like a person driving a car. This mismatch between input and desired output is one of the major factors behind the ongoing challenges in developing and deploying these cars. It’s also the major factor behind the challenges people have in understanding what autonomous cars do well and do poorly.
I’ll give you an example of the latter point.
This is a video Tesla released a video of their perception system a couple of years ago. It looked very impressive, with lots of semantic information—this is a pedestrian, that's a crosswalk—superimposed on the video. I thought it was deceptive. Not because I didn't believe that they were able to extract the semantic information they displayed. It found it deceptive because that semantic information was superimposed on top of the camera image of the world. The actual autonomous systems in Teslas could only see the semantic information. Human viewers, on the other hand, saw the semantic information in addition to everything we usually see. Our perceptual systems automatically take in information about the visible world and make a host of inferences about it. When we see the video, we ineluctably overestimate how much the Tesla vehicle really knows about what’s going on.
So what do autonomous vehicles see, and how does that differ from what we see? At the level of the available hardware, there are vast differences. Humans have two eyes, facing forward. We can look in more-or-less any direction, but only one direction at once. Autonomous vehicles have a collection of sensors, all around the car. They have cameras pointing in all directions. Often, they have a stereo camera facing forward. These cameras don’t have the flexibility of focus or focal resolution of the human eye, but there are a lot of them. Sometimes they have 360 degree cameras that computationally stitch together an image of the world around the entire car. Autonomous cars have lidar, capable of determining the distance of objects in the world with high reliability even when it’s dark out, but at a relatively low frame rate (generally somewhere in the 10 to 20 frame per second range) and with a relatively short maximum distance. Autonomous cars also have radar, which is excellent at determining range but has no real spatial resolution. That is, it can tell you that something’s in front of you, but not generally what it is or more than approximately what direction it’s in. Some autonomous cars have more esoteric sensors, like thermal cameras or ground-penetrating radar, but they aren’t widely deployed. There is not at the moment a consensus that they’ll be necessary. Autonomous cars can use all their sensors at once, which means that unlike humans, they can see in every direction at once.
So on the face of it, autonomous vehicles would seem to have much more information, on a moment by moment basis, than humans do. That is not the case. The amount of raw information coming into their sensors might be greater (maybe? It’s complicated), but the picture of the world that these vehicles use to make decisions is in important ways far more limited and worse than the picture we can use as humans. That’s a pretty general statement, as I’ll explain, but I've thought about pedestrians a lot, so I'll use a pedestrian as a first example.
When a human driver sees a pedestrian—even a low-resolution image of a pedestrian—within about a quarter of a second – the time it takes from light reflected off that pedestrian hitting the driver’s retina to the neural representations produced by that light to spread all the way through the hierarchical visual areas of the brain – that human driver knows a vast set of information about the pedestrian. She knows, first of all, that she’s looking at a person. The driver knows if that person is a child, or if they’re elderly. She knows if the pedestrian is walking like they’re in a hurry, and what that means for how they might cross the street. The driver knows if the pedestrian is paying attention to her car. The driver knows if the pedestrian is carrying something. She knows if they're angry. She knows if there's something a little off about that pedestrian. She knows that the pedestrian is still going to be in roughly the same spot in the world the next time we look, that the pedestrian can only move so fast. She knows if they're gesturing, at us or somebody else. She knows if they're alone, or accompanied by somebody else. If the pedestrian is behind a car, the driver still sees the pedestrian, and understands their physical relationship to the car.
I’ll stop there, but really that’s just a small sample of what we as human drivers know intuitively a fraction of a second after our eyes have first lit on a pedestrian.
There’s something very strange, of course, about saying that we “know” these things. Many of the examples I gave are questions about somebody’s beliefs or state of mind. The actual truth of what they believe is locked up in their head. In fact, in human perception, the things we “know” often aren’t facts about the world at all. Recovering factual information is not what our visual system evolved for. What our visual system does is extract the insights about the world around us that are necessary to guide our behavior. We look at the world and perceive the aspects of it which are necessary for deciding what we're going to do. Our feature vector—the input into our decision-making systems—has been exquisitely matched, over thousands of millenia, to the desired behavioral output.
When an autonomous vehicle looks at a pedestrian, it sees a lot less than that human driver does. It knows for certain—if it has Lidar—that there's something there. It can almost always tell that the something is a pedestrian, unless that pedestrian is behind a car, in which case it gets really difficult. If there are two pedestrians, it might or might not know that. It can generally tell what size the person is. It can often, but not always, connect that pedestrian in that location to that same pedestrian in a slightly different location a fraction of a second ago. If it can figure out that it's looking at the same pedestrian, it can figure out how fast they're moving, and in what direction. Sometimes it can tell if the pedestrian is gesturing. Sometimes, in a limited way, it can identify the likely meaning of a gesture. But the sum total of what it perceives is much more limited, and much less useful, than what humans perceive.
I used pedestrians as an example, but this gap in the amount of information available is not just true for pedestrians. When humans look at other vehicles, we are making inferences based on thousands of characteristics of that vehicle. The make of the car. If it's cheap or expensive. Whether the driver is visible. Whether the way the car is moving—how fast, how steady—indicates that the person is paying attention. Whether the position and orientation of the vehicle—in the robotics literature this is called 'pose'—is communicating aggressiveness. When humans look at an intersection, we see not just the road signs and signals (in fact we often ignore them) but a tapestry of coordinated and conflicting goals and intentions. That tapestry evokes, for us, past experiences with complex social negotiations.
For some applications, this difference in information isn't a huge deal. When you're driving on a highway, treating the vehicles around you as cuboids—3D boxes—with a heading and velocity is, with some exceptions, pretty reasonable. For robots that aren't really expected to behave like a human would, like sidewalk delivery robots, it is again mostly okay.
The problem for autonomous cars is that while they can perceive enough of the world to avoid hitting anybody, they have to do much more than that. Autonomous cars that are working correctly have to be able to interact seamlessly with humans. The output of the autonomous driving system has to be a set of behaviors that allows for interacting in a believably human-like way. As I’ve discussed before, the social nature of driving demands it. The aspects of the world that people perceive while driving—our feature vector—comprise the set of information that humans use to make driving decisions. Autonomous cars are trying to make the same behavioral decisions as humans with less and different information. There is a mismatch between the feature vector and the output.
Autonomous cars have made remarkable, even extraordinary, progress. But when you look at their performance, you see strange gaps. Cars that can’t pull over. Cars that freeze for no obvious reason. That’s because the input is not the right size and shape for the desired output. If you want to know why progress in autonomous cars seems to stall so often, and why these vehicles still have difficult-to-explain failures, the mismatch between the feature vector and the behavior these vehicles are aiming for, that's almost always the answer.