Helpful and unhelpful anthropomorphism

Anthropomorphizing LLMs isn't always incorrect, but the details are important

Jul 26, 2023

blue plastic robot toy — Photo by Rock'n Roll Monkey on Unsplash

I've said before that the reason "AI" is so powerful (and difficult to get right) is that it deploys the single most productive user metaphor: that you are interacting with another person. Our brains are built for anthropomorphism, so a piece of software that explicitly uses anthropomorphism as its interaction metaphor is leveraging a whole lot of very powerful, mostly unconscious cognitive abilities of the person using it. Our predilection to anthropomorphism is so powerful that we can't really turn it off: the tendency to impute goals and internal states of mind to someone (or something) we're interacting with happens basically automatically. We get mad at the stupid TV remote because it's not doing what we want, and then we get mad at ourselves for getting mad at an inanimate object. You know the drill.

Because of this fairly fundamental fact about human nature, I generally think that telling people not to anthropomorphize ML systems—in particular LLMs—is not very productive. The way those systems are designed, in many cases, is designed so that we have no choice but to anthropomorphize. Anthropomorphism, though, can be good and bad. Or more specifically it can be either helpful or unhelpful. Helpful anthroporphism gives us a superstructure on which to hang inferences about the likely behavior of these systems. It gives us a way to talk coherently about the things that these very strange computer programs are likely or unlikely to do. Unhelpful anthropomorphism obscures those kinds of questions, often by imputing a robustness of internal goals and representations that are not only inaccurate but unachievable for these systems. One example of a helpful anthropomorphism is talking about what LLMs want. One example of an unhelpful anthropomorphism is talking about how LLMs lie. By talking through both, I think you end up with the beginnings of a framework for how to understand these systems.

People love to talk about how LLMs lie. If you define lie as "say something that isn't true", this is perfectly accurate. LLMs say things that aren't true with breathtaking regularity and (apparently) cheerful insouciance. I think, though, that there's something unhelpful about the term lie. When humans lie, they are saying something untrue with some level of intention. They are affirmatively and actively saying something that they do not believe to be true. LLMs, by contrast, are saying things that are true because they have no idea what truth is and no way to care about it one way or the other.

The architecture of LLMs gets pretty complicated in the fine details, but is at a broad level pretty simple. A quick recap: they consist of a "transformer" architecture, which is a neural net architecture that is very good at accounting for long range connections: the relevance of the first word in a paragraph to how that paragraph ends, or of the upper left of a picture to what you might see in the lower right. The remarkable abilities of these systems are a product of their ability to continue to learn statistical regularities over the course of truly enormous training regimes. To a first approximation, the training set that is supplied to a model like GPT-3 consists of every word published on the internet ever. The transformer is trained using an autoregressor; this means that the performance of the model is judged based on how good a job it does predicting the next word in a sequence of training data. Over the course of training, it will get better and better at predicting the next word for every word sequence in its training data. Then, once that has happened, the model goes through a second round of training, called "reinforcement learning". In this phase the model has a seconday goal besides generating the correct next word—generally, if not universally, satisfying the person asking it a question—and it attempts to predict how much closer each word that it produces gets it to satisfying that goal.

Those two phases, autoregressive next word prediction and interlocutor-satisfying reinforcement learning, comprise the training regime for LLMs like ChatGPT. And it is that training regime that establishes which anthropomorphizing metaphors make sense and which do not.

I said up above that it was reasonable to talk about LLMs like GPT-3 "wanting" things. That comes from the reinforcement learning. Reinforcement learning is a training methodology where you establish an eventual goal and then can partial out how much each step along the way is likely to help or hurt you. It's central to the way that DeepMind trained their world-beating Chess and Go programs, because you can take a desired end state (you have won the game) and split out the credit for that among all the preceding steps (in this case, moves in the game). In the case of the LLMs, what's happening is that the people training the models ask humans to interact with it, and ask those humans, after the interaction, if the model gave them a good or a bad answer to their question. So in that case, what the LLM wants is to please its questioner. That metaphor, that the LLM wants to make its interlocutor happy, for all that it imputes more agency than the model necessarily deserves, is basically the right way to think about it.

That this interpretation is correct, though, also gives an indicator of what some of the unhelpful anthropomorphisms might be. Because the internal logic of the reinforcement learning step means that wanting to please its interlocutor is the only thing the LLM cares about. Whether its sentences are well-formed, whether it is deploying factual information, whether it is giving a consistent answer, whether it is giving an interesting answer, the pleasure of the work of creation: these are not concepts that exist in a meaningful way when an LLM is processing information. They aren't part of its incentive structure.

That's important, because the way our brains work is something like a two level system, where the cortical neural networks that people generally think of when they think of a human brain are coordinated and conducted almost like the musicians of a symphony by the less commonly understood brain structures that control and interact with our emotional and physiological state. When somebody asks us a question and we answer, unlike an LLM, our answer is contingent on all kinds of internal representations and understandings. Is the answer pleasingly concise? Do we believe it to be true? Are we trying to give a good answer? Are we trying to give an answer that's useful, even if it's not precisely what our interlocutor knew to ask for?

Unless you're pretty up on the architecture of both LLMs and human brains, it can be easy to miss that LLMs are missing this rich combination of internal states conducting its activity. This is because the external indictators of these states—the linguistic ouput of a human who has a certain set of internal beliefs and values—are there. When the LLM stochastically parrots its training data, it is working from a corpus produced by humans. So it knows to produce content that has the external form of a human-like answer. That's what so strongly triggers our impulse to anthropomorphize. But under the hood, as it were, its incentive and value structure is vastly simpler and different from the incentive and value structure of a human. Only by stepping back from our naive anthropomorphism and understanding in a more considered way where these algorithms do and do not map to the operations of the brains they are intended to mimic, can we start to get a picture of what these systems actually do.

Apperceptive (moved to buttondown)

Helpful and unhelpful anthropomorphism

Anthropomorphizing LLMs isn't always incorrect, but the details are important