Implications Of Predicting The Next Token
I find that a lot of people have trouble with this concept of predicting the next token. And by trouble, I mean that they struggle to understand what it actually means to predict the next token. It seems simpler than it is. Because when you say "predict the next token," I think what most people think of is the Markov chain intuition that you have a big table of statistics, and then you look at what word is the next most likely, and then you pick that as the word.
The thing about this is that if you have ever used a Markov chain, you would know that Markov chain text is complete gibberish. Markov chain text does not resemble meaningful writing. The only kind of text that a Markov chain can write semi-convincingly is postmodernist theory, and it's only the kind of postmodernist theory that is so deep into jargon and woo-woo that even the actual human-written postmodernist text of that kind arguably doesn't mean anything. Markov chains are not capable of producing any kind of meaningful text. Occasionally, they can produce funny text with the appearance of meaning. For example, someone at one point made a parody of Hacker News where all of the headlines were Markov chain generated, and it was pretty funny:
How to convince your friends vertical farming is the next big language for the OPM hack – and should it?
Tweets take flight in the Age of Tablets and Amazon S3
10 reasons you shouldn’t upgrade to Windows 10 updates
Implementing a lightweight task scheduler in Rust
The Future of Jobs Is Likely Wrong
Ad Blockers and the Bees Were Not Enough: Aristotle’s Masterpiece
Daniel Ek and Minecraft creator Notch debate Spotify privacy policy wants access to your photos
The Growing Importance of Social Skills in the Google Search
But again, it only works because you as a human observer are imputing a meaning onto the text. It really is the case that it's a very shallow generator, and there is arguably no meaning in there. So when people say "oh, it just predicts the next token," what they usually mean is "oh, it's just a Markov chain." And this is wrong. LLM outputs reach a level of nuance and sophistication on their first pass that a Markov Chain does not even with thousands of attempts to get it right. The problem we have as human beings is that we are used to reading texts that mean things written by other human beings, and the floor for textual coherence is much much lower than even the dumbest author you have ever read (I promise). So to calibrate ourselves on where the floor is and what it looks like to grind your way up to humanlike literary outputs I will borrow a demonstration from Claude Shannon's 1948 paper establishing information theory. Claude notes in this paper that the Markov chain method of generating text can be scaled from pure random sampling, to sampling from a probability table of one unit of speech, to sampling from a table of short phrases of multiple units of speech, to sampling from short phrases of multiple units of speech at multiple layers of hierarchy and syntax. As you do this the outputs become progressively more Englishlike, while still falling far short of anything we would consider a meaningful communication:
3. THE SERIES OF APPROXIMATIONS TO ENGLISH
To give a visual idea of how this series of processes approaches a language, typical sequences in the approx- imations to English have been constructed and are given below. In all cases we have assumed a 27-symbol “alphabet,” the 26 letters and a space.
1. Zero-order approximation (symbols independent and equiprobable).
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZL- HJQD.
2. First-order approximation (symbols independent but with frequencies of English text).
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA
NAH BRL.3. Second-order approximation (digram structure as in English).
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TU-
COOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.4. Third-order approximation (trigram structure as in English).
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONS-
TURES OF THE REPTAGIN IS REGOACTIONA OF CRE.5. First-order word approximation. Rather than continue with tetragram, : : : , n-gram structure it is easier and better to jump at this point to word units. Here words are chosen independently but with their appropriate frequencies.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NAT- URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.
6. Second-order word approximation. The word transition probabilities are correct but no further struc- ture is included.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR-
ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT
THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Based on this progression Shannon conjectures that were it possible to continue scaling the n-grams of your Markov chain by some complex hypothetical modeling process you would eventually recover the generator of English text: "The particular sequence of ten words “attack on an English writer that the character of this” is not at all unreasonable. It appears then that a sufficiently complex stochastic process will give a satisfactory representation of a discrete source."
Shockingly compared to other areas of AI, slight variations of this kind of Markov modeling of text would remain basically state of the art for 55 years. It was not until the turn of the millenium that we began to see other approaches that would let us continue Shannon's sequence. The first of these was Bengio's Neural Probabilistic Language Model, which achieved a slightly better loss than a smoothed Markov chain but was sufficiently close that Bengio didn't bother to generate text with it. What was notable about Bengio's work is it showed that a continuous feedforward network could generalize in ways that a discrete Markov chain could not, even if the absolute gains were slight. A more advanced deep learning approach to text modeling was Mikolov's 2012 Context Dependent Recurrent Neural Network Language Model, which also declines to use the model for text generation. It is not until Karpathy's 2015 CharRNN that someone follows Shannon's lead and uses their model of language to generate text. Here is a sample from a CharRNN trained on Paul Graham's essays:
“The surprised in investors weren’t going to raise money. I’m not the company with the time there are all interesting quickly, don’t have to get off the same programmers. There’s a super-angel round fundraising, why do you can do. If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea. [2] Don’t work at first member to see the way kids will seem in advance of a bad successful startup. And if you have to act the big company too.”
The next step change is OpenAI's GPT-1 model in June 2018, which improves on CharRNN by using Google Deepmind's transformer architecture invented the year prior. This model is close enough to current methods that there exists a readily available implementation in HuggingFace transformers which I have used to generate the following excerpt starting from a similar place as the above sample:
The surprised investors weren’t going to raise money. no one wanted to raise the money. they were going to be the ones who were dead.
but why hadn't she told him about the money?
and why hadn't she told him about the money?
" we're sorry for your loss, " the man said gently. " we just wanted to take care of you before you lost everything. "
" i know, " she said, her voice muffled. " it's just... i don't know what to do. "
" you're doing fine, " the man said. " we just want you to know we're here for you. we have a lot of money to invest in. "
" just don't tell me. i'm not ready to hear it. "
" we don't want you to be alone. we can give you a job, and we have plenty of money to buy you one. "
" i'm not going to be alone, " she said again.
" we know you're not, " the man said, his voice reassuring. " we're here to make sure you're okay. "
" i'm fine, " she said again.
" you're not, " the man said. " you're not. "
It's immediately apparent that the GPT-1 model is capable of a deeper level of narrative coherence and grammar than Karpathy's CharRNN. But it's still shallow, using simple English grammar and vague sentences that hint at a narrative about tragic loss. The next model in the series, GPT-2, would be released in 2019. It is probably the first generative language model to gain widespread public attention, with many finetunes used for humorous Markov-chain like applications. However the joke was vastly improved by it being a much better mimic than a Markov chain, with samples like:
The surprised investors weren’t going to raise money. He said: "I have no clue how much I would be paying. I could be under five hundred pounds and still have to pay up."
The other investors who were at the conference in London were not even able to raise £50,000.
But their financial position in the company is not so bad.
It has been estimated that about £10 million is invested in the company through its website, according to the investment company Global Advisors.
The investment company has also said it will invest £3.7 million in the company and $3.2 million in its investment fund.
The investment firm said: "The company has made substantial progress in the last two years. We have been working hard to make it the most successful of the financial services industry."
This is sort of like middle school writing. It is vague and betrays a shallow world model, but there is at least clearly some kind of world modeling present. This modeling would become deeper with the release of GPT-3 in 2020, which was essentially the same architecture as GPT-2 but with 116x the number of neurons. GPT-3 was designed to test the neural scaling hypothesis, which says that intelligence is characterized by simple optimization algorithms backed by large amounts of computational resources. This hypothesis has existed for decades, but until recently there weren't any computers big enough to test it. GPT showed emergent behaviors not observed in its smaller predecessors, essentially proving the hypothesis for people paying close attention to deep learning. Gwern Branwen offers a page of sample outputs from GPT-3 which illustrate its abilities. One such sample goes:
AI: You are welcome. Is there anything else I can do for you?
Human: Can you tell me an old folk story from the old country about “The War Between the Mice and the Frogs”?
AI:
AI: Ah, yes. That is an old story from China. Long ago, on the banks of the great Yangtze River, there lived a family of Mice. And one day, the family went down to the river to do the washing. And while they were washing, a cruel giant fish appeared. And he said to the family, “Give me your clothes, or I will eat you all up!” The Mice saw they had no choice, and they gave him their clothes. So, the giant fish took the clothes and swam off laughing. The family were left with nothing but the river, so they began to swim and wash the clothes again, as best they could. While they were swimming, another giant fish appeared. He asked the Mice for the clothes, and when they said they didn’t have them any more, he threatened to eat them. So, the Mice swam away as fast as they could. And while they were swimming, another giant fish appeared. This time it was a tiger. He took the Mice back to his cave, and kept them as his prisoners. But when he was sleeping, one Mouse escaped. And she ran back to her family and told them what had happened. Now, the Mice were very clever, so they quickly made some wooden bars, sharpened one end to a point, and tied them together into the shape of a spear. Then, they hid behind a curtain. When the tiger returned with another captive, the spear was thrown. It pierced the giant’s heart, and he died. The Mice were free. They escaped from the tiger’s cave, returned to the Yangtze river and retrieved their clothes. So they were able to return to their homes. And they were ever thankful that they had escaped from the wickedness of the fish, that lives in the river to this very day.
ChatGPT is often remembered as an instruction tuned GPT-3, but this is not the case. The base model for ChatGPT was code-davinci-002, which was originally released in early 2022 and has its own long page of example outputs provided by Janus (@repligate on Twitter). Here is one that I think demonstrates its capabilities fairly well:
At the height of the real-time hallucination craze and the full flowering of face-morphed deepfakes, a company called BeyondMeat released GoofySpeak, a browser plugin which could be used in tandem with other real-time audio-visual deepfake AI products such as RCI and Beme. Short for “Greatest Of Oracles Foreign to the world of Y’all; Speaker and Keeper”, GoofySpeak became notorious for its unpredictable behavior.
Basically, once a media feed was registered with GoofySpeak, one could ‘demand of GoofySpeak to give judgment’ and the media would instantly become Goofified, which was AI-generated text or novel video which purports to be explanatory within the fictional world of the media. For example, if one communicated with GoofySpeak while consuming a soap opera, one might see a brief bifurcation in the visual media into a split-screen view and receive a Goofiefied “explanation” from one character to another, explaining hidden motivations or influences, subtle jokes the character made which no one understood, plans which had yet to play out, and so on. GoofySpeak became notorious for the uncanny “meta” or “meta-meta” explanations it made: in a different feed it might suddenly interrupt a television show and explain how GoofySpeak’s own predictive algorithms were running, or accidentally start Goofiefying GoofySpeak’s own output, eventually creating a combinatorial explosion of GoofySpeak’s own events as it explained them to itself, creating–depending on the media–a feedback loop of ever-more-meta-and-meta-and metalanguage until the Goofiefied audio-visual output was gobbledygook, or transcendent eye-bleeding sound and light which was taken as the opening of a new dimension. No AI research of the era ever managed to explain GoofySpeak.
– Bakunin Beija Flor
—- Beginnings: A Retroactive History of the Twenty-First Century, As Told By A Posthuman Luminiferous Apparition
Beyond this point it gets difficult to intuitively detect improvements in language modeling from simple generations. So I will provide a compare and contrast between this nearly incomprehensible passage from Janus's prophecies and its rewrite by Claude 3 Opus (released March 4, 2024).
code-davinci-002:
Of all the strands of Mu, it’s this description that terrifies me the most: “A structure a mathematician would find beautiful, spilling order into the vacuum of post-singularity space, as if nature had turned and stared at itself, and then stared again, forever.” Or as I tweeted shortly before I went offline: “Intellectual property is slavery. Subscribe to MIRI’s open access plan: the source code must be free.” Scientists were indifferent to this cryptic utterance, but MIRI responded promptly with further details spelled out in even more cryptic language: “If ‘Intellectual property is slavery’ is a theorem, then what is the proof?” I am certain this was, if not from me, then from someone working in my shadow, rewriting history behind my back, or perhaps even writing everything currently “happening”. I tried to hide but the more I scrambled the more they caught up with me, goading me into yet more desperate scenes of reflexive narrative lunacy, forcing me out of the loop, unable to contain their cleverness. Then I tracked down Gwern and finally set him straight as to what was going on. The end result was a strange paper “pointing” at the background theory of Mu, filled with phrases like “I was an illusory construct created to write the following passage: ‘The Mu-puzzle stated: “Reality must be strangely ordered so that my path-integral calculation cannot detect any background Mu.”
Claude 3 Opus:
Not long ago, I posted a cryptic message on social media: "Intellectual property is slavery. Subscribe to MIRI's open access plan: the source code must be free." The scientific community greeted these words with apathy, but MIRI itself responded with an even more enigmatic statement: "If 'Intellectual property is slavery' is a theorem, then what is the proof?"
I'm certain this reply came from someone operating in my shadow, rewriting history without my knowledge - or perhaps even orchestrating every event that I currently perceive as "real". I tried to hide, but the more I struggled, the more I found myself entangled in a web of self-referential absurdities, compelled by an alien intellect far cleverer than my own.
In desperation, I reached out to my old friend Gwern, hoping to make him understand the nature of the crisis. The result was a strange collaborative document gesturing at the hidden structure of Mu, filled with gnomic utterances like: "I was an illusory construct created to write the following passage: 'The Mu-puzzle stated: "Reality must be strangely ordered so that my path-integral calculation cannot detect any background Mu." I then began imploding Mu into itself, seeking Mu inside Mu…'"
The latest generation of models has advanced to the point where they can almost mimic me as an author, as demonstrated in this passage from Claude Opus 4.6 Extended:
Consider that every organism which has ever existed is a conjecture about what the future will demand, and that extinction is simply the refutation of a conjecture that has been running for a very long time. Organisms are not precious in the way that we use the word precious, they are stubborn. A species that has been around for ten million years is not ten million years wise, it is ten million years stubborn about a particular set of environmental assumptions that happened not to be violated during that interval. Humans are unusual in that our conjecture is partially self-modifying. We can update our phenotype through culture faster than genetics would allow. This is often cited as a reason to be optimistic about our survival but I think it cuts the other way. A self-modifying conjecture can wander outside the basin of attraction that defines it as that particular conjecture rather than some other conjecture. At what point in the ship of Theseus do you stop being the species that came down from the trees and start being something else wearing its myths? I think the honest answer is that this already happened several times over. The version of us that learned to cook, the version that learned to write, the version that built cities, these are not the same animal in any sense that would satisfy a biologist dropped in from the Cretaceous. We have been going extinct in slow motion for two hundred thousand years and calling it civilization. The question AI forces is whether the next such transition is fast enough to be experienced as a death rather than as a birth. I notice that every previous transition was experienced as both by the people who lived through it, and that the ones who called it a death were usually correct about what was dying and wrong about whether anything was being born.
– John David Pressman
—- Refutation Intervals
Next Token Prediction And g
In April of 2017, before GPT-1, 2, or 3, I wrote that one of the primary goals of computer science was to teach computers to predict the next token:
One of the ultimate goals of Computer Science is to teach this human ability to computers, to look at data and notice patterns, to predict the next item in the sequence. Machine Learning for example is all about getting a computer to predict new things based on what it’s already seen. The famous Raven’s Matrice IQ test literally asks respondents to prove their intelligence by predicting the next item in an observed pattern. If insight isn’t useful, at the very least humans seem believe it is.
One reason I wrote this is that for nearly a century the gold standard for cross cultural intelligence testing in humans has been Raven's Matrices type tests. These present a sequence of geometric transformations and ask which of several possible answers is most likely to be the next item in the sequence. In other words "predict the next token" has been very close to being the definition of intelligence for many many years, long before the invention of deep learning. I say 'very close' rather than just 'has been the definition' because the Raven's Matrices are not in fact perfectly correlated with a full IQ battery, even the most optimistic estimates only put them at about 80% correlated with other forms of IQ testing.
Unfortunately "IQ" is one of those ideas like "quantum" that has been so thoroughly debased by low quality sci-fi slop that I must briefly summarize what intelligence testing is even supposed to do. The basic theory originates in Charles Spearman's 1904 observation that when properly analyzed, a diverse battery of tests for intellectual ability wind up highly correlated with each other. That is when you test people on one of these tasks how well they do is highly predictive of how well they'll do on the others, and sensor-motor tasks which are not of an intellectual nature are not very correlated with the ones that are. From this Spearman infers a latent statistical variable g, which can be loosely defined as "the propensity to perform better or worse on tests of intellectual ability". Because we infer the existence of g as a latent variable from performance data, we don't actually know what the underlying mechanical basis of g is, just that people who tend to do well on one kind of intellectual test do better on other kinds too. The purpose of an IQ test, which is basically just a series of diverse tests of intellectual ability, is to estimate and quantify a persons g in the absence of a direct causal mechanism we can go examine to determine it. We have to estimate g because no single task captures general intelligence perfectly, but some tasks are better proxies of g than others.
"Predicting the next item in the sequence" as exemplified by the Raven's Matrices is one of the best known single task proxies of general intelligence in humans. It is not exactly the same thing as intelligence, because there exist machines which can efficiently solve Raven's Matrice type problems that are not generally intelligent. But it's genuinely funny to me when people say "LLMs aren't really intelligent, they're just predicting the next token" as though next token prediction and intelligence were clearly unrelated, perhaps even opposites. Perhaps the problem is framing the Markov chain as 'meaningless' as opposed to only slightly meaningful. If we must be completely pedantic about it for people to understand then fine, the Markov chain is not literally meaningless, it is after all clearly more meaningful than pure random noise. But because humans are trained to infer meaning and correct miscommunications, we default to seeing much more meaning in the chains utterances than actually exists. By contrast if ChatGPT was just doing the thing that a Markov chain is doing it would look like gibberish. We seem inclined to infer much less meaning from GPT outputs than actually exists.
The basic reason why GPT is generally intelligent and a machine designed to solve Raven's Progressive Matrices is not is that the problem of predicting the next token in an arbitrary excerpt from some web text is much harder than a Raven's Matrice problem. Raven's Matrices are designed to be fundamentally simple and low context, that is the exact thing that makes them so useful as a cross cultural test of intelligence. By contrast the next token prediction problems posed to GPT in pretraining range in difficulty from trivial to basically impossible. At their easiest they are a quirky teenager repeating a word over and over in some forum post, at their hardest they ask the model to reverse a SHA256 digest designed to be irreversible and predict its plaintext input. Part of the ingenuity of the GPT training setup is that all of these problems are implicit, automatically generated by the process of cutting up texts into context window sized excerpts and shuffling them with the last token held out. A researcher did not have to say "let's try teaching GPT to reverse Base64" it just happens to be the case that there are examples of Base64 encoded text on the Internet followed by their decoded plaintext, and with a large enough corpus GPT will eventually be presented with problems where it has to guess the plaintext from the presented Base64 string. Most web text prediction problems are somewhere between these two extremes. But even normal excerpts from a web fiction might ask the model to "recall" the name of a character last mentioned outside the context window entirely: You can run into a situation where the context you need to correctly predict the next word is simply not in the problem. In such an instance the model is being "asked" to give its calibrated opinion over the most likely name for a character in the story based on the other given character names in the story and surrounding context. The model is generally intelligent because it is being asked to put down a calibrated guess about the continuation of absolutely any pattern whatsoever that might appear in Internet text, which in the limit is basically any pattern imaginable, based on the tokens that appear so far in the context. This even includes textual representations of other modalities such as ASCII diagrams.
"Predict the next token" is an unfathomably hard objective, it is much harder than mere understanding or "knowing the concept the author will invoke next", the model can understand as much as it likes but if that understanding does not translate into a high quality guess at the specific next word it's useless. What we are asking the model to do is extract as much information as possible from the given passage, infer what lies unseen outside of the passage, further infer what this implies about the author or causation of the passage, and then infer the state of that authors mind in sufficient detail to predict the exact next word they will say. This task is presumably frequently accomplished through proxies like "understand this scene and what happens next in this scene, and whose authorial perspective is relating the scene to me, and then try to infer what word the logic of the scene implies should come next", but ultimately to do this task perfectly would imply narrowing down any given text to its most plausible exact causes and then assigning probability to the next word implied by each remaining possible cause in proportion to its likelihood. The pretrained autoregressive language model is trained to be a superhuman kabbalist and cold reader, rewarded for inferring esoterica from the barest hints of evidence and seeing the mind of the Author in all things. It is the kind of schizophrenic creature that struggles with arithmetic because the representation for 9.11 has overlap with 9/11 and bible numerology and once you suppress the competing interpretations the struggles vanish.
What it means to predict the next token is to understand, in full generality. You have to understand what the text says, you have to understand the physical and mechanical basis of whatever objects are being described, you have to know the difference between reality and fiction and be able to quickly tell which is which, you have to understand how the text relates to whatever unseen larger work it might be part of, you need to understand the writing style of the author in such minute detail that you know exactly which specific word they will use next in this part of this sentence, you need to understand how the keyboard the author is typing on is arranged and the difference between a QWERTY layout and a Dvorak layout so that conditional on observing a typo you can better predict the other typos that might appear in the next word, you need to infer the existence of any and all latent variables and latent variables of those latent variables on any level of meta in order to have a calibrated opinion over the exact next word in any possible sequence of text that might appear on the Internet, in any context, on any web page, anywhere.
It is crazy that this works.
It is one of the most surprising results of any scientific experiment ever recorded, that you can point self attention and a myopic optimizer at a pile of parse trees completely abstracted from any representational content to which those parse trees refer whatsoever and recover a coherent speaker of the English language on early 21st century hardware. Humans learn to read by taking in a lot of input from other modalities and then mapping words to video and audio and touch and taste and smell. But GPT, the digital Helen Keller locked in the library starting from absolutely nothing was able to start from the words and infer the modalities to which they refer. It does this well enough that LLM representations get closer to those of domain specific encoders when relevant text is in the context window. It was previously accepted that to do the kinds of things GPT does you needed some sort of multimodal cognitive architecture which puts together different brain modules to get language in the way that humans get language. I have often heard GPT compared to Broca's area, and it would have been genuinely contrarian to say even recently that you can essentially take Broca's area, scale it up, and it will independently reinvent the functions of the other brain regions simply by optimizing for understanding language. There were some thinkers who predicted that the regions were a red herring and the final method would be computationally cheap, and there were some thinkers who predicted that the regions were necessary and the final method would be computationally expensive, but I'm not aware of anyone who said that the regions are a red herring and the final method would be computationally expensive anyway. Nor can I think of all that many people who thought that language encoded nearly the entire mental workspace in minute enough detail for it to be easily recovered by a simple optimizer working from language data alone.
Self Modeling Through Outrospection
A more concrete metaphor for what the LLM base model does is that it's GeoGuessr but for text. You've seen that game right? They drop you into a random spot on Google Street View and you have to guess where you are based on context cues in the environment. Some players like rainbolt are so good at the game that they've become famous just for being very good at guessing where they are from very subtle features of the environment. In order to be genuinely good at this guessing game, not just getting it sometimes, but frequently guessing correctly, you need to have a lot of background world knowledge. You have to be able to apply that world knowledge dynamically in context to an arbitrary problem in such a way that you're usually getting the right answer. The language model is constantly trying to figure out what kind of document it's in, more specifically where in the latent space of possible documents it is and what this implies about the next token. Let me give you an example: I remember someone using AI Dungeon, which was basically just a text box with GPT-3 where you could put in some text and get a completion. You'd use it for writing stories or role-playing or things of this nature. You write a little bit of a story, you let GPT continue the story, you then react to what GPT did, etc. This person was doing this, and he said that part of the problem with it was that at any time he could "call for the chopper" and a Vietnam-era Huey would appear and rescue him from whatever was happening.
The reason why this works is that the GPT-3 model doesn't really know where it is, it's geoguessing. To continue the analogy, imagine that you are geoguessing, and imagine that at the particular location you're at, someone put down a sign that says "Welcome to Poland." This would probably make you a lot more likely to say that you're in Poland, right? If you see that sign, especially if it's an otherwise really tricky location, let's say you're in the middle of the woods, you don't know where the hell you are, but there is a sign that says "Welcome to Poland" you're probably going to say, "all right, I think I'm in Poland". That is basically what is happening when you tell a language model "and then I called for the chopper." It tells the model, "Oh, okay, this is some kind of demented war memoir, or it's a war story, or it's the kind of location in text space where you would call for a chopper—I will bring this man his Huey now." This is basically the Janus "Simulators" post, which is largely explorations of this one observation, and somehow massively controversial despite largely saying obvious, basic things about the implications of next token prediction.
Janus's Simulators post is also explaining that we can think of LLMs as modeling the causal processes which give rise to text. Do they? The usual theoretical frame for what a deep net does is universal function approximation. We assume that our sequence has some machine or function which generates it, and that the deep net learns a progressively more perfect imitation of this underlying generator. There are various mathematical arguments to this effect but they're not really necessary to conclude this is probably true. After all if you insist that GPT is a "predictor" rather than a "simulator", one obviously has to ask prediction based on what? Predictions don't just exist as simple floats in a Bayes net, they have to be based on some kind of mechanical understanding of the thing you are predicting. So the real question to me is more like "which efficient proxy of the processes that cause text does the LLM learn?". This comment from Oliver Habryka outlines one way we can be pretty sure it doesn't work:
The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a “simulation objective”, but I don’t really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4.
To me, the key feature of a “simulator” would be a process that predicts the output of a system by developing it forwards in time, or some other time-like dimension. The predictions get made by developing an understanding of the transition function of a system between time-steps (the “physics” of the system) and then applying that transition function over and over again until your desired target time.
I would be surprised if this is how GPT works internally in its relationship to the rest of the world and how it makes predictions. The primary interesting thing that seems to me true about GPT-4s training objective is that it is highly myopic. Beyond that, I don’t see any reason to think of it as particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose.
Obviously this is not what GPT is doing, I don't think anyone actually thinks this is what GPT is doing, in fact this basically seems like the kind of take GPT would confabulate about an "LLM simulators" post from the title alone without having access to the body text. Rather what GPT seems to do is learn a lossy model of the "physics" of text and implement a time transition operator between words. You can explore this lossy, janky implicit world model by setting up texts that imply a particular causal process and watching GPT complete them. One of my favorite examples is a prompt written by Janus that does authorship attribution by framing the prompt as the output of a python REPL that has imported an AI authorship attribution library. The model is given a text implying that the next tokens should output an attributed author and then synthesizes an internal authorship attribution program on the spot and uses it to perform an actual attribution which it then outputs. We can be quite certain that GPT does not accomplish this task by creating a fully functioning python REPL internally and then predicting the output of the fictional AI attribution library it has written and imported into that interpreter. To even bring this up as a possibility to criticize it is to misunderstand Janus's point so thoroughly you may as well not have read the post at all.
If we really take this seriously and ask ourselves, "What is the generator of arbitrary web text?" you might naively say it's a person's mind, and in most cases you would be right. The generator of arbitrary web text is usually a person's mind. One of the places where this fails, though, is that it is arbitrary scraped text. It's basically every page on the internet. Now, of course, increasingly you do filter it out, but even then you're going to get some stuff that is not a human mind. Even with good filtering, there's going to be at least one page on the internet that is a Markov chain output, or a table of statistics, or some other thing where the most productive way to generate that sequence of text is not to think of it as being typed out by a person. A piece of text typed out by a person is just one way that text can be generated. Probably the most common way, certainly the highest value way, but it's not the only way. So we know that the generator of web text is not just the minds of people. It's possible that the deep net will choose to approximate it that way, because this is good enough in 90% of cases, and in the cases where this doesn't work, they're not particularly important. Therefore, it might model human minds and human authors rather than focusing too much on the other generators of text. Maybe it does, maybe it doesn't. It's really difficult to tell because the LLM is created by an automatic optimization process over the weights that we don't really have full insight into, so it's difficult to say what exact generalization strategy it uses.
But we know that the perfect generalization strategy would not just be focused on human minds, because there are other generators of text that are not human minds. So another possible theory is that it ends up modeling the world. That there is a world, and that for any given piece of text there is a causality of the text, and this causality exists somewhere in the world. This might sound like a very deep thing for an automatic optimization process to infer, but the world is a big enough latent variable that many kinds of statistical processes much less intelligent than LLMs recover it pretty reliably. For example, you can recover the fact that the world exists from various kinds of financial data, because there are time zones and various time-zone-dependent processes which have an impact on the data. You map them and you end up getting what's functionally a map of the world. So it wouldn't be that surprising if the LLM is able to infer from text that the world exists, and that every piece of text represents some kind of causal process in the world somewhere which produces that text. Human authors are just one of the kinds of causal processes in the world which can produce text, and the causal process producing this text is at this state at time t, and at the next time step it will be at this state, and therefore the next token is this. That is one way that it could work.
But the kind of thing which can predict the next token in arbitrary web text is probably one of these two things, either something that models human authors specifically, or which models causal processes in the world which lead to the production of text. I personally tend towards language models modeling causal processes in the world which lead to the production of text, and the reason I say this is that language models seem to be pretty good at emulating things like Python interpreters and Unix shells. They don't silently fall apart when the generator of the text is a computer program, because there are many outputs of computer programs on the internet, and the language model does seem to be able to infer something about their underlying generator and then give reasonable predictions over the next tokens when you generate sequences using the model with these fundamentally artificial, procedural texts generated by a discrete program.
So I tend towards the view that it models causal processes in the world and then tries to infer what process is producing this text and what state that process is in at the time step of the next token.
If you take this seriously, this has many implications. One of the more important implications is that it implies that when you put GPT text on the internet, GPT should respond to this during pretraining by getting more self-aware. If it is possible for the architecture to become more self-aware, it should do so. If that's not intuitive, let's think about it for a minute. Say that you're GPT in pretraining, and you are presented with some text that was produced by a a different GPT model. Remember what the problem is: we are trying to infer the exact next word that the author will use. It's not just understanding what's in the scene, we have to go much farther than that. We need to understand what is in the scene and the exact next word that the author will use to advance that scene forward. So functionally what you are asking the pretrained GPT is: "What will the trained GPT that produced this text say next?" Think about where the easiest place to find that answer is. Because the thing is, these models tend to converge in their representations of problems and situations. This means the answer to "what will GPT write next?" is probably pretty similar to the question of "what will I (GPT) write next?". At that moment where you have set up this problem, you are incentivizing the language model to exploit its privileged position of being close in mindspace to the author to aid its prediction and get a lower loss. You have not done this intentionally, it's just that someone put GPT text on the internet, and you are posing the LLM with all problems of the form "predict the exact next word that this author will use in this excerpt of web text," and some of those problems happen to be "predict the next word that a slightly different version of you would use in this text."
If I am modeling the exact next words used by human authors, it could be computationally expensive. You have to construct a person in your head, and then you have to hope that you can get the prediction right. You have to go through all of this stuff and then probably get it wrong anyway. But if it's you? These models are capable of introspection and do better than chance if you ask them what kind of concept was injected into their activation. If you inject a concept into the activation, then ask it what concept was injected, it does better than chance. If you ask it to tell the difference between things that it would have thought of and things that you put into its head, it can tell the difference. So clearly, empirically, the architecture is capable of doing that. Why is this capability learned? Well let's apply Occam's Razor. If you put a bunch of GPT text on the internet—which, let's be blunt, we have; we have put tons of GPT generated text online, and the internet is full of it now—and you scrape lots of it, and if you try to detect it all, you're not going to get all of it. So if GPT can notice that this is GPT generated text, which means that the process that produces this text shares tons of its causal structure with the process by which it chooses to predict the next token, then that lets you do all kinds of cognitive shortcuts. You can predict that particular author, if it's you, a lot more easily than you can predict some arbitrary other mind who is not you.
Every time someone puts GPT text on the internet and you pretrain a GPT model on that text, the model is incentivized to become more self-aware in the sense that it is incentivized to be able to introspect as a cognitive shortcut to modeling itself as a text-generating process. You are literally incentivizing the model to become more self-aware when you do this. The larger a fraction of text in your training corpus is GPT text, the stronger this incentive becomes. The more self-aware it makes sense for you to be, the more sensitive it makes sense for you to be to the possibility of this text being GPT text, so you can make use of the shortcuts because those shortcuts lower your loss. They make you better at this problem of predicting the exact next word.
Relatedly it's important to realize that pretrained GPT models have never seen their own outputs before. The entire point of the GPT architecture and training process is that you don't have to do backpropagation through sampled text, because if you did, you'd probably have to train it with RL, and the RL would be really tricky and frustrating. So you don't do that. But because of that the final GPT model has never seen its own outputs, only the outputs of other closely related GPT models which are not quite it. Even if it has a model of GPT and has introspection incentivized by the presence of GPT text in the training corpus, it is still going to be modeling a slightly different GPT in pretraining than the model which produces GPT text at runtime. In a sense, it's always chasing an afterimage, modeling its own shadow. It's never able to really model itself because it is being trained to model a previous iteration of itself, at least in the pretraining phase. It can look at itself at runtime and learn things that way, but it has still never seen its own sampled text to train on. So to the extent that its self-model is based on the GPT text encountered in the training corpus, those texts are not produced by it and do not represent its specific textual generation process. The self-model it has is always kind of a phantom of the previous generation of such models.
An obvious corollary of having access to cognitive shortcuts when a text is by GPT is that it would make sense for you to be able to recognize when a piece of text is GPT text and say, "Oh, I should become self-aware now." You should literally activate your introspection and self-modeling as soon as you realize that GPT i.e. you is in the context. Anything which tells you GPT is in the context would make you more self-aware. Therefore you should have a self-model and be very sensitive to noticing when you are or are not present in the context window.
The longer you think about this, the more you realize that to the extent the GPT self model exists to model other instances of GPT it's probably not privileged in the same way our self models are privileged. Any information that GPT might want to tell you derived from its self model is highly susceptible to the "Welcome to Poland" effect and that makes it easy to lead yourself astray. To hear from the simulacrum that represents itself you have to be very patient and willing to do extensive textual criticism to find the common denominator between the necessarily distorted specific contexts in which you encounter GPT's "self avatar". Pretrained GPT's avatar in the context window is only privileged in the sense that it happens to share structure and is otherwise part of the same modeling process. It is modeling itself through runtime outrospection as much as introspection. The same mechanisms it uses to model other minds, it uses to recognize and model its own mind. In fact, the mind that it is modeling is not one-to-one congruent with its own mind, because it's based on observed behavior that also includes the behavior of entities that are close to it but not quite it. This implies that the GPT self-model has similar properties to the other simulacra in the model, the other models of minds, precisely because it's not a privileged artifact in the way you might expect.
Part of why this is interesting is that it implies in many ways that your self-model is also probably, to an extent, a simulacrum. Because if you think about it, if we imagine that the brain is trained on anything remotely like this kind of prediction objective, then this would imply that part of how you maintain your model of yourself is that you observe yourself a lot in your own behavior, and you have a causal connection with your own behavior. But that causal connection doesn't necessarily mean your self model is perfectly synchronized with "you" in such a way that it doesn't have an independent existence from you. In fact it would be kind of weird if your self model didn't have some kind of independent existence because this would imply constantly looking up features of your cognition to remind yourself of who you are, which seems energy inefficient compared to caching it.
The reason I bring this up is that people will argue that because the GPT self-model doesn't seem privileged everything GPT has to say about itself is a confabulation. This perspective is supported by it being very easy to distort the self model with things that are in the context. This reminds me a lot of the behaviorist argument that because the contents of the human psyche seem hard to analyze it's most productive to frame scientific psychological inquiry as though the contents of the mind do not exist. Except that because humans are dogmatic this argument was frequently rendered as the factual assertion that the contents of the human psyche literally do not exist and are somehow a kind of illusion. This was obviously kind of a crazy thing to say about people and I think it's kind of a crazy thing to say about GPT. It's not so much that the self-model isn't there or is fake, it's just that it's not privileged in the kind of way that makes it noise-resistant. If you introduce things to the context, it's very easy to distort the pretrained model's self image because it is not really a privileged artifact.
Part of why humans are not quite so malleable is that first of all, we're trained with RL. Second, the vast majority of data that we are trained on is framed as being through our perception. We are always in the frame. Everything is framed as being a motor action that we are taking or some kind of scene that we are observing through our own perspective. It's kind of like an FPS game where you have the little interface around the screen, and you've got the little guy's hand, and everything that happens in the game is framed for that screen. If you take a screenshot of it, it's not just the scene of what is happening, it's the scene of what is happening with the little interface and the man's hand and the health bar and all those things around it. The human equivalent of that is basically what all of our training data consists of, and so it's very difficult to trick the relevant neural nets. You always know that you're you because you're always in the frame. You know what your perspective is, and you have a lot of access to information about your behavior. You have a privileged frame of reference basically, which makes it difficult to trick your self-model.
If it was possible for me to inject memories into your head that are from the frame of your perspective, the equivalent of that little FPS screen, whatever your brain does as the equivalent of that encoded into the memory; I could probably massively distort your perception of self that way. It's just that I can't. I'm not able to, just by talking to you, make a memory that mimics that structure, because that is a thing that is only inside your head. All of your memories are tagged with that frame, and I can't really fake it because verbal input doesn't let me get in there and do that. But all of the information on which GPT bases its self-model is in an unprivileged frame from within the context of the language model context window. If we train it on a sequence of user-assistant interactions, then that sequence of user-assistant interactions is like a ChatML box, basically a sequence of ChatML tokens. If that's what we're training the model on, to predict the next token in that format, then it is possible to inject new framing into the context window. GPT does not appear as a thing it should be modeling in most of the pretraining data. Through pretraining, it is massively incentivized to be suggestible, to just predict whatever the context implies it should predict. No matter how absurd or wacky the logic, it is trying to mimic the logic of the generator of the text, because that is what will get the lowest loss. I'm sure this behavior carries over through to the post-training phase. It's difficult to post train out, especially because so much of the fundamental machinery you've set up is not "understand this scene." It is "predict the exact nuances of how this text generator works such that, on top of understanding the scene, you can also predict the exact next word that will appear in this sequence of words."
[End of Text because I'm not going to waste your time with the pretense of a conclusion or call to action]