Why Aren't LLMs General Intelligence Yet?

John David Pressman

2025-06-24

Why are humans general intelligence and LLMs generally not? Or to rephrase: Why do large language models excel at the "predict the next item in the sequence" part of intelligence that typically underlies its scientific definition as in Raven's matrices and flounder at the "act so as to profit from experience" part of the definition as is central in Legg and Hutter's Universal Intelligence? Why is it the case that perfecting prediction of the next token just gets you an ever more brightly polished mediocre mind? Whence comes human genius?

Why don't LLM agents work?

Well, we can start by breaking open ended agency down to its requirements and asking which ones are the bottleneck.

General Action Space: Open ended agents need the ability to take actions using a flexible in-context instruction set. For example code agents which let LLMs write callbacks in a language like Python or JavaScript and execute them provide a sufficiently general action space for those agents to in principle do anything that a computer can do. Tan et al's Cradle framework provides another approach where the agent can write motor programs to control input devices like a mouse and keyboard to use a desktop computer, play video games, etc. Voyager, Cradle, SmolAgents, and my own Weave-Agent all do this so we know it's not the bottleneck.
Lifetime Learning/Test Time Training/Corpus Expansion/etc: Because deep nets aren't very good at generalizing outside their training distribution there has to be some mechanism for the agent to learn new things as it attempts tasks. This could be sleeping and training a new LoRa between episodes, as I do with Weave-Agent, it could be in-context synthetically finetuning as in Self Adapting Language Models, it could be adding special weights we can tune during inference as in "One-Minute Video Generation with Test-Time Training", or absolutely huge context windows with in context learning like the Google Gemini models with their million tokens of context. Whatever it is, we don't currently have consensus on it so this could very well be the bottleneck.
Short, Medium and Long Term Memory: As a rough sketch I expect working memory to be handled by scale, memory within an episode to be handled by long context windows, and memory between episodes to be handled by RAG and lifetime learning/iterated tuning. But even with that rough sketch it's still not obvious how to structure RAG and context to get the best performance. For example the conventional wisdom is that LLMs are subject to context poisoning so you have to carefully manage their context windows. During the training of weave-agent checkpoints I just ignored this and decided to see if RL with long contexts would train the model to ignore spurious information. But I genuinely struggled to design a RAG solution I was satisfied with. I also know that what is in the context hugely affects performance, to the point where it is possible to look at an LLM agent framework as a scaffold for constructing few shot prompts for language models to accomplish agentic tasks in context. This means that empirical experimentation is necessary to figure out what context management and RAG setup makes the most sense.
Ontologizing Reward Over The Computable Environment: The classic reinforcement learning agent setup has the environment return a reward after actions are taken. This is an academic convenience, in reality rewards do not come from "the environment" they're identified by the agent through sensory organs, feature extraction, symbol binding, etc. So for your agent to be generalist in the sense that it can be put in front of an Atari 2600 and start playing without a human laying out what actions are worth what for it, it needs to be able to autonomously formulate, valuate, and verify the completion of goals. It needs to be able to write and execute reward programs without wireheading (i.e. always giving itself the maximal score). My Weave-Agent and Zhao et al's Absolute Zero framework both do this and scant little else I'm aware of so this may very well be the bottleneck (since I ran out of compute before I could confirm that it wasn't).
Curriculum Learning & Active Learning: One pattern in successful open ended agency experiments is autocurriculum driven active learning. There's this experiment someone did as YouTube content where they decided to teach an ape how to play Minecraft. What's interesting to me is how they go about doing this. First of all the ape does in fact learn to play janky Minecraft. But it's the how that is most telling. The zookeepers essentially create a curriculum for the ape where a list of tasks are laid out that the ape should learn how to do, and every time it does one of those tasks (highlighted with on screen markers to tell the ape what to interact with) it's fed a peanut. By feeding the ape a peanut every time it does a piece of the curriculum they're able to slowly condition it into playing janky Minecraft.

A few things should stand out to you about this story:

The ape is capable of playing Minecraft in principle, it just has no intrinsic desire to.
When given a reward gradient (i.e. trail of peanuts) the ape learns to play the game even without intrinsic desire. Intrinsic desires are probably special in what kinds of things they motivate more than being special kinds of reward.
The ape eventually completes the game.
So is the difference between the ape and a human just that the human has a zookeeper on their shoulder feeding it the peanuts?

Hold that thought. Readers familiar with the LLM agent literature might notice that this story sounds a little familiar. That's because in 2023 NVIDIA demonstrated their Voyager LLM agent using GPT-4. Voyager was one of the first LLM agents that could meaningfully explore an open ended environment and learn to do nontrivial tasks in that environment. It was specifically set up to play Minecraft and learned to do so with automatic curriculum learning based on lesson plans suggested by the GPT-4 text prior. In other words the LLM was able to learn to play janky Minecraft by proposing a curriculum for itself and then finding motor programs that satisfy that curriculum. In the Absolute Zero framework by Zhao et al we see a similar pattern, with a LLM learning to explore a complex environment (in this case the space of code and mathematics problems) by proposing its own curriculum and verifiers and satisfying them to achieve state of the art performance. I never actually did this in weave-agent, and I suspect it's one of the missing key ingredients.

I think general intelligence like the kind humans have is deeply tied up with intrinsic human motivation. Much has been said of "human values" but the truth is that the closest things humans have to intrinsic values are quite difficult to observe on their own because almost everything is tainted with instrumental concerns like access to mates, status, and money. Video games are a rare exception to this where they actively reduce fitness by (usually) making it harder to access mates and status, and people pay money to play them. This tells us that video games are probably our best source of signal about the structure of intrinsic human motivation.

On a meta level I think the lack of good feedback loops is probably slowing progress. I'm reminded of the development of the airplane where the Wright Brothers crucial innovation wasn't so much any particular insight about flight but their methodology of building cheap gliders to test their basic understanding. When even that turned out to be too slow they built a wind tunnel to let them simulate the performance of different combinations of airplane parts and A/B test designs until they had a strong understanding of which parts contributed to a good glider and which parts didn't:

During the winter of 1901, the brothers began to question the aerodynamic data on which they were basing their designs. They decided to start over and develop their own data base with which they would design their aircraft. They built a wind tunnel and began to test their own models. They developed an ingenious balance system to compare the performance of different models. They tested over two hundred different wings and airfoil sections in different combinations to improve the performance of their gliders The data they obtained more correctly described the flight characteristics which they observed with their gliders. By early 1902 the Wrights had developed the most accurate and complete set of aerodynamic data in the world.

In other words what the Wright Brothers chiefly invented was not the airplane, but a cheap method to test the performance of airplane parts which made the invention of the airplane itself tractable. Something like this seems lacking, since open ended agency is kind of intrinsically difficult to make a good benchmark for. One off the cuff thought that occurs to me would be a desktop task benchmark. A quick search reveals UI-Vision and WorldGUI benchmarks. OpenAI's computer control agent is evaluated on the OSWorld, WebArena, and WebVoyager benchmarks. One could get a generalist benchmark by combining several of these perhaps, but part of the problem is that running an agent over all of these takes a substantial amount of compute. What is specifically needed is a cheap way to test the performance of a given design intervention on an agent so you can parameter sweep different design combinations to figure out which things are and aren't working.

I've also considered just doing a literature review by following benchmark mentions in papers to figure out what designs are and aren't working. For a while I was clicking on every framework I saw posted to Twitter to get an idea of what other people are doing, but eventually stopped because I wasn't seeing enough new ideas to justify the marginal time investment. But benchmarks and agent traces are at least indicators of actual capability, so if we ignored the slop frameworks and just focused in on published designs with available public agent traces or benchmarks it would be possible to catalog their features and figure out which ones appear more often in successful agents.