Why Aren't LLMs General Intelligence Yet?

John David Pressman

Why are humans general intelligence and LLMs generally not? Or to rephrase: Why do large language models excel at the "predict the next item in the sequence" part of intelligence that typically underlies its scientific definition as in Raven's matrices and flounder at the "act so as to profit from experience" part of the definition as is central in Legg and Hutter's Universal Intelligence? Why is it the case that perfecting prediction of the next token just gets you an ever more brightly polished mediocre mind? Whence comes human genius?

Why don't LLM agents work?

Well, we can start by breaking open ended agency down to its requirements and asking which ones are the bottleneck.

A few things should stand out to you about this story:

  1. The ape is capable of playing Minecraft in principle, it just has no intrinsic desire to.

  2. When given a reward gradient (i.e. trail of peanuts) the ape learns to play the game even without intrinsic desire. Intrinsic desires are probably special in what kinds of things they motivate more than being special kinds of reward.

  3. The ape eventually completes the game.

  4. So is the difference between the ape and a human just that the human has a zookeeper on their shoulder feeding it the peanuts?

Hold that thought. Readers familiar with the LLM agent literature might notice that this story sounds a little familiar. That's because in 2023 NVIDIA demonstrated their Voyager LLM agent using GPT-4. Voyager was one of the first LLM agents that could meaningfully explore an open ended environment and learn to do nontrivial tasks in that environment. It was specifically set up to play Minecraft and learned to do so with automatic curriculum learning based on lesson plans suggested by the GPT-4 text prior. In other words the LLM was able to learn to play janky Minecraft by proposing a curriculum for itself and then finding motor programs that satisfy that curriculum. In the Absolute Zero framework by Zhao et al we see a similar pattern, with a LLM learning to explore a complex environment (in this case the space of code and mathematics problems) by proposing its own curriculum and verifiers and satisfying them to achieve state of the art performance. I never actually did this in weave-agent, and I suspect it's one of the missing key ingredients.

I think general intelligence like the kind humans have is deeply tied up with intrinsic human motivation. Much has been said of "human values" but the truth is that the closest things humans have to intrinsic values are quite difficult to observe on their own because almost everything is tainted with instrumental concerns like access to mates, status, and money. Video games are a rare exception to this where they actively reduce fitness by (usually) making it harder to access mates and status, and people pay money to play them. This tells us that video games are probably our best source of signal about the structure of intrinsic human motivation.

On a meta level I think the lack of good feedback loops is probably slowing progress. I'm reminded of the development of the airplane where the Wright Brothers crucial innovation wasn't so much any particular insight about flight but their methodology of building cheap gliders to test their basic understanding. When even that turned out to be too slow they built a wind tunnel to let them simulate the performance of different combinations of airplane parts and A/B test designs until they had a strong understanding of which parts contributed to a good glider and which parts didn't:

During the winter of 1901, the brothers began to question the aerodynamic data on which they were basing their designs. They decided to start over and develop their own data base with which they would design their aircraft. They built a wind tunnel and began to test their own models. They developed an ingenious balance system to compare the performance of different models. They tested over two hundred different wings and airfoil sections in different combinations to improve the performance of their gliders The data they obtained more correctly described the flight characteristics which they observed with their gliders. By early 1902 the Wrights had developed the most accurate and complete set of aerodynamic data in the world.

In other words what the Wright Brothers chiefly invented was not the airplane, but a cheap method to test the performance of airplane parts which made the invention of the airplane itself tractable. Something like this seems lacking, since open ended agency is kind of intrinsically difficult to make a good benchmark for. One off the cuff thought that occurs to me would be a desktop task benchmark. A quick search reveals UI-Vision and WorldGUI benchmarks. OpenAI's computer control agent is evaluated on the OSWorld, WebArena, and WebVoyager benchmarks. One could get a generalist benchmark by combining several of these perhaps, but part of the problem is that running an agent over all of these takes a substantial amount of compute. What is specifically needed is a cheap way to test the performance of a given design intervention on an agent so you can parameter sweep different design combinations to figure out which things are and aren't working.

I've also considered just doing a literature review by following benchmark mentions in papers to figure out what designs are and aren't working. For a while I was clicking on every framework I saw posted to Twitter to get an idea of what other people are doing, but eventually stopped because I wasn't seeing enough new ideas to justify the marginal time investment. But benchmarks and agent traces are at least indicators of actual capability, so if we ignored the slop frameworks and just focused in on published designs with available public agent traces or benchmarks it would be possible to catalog their features and figure out which ones appear more often in successful agents.