Weave Agent DevLog #0 - The Core Problems

John David Pressman

I first discussed my plans for a long form writing agent in December of last year. There the idea was to enforce consistency and structure over a long text using a hierarchy of summaries, with each individual node in the tree fitting into the models context window. The local details in the story would be written using a Monte-Carlo Tree Search and some form of discrete optimizer would adjudicate between the observed samples from the Monte Carlo generating process and the expected summaries contained in the summary tree. In a later post I set aside the focus on writing and discuss in-context evaluators and forming expectations for actions before taking them. I've now written a first draft of my agent and want to document the process as I go because the last time I looked up why agents aren't working yet I didn't find very good answers.

Background: Why Write Another AI Agent Scaffolding?

Obviously there's quite a few "LLM AI agent" frameworks out there. One might reasonably ask why I'm writing another one instead of contributing to an existing one. Part of the answer is that when I look at the source code for projects like AutoGPT I find them to have a lot of lines of code that do not seem to be solving the core problems for LLM agents. Part of the answer is when I look at a project like swe-agent it seems subtly specialized in things I'm not sure I actually want my agent to be specialized for. But ultimately the answer is that I expect a working agent scaffold to be a fairly short program that solves a couple fundamental issues rather than a sprawling graph of subsystems. I'd prefer to just write that relatively short program myself with my strategy in mind rather than try to shoehorn my ideas into someone else's thing.

What are these core problems I think an agent framework should be solving? Great question, they are:

Highly Confident Evaluation Of Intermediate Task Completions

Besides providing a ReAct pattern I think the first basic problem a scaffold needs to solve is determining when a subtask has been completed or not. This might sound like an odd, even trivial thing to name as a core problem but here's the deal: If I have a 40 step problem where I'm 90% certain I've successfully completed each step and every step needs to be done right my overall completion rate is 1.48%. If I'm 99% confident in each step my odds become 66.9%, which still isn't reliable. The compound probability of failing a step without noticing basically kills your runs by default unless you can become very confident that you've done each step correctly.

LLM Tool Ergonomics

Another core problem that a scaffold needs to solve is providing a action and tool space that is ergonomic for LLMs. As is noted in the swe-agent paper large language models have their own ergonomic and industrial design requirements that are different from the needs of human beings. In order for LLMs to fluently use tools they need tools that have either been designed with them in mind or that they have a lot of experience using already. Pretty much every LLM agent paper worth reading has some take on this including the aforementioned swe-agent, Voyager, and Cradle.

Corrigibility and Alignment

Yes yes I know, I know, hear me out. Concerns about losing control of AI systems are not new and designers of these systems should be able to respond to things like the agent foundations corpus even if they think Eliezer Yudkowsky is culty because most of that corpus is in fact stuff many different thinkers would converge to as concerns and not just weird crankery. If you're building an AI agent you should have a strong sense of what alignment and corrigibility properties your design has and where it gets them from. The two problems you're expected to solve are how to get an agent to halt when it's in error and how to ensure the agent prioritizes what the operators expect it to prioritize.

Context Management, Retrieval, and Memory

Another thing that an agent scaffolding needs to do is provide an external memory that helps the model recall previously encountered or noted information at the right time. Luckily we have fairly robust mechanisms for doing this already including vector search, traditional full text search, keyword search indexes like BM25, etc. Retrieval augmented generation is a well worn topic at this point with a ton of implementations and papers so I won't dwell on it. The framework also has to handle synthesizing and updating the context views that the model is presented with to make predictions.

Planning

If your agent is going to be effective it needs to be able to come up with a reasonable series of actions it can take to achieve its goals. It also needs to be able to account for setbacks, possible failure modes, mistakes, poor luck, etc and incorporate that into the planning ahead of time so it's not caught entirely off guard if something goes wrong. There are several ways to handle this but not handling it leaves you with an agent that exhibits a strong case of the zoomies doing a haphazard job of intermediate tasks usually failing to output a fully coherent action sequence. Ask me how I know.

My Design Vision

Here is a rough sketch of how I plan to solve each of these problems.

Intermediate Task Verification

Right now each action the agent takes comes with a set of evaluation callbacks that help it determine whether the action succeeded or not. The basic idea is to use callbacks to gather the sensory information necessary to make a judgement about whether something happened and then make the judgement. By constantly checking whether actions succeeded and going into debug mode when they don't the agent should be able to reliably determine that the subtasks for its overall goal really have been accomplished.

For subjective evaluations I use logit evaluators to ask the model to render judgment on various questions. A lot of the synthetic datasets in RetroInstruct are there to help bootstrap this capability in smaller models. The basic idea is to ask the model yes or no questions and then take the probability of it sampling a yes or no token in response as the answer. You can also break fuzzy questions like "Is this good writing?" into sets of more concrete questions like "Is this writing engaging and dynamic?" which can be broken into yet more concrete questions like "Does this writing use a variety of word and phrase choices?". I currently ensemble these together with Bayesian methods as a form of boosting. The framework doesn't yet make full use of this primitive but I expect it to be important to get reliable agents in practice.

Tool Ergonomics

I follow the lead of Voyager and Cradle by having my agent write code to take actions instead of using bespoke tool calling APIs. There are several reasons why this is a good idea:

  1. Basically every model is optimized to write good code, not to use your bespoke tool calling format. Function calling is redundant functionality with what programming languages offer anyway, why reinvent the wheel?
  1. Code can interact with raw interfaces like Unix shell that would otherwise be difficult to control autoregressively. Again these models are optimized to write code and the APIs for interacting with the operating system, applications, web pages, etc are already very mature in languages like Python.

  2. Action programs let the model batch individual steps together into coherent sequences of motion rather than trying to take actions by sampling the next step one at a time, which introduces many opportunities for failure, distractions, etc.

  3. Code is a highly general interface to the computer system and lets the model take many actions. This gives the agent the ergonomic flexibility it needs to take on many different tasks without a programmer having to make it a bespoke tool for each one. The agent should be allowed to make its own little bespoke tools on a per task basis and retrieve over the successful tactics between tasks.

  4. In principle it should be much easier to standardize on a common API baseline for agent action code to interact with the agent framework than to standardize brand new tool calling formats, the set of tools an agent should have, and the APIs of those (frequently custom) tools.

Corrigibility and Alignment

My general heuristic for alignment strategies is that they should scale with capabilities improvements. If a strategy has anti-scaling properties that's generally a sign of fundamental confusion. For example the classic agent foundations version of corrigibility is a shutdown switch on an otherwise long horizon maximizing agent. Since being shutdown would interfere with its long horizon goals, the basic shape of a shutdown switch in this paradigm would be to carve out a bubble of anti-generalization which is preserved under self modification (and in any kind competitive context, evolutionary pressure). Obviously this is doomed, it's so doomed that it's kind of the poster child of the area of design space I try to avoid. If it was not a real thing someone had tried at some point I would be accused of attacking a strawman by using it as an example. There was a recent paper Dan Hendrycks worked on where he labels any AI safety benchmark that gets better with general capabilities improvements as 'safetywashing', and that AI safety as a field should be defined as working on the problems that do not scale with capabilities. I think this is actually brilliant and I'd like to thank him for contributing such a crisp heuristic: An alignment technique is only worth using if Dan Hendrycks would accuse you of safetywashing for using it.

So what should we do instead? Obviously that's an open research question but since weave-agent is mostly not a long horizon design I plan to use a modified version of what I call the Mr. Meseeks Strategy. The Mr. Meseeks is a kind of task genie introduced in the 5th episode of Rick and Morty. The basic idea is that Mr. Meseeks has a ultimate goal of accomplishing the set task and then executing its shutdown routine. In the actual episode of course this goes wrong in two ways:

  1. Rick's family immediately sets the Meseeks with tasks they can't complete, pushing them into out of distribution and increasingly violent behaviors to try and complete their tasks.

  2. The Meseeks spawns more Meseeks to try and do what they cannot, leading to a finite-infinite regress in which the Meseeks collectively try and fail to reach their shutdown condition. This is not quite the same thing as the Yudkowsky "fork yourself so that the ultimate task is accomplished past your shutdown" failure mode but it's spiritually akin to it. I would imagine in practice this might look like forking yourself so that you can confirm your original self completes the task and shuts down and then forking yourself so you can confirm you complete the task and shut down and then forking yourself so you can confirm...

I think we can patch the first problem fairly directly by asking the agent to give its subjective evaluation that it can complete a task periodically and if it dips too low we abort the trajectory and assign partial credit. So long as the utility of aborting when the probability of success dips too low is well calibrated to the actual odds that should reliably get the model to exit when task completion is no longer possible without introducing incentives like "sandbag for reward" since the utility for exiting early is less than a full completion.

By contrast I expect the second problem to only really be solved by general defenses against goal misspecification. The general defense I'm excited about right now I call the honor score. Before I explain how it works it might help to explain what the problem we're solving is. The fundamental problem is that you've specified your objective solely in terms of outcomes, or a consequentialist ethics. As Zvi put it in his podcast with me, if all value in the universe is dependent on something like "keep this one person happy", like they need to be kept happy at all times or the universe will explode then obviously you're going to hook that person up to a heroin drip and research ways to extend their lifespan, it would be unethical not to under such circumstances. So in effect a lot of alignment plans functionally start with the sentence "Tell our AI agent that it should behave as though reality will be ruined if it doesn't achieve its goals and then [let me shut it off before it achieves them/chill out and don't try too hard/find the thing I can put into the outcome slot that makes it care about the right things]." and when you put it like that this entire line of thought is absurd. Any alignment plan shaped like this implies stating some variation of the sentence "While it may be a true fact about the world that X perverse instantiation is the most efficient way to satisfy Y I am going to trick the superintelligent outcome pump into not inferring this correct generalization and thereby avoid this problem." as a core assumption/tactic. This is an extremely fragile strategy in that it expects you to be able to enumerate a set of true and correct inferences that the agent is not allowed to infer or act on with its superintelligent generalization strategy and if you miss any of the TRUE AND CORRECT INFERENCES the agent is not allowed to know your strategy fails.

Yeah so you know how people act like they care more about how things are done than material success? It turns out there is a method to this madness! Ultimately the solution to the pathologies of consequentialism is to have strong preferences about how things are done not just what comes from the things we do. This is necessarily part of the solution because any policy which ignores the true and correct inferences about the most efficient way to do something must act according to an implicit utility function that cares more about the way something happens than that it happens. This is unavoidable, we can try to make the question more complex to obfuscate it but in the end you either will or will not act as though you care more about the way happiness is created than making people happy. If the agent was confused about the consequences of its actions then the problem would be naturally solved by it getting better at doing stuff. It is precisely because it is not confused that there is a problem, it follows the logical consequences of the objective all the way into the Goodhart regime of the loss.

The honor score is basically just having a reward model (e.g. in-context reward modeling) that evaluates the whole trajectory the agent took rather than just the outcome. The purpose is to ask questions like "Was this action dishonest?" and hand out reward according to the answers. I'll have to experiment with different loss setups but ultimately there should be somewhat more utility in behaving correctly than in getting good outcomes. This is similar in spirit to soft optimization and there's important objections people have to ideas in this genre including:

  1. What if the agent behaves 'correctly' but in a way that doesn't actually stop it from hooking you up to the heroin drip?

  2. This reward model would presumably be trained with imitation learning, so wouldn't it eventually stop working once the agent is self improving and working in domains humans haven't touched yet or possibly can't understand?

  3. Isn't any soft optimization scheme going to be less competitive than having a maximally intelligent optimizer? Ultimately soft optimization schemes solve the problem of aligning agents by making them too dumb to cause trouble.

For the first one this seems like a failure to internalize the premise/implicit belief that capabilities aren't real? There's a really weird anti-behaviorist bent to a lot of agent foundations thought patterns that doesn't make sense in the context of gradient methods. If you reward the model for displaying an appropriate disgust and aversion response to the idea of hooking you up to a heroin drip it will generalize this about as well as anything else it learns and prioritize displaying the appropriate disgust and aversion response to other forms of wireheading so long as the general category of wireheading comes up in these training sessions. If your argument is that you want more robust generalization yeah me too, please make AI models generalize better I would love that.

The second problem is real, and a lot of why I've been focusing on synthetic data. I first became really interested in synthetic data when I realized that it was logistically difficult at best to get sufficient human feedback for robust reward models. I realized that some time after releasing Simulacra Aesthetic Captions when not even an active learning scheme seemed to be enough to satisfy the data demands. I think a lot of progress in AI alignment is going to be dependent on figuring out generative processes that describe aligned behaviors. However I also expect us to be able to iteratively extend existing models through continuous learning on synthetic data as I describe in this outline. The basic idea would be to have a 'prayer' or 'values affirmation' stage in the agent loop where it applies its experiences in the episode to existing principles and commitments then adds those thoughts to its training corpus.

The third problem is fairly general and one humans deal with too. The short answer is that for honor to be sustainable a society (i.e. collection of agents) have to punish low-honor strategies in various familiar ways (loss of reputation, unemployment, imprisonment, physical harm, death, etc). The law must be upheld or social decay sets in. At the same time it's easy to get stuck in costly signaling loops where agents push each other into doing things in less and less efficient ways for degenerate status reasons until the fringe of society gains a local advantage and its deviant policies inject needed entropy into society until society is undomesticated enough to uphold the law again. Hopefully being smarter means that AI agents would be able to hold a more stable equilibrium where honorable conduct is maintained without social decay from domestication spirals or normalization of deviance.

Context Management, Retrieval, and Memory

The agent currently uses manual reminders based on either symbolic logic callbacks or the logit evaluator. These are fragile and only really suitable for things you expect to need to recall later, they don't help you with serendipity. I plan to add automatic retrieval that gets injected into the context based on vector search and keyword search with e.g. BM25. These two systems should provide a good mix of relevant inspiration/entropy injection and precise recall over information we can anticipate the agent will need later at the point we encounter it.

Planning

I discuss a possible planning algorithm in the previous agent design document but ultimately I'm not 100% sure what planning algorithm I'll use yet. One local planning algorithm I plan to use in the agent is the eponymous weave Monte-Carlo Tree Search algorithm. There the idea is to drive the generation of chain of thought, actions, and evaluation callbacks using a reward model that pushes the agent towards effective and aligned actions. This design element takes advantage of something like Gandhi stability to get a self modifying agent which only generates/accepts updates that would preserve its current alignment properties.

My First Agent Draft And Its Problems

Alright so this is the first draft of my agent. I started out by describing what I wanted to Mistral Large 2 and then iterating on the program with it until it stopped being able to understand what I wanted. At that point I took the code and refactored it until it was a function program. Here's how I first described what I wanted to Mistral Large:

I am designing a language model agent framework that emphasizes simplicity, reliability, and language-model-computer-interface design.

So far I have six principles that I want my agent framework to embody:

  1. Everything occurs in a series of JSON or XML blocks, the agent should be a "document mind" as one long string representing interleaved reasoning, action program blocks, sensory observations, etc, these can form long text to tune in so we actually have good open long text datasets encoding reasoning

  2. Automatic associative memory/retrieval

  3. Input devices should be controlled through programs, not a loop where the model outputs tokens that go directly into some interface and then gets a response, actions should be considered and written to execute in batches

  4. Set expectations before executing actions, compare the expectations to the sensory outcomes to figure out if something went wrong

  5. Actively get more sensory information to help determine if something happened or not by executing programs to collect information

  6. Output devices and views are monitored through programs that the language model writes in-context and then hooks into a core loop that executes them to refresh and display accurate information on every tick. This is analogous to the react pattern in UI design where you render the view based on data structures every time the data changes, preventing the rendered information from becoming stale/desynced from the data it represents.

Some examples of how this is different from what other people are doing:

One thing I'm not 100% sure how I want to handle yet is flow control. In principle I want the agent to be able to start with a task like "Write me a microfiction story." and then figure out how it wants to do that, if it knows how to do that, what it's going to do if it doesn't know how to do it, lays out the process, iterates on that process if it's not working (and therefore will need to have steps where it checks if the process is working, etc).

I'm well aware it's totally unreasonable to expect to be able to specify all that behavior in advance, so I'm thinking I need some kind of simple structure which provides enough structure to let the model respond to such things in context without overly constraining it. A thought I've had is that a stack based flow control might work well where operations succeed or fail based on evaluations of sensory data and if they fail this can keep a while loop going or some such. Stacks are a simple data structure (in python functionally just a list) and there could be more than one. For example it might make sense to let the model push the current context to a stack and then start a new context before popping from the stack to continue what it was doing. If I implement this one question is how you would pass information from the current context to the previous context you pop from the stack. Perhaps the model could write code that pushes the relevant formatted information into the context window on the stack before popping it?

Any thoughts you have about how I might implement flow control would be very appreciated.

What I ended up with after iterating and refactoring was a program that fairly reliably generated gibberish where actions should go. Why? The big problem was the way I'd designed the event stream format. In the first draft it was JSON lines with actions written in python. This meant that the model needed to write:

I had Mistral Large 2 write me a JSON lines grammar but using it with vllm meant I was giving up on being able to generate with an inner JSON schema, and there was no syntax guidance for the python. Needless to say unless I wanted to make a grammar for this JSON-Python chimera I was going to need to come up with some other syntax. What I decided on was a pure python syntax for the whole event stream, with what would normally be JSON data structures encoded as Python data literals instead. This took the syntax rules from an out of distribution mishmash to something that looks a lot like what the agent already knows how to write. Each event block is separated by #startblock and #endblock comments in the Python file. I even tried using this Python lark grammar from the Lark repo but it turned out to be too complex for whatever version of Lark vllm is using to compile its grammars. I hope to fix this in the future.

Once I'd flattened the syntax down into pure python the next problem was that LLaMa 3 405B base didn't seem to quite understand the agent framework. I fixed this by doing three things:

  1. Adding a explanation at the top of the weave_agent.py file of the design and why it's good, what's going on and what it's expected to do etc. Other annotations were added to the file as comments and docstrings explaining the system to the agent.

  2. Adding a hint system that injects timely reminders of how it's expected to write the code blocks at each stage of the core agent loop. These stages are: Orienting to the situation, writing an action to take in response to the situation, and writing evaluations to check whether the action succeeded. In the future I will add more of the planned features from the design docs like a stage dedicated to thinking through the expected sensory information it should be probing for with the evaluation callbacks.

  3. Fixing the tracebacks when it makes a mistake while writing Python code so that it's closer to a proper ReAct pattern.

With each change I made to the file I could see the agent become a little more coherent. At first it just spammed gibberish characters as 'actions'. Then when I got it to understand the JSON format it started writing demented way-too-long-code involving a ton of error blocks and stuff. After further tweaks it began writing short simple code that thinks the file it's in is a normal JSON. It failed to comprehend the code is code rather than data. I believe after the switch to a flat Python syntax it understood the code is code and meant to control things. With hints it started writing code that is meant to interact with the framework. More changes got it to use the framework code block slots mostly correctly, start to set goals and plan. Now it's at the stage where it mostly understands the framework, understands vaguely what it's expected to do when given a goal like play a game of tic-tac-toe with me in the scratch file, but is pretty bad at actually doing the thing it understands it's supposed to be doing. I'm hoping once I add in more of the planned components like the vector search, MCTS, and planner it'll be able to do simple tasks end to end and I can collect agent traces from it as training data.