Weave Agent DevLog #1 - Flow Control

John David Pressman

This might sound like a dumb question but in what way is a LLM agent an agent? Before LLMs existed there was a fairly compelling theoretical frame for agency combining utility functions and solomonoff reasoning which saw agents as extracting rewards from a computable environment. LLM agent frameworks by contrast usually don't work like this. What utility function is a LLM maximizing? On paper a base model is at least "predicting the next token", but if you look at the log odds of texts at a per token level it's fairly obvious that while "predict the next token" might be a goal the model tries to satisfy it's quite impossible to actually do in full generality so some useful proxies of this goal are presumably learned. In a RLHF tuned model or similar the picture gets murkier, the model wants some correlates associated with the positively labeled samples and avoids some behaviors associated with the negatively labeled samples. When we make an "LLM agent" scaffold and start introducing things like MCTS, control vectors, self prompting, and logit evaluators the picture risks becoming totally illegible.

Basically the question is: what are we doing here exactly and how will we know if we're succeeding? The nominal answer is that we're writing a task genie which is given some high level local objective, figures out how to accomplish it, and then executes the plan. But in practice things are a little murkier than that, no plan survives contact with reality and constant adaption is necessary to make plans work in practice. Sometimes things can't be fully planned in advance because they rely on situations and action spaces in those situations which are not predictable in advance. Sometimes a plan isn't even desirable. If my friend invites me to "hang out" and I show up with a plan for what we're going to do I'm harshing the vibe, and LLM agents should definitely be able to vibe.

Every LLM agent framework is as much an attempt to answer the question of what an LLM agent should be as it is an attempt to implement that thing. In the last development log I said that a core problem was ensuring that each intermediate step in a plan actually gets performed correctly before moving on. Before we can solve that problem though we need the list of intermediates to be accomplished in the first place. These are usually supposed to be provided by a planner of some kind, a program that figures out how to navigate between a starting state and a goal state in some computable environment. When I introspect on how I do this, what stands out to me is that I'm generally not consciously planning out the whole trajectory but identifying focal and inflection points along it. I think "drive to the store, grab milk, go to checkout, drive home" not "turn left on highway 3, walk to aisle 9...". The reason I do this is that I trust my local planning and retrieval to be good enough to remember what spatial layout I navigate to get to the store, and the milk, and the checkout, and go home. One way to think about this then is that planning works in terms of abstractions on an action space that is known to be reliable.

Truthfully I've found answering these questions for weave-agent fairly difficult. Traditional planning algorithms assume you have a discrete low level action space you can build a hierarchy of motor programs out of. They assume that you have deterministic actions with deterministic consequences, not that you have to build reliable behaviors out of fundamentally stochastic primitives. A lot of my heuristic for what to do with weave-agent when I don't really know how to do something is to pluck some of the low hanging fruit that might make the framework more coherent and see if that opens any new avenues or insights.

The Tick Structure

Because I didn't really know what to do the first thing I did was nail down the tick structure. A tick is one full cycle of the agents event handling loop. Right now it comprises the following stages:

  1. Task Inference - This is where the model manages the status of tasks in its queue. It has an associated kanban board that it can use to track tasks as idle, going, completed, blocked, etc. It can also decide to add new tasks in response to the result of the previous tick. The idea behind having a standalone task inference stage separate from taking actions is that task inference is something the agent needs to do on each tick and will do much more reliably if it's specifically prompted to do so.

  2. Orientation - In this stage the model summarizes the relevant events of the previous tick and decides how it's going to respond to the situation. It's basically just a reasoning prompt.

  3. Action - The program which implements the action the model is going to take to respond to the situation. It's basically just meant to be executable logic which acts on what was decided in the orientation stage.

  4. Expectation - Another reasoning block where the model is prompted to imagine what it will look like if their action succeeds and what sensory evidence they could gather that would confirm its success.

  5. Evaluation - A series of program blocks, currently three, where the model implements the callbacks necessary to gather and evaluate the sensory evidence the model says it expects in the expectation block.

  6. Outcome - A block with a table of the evaluations performed and their results. The outcome block of the current tick is used to perform task inference and orientation along with the observation blocks on the next tick.

One of the specific ideas I had was to validate that each of these stages has been added to the tick before executing it. Most of the purpose of this is to reject ticks where the model goes so far off track it's not even following the intended format anymore since that would poison the context for subsequent ticks.

The kanban board took me a while to come up with and can be attributed to a conversation I had with George Walker where he pointed out that for planning and flow control I should probably be asking where the LLM would have already seen this in its training data and that in principle there should be a bunch of project management stuff I could work from. I looked around at some command line kanban tools for inspiration and liked the way that kanban.bash displays the ticket history. I gave the following prompt to Mistral-large 2:

Please write me a kanban board class in Python with the following features:

- The name of the class is WeaveKanban
- Each card can have a state of idle, going, completed, blocked, or aborted
- Cards have their status changed through dedicated methods like blocked(),
going(), aborted(), etc.
- If a card is set to blocked a list of tasks it's blocked on must be provided 
- Whenever a card has its status changed an explanation must be given
- The history of card status changes is kept
- The kanban is meant for command line display so it each view must render into
printable ASCII text
- One view shows the description of the card, the title, other metadata, and the
history of status changes with explanations 
- Another view shows the list of cards in a table with the first column being the
id of the card, the 2nd column being the card title, and the 3rd column being an
abbreviated history with each of the card states represented by a single letter
(e.g. blocked is B) separated by spaces. So a card that is going, blocked, then
ultimately aborted would have the string `G B A`
- The kanban must be serializable to JSON and loadable from the same JSON

And then iterated on what it gave me until I had something like the present WeaveKanban class. It's not entirely done, in particular I want to add the feature that when a task is blocked by another task it gets automatically unblocked when all its blocking tasks are completed.

I also wrote a BlueSky thread of other ideas I had like formulating SMART goals, Gantt charts or mermaid.js for the dependency graph, and action description language. But after thinking about all of them for a while I decided that realistically the kanban board already implicitly encoded a dependency graph with the blocked status and this probably provides 80% of the benefits of more complex solutions for 20% of the implementation complexity.

For Want Of A Lark Grammar, MCTS Was Attempted

After the introduction of the kanban board I noticed that the model was starting to have trouble writing a complete working set of programs for a tick, even writing syntax errors. Since these models are autoregressively sampled over hundreds or thousands of tokens it makes sense that it would occasionally make a typo or something. The way the agent is currently set up that basically means starting the whole tick over. Rather than move right away to letting it retry different parts of the tick I figured that python syntax errors should probably be precluded by the sampler anyway and it's kind of annoying that my thing can be derailed by syntax errors when I've gone out of my way to ensure the whole context is in python grammar to begin with. So I asked Claude 3.5 Sonnet what was wrong with the file and it helpfully pointed out multiple places I could simplify the grammar so vllm can actually use it. While this worked it turned out that the resulting state machine was computationally intractable with vllm's grammar guided sampling method.

Rather than give up on the idea entirely I decided that this seems like a good time to try incorporating the Monte Carlo Tree Search portion of weave-agent. I added a simplified way to invoke the bayesian weave evaluator and changed the generate_block() method to use the weave MCTS. It turns out that doing naive bayes on logit evaluators as the reward model for your MCTS is really slow and before optimizations was functionally computationally intractable. What I ended up doing was changing the evaluator code to use async requests and send them all to vllm at once, which was still too slow. So I changed it to send the requests to evaluate all the children at once, which was still too slow. So I removed the bayesian evaluator part and went back to asking one question like the original weave MCTS code, which is still too slow but sorta viable. Just a little bit ago I realized that I can actually make the chunk size I evaluate at one time pretty big and the tree search still works. I was doing 32 token spans in earlier experiments with worse models, then later bumped it up to 64 which is what I started out with on the weave-agent experiments, then upped it to 128, and just a little bit ago I tried 256 and realized that still works fairly well while cutting the computation in half. I think part of what's going on is that writing is momentum based, so the 'stem' you write from, say the first 40 tokens in a chunk is actually more important than the tokens that follow it. This means if you're bottlenecked on serial ops like the in-context reward modeling then it makes sense to generate bigger chunks because they're both easier for the evaluator to evaluate than the shorter stem alone and conditional on the stem being good the subsequent tokens are also probably pretty good which means it hobbles the MCTS less than you'd think to force it to evaluate longer spans while saving compute.

Ultimately I will probably end up having to give the model multiple tries at writing any given part of a tick. I've also been trying experiments with using models that aren't as good as LLaMa 3 405B base for the LLM like Mixtral 8x22B and they don't work nearly as well. I suspect it might be possible to overcome this by borrowing a page from Voyager and implementing retrieval over previously successful tactics and programs used by the model. In addition to possibly letting me save compute by doing less inference on code where a close match already exists in the database this allows me to bootstrap the agent by writing useful examples of tactics for it to adapt from when it gets stuck or needs help. I plan to use iterative retrieval in the vein of RepoCoder along with a mixture of BM25 keyword search, vector search, and full text search so I have recall over the full spectrum of precise and fuzzy matches to the current context. I still haven't decided how I'll allocate the compute used to make the embeddings without interfering with deployment of the LLM but I'm sure I'll figure it out.