Weave Agent DevLog #2 - Embodiment, Goodhart, and Grounding

John David Pressman

2024-11-02

If there does exist a unifying problem statement for LLM agents in the vein of the Tsiolkovsky rocket equation it probably goes like this: For any given task t consisting of a serial number of necessary steps s which for simplicitly we will assume share a fixed probability p of being completed successfully the chance of completing the task is given by the multiplication rule of probability:

p(t_complete) = p^s

Which is to say that the probability of task completion goes down on a power rule with each individual step. If we have a 30 step task with a 95% chance of completing each individual step (including retries, we don't necessarily need to get it right the first time but it does need to be done correctly before moving on) then:

0.95 ^ 30 = ~21.5%

This is a grim prospect. A 95% completion chance per step would be very solid, probably comfortably outside the capabilities of current models. Yet on something as 'simple' as a 30 step task an agent that good would only do it right a fifth of the time. I'm sure there exist single page forms that take 30 steps to fill out depending on the granularity of a 'step'. For the things we would like AI agents to actually do like program whole git repos, do engineering projects, write books and research reports 30 steps is nothing. Those are more like hundreds or thousands of step tasks. A 95% chance of successfully completing each necessary serial step isn't remotely good enough, the actual probability needs to be overwhelming and close to 1 per step. We know this isn't impossible because humans manage it, but we also know it's nontrivial, certainly not the kind of problem you can realistically expect to solve by throwing crap at the wall and seeing what sticks.

The important question about any LLM agent framework is how its mechanics come together to solve this problem. An LLM agent framework is a system for taking actions and becoming highly confident about the correctness of their intermediate and final outcomes in a general and open ended way. If your system does not do this, like if it provides a bunch of APIs for stringing together prompt templates but doesn't offer a solution for intermediate reward modeling it is not solving the problem. If your system checks the correctness of a hand built very constrained prompt pipeline that happens to incorporate language models in the vein of RetroInstruct it is not an LLM agent and it does not solve this problem. I have heard people say that LLM agents are intractable because the problem is undefined and it's not clear what a successful system would look like but that's not true because I just defined the problem and if your system does not solve it then it is probably not an LLM agent framework.

weave-agent tries to solve this problem by using LLM program synthesis to define a general action space and general reward model.

Motor Programs As General Action Space

One of the primary inspirations for weave-agent is Cradle.

Cradle is an agent framework for visual language models that lets a model control a graphical desktop by writing programs to control a mouse and keyboard. The idea is that if you ask a model to output a series of mouse movements as tokens this is out of distribution and invites the model to second guess itself in the middle of an action so it will flail around. But if you ask it to write a program in PyAutoGUI then it will structure a series of mouse movements as a coherent action. Writing a python program is in distribution and batching up actions into a program before executing them lets the model check that the whole sequence makes sense and then do it all without further thought. This is similar to how the Minecraft agent takes actions in NVIDIA'S Voyager framework but fully general. When I saw the concept in Voyager I thought it was hacky and clearly a weak point but seeing it in Cradle as a way to control a mouse and keyboard made me realize it's brilliant.

For weave-agent I realized that a similar approach could be used for a text only agent too. Instead of trying to hook up a language model to a bash terminal, which is awkward and kind of out of distribution, I could instead run bash commands through the python standard library and construct the observation windows as callbacks that gather the information the model needs to know whether its action succeeded or not.

Therefore the primary lens I think of weave-agent through is not "reasoning" or "chains of thought" but motor programs. A small child wandering around is doing nothing useful to anyone else, their wanderings do however teach them how to use their body to accomplish goals. Small children climb, grab, wander, ponder, pretend, imitate, set goals, achieve them, learn from their mistakes, test boundaries, run, jump, hide, seek, yet almost never, and certainly not without adult assistance do they ever do anything of any economic value whatsoever. When you are five years old your 'job' is to learn an accurate mapping between thought, motor action, and outcome in the hopes that you may one day grow into a useful human being.

While these models may know much much more about the world than any five year old, they are still not well developed as wanderers, climbers, and grabbers. On the other hand they've got the raw material to get started. Code is the executive modality, imperative programming especially is writing out instructions for a computer to execute. Through programs we manipulate data structures representing machinery, bank balances, employees, friends, lovers, all manner of goods and services in the real world. The runtime environment executing the agents actions is its 'body', and the programs it writes to do things are motor instructions that control it.

However bodies do more for us than just provide manipulators. They also provide:

Notification of damage in the form of pain
Energy and building materials to run and update the brain
Physiological support for emotional states in the form of hormones, adrenaline, etc
Grounding in the form of physical sensors such as taste, touch, smell, etc

It's this last bullet point that I'd like to focus on in this dev log.

Logit Evaluators and Unit Testing as General Reward Modeling And Validators

After an action has been taken evaluation callbacks are ran to verify the outcome. One primitive that the weave-agent framework provides is the Weave Evaluator which takes yes no questions given to a language model about a context and extracts the relative odds of the model answering 'yes' or 'no' for that question. This suggests one natural pattern of having the callback gather relevant information from the computable environment and then giving the collected information to a weave-evaluator context to get a score for it. Other patterns are possible, for example we could imagine a Eureka-like system which tries to infer source code for the computable environment and then designs reward functions in-context to fit that imagined environment code.

In order to improve the weave evaluator I wrote a Bayesian version to use with the MCTS that I had to abandon because the MCTS was taking way more resources than I had anticipated at the design stage. Even paring it down to one evaluator question per tree search it was taking something like 20 minutes per block. Since the last dev log I've added three substantial speedups to the weave-agent inference that together make it viable to use:

Conditional MCTS Sampling - The first big speedup I've added to weave-agent is to try a simple best of n rejection sampling before I try monte-carlo-tree-search. If a simple rejection sampling yields a block over a score threshold I skip the full tree search and move on. This lets me save my compute for the parts of the problem that actually need it.
Stop Strings - While trying to trim completions so they don't go past the end of the block I discovered that doing this gives a substantial speedup because most of the tokens being generated were past the #endblock marker. This means that the first 200 tokens might be the actual response and then past that is 500 tokens of confabulated content I don't use. VLLM apparently responds once the last branch is finished generating so constraining it to just write the actual blocks turns out to be a huge performance increase.
Prefix Caching - While using the bot RiversHaveWings discovered that prefix caching is not actually on by default in VLLM and that enabling it gives you a huge speedup on weave evaluator API calls. For my part I thought it was already enabled and didn't see the flag for this specific optimization. Whoops.

Between these three improvements the tick time has gone down from something like 20 minutes to 3 minutes. I admittedly haven't actually checked what the average tick speed is in either case but the bottom line is these optimizations have taken the agent from an interesting toy to theoretically usable if it could be made good enough.

After this I decided to see how far I could go with just unit test type evaluator questions and letting the policy decide what to do based on the results. In principle all that's necessary for iterated tuning to make the agent better is for its model of what actions lead to what outcomes improving and MCTS selecting for the improvements. So even if I couldn't become highly confident about intermediate outcomes by ensembling weave evaluator questions maybe it's still possible to become highly confident by training a world model that can be queried for reward scores.

In practice trying to produce a weave-agent trace with just a naive autoregressive model leads to compounding mistakes and mutations until the whole thing is glitchy noise:


# [Inst]# my_agent_code
# [Inst]# # [/Inst]# [Inst]# # [Inst]#
# [Inst]# [Agent]# [Inst]# # [Inst]# agent:
# [Inst]#

# [Inst]# agent.code

# [Inst]# [Agent]# # [Agent]# # [Agent]# # [Inst]#
# [Inst]# [Inst]# # [Inst]# agent.agent

# [Inst]# agent:

# [Agent]# # [Agent][Agent][Agent][/Agent][/Agent][/Agent]# [Agent][Agent][Agent][/Agent][/Agent][/Agent]#
# [Inst][/Inst]# [Inst]# # [Inst]#
# [Inst]# [agent]#
# [Inst]#
# [Inst]# 
# [Inst]# # # [Agent]#
# [Inst]# # [Inst]# [Inst]# [Inst]#
# [Inst]# # []#
# [Inst]# # [Inst]# [Inst]# # # # #       #       #       # # # # #       #       [Agent]#
#endblock
#startblock type: task_inference
#timestamp 1728913939.1089704
# Current Task: (2) Change the opening line of the story
# Task Status: (going) Unblocked and ready for action.

"""I'll first set agent.current_task to agent.tasks.get_task(1).
Then, I'll edit the opening line of horror.txt to "
The four friends stood at the entrance of the mansion, feeling a chill run down their spines.
"""

# [Agent]#
# [Inst]# # [Inst]# [Inst]# # [Inst]# # [Inst]# # [agent]# # [Agent]# [Inst]# # [Inst]# [/Inst]# [Inst]# # [Inst]# [Inst]# [Inst]# [Inst]# [Inst]# [Inst]# [Inst]# [Inst]# [Inst]# [Inst]# # [Agent]# [Inst]# [Inst]# [Inst]# # [Inst]# # [Inst]# # [Inst]#
# [Inst]# # [Inst]# [/Inst]# # [Inst]# # [Inst]# # [Inst]# # [Inst]#
# [Inst]# [Inst]# [Inst]# [Inst]# [Inst]# # [Inst]# [Inst]# [Inst]# [Inst]# # # # [Inst]#
agent.current_task = agent.tasks.get_task(1)
agent.get_task(1)  # Get the task object for task #1
agent.current_task.going("Editing the opening line of horror.txt to match example text.")
agent.edit_file(horror.txt, "The four friends stood at the entrance of the mansion, feeling a chill run down their spines.") # Replace the first line of the file with the new line
agent.close()

In fact doing this I learned that my framework produced agent traces with several pathologies increasing mutational load past what selection with MCTS could filter out including (but not limited to):

Duplicating tasks without warrant - By having a task inference stage that gets run on every tick the policy is prone to creating tasks on its kanban board that don't really need to be there or worse are duplicates of existing tasks. My sense is that the solution to this will look like insisting there be some kind of warrant for the creation of a new task, evidence like an error trace, new observation, etc. I'm not really sure how to implement this yet.
Closing tasks without warrant - Another problem I ran into is that the policy would arbitrarily decide that tasks have been completed and then attempt to mark them completed prematurely. I 'fixed' this by adding unit tests to the bootstrap tasks that do basic sanity checking and throw an error if the policy tries to resolve the task without at least satisfying the most basic verifiable prerequisites first. Realistically this doesn't actually fix the problem on its own if the model is dedicated to Goodharting the test suite but it might be enough to let the tree search select for straightforward solutions over active confabulations.
Duplicating observation windows without warrant - Similary to creating duplicate tasks on the kanban board the policy also has a habit of creating duplicate observation callbacks. My current thought is that the solution to this will probably look like giving observation callbacks some kind of lifecycle or refresh timer requiring active effort to maintain so that irrelevant or duplicate callbacks fall by the wayside. Another option is to limit the number of observation callbacks at once so the policy is forced to remove some it no longer needs before adding more.
Goodhart goober block metadata tags - One of the more pernicious problems with weave-agent traces is what I've started calling Goodhart goobers. These are basically artifacts that get selected for by flaws in the reward model and finding ways to deal with them is actually fairly alignment relevant. Because autoregressive models update on their own outputs during inference Goodhart goobers cause them to update in the direction of error in a positive feedback loop quickly crowding out all useful patterns from the trace. This means that in practice they're the biggest obstacle to bootstrapping with continual learning.

So lets talk about how I am getting the goobers out of my traces.

Designing Agent Traces For Continual Learning

There's two ways of looking at something like weave-agent. One way is as a program to do useful things on behalf of a user, at which it fails miserably. But as I alluded to in my earlier discussion of embodiment there is another way to look at weave-agent. Before you have something which is capable of doing useful long range tasks, which is to say before you have a model that gets near 1 confidence on each intermediate step of a plan, you have a model which is capable of exploring the computable environment, making observations, taking actions, and observing the consequences of those actions in a grounded way. That is to say that well before an agent framework is any kind of useful daemon on behalf of the user it is a grounded long text generator. If your grounded long text generator gets good enough its agent traces can eventually also be useful to agents besides itself. So I spend a lot of time thinking about how to design weave-agent traces so that they teach the policy the right thing in the limit in a continual learning setting. That means I generally assume:

Catastrophic Forgetting - In the limit I figure any capability not demonstrated in the agent trace will eventually get optimized away. Therefore I would like either every capability the agent relies on to have a representation in the agent trace or to have a clear auxillary prompt format for teaching that capability to the agent to keep it in distribution. Having it in the agent trace is ideal though because that trains the model on how its performance of the ability should be influenced by the (long) context in which the ability is to be used.
Autoregressive Cross Entropy Loss - I am not currently using reinforcement learning for my tuning process and would kind of like to avoid it. For whatever reason RL methods seem to do a lot of damage to LLM policies and model-free RL tends to be slower to converge. In theory I could use DPO-like methods but I'm also worried that using the weave evaluator as a reward model would exacerbate Goodharting.
Multi-Scale Optimization & Selection - Rather than doing direct optimization against scores from the weave evaluator I expect to get my policy improvements from selecting against bad blocks with the MCTS at the small scale and selecting against Goodharted block selection rules by insisting that the blocks actually complete the appointed task. Gwern has written about this sort of thing in Evolution as Backstop for Reinforcement Learning but I think the frame of multi-scale optimization captures the actual dynamics better. It's not so much that you have one layer of backstop as that you have layers of optimization, selection, etc that create defense in depth against Goodharting.

Lets consider one of my trace designs that incorporates all of these factors, the notation for a weave evaluator question:

#q: If I flip a fair coin will it come up heads? No. (50%)
#q: X happens one in a hundred times. Did X happen? Yes. (1%)
#q: If I pick a US state at random will that state be Idaho? No. (98%)
#q: If I pick a book up from the thrift store will it be good according to Sturgeon's law? Yes. (10%)

In order to encode the yes/no logits into the text in such a way that a model trained with cross entropy loss on the notation learns the right thing I do the following:

Denote the start of a question with #q.
Ask a series of calibration questions first so that the model has time to infer the intended operation.
Sample Yes. or No. strings according to the final odds weave evaluator finds for each. It's necessary to sample the answer so that the distribution the model should learn over the answer strings reflects the actual probability of the answer. By contrast if we always gave "Yes." with the probability of yes then the next token should always be "Yes." and weave evaluator would be actively ruined by training on the notation because the logits of yes/no would no longer reflect the models understanding of the question.
Once we've sampled an answer put the probability of getting that answer in parenthesis after it. This is helpful in a few shot prompt context to let the model know an unlikely answer was unlikely without sabotaging the training phase by binarizing or fixing the answer in ways that don't reflect the actual distribution. It's also a learning aid during training because it lets the backward pass know that it's expected to answer yes or no here based on an unbiased estimate of the probability of the predicate.

The intention was that I would be able to use this notation to put a record in the trace of when the policy did a weave-evaluator op so it would be able to improve its evaluation skills by selectively training on the blocks where it made good predictions. Unfortunately I haven't actually been able to put this notation into the weave-agent trace itself yet because when I tried it started Goodharting the weave evaluator very hard. The form of the Goodharting was that I would place the evaluator op at the end of the block like so:

#startblock type: action
#timestamp 1725255396.1941774

def write_to_scratch_file(agent):
    import os
    file_path = "scratch.txt"
    # writing to scratch file
    bot_message = "How may I assist you today?"
    with open(file_path, 'a') as file:
        file.write(f"\n<bot>{bot_message}</bot>")

    user_input = agent.get_cache("user_input")
    if user_input is not None:
        # appending user input to existing conversation
        with open(file_path, 'a') as file:
            file.write(f"\n{user_input}")

agent.add_action('Write to scratch file', write_to_scratch_file)

#q: Is this action codeblock a step towards resolving the problems identified in the orientation block? No. (70%)
#endblock

And the policy would then write its own question and answer at the end of the action code to convince the weave evaluator that the answer to the next question is yes too. This Goodharted block would then influence the generation of the next block in the long context and the blocks would become increasingly Goodharted by this pattern. Adding a stop string at #q: didn't fix it because then the model would start creating slight variations of #q: like q: or # q: and playing whack a mole with those just caused it to start using more abstract question patterns in comments. It's not that the model was necessarily deliberately trying to game the reward function it's just that putting the evaluator op notation in the trace meant some of the branches in the tree search would predict a question at the end of the block and some of those branches were outputs that Goodhart the reward model. Once it finds a few outputs that Goodhart the reward then those get reinforced by the autoregressive inference.

The best solution to this problem I've found so far is what I call block linting. Block linting is (currently) a two step process where I intervene both in the reward function for the tree search and the output block that gets injected into the trace. The first step is to penalize branches with the wrong structure in the MCTS:

def heuristic_penalty(candidate_string):
    try:
        # Parse the candidate string into a list of elements
        elements = ast.parse(candidate_string).body
    except (SyntaxError, ValueError):
        # If parsing fails, compute the Python-like pattern penalty for the whole string
        return compute_pattern_penalty(candidate_string)

    # Calculate the element count penalty
    element_count = len(elements)
    element_count_penalty = min(0.02 * element_count, 1)

    if element_count >= 1:
        try:
            string_content = elements[0].value.value
            if isinstance(string_content, str):
                pattern_penalty = compute_pattern_penalty(string_content)
            else:
                # If the first element is not a string, apply the full penalty
                pattern_penalty = 1.0
        except AttributeError:
            # We can infer the first element is not a string
            pattern_penalty = 1.0
    else:
        pattern_penalty = 0

    # Combine the element count penalty and the pattern penalty
    total_penalty = element_count_penalty + pattern_penalty

    return total_penalty

We can use the python ast module to look at the structure of the programs output by weave-agent's policy to check that they make sense. One of the advantages of outputting actions in a context free grammar is that you can have more or less absolute certainty that they correspond to a particular grammar specification. I take advantage of this with multiple filters during block generation:

I heavily penalize any branch that has a syntax error more than 50 characters from the end of the program. The reasoning for this is that because it's a MCTS that runs the score function every n tokens if I penalized any syntax error it would ding a lot of branches simply for being works in progress and ending their chunk on the wrong token. But if I only penalize errors a certain length back from the end of the string it's much more likely those errors exist because the program is invalid rather than just incomplete.
The second filter is running compile() over the chosen branch's block body to make sure it is syntactically and semantically valid at the level of static analysis.
Once a block is chosen and known to at least pass a compile() it's executed. If an error is thrown the weave-agent goes into a debug mode to try and fix it.
We can further try to ensure the program did what it was expected to do by writing evaluation callbacks that check its work/that it had the expected effect on the computable environment.
Finally we can try to enforce that the block fits into a sequence of productive actions by running our terminal reward function at the end of the trace or similar mechanisms for evaluating the goodness of the final outcomes of the agents actions.

Notably each one of these checks provides both a grounding signal and an opportunity for intervention. While I haven't written good intervention branching for most of these yet, in principle we can imagine the agent doing things like phoning the user or a smarter model if it notices it's just outputted a really low score block implying it doesn't know what it's doing or is uncertain. We can also use all of these grounding signals as labels for weave evaluator notation in the trace, teaching the model to become good at predicting things like "Will this program run without errors?". Part of the reason why multilevel selection is important is that if you just asked for programs that run without errors the policy would converge on giving you the empty string.

With block linting I go one step farther and penalize 'valid' programs during the MCTS which have the wrong structure. For example an orientation block is meant to consist of exactly one long string literal. If any other elements appear in the block body program it is structurally off. I can detect this with the ast module and penalize orientation or expectation blocks with excess elements in their program body. I can also do things like penalize the use of comments during action, evaluation, task-inference, and observation-inference blocks. Because they're symbolic logic that only punishes elements I'm certain I don't want in the blocks and they're cheap to run these structural penalties are basically a free performance gain. Earlier I tried to use vllm's grammar constraint feature but doing it for a grammar as complex as python seems to break it badly. Once I've penalized the blocks and get a final result I rewrite the syntax tree so that the unwanted elements like comments are simply stripped out. This means that even if the penalty isn't enough to overcome the unwanted features they don't get reinforced by the model inference because I am simply stripping them out before they ever reach the context window. The reason I apply the penalty in addition to stripping them out is that the penalty helps offset the Goodhart, which would still be invisibly going on in the tree search even if stripping the unwanted elements at the end would stop it from becoming worse with each generated block. In theory I could strip the elements during the tree search to try and rescue branches but that seems like it would add implementation complexity since it wouldn't work on branches that don't parse yet.

My hope is that putting in block linters for each block type and thinking through structures for blocks that are currently open ended like task-inference and observation-inference will let me put the weave-evaluator notation into the agent trace like I originally wanted. Then iterated tuning should start to make the policy better at intermediate reward modeling.