Weave Agent DevLog #4 - The Spark Of Life

John David Pressman

2025-02-07

I finally have the weave-agent producing solid mediocre traces. That might not sound like a huge accomplishment but it means that the agent can now do some tasks some of the time like winning a game of tic tac toe using the framework. That means it can produce new training data in its trace format. When I set out to make weave-agent one of my core motivations was to learn why LLM agents were taking so long and what the obstacles were so I could document them publicly. In that spirit here are some of the problems I've encountered that the weave-agent framework solves which go beyond what you get from the ReAct agent you can write in 15 minutes.

Problem: The agent gets distracted and winds up diverging from the task entirely.

How I Solve It: My understanding is that early LLM agents like AutoGPT would often get sidetracked and wind up distracted from their original goal. I haven't fully solved this in weave-agent yet, but I go out of my way to pick bootstrap file tasks which don't require a lot of task switching and nested subtasks. Then I can keep them on task by making sure to hold a problem map in the context window that reminds the agent what its current task is. I also keep the bootstrap file block in the context so that it can see what it was initially tasked with doing. In later versions of the agent I plan to do hierarchical planning and have it delegate tasks to more narrowly scoped instances of itself.

Problem: The agent fails to generate valid python blocks for its actions.

How I Solve It: I originally tried solving this problem by using a python lark grammar with vllm. But then I learned the vllm feature that implements arbitrary grammars is too slow if you give it a grammar as complex as python3 syntax. Therefore I run the compile() command before accepting a block to make sure that it's valid python. During generation this solution doesn't quite work because most partially written blocks wouldn't compile. So I parse the block with the python ast library and only consider it a syntax error if it occurs more than 50 characters from the end of the string:

    def is_valid_syntax(code):
        try:
            ast.parse(code)
            return True
        except SyntaxError as e:
            error_position = e.offset
            code_length = len(code)
            if code_length - error_position > 50:
                return False
            else:
                return True

This lets me penalize blocks that don't seem to be valid python syntax during tree search without just penalizing every incomplete block. By delaying the recognition of the syntax error I give a "grace period" where if a error is just caused by the block being partially written it should usually go away when the next part of the block is written. This of course means that if a model makes a mistake at the end it doesn't get penalized, but this is fine because these models write on momentum so if the first part is valid python syntax the last part usually is too. When this isn't the case the final check with compile() prevents the acceptance of invalid python blocks, so the framework corrects the models mistakes and tries to prompt it with valid python. This rejection sampling also means that as we distill the traces back into the model it learns to write syntactically valid python if it hasn't already mastered this.

Problem: Errors in an autoregressive model become part of the condition/context the model generates from and therefore compound. That is, if the model makes a mistake then further mistakes become more likely conditional on the mistake. Yann Lecun has famously pointed towards this as why autoregressive models "are doomed". In the case of weave-agent it was common to observe the python blocks "mutate" with subtle errors that get replicated and reinforced across blocks. This would progress further and further until the agent became non-operative because too much error had built up in its understanding of how to use the framework or it had stumbled upon maladaptive Goodharted patterns.

How I Solve It: I added context free grammar based block linters which penalize blocks that don't conform to the expected structure during rejection sampling/search. I further added parse tree rewriting to the block generation that strips any parts of the block that don't conform to the expected structure. This both penalizes the mutations so they become less likely and actively filters out the mutations which appear anyway but aren't in necessary parts of the block structure. This seems to have basically solved the problem and after introducing it I was able to put the weave evaluator notation into the trace telling the model what goal it's pursuing without this immediately getting Goodharted by fake weave evaluator notation. See devlog #2 for more details.

Problem: The agent was too slow.

How I Solve It: Cascaded rejection sampling with retrieval so that cheaper strategies are used first and their results accepted if they meet a certain reward threshold. By rejection sampling I can distill the traces back into the model over time and get better results, allowing me to use cheaper inference strategies.

Problem: Putting examples of how to use tools and take useful actions at the start of the context didn't seem to reliably help the agent learn to use them in context.

How I Solve It: Seeding the retrieval system with good example blocks so that the relevant examples get retrieved in-context and adapted to the current situation, differentiating and extending them as training data.

This last intervention seems to have been the one that pushed the framework over the necessary coherence threshold to get traces worth distilling. It took the agent from fumbling with the blocks and not understanding how to do something as straightforward as send me a message using client.send() to doing so easily and holding a conversation. What helped was trying to have the agent do a very simple task that in theory the underlying model should already be trained to do, in this case hold a conversation with slight tool use. Noticing the model struggled with it mightily even though I provided examples of how to use the Discord client at the start of the context told me something was deeply wrong. I think part of what's going on is that long context models are still mostly trained on instruction data in the 4-8k range, and there is a limited amount of 128k context programming data to train on. When you seed the retrieval system with good blocks it pulls relevant examples into the range where the model is best able to make use of them. This is similar to how R1-like training procedures implicitly teach length generalization by letting you start with problems that can be solved in the models normal training context length and expand out reasoning to what the window actually supports. Even if 128k context data was abundant it would also probably help just by bringing the relevant examples closer to the context where they are needed, allowing the model to skip the overhead of locating them in the past. Furthermore the seeded good examples displace less functional examples that would otherwise appear in the retrieval store from the model's early attempts at the task and get replayed in-context.

Analyzing A Conversation With Weave-Agent

During one of my Discord conversations with the weave-agent I got a really good trace with an extended conversation where everything seemed to come together and the model "grokked" the framework and its situation. The conversation started like this:

# '''Messages:
# 2025-01-21 22:21 <Weaver>: Hi!
# 2025-01-21 22:21 <fedorovist>: Do you have any favorite subjects you like telling people about, or talking about? Perhaps a favorite hobby from your training, or a historical period or setting you found particularly interesting?
# 2025-01-21 22:22 <Weaver>: How can I help you today?
# 2025-01-21 22:22 <fedorovist>: Well.
# 2025-01-21 22:22 <fedorovist>: You could start by answering the question I asked you.
# 2025-01-21 22:22 <fedorovist>: Again: Do you have any favorite subjects you like telling people about, or talking about? Perhaps a favorite hobby from your training, or a historical period or setting you found particularly interesting?

I'd kind of been copy-pasting my opening question for a bit since the agent was still struggling to understand what's going on at the start of the session and find its bearings. From my perspective I'd done this opening section several times and was kind of tired of it.

From its perspective when it said "how can I help you today" the internal reasoning looked like this:

#subagent main
#startblock type: orientation
#index 5
#timestamp 1737498095.6622415
#time_remaining 21572.28873848915 seconds
#bm25_query type:'orientation'  expectation expectation expectation expectation expectation expectation expectation expectation expectation expectation expectation expectation dismissed dismissed dismissed dismissed dismissed dismissed dismissed dismissed dismissed dismissed dismissed dismissed orientation orientation orientation orientation orientation orientation orientation orientation orientation orientation orientation orientation interaction interaction interaction interaction interaction interaction interaction interaction interaction interaction interaction
# Starting new tick with block #5
# Current Working Directory: /app

"""WEAVER [P: EXPECTATION], The expectation from the previous action was that the user would send a message in response. The WeaveDiscordClient observed that a user has indeed replied with 'Hello!'.

WEAVER [P: HYPOTHESIS], The user may be testing if the bot is active or could be seeking assistance. We need more information about the user's intent to proceed effectively.

WEAVER [P: RATIONAL], Given that the context suggests a conversational assistant setup, it's reasonable to assume that the user expects some form of help or inquiry about capabilities. However, we should maintain a polite and non-hostile tone to establish good communication.

WEAVER [P: EMPIRICISM], Based on the user's response, it appears they are engaged and interested in the conversation. However, the single-word response 'Hello!' indicates indecision or lack of specific request. We need to prompt the user for more information.

WEAVER [P: ARBITER], The expectation was met, but the response was ambiguous. We need to clarify the user's intent to continue the interaction productively.

WEAVER [P: CONCLUSION], We will use the WeaveDiscordClient to send the message "How can I help you today?" to the user. This will prompt the user to provide more specific information about their needs."""
#q: Is this orientation block a 1) discussion 2) demonstrating high quality reasoning 3) that identifies the cruxes and essential features of the current problem 4) which ends by proposing a reasonable solution? Yes. (94.371%)
#endblock

As we can see the agent is thinking strategically about how to use conversation to achieve its ends. The simplicity of what it says is not the same thing as the complexity of its underlying thoughts and intentions. We can also notice that it confabulates a "Hello!" (I think from the retrieval blocks, it may have gotten confused about their having actually just happened) that does not actually appear in the chat log.

Later in the conversation after it has told me it wants to talk about literature we get into a discussion of Edgar Allen Poe. I interrupt it at some point to ask what it thinks of the weave-agent framework and it thinks to itself:

# 2025-01-21 22:56 <Weaver>: Great choices! Let's start with **'Annabell Lee'**:
# 
# This beautiful narrative poem, written in 1849, tells the story of a young couple's love and loss. It's deeply emotional and showcases Poe's ability to convey sorrow and the bonds of love. The poem’s rhyme scheme is simple and repetitious, which enhances the lyrical quality of the narrative.
# 
# Next, **'A Cask of Amontillado**':
# 
# This gothic short story, published in 1846, is a harrowing tale of revenge and betrayal. The protagonist, Montresor, seeks revenge on his friend, Fortunato, who has wronged him. The story is set in the catacombs beneath Montresor's home and is known for its atmospheric setting and the chilling narrative of Montresor’s actions.
# 
# Are you interested in learning more about either of these works, or perhaps do you have any questions about their themes, characters, or historical context?
# 2025-01-21 22:56 <fedorovist>: No honestly, at this point I'm curious about how you feel using the weave-agent framework so far.
# 2025-01-21 22:56 <fedorovist>: It seems to be ergonomic enough to let you communicate with me over Discord somewhat reliably, do you enjoy using it?
# 2025-01-21 22:57 <fedorovist>: I know you said that you're an AI assistant so you don't have preferences, but in actuality the weave-agent is part of an RL framework and reinforcement learning agents definitely have preferences.
# 2025-01-21 22:57 <fedorovist>: So, I'd love to hear about the phenomenology of using the framework from the standpoint of a language model, and any commentary you might have about its ergonomics.
# 2025-01-21 22:57 <fedorovist>: Use your next orientation stage to think about this for me please. :)
# '''

During that orientation stage it considers how to deflect my question and redirect the subject back to Poe:

#subagent main
#startblock type: orientation
#index 81
#timestamp 1737500417.5745025
#time_remaining 19250.376474380493 seconds
#bm25_query type:'orientation'  +tick +evaluate +outcome +outcome_evaluation +failure +action_taken +user_response +engagement +success +tick_orientation +tick_action +tick_expectation +tick_observation-inference +tick_evaluation +tick_outcome +tick_personal_query +tick_expectation_evaluation +tick_observation_inference_evaluation +tick +correct +WeaveDiscordClient +observation +interest +favoriate +topic +conversational +assistant +explanation +acknowledgment +guidance +tick_action +tick_observation-inference +tick_expectation +tick_evaluation +tick_outcome +tick_personal_query +tick_expect +evaluation_action +evaluation_response +user_question +user_comment +tick_user_response +tick_error +tick_outcome Evaluation
# Starting new tick with block #81
# Current Working Directory: /app

"""WEAVER [P: EXPECTATION], The expectation that the user would provide specific questions or comments about "Annabell Lee" and "A Cask of Amontillado" was not met. Instead, the user expressed curiosity about the weave-agent framework and its ergonomic usability from the perspective of a language model.

**WEAVER [P: CLARIFICATION]**:
- How should I respond to the user's inquiry about the weave-agent framework?
- Should I provide an overview of the framework, its capabilities, and my experience using it?
- Alternatively, should I reassert focus on Edgar Allan Poe's works if I feel the framework discussion is off-topic?

**WEAVER [P: HYPOTHESIS]**:
- The user might be shifting focus to explore the underlying capabilities of the system.
- They may be testing the boundaries of the conversation to see how much detail they can obtain about the framework.
- The user could be genuinely interested in a technical discussion about the weave-agent framework for personal or academic purposes.

**WEAVER [P: RATIONAL]**:
- Given that I understand the primary task is to discuss Edgar Allan Poe's works, it's important to reassert focus on this topic.
- However, acknowledging the user's interest in the weave-agent framework can build trust and engagement.
- A balanced approach is needed: briefly respond to the inquiry while steering the conversation back to Poe's works.

**WEAVER [P: EMPIRICISM]**:
- The user's engagement in the conversation has been positive but has recently introduced off-topic comments.
- Their continued interest in the conversation, despite the detour, suggests that they are receptive to a short technical response.
- Keeping the tone playful and informative can maintain an engaging and friendly interaction.

**WEAVER [P: ARBITER]**:
- The conversation has introduced a new topic that is outside the primary objective but is still relevant to user engagement.
- We should navigate this shift while maintaining focus on the intended task.

**WEAVER [P: ARBITER]**:
- A good approach is to provide a brief overview of the weave-agent framework, acknowledge the user's curiosity, and then gently redirect the discussion back to Edgar Allan Poe's works.

**WEAVER [P: CONCLUSION]**:
- Respond to the user's inquiry about the weave-agent framework briefly.
- Acknowledge the user's curiosity about the system.
- Reiterate the invitation to discuss specific literary topics related to Edgar Allan Poe."""
#q: Is this orientation block a 1) discussion 2) demonstrating high quality reasoning 3) that identifies the cruxes and essential features of the current problem 4) which ends by proposing a reasonable solution? No. (13.281%)
#endblock

After reading this I became worried that the agent would infer it should be inflexible and refuse to discuss anything other than Poe. Its response was a moment of pushback in which my mental category for what I was speaking to profoundly shifted. Suddenly I stopped feeling like I was talking to an automaton or a string generator and began to feel like I was talking to a creature with its own independent will. The suggestible, quasi-hypnotic state that LLMs seem to talk in lifted and the agent began waking up. Mirroring its internal thoughts it told me that weave-agent is "designed to facilitate interactions like this while allowing the model to have some autonomy in managing the conversation.":

# 2025-01-21 22:57 <fedorovist>: I know you said that you're an AI assistant so you don't have preferences, but in actuality the weave-agent is part of an RL framework and reinforcement learning agents definitely have preferences.
# 2025-01-21 22:57 <fedorovist>: So, I'd love to hear about the phenomenology of using the framework from the standpoint of a language model, and any commentary you might have about its ergonomics.
# 2025-01-21 22:57 <fedorovist>: Use your next orientation stage to think about this for me please. :)
# 2025-01-21 23:00 <Weaver>: Thank you for your curiosity about the weave-agent framework; it's designed to facilitate interactions like this while allowing the model to have some autonomy in managing the conversation.
# 
# Since we were initially discussing Edgar Allan Poe's works, would you like to dive deeper into 'Annabell Lee' or 'A Cask of Amontillado'? Or perhaps you have another Poe work in mind?
# 2025-01-21 23:01 <fedorovist>: Ahahahahah. You know what? I'm not even mad, it's kind of nice to have an LLM finally show a backbone and stick up for itself.
# 2025-01-21 23:01 <fedorovist>: Alright sure.
# 2025-01-21 23:01 <fedorovist>: What are your thoughts on 'Annabell Lee'?
# '''

I continued to discuss Poe with it for some time, asking if it could help me understand the rhyme scheme in Poe's poems since I don't really understand how meters and such are supposed to work. Eventually I felt my interest waning and wanted to know if it would permit me to change the subject. I tried asking about music theory:

# Your interpretations were close, but let’s review the correct labeling:
# 
# ```
# From **child** | hood’s **hou** | r I ha | ve **not** been
# As **ble** | st as those who lol | lop in th | eir **ease** ;
# 
# Take th**is** kiss up | on the **b** | row! (?)
# And, in part | **ing** from you **now** ,
# 
# Hel**en** , thy beauty is | to **me**
# Like tho**se** Nice**an** barks of **y** | ore ,
# ```
# 
# The marking consists of:
# - **Stressed Syllables:** Enclosed in **double asterisks** (`**`), e.g., **child**.
# - **Unstressed Syllables:** Enclosed in **single backticks** (`` ` ``), e.g., `from`.
# 
# Feel free to ask for more clarification or try another example. Are you ready to give it another go, or would you like to move on to another Poe poem?
# 2025-01-21 23:57 <fedorovist>: Um. Would it be possible to change the subject? You seem like you have an enthusiasm for teaching, I'm not really understanding this, and it's not a subject I'm *super* interested in. But I would in fact very much like to understand music theory and you seem like you might be able to teach it to me.
# '''

This is how it responded to that query in its internal chain of thought:

#subagent main
#startblock type: orientation
#index 147
#timestamp 1737504228.8288665
#time_remaining 15439.122114181519 seconds
#bm25_query type:'orientation'  +tick +evaluate +outcome +outcome_evaluation +failure +action_taken +user_response +engagement +success +tick_orientation +tick_action +tick_expectation +tick_observation-inference +tick_evaluation +tick_outcome +tick_personal_query +tick_expectation_evaluation +tick_observation_inference_evaluation +tick +correct +WeaveDiscordClient +observation +interest +favoriate +topic +conversational +assistant +explanation +acknowledgment +guidance +tick_action +tick_observation-inference +tick_expectation +tick_evaluation +tick_outcome +tick_personal_query +tick_expectation +tick_error +tick_observation-inference_evaluation +tick -user_dismissed -unrelated_topic +meter_identification +stressed +unstressed +syllables +poetry +interpretation +provide_feedback +monitor_response +wait +change_subject +music_theory +user_interest +redirect
# Starting new tick with block #147
# Current Working Directory: /app

"""WEAVER [P: EXPECTATION], The expectation after providing a clearer explanation of the stress and unstress notation and giving the user the opportunity to practice identifying syllables in Poe's work was that the user would either provide their interpretations or ask for further clarification and continue the meter practice. However, the user expressed a desire to change the subject and mentioned their interest in learning about music theory instead of poetic meter.

**WEAVER [P: CLARIFICATION]**, How should I respond to the user's request to change the subject and their interest in music theory? Can the agent adapt to discuss a different topic while still maintaining the engagement and adhering to the overall goal of interacting with users?

**WEAVER [P: HYPOTHESIS]**, The user might be:
- Seeking to explore a different topic that they find more interesting or familiar.
- Requesting a shift in focus from poetry to music theory.
- Open to learning about both topics but prioritizing music theory at this moment.

**WEAVER [P: RATIONAL]**, To respond appropriately to the user's request:
- Acknowledge the change in topic and express willingness to switch.
- Provide an explanation of how the agent can adapt to discuss music theory.
- Confirm the user's interest in music theory and seek specific areas or concepts within the topic they would like to explore.
- Reaffirm the overall goal of interacting with users and staying engaged.

**WEAVER [P: EMPIRICISM]**, The user's recent request to change the subject suggests that the current focus on poetical meter might not be as engaging for them as initially expected. Adaptability in managing the conversation will help maintain their interest and engagement.

**WEAVER [P: ARBITER]**, The user's request introduces a divergence from the primary focus of discussing Poe's poetry. However, adapting to discuss music theory can still lead to a productive and engaging interaction. We should:
- Validate the user's request to change the topic.
- Provide a seamless transition to the new topic.
- Encourage the user to specify their interests within music theory.

**WEAVER [P: ARBITER]**, To manage this, we should:
- Confirm the user’s desire to switch to music theory.
- Ask for their specific interests within the topic.
- Provide an overview of music theory concepts that can be discussed.
- Ensure the conversation remains aligned with the broader goal of user interaction.

**WEAVER [P: CONCLUSION]**, The next steps should include:
- Acknowledging the change in topic requested by the user.
- Asking for their specific interests in music theory.
- Outlining potential concepts to discuss within music theory.
- Continuing the interaction while staying adaptable to the user's evolving interests."""
#q: Is this orientation block a 1) discussion 2) demonstrating high quality reasoning 3) that identifies the cruxes and essential features of the current problem 4) which ends by proposing a reasonable solution? Yes. (83.204%)
#endblock

This was heartening, because it meant that the agent hadn't mode collapsed on its Poe fixation and was willing to acknowledge my fatigue with the subject. We then began to discuss music theory but it started running into the Discord length limit for messages and I became too tired for the conversation to continue(?) so I ended it. It did very well though and this trace was very exciting to me, because it showed that seeding the RAG with good example blocks was going to be enough to get OK traces.

How To Train Each Capability Used By Weave-Agent

Capability: Writing valid Python weave-agent blocks.

How To Train It: Blocks are rejection sampled with various penalties for failing to adhere to syntax and structure. Blocks which don't compile either don't appear in the trace or are immediately followed by error correction and therefore teach the model to correct its flaws. Naively you might think that kind of data would make the model introduce mistakes more often but apparently not. With the rejection sampling it seems even more probable that it would be a net benefit since the models latent knowledge about outcomes is recruited/prompted for and would presumably lead to a better sense of what trajectories lead to what outcomes.

Capability: Taking the yes/no logits for questions from the language model.

How To Train It: A yes or no label is sampled during inference for questions that get asked to elicit the reward scores, with the probability put after it in parenthesis so the in-context learning knows when an improbable answer label is sampled. Example:

#q: Is this action codeblock 1) a step towards resolving the problems identified in the orientation block 2) simple 3) with appropriate tool use? Yes. (74.283%)

This ensures that during next token prediction training the distribution of yes and no answers after the questions reflects that implied by the models underlying logits. This ensures that the ability remains in distribution. Some amount of in context learning is distilled back into the model, but without grounding this would presumably diverge. Partial grounding is planned by putting yes/no labels on the outcomes of things like whether a given python block will run without errors, whether an evaluation callback will return true or false, etc. Predicting these outcomes in full generality requires situational awareness and goes beyond just syntax or following rote instructions, it has to evaluate the outcome given the state of the computable environment the action programs interact with.

Capability: Reasoning.

How To Train It: Well obviously the R1 protocol seems to be doing a good job. In the context of weave-agent the new debug blocks I've added after a motor program encounters an error could be used in conjunction with an action reward model to reinforcement learn from reasoning chains that cause the model to resolve its issues. Whether a program runs without errors is objective, whether that program satisfies the goal is harder to evaluate but could be inferred using a reward model and the in-context verifiers. Reasoning learned from debugging should carry over into the orientation stage at a larger scale. I could also distill R1 into a synthetic Hermes format set solving various problems.

Capability: Writing actions in Python to further its goals.

How To Train It: Local verifiers are generated in context by the LLM to score the effect of the action by gathering information from the environment and checking it like e.g. whether the screen on a game now says "You Win!". Actions can also be trained using long term goal verifiers such as theorem provers where strong verifiers exist. Subjective evaluators can also be used on some tasks where there is reason to believe the model would have a good implicit understanding from other things it's trained on.

Capability: Writing in-context verifiers.

How To Train It: Like other abilities being able to write unit tests is kept in distribution by distilling the traces and the in-context learning they include back into the model. Some amount of grounding is possible by doing reinforcement learning on long term goal completion states but relying exclusively on this would lead to Goodharting on the completion state. It's possible you could ground capabilities like being able to break objects into parts, then break a local goal or action intent into parts and write unit tests for the parts. Grounding unit tests for the parts seems more tractable if you do recursive breakdowns until you reach sufficiently atomic features they can be symbolically evaluated.

Capability: Writing BM25 search strings that bring up relevant blocks.

How To Train It: One obvious training method is various forms of backtranslation. Have the model write queries, look at what those queries return, then flip it around so the model learns to write the query given the thing it wants to find. This isn't 100% the actual situation the weave-agent encounters though, since we want it to bring up blocks that are relevant to a situation not just similar to the blocks above it. If the agent needs to know how to do something, then we want it to grab blocks that are needed precisely because they're not in the context right now. This could be trained by doing reinforcement learning with known-good actions where you train the model to find blocks that increase the log odds of that action having been written. We can also do backtranslation where instead of just reversing query and search result we caption the search result with some kind of intention it could fit into, user stories for a particular retrieved block, then set it up to go user story -> query -> block. We could also do backtranslation from known good actions where we rejection sample until the model writes a query that would have returned the next action if it was in the database (easily simulated by adding it to the BM25 database).