Should Weave-Agent Plan More Like ReActTree Or MuZero?

John David Pressman

I've been pondering for a while whether the weave-agent planner should be more like ReActTree or MuZero and I think MuZero is starting to win the argument in my head. The basic reason why being that lazy hierarchical planning where you break a task into parts and assign subagents to the parts requires a ton of upfront work from the agent and serial ops. My original intuition was that programming is already a form of hierarchical planning so if I frame the parts of the tasks as subagents with inputs and outputs then I can reuse the existing python coding ability to plan. The problem is that weave-agent also tries to make sure subagents don't mark their tasks as complete early by adding unit tests to the completion states. Even if we removed those the syntax for creating a subagent currently looks like this:

    def check_dismissed(subagent):
        """Check if the bot has been dismissed by a user."""
        response = requests.post("http://localhost:8080", json={"action": "get_messages"})
        messages = response.json()
        for msg in messages:
            if msg["content"] == "You are dismissed." and msg["author"] != "weave-agent":
                return True
        return False

    schema = {"dismissed": "boolean"}
    main_agent = agent.subagent("main", None, "Interact with users until dismissed", schema, args.budget)
    main_agent.task.add_evaluation("Check if dismissed", check_dismissed)

Which is a huge amount of boilerplate when the behavior we actually want to facilitate looks more like this:

def do_conversation(subagent):
    """Have a conversation with the user."""
    # Introduce ourselves and profile the user
    profile = introduce_and_profile_user()
    # Set a topic of conversation
    topic = set_conversation_topic(profile)
    # Discuss the subject
    conversation = discuss_with_user(profile, topic)
    if dismissal in conversation:
        return True

Which, while intuitive, if we think about it for more than 30 seconds runs into all sorts of problems including:

  1. What if the user wants to discuss more than one thing or switch topics? Should the agent railroad them so it can focus on the plan?

  2. How does the agent know when to end the conversation or when the conversation has ended? We could say to use its general intelligence but how do we ground that for training purposes? If we say the rewards come from in-context verifiers then those don't appear in this function, so clearly a separate process would need to go in and translate these conjectured function calls into all the boilerplate above with the schema and such.

  3. This translation/generation would be a lot of upfront work and be a ton of latency. We could offset this by only doing the work necessary to get started on the task and then generate the rest in the background while we do other things, updating and reshaping it (how?) as circumstances change so that by the time we're ready to task switch it's almost effortless.

  4. The structure of the plan/program itself feels off to me. A conversation usually isn't made up of one topic, if the user immediately gets into a topic with me I might want to delay learning autobiographical information about them or learn it incrementally through context cues in the conversation, etc.

  5. This kind of program doesn't seem to really encode alternatives and possibility space well. That can be partially mitigated by expecting subagents to return a failure when expected conditions don't manifest. Essentially returning an error when the vibe coder in the background can't make the plan fit the conditions and the agent can't make the conditions fit the plan. Some of it could also be mitigated by writing programs with if conditionals and loops so that e.g. a set of expected opening topics could be written down and the conversation formulated as some kind of stack based loop where conversations start and end returning zero or more continuation topics and it ends when there are no more continuations on the stack.

One major problem with this is that we are no longer really proposing to use standard python to let the agent control itself. Rather we are now considering making a domain specific planning language which is a subset of python whose semantics are inferred and compiled into actual executable code by a language model. I could make this, it wouldn't even be particularly hard with models like R1 available to perform the tedious grammar parsing and syntax extraction tasks. Though I'm not sure how reliable it would be, and we're already involving a language model to flesh out the plan and reshape it to circumstances to such a degree that the python code plan itself feels increasingly vestigial.

So what would the MuZero approach look like by contrast?

Before answering that I think it would be useful to point out something important about planning. When we perform tasks in the real world there is both latency and serial bottlenecks from the environment to consider. Even if we have a bunch of GPUs, we can only use them to make generating and executing any particular action in the moment so fast because there are irreducible computations we can't shard out which must be waited on. However we can use our world model to simulate as many hypothetical parallel futures as we want since they're not entangled with environmental state or each other so they can be ran in parallel as far out as we can model. This means that if we want to continue scaling our inference compute for agents we will always end up picking a design that lets us use more planning to make actions in the moment faster to execute and more effective. We will also always want to have some kind of background process that subconsciously plans at the same time the agent does stuff so we can maximize our number of useful parallel ops. In other words we always want to be planning and we always want to be planning in ways that make every action in the canonical branch of our MCTS better than it would be without the planning since we can't actually use our parallel ops to make individual actions better in real time except by e.g. scaling the model.

If you've never read the MuZero paper it's actually pretty simple. The basic idea is that we have an RL agent with the usual RL agent components like state transitions, a utility funtion, etc. We give the model a history of observations and actions, then ask it to predict three things: The policy (next action), utility/value of the action (which is trained on the whole trajectory so presumably encodes long term reward), and the actually observed immediate reward feedback from the environment. It turns out this 'shallow trace' of agent-environment interaction behavior is sufficient to teach the agent to model the game and win. MuZero as described in the DeepMind paper uses a Monte Carlo Tree Search to hypothesize over future reward trajectories and find the series of actions that is most worth taking.

(As an aside, the AI at the end of history in Janus's prophecies page being named 'Mu' makes a lot more sense when I realize that MuZero and GPT were both revelations about the surprising amount of information encoded by 'shallow traces' of environmental behavior when you infer their generating functions with gradient based program search.)

Much has been made of the failures to get MCTS working for LLMs, but I think these failures are mostly caused by cargo cult attempts to apply it. What we want is not a tree search over tokens or even chunks of tokens but search over sequences of actions with rewards. A MCTS is most useful when our utility function doesn't know the value of a move or has trouble evaluating what to do in a particular board state, so we perform search by thinking ahead until we encounter some board states we do know how to evaluate. This doesn't work very well for raw text because texts don't usually have the property that you're trying to reach a particular string or that the value of current tokens would be hugely informed by what tokens occur later.

However if we have already set up an agent trace format such that we can usefully ontologize over observations, actions, rewards, etc then we can formulate our tree search in those terms and do rollouts to figure out if we expect to observe high reward from taking a particular sequence of actions with our LLM agent. For the sake of computational efficiency this will be easier if we simulate a level of abstraction up from the normal trace with all its details. MuZero gives us a template for what information is necessary. I think we would want to tree search over ticks of the weave-agent loop and the format for a single tick might look something like:

<tick>
<observation>
A tic tac toe board state where our opponent has gone first and taken the top right corner.
</observation>
<orientation>
  <summary>
  Since the best move in tac tac toe is to take the center if it's available, we must take the center.
  <summary>
  <score>2</score>
</orientation>
<action>
  <summary>
  A python function named "take_center" which sends an HTTP post request to the tic tac toe server that makes a move in the game which takes the center square.
  </summary>
  <score>3</score>
</action>
<outcome>
  <eval>
    <question>Was Center Square Taken</question>
    <answer>Yes.</answer>
    <prior>89.2%</prior>
  </eval>
  <score>2.8</score>
</outcome>
</tick>

Where the "MuZero approach" differs from the hierarchical planner described earlier is that here instead of lazy evaluating our plan top down we would plan bottom up. The summaries we use to train the planning mode would come from actual summaries of orientation, action, and evaluation blocks. We format these summaries into sequences of tick blocks with reward scores and do tree search over inferring the sequence forward to plan. The plan can then be used to guide the policy (LLM) by retrieving over the tick blocks as part of our prompt to the model. What's not clear to me with this approach is how you would segment the plan into parts and bind test functions to the parts to get an overall problem map and grounded progress representation.

One hint I've come up with so far is that it's possible to use our parallel ops to improve resolution retrospectively as well as prospectively. So we could for example add questions about outcomes we care about halfway through the plan and then retrospectively ask those questions about the plan so far to adjust its reward score. If we discover something that causes the plan to no longer seem like a good idea we can stop and reevaluate other trajectories downstream of our current canonical branch to figure out what to do instead.

Personally when I plan it feels to me like I start out by identifying key events/pivot points (how?) and then individually exploring the game tree for those until I feel like I have a good enough sense of what I will do at each pivot point in the most important forseeable cases. This implies it would be ideal if our MCTS was chronologically nonlinear and used operations like backchaining to take locally coherent sequences of moves in forseeable situations and append them to the end of the canonical branch to adapt and play them in-context where appropriate. Often I'll have a particular event I would like to occur because I have a good sense of what I'll do in that scenario and can play the prepared game tree in my head for it. This process of identifying pivot points and exploring locally coherent game trees feels closely related to the process of segmenting a plan into parts and binding outcome evaluations to those parts.