JDP Thinks About How To Make R1 Easier To Prompt

John David Pressman

2025-01-23

JDP — 1/21/25, 8:58 AM

What? Isn't that what I said?
(Sorry, I read this paper super closely on stream so I'm going to be a bit insistent on the details)
It's actually kind of a confusing paper because it's written about and reporting the performance of two totally different models.
So for example, I saw some people reporting R1's performance as comparable to o1 mini.
But that's what they say about their SFT distills.
Not about R1.
If you're just skimming the paper though you could easily glance that line and think it's about R1.
Actually, now that I think about it, it's about three models.
R1-Zero, R1, and the distillations from R1
And the R1 training pipeline is like...I'm not 100% sure I got it right in my head it's a little complicated but.
I believe it goes.
R1-Zero -> Filtering -> RetroInstruct -> Sprinkle Of Human Traces/Data -> SFT on previous -> First RL stage is R1-Zero's training process with the SFT checkpoint -> Second RL stage is RLHF
This really could have used a diagram.

JDP — 1/21/25, 9:03 AM

I'm pretty sure what's happening here is that they were going to release R1-Zero but the traces were sort of ugly, and then Qwen released their reasoning model and suddenly R1-Zero wouldn't be an impressive release so they delayed it to make R1 and the distills.
ANYWAY
I believe DeepSeek that the humanlike behavior in R1 is emergent.
I could explain how that happens if you want.
Or speculate in an informed manner, rather. :p

I believe DeepSeek that the humanlike behavior in R1 is emergent.
I think it's largely emergent in O1 too.

JDP — 1/21/25, 9:04 AM

Teortaxes takes about OpenAI's "purism" are basically bullshit, O1's traces are some of the most obviously synthetic data I've ever seen.
Like, gorgeous synthetic data.
But I can tell that O1 invokes certain words and phrases as like, function calls almost, or ways to enter into certain routines.
And that's a clear sign of either low policy entropy RL or RetroInstruct type data, but I agree it's probably emergent after seeing R1.
If I had to guess, it's emergent after some RetroInstruct type cold start data.

JDP — 1/21/25, 9:06 AM

Of course, this is OpenAI, so they probably did have some data annotators involved.
Like, OpenAI is famous for their data annotator budgets being very generous.
BUT SO.
With that in mind here's kind of how I think the Claude stuff works.
They make a RetroInstruct type set for some very narrow desired behavior.
Like, "the model notices it hallucinates and corrects it"
You might think "doesn't that encourage the model to hallucinate?"
I had the same question.
Someone apparently studied this and it turns out no, but I lost track of the paper/didn't save it.
It's one of those "LLMs really do generalize" papers.
They looked at whether mistake correction data causes the LLM to make more mistakes or avoid the mistakes in the first place.
And it does in fact generalize to noticing and avoiding the mistakes.

JDP — 1/21/25, 9:08 AM

What you then do is like, SFT on these synthetic traces to get the behavior into the model, which it then overdoes a bit.
But that's fine.
Because what you then do is RL with some subjective weave evaluator type setup or other flexible neural grader.
Which then speciates the behavior into more appropriate contexts.
That's why it has that template generated feel, but clearly isn't completely a template.
You get me?
Basically the pattern is you bootstrap behaviors with backtranslation/careful prompting/et al.
And then you generalize those narrow contexts/templates with RL.
They also mix in human annotation data to give it some more diversity/naturalism, like, you behavior clone from humans still to mix entropy back into the process in the direction you want.
But ultimately you can only pay for so much annotation data.
And you want to focus it on the stuff you actually need humans to do.
"Model makes a mistake and corrects it" can obviously be template generated, you don't have to pay for that, so we can imagine the labs mostly don't.

JDP — 1/21/25, 9:11 AM

And it does in fact generalize to noticing and avoiding the mistakes.

This is actually one of the most important capabilities language models have if it's true.
Because it means that if I have something like a weave-agent trace where the model struggles, figures out something, and then continues.
That data is valuable rather than detrimental.
That's why you see me doing things like trying to walk it through its confusions in the chat channel rather than starting over.
If I can get it to the other side, that's a golden trace.
Especially because what it tends to do while I direct it is like, slowly start to use appropriate problem solving mental motions.
Like, it tries to contextualize what I say into its frame and then slowly frame shifts.
Getting a textual record of that process of updating on evidence is very useful.
But like, IDK a lot of why I found reading the R1 paper extremely heartening.
Is that it validated my core assumptions.

That you mostly just need known existing optimization methods, there's no magic fancy secret sauce or crazy balancing act I need to do. If you have solid data/traces you can literally teach them to the model with SFT.
You don't need a separate reward model, the policy can be usefully used as a reward model.
Formal symbolic verifiers can generalize further than you might think. Weave-Agent's current strategy is reliant in large part on being able to generate symbolic proxies of the underlying goals in context.
Filtering the top n% of traces and SFTing on them for your next round just works, it really is that simple.

JDP — 1/21/25, 9:16 AM

They did GPRO
Which is like, a monte carlo version of PPO or something.
But yeah.
Basically, I made a bunch of know-nothing "brain dead" simplifying assumptions for the weave-agent strategy that I was like, 70-80% sure should work.
Now I'm more like 95% sure.
It really is that easy.
i.e. Data is paramount, any generative process that can create a good trace is valid.
Which is also a core assumption of RetroInstruct.
I mean this "sounds" obvious, and in a sense it is, but it's one of those things you can't really be 100% sure is true until you have direct experience of it.
You know?
Like, it might be possible for example that the models are very sensitive to generative process and we just got lucky with language.
(The sheer generality of deep learning and the transformer across every modality implies this is unlikely, but hey you never know)
That SFT on the R1 traces worked so well is especially good news to me.
Because it means that like, I really do just have to get a critical mass of weave-agent traces that are good and I win.
Honestly, they don't even need to be good, a critical mass of mediocre traces.
Traces that are merely not total garbage.
Can you see why I would take R1 as extremely optimistic?
https://x.com/jd_pressman/status/1870946561133572214
It implies this theory is basically valid, there is no extra sauce or insight necessary, it really is that simple and all I have to do is get past the bootstrapping phase.
SO.
LLMs, Claude.
I think your basic idea of "just try and get R1 to translate your problem into its domain of competence".
Is both a good idea, and basically what I told Amanda Askell she should do to turn Claude into a schizotheory assistant monster of a model?
It's also fundamentally very simple.
All we really need is a reliable process for generating like, esoteric but valid math/engineering problems.
I feel this can probably be done using lean.
Because R1 knows lean well, and can translate e.g. competition programming problems into lean.

JDP — 1/21/25, 9:25 AM

And assert constrains on them.
So when I was discussing making a royalty free competition problem set with it (since leet code is technically copyrighted and I like to be a purist about this to push myself to synthetic data maxx)
One thing I'm not sure people appreciate about RetroInstruct for example is if I made a benchmark based on it.
And people Goodharted the benchmark.
I would just generate another test set.
And then evaluate on that.
There is no way to Goodhart on RetroInstruct besides just like, actually learning how to model the underlying generative processes.
Memorizing it will not work.

JDP — 1/21/25, 9:29 AM

Because R1 knows lean well, and can translate e.g. competition programming problems into lean.

I'm thinking what we can do then.
Is make a kind of combinatoric list of subjects.
That can be combined together to make odd problems.
Give this list to R1, and have it formulate problems that would combine domain X, domain Y, and domain Z in lean.
Which can be verified to be internally consistent by just checking that the lean proof checks.
Then all you have to do is verify that the relevant domains actually appear in the lean.
Which could be done with either a weave evaluator type setup.
Or some kind of cleverer symbolic method where you insist on certain touchstones/certain patterns existing.
Could also just ask R1 itself to think about whether the domain appears in the problem.
Once you have these problems described.
You have a different LLM, possibly R1 again since it has RLHF/literary data too.
Write out a confused, vibe-y humanities description of the problem.
Then you have it write out a derivation of that which is missing one of the key domains, but specifies the problem that domain is meant to solve.
Like "I feel like I need some kind of way to solve X, but I don't know how I would do that, do you have any idea?"
And then you go backwards.
So you start with the confused version of the problem in the wrong frame that's missing one of the domains.
The "assistant" gives a response that introduces the missing domain.
The "user" then "gives" the version of the problem that is confused and vibe-y but has all the necessary pieces.
And finally at the end you have the original pristine mathematically clear description of the problem in lean.

JDP — 1/21/25, 9:35 AM

Or maybe even like, have R1 just give you the missing domain and the confused vibe-y humanities version and asks if that makes sense to you.
The user confirms and it gives you the beautiful mathematically clear version.
Or even, you can have different templates using these parts.
Maybe sometimes you have R1 just try to solve the whole problem for you given the thing with the missing domain.
Maybe other times you have the user give the confused version, other times R1 gives it as an example of a thing to try and mirror them.
Maybe you generate related problems to the one with the missing domain so that the model can give you a distribution over possible problems you might be talking about.
Suddenly you go from R1 needing to be talked to in a very specific language that phrases everything like a competition problem.
Oh right.
You could also have styles of description.
So "humanities major", "STEM guy on LSD", "autodidact-ish person who mixes multiple somewhat related fields but in a nonstandard way to how a college educated person would describe the ideas".
Maybe have like, 3 or 5 styles of description.
With different generative processes to get them.
Bam, suddenly you no longer need a separate model to get the human interface part into the distribution.
The model would just know how to take your halfass weird description...ah right "lazy/busy but fundamentally intelligent guy" should definitely be one of the description types.
This would be a bit of elbow grease, but way less elbow grease than beating Claude along all possible dimensions.
And it would solve the fundamental problem.

So you would.

Make a list of problem types you expect R1 to be able to do, and which can be combined to get esoteric/oddball combinations. Note you do not have to solve these problems, you just need the model to be able to describe them rigorously.
Give R1 some prompt and combination of domains from this list and tell it to make a math/programming/whatever language it speaks type description of a problem combining those domains in lean. You proof check the lean description of the problems to make sure they are internally consistent. Again it's not particularly important if these problems make sense or are useful, they just need to be valid combinations of the domains which are internally consistent. If you have enough such examples the model will probably generalize to the stuff you want it to actually do.
(Optional) To give a bit more direction to the problem generating process you could have "problem themes" that specify what kind of problem combining those domains should be devised by R1. I don't know what the list of themes would be we'd have to figure out what dimensions of problem type are relevant to us. But one thing that might be useful would be some kind of content guidance so the model doesn't generate things that are...objectionable.
Along with the lean version of the problem you should have a plain English description of it. You can verify that the plain English description is correct by using an embedding model on the lean code and checking that the embedding similarity between the original lean code, and lean code generated by asking R1 to make a lean program describing the plain English version of the problem are similar. So basically checking that Lean -> English -> Lean full cycle translation gets you a high similarity lean program to the original one you captioned.

CHECKPOINT: 5. You now have a set of lean problems in oddball combinatoric domains verified to be internally consistent with known matching English annotations/descriptions.

From this set, which is valuable in its own right, you then have an LLM make lossy confused and slightly wrong frame descriptions of the problem according to one of four or more descriptive styles: "humanities major", "STEM guy on LSD", "autodidact-ish person who mixes multiple somewhat related fields but in a nonstandard way to how a college educated person would describe the ideas", "lazy/busy but fundamentally intelligent guy". Each of these description types would have a somewhat different generative process to get the right kind of vibe.

It's okay for these annotations not to describe the problem correctly, in fact what you actually might want to do is verify they don't describe the problem entirely correctly. Though, I think it's actually fine to leave it with some natural noise, that way the model will generalize correctly both when you have a correct and incorrect description of the problem on your end.

You now make a third set derived from this set which is lossy descriptions of the problem based on the lossy description of the problem with a frame change (perhaps only change the frame 50% of the time so it doesn't learn to always change the frame if it doesn't need to be changed?) and one of the domains missing.

You should again verify that the domain is actually missing. One way to do this would be to have R1 turn the domain missing frame shifted version of the problem into lean and then verify that the embedding similarity to it and the last description of the problem is below some threshold.

To recap. You now have a 1) set of underspecified problems missing one of the domains their solution relies on, 2) set of coherent fill ins for the missing domain necessary for a solution in those problems 3) formally specified denoised descriptions of those problems that turn the simulated human slop into self consistent coherence
Make various templates combining these parts in different ways to get some diversity in interaction types and which persona in the multiturn interaction says what. As a tip, keep the different generated parts in the previous steps under separate JSON keys if there are different underlying generating processes you use to get the pieces you glue together. Only glue the pieces together at the end, glue them together into the appropriate shapes during the intermediate set generations that rely on previous sets in order to generate the next set, but make sure you have access to and can rearrange all the parts at the end to give you maximum freedom during this final templating process.

JDP — 1/21/25, 10:06 AM

Anyway everything I just wrote there is basically what I would like R1 style models to do for me, wrt RetroInstruct. I want them to come up with that plan, and then ideally execute it.
Since like, all those operations are in principle like, things that an LLM agent could just do.
But you'll notice that like, I actually pay a ton of attention to "do this thing here to diversify it/make sure it generalizes".
i.e. I in fact demonstrate a certain taste in how to make sufficiently deep generative processes that RLHF assistants just can't really be prompted for.
R1 type models seem like they could learn to do it in principle though.