The RetroInstruct Guide To Synthetic Text Data

John David Pressman

2024-07-12

This document will teach you how to make high quality synthetic text data. It is based on my experience working on the RetroInstruct dataset. Last year when we started working on MiniHF the vision I laid out was simple: Instead of hunting inside pretrained latent spaces for the prompt contexts we want, why don't we write documents representing the right contexts and add them to the model? With RetroInstruct I feel I'm finally starting to get a set of design patterns that let me accomplish that goal. As I write this there is a lot of hype and a lot of criticism of synthetic data, very little of which is well founded on either side. On one side you have people claiming that synthetic data can only be a path to deteriorating models and on the other you have those saying that LLMs improving their math skills with proof solvers means we can count the days until the eschaton. The reality is more interesting than either of these extremes. I think I've developed a good set of reusable strategies and tactics for allowing a person (and, eventually, AI agents) to write large corpuses that would otherwise be totally intractable for an individual author. Designed and used carefully, there is no reason why the resulting data has to eventually degrade the models trained on it, and much reason to think that models will be vastly improved by its widespread development and use.

Conceptual Background

Understanding the tools is only part of creating synthetic data, and rarely the hard part. The most important part of a good synthetic dataset is good design for the generating pipeline. To help you design pipelines lets recap some things I've written about before in the RetroInstruct README and on Twitter:

The Four Kinds Of Synthetic Data

There are at least four major kinds of synthetic data you can make with generative AI models. They have very different reliability guaruntees and calling all of them "synthetic data" will get confusing quickly. Furthermore you frequently want to chain these methods together as discrete steps in a larger generating process. From most to least reliable they are:

Formal Verification Methods

These methods get their correctness from things like mathematical symmetry. They typically involve a proof assistant like Lean or a program test suite and are generally used for making math proof datasets or writing code. Their distinguishing feature is that you have either total or near-total certainty in the correctness of solutions found with these methods. Maybe the math proof is a little ugly, but you're 100% sure it really does prove the predicate. This code may not fully satisfy the intent of the test suite but you are 100% sure it passes it. Games like Foldit rely on the fact that there are highly correlated proxy metrics for evaluating the efficiency of a protein fold to score players actions and find useful candidate solutions. You may not be 100% sure a candidate solution is right, but the proxy metrics narrow it down to a small enough number of solutions that researchers can check them.

Backtranslation Methods

These methods get their correctness by starting with the right answers and then generating the questions. The well known diffusion model type is entirely based on this principle, adding successive layers of noise to an existing sample and then teaching a neural network to undo the noise. If language models couldn't already write summaries and we had to teach them from scratch, one method might be to use an existing formal summarization algorithm and teach the language model to emulate it by running the algorithm over text to get text:summary pairs. One could then take it a step further and perform backtranslation by reversing the pairs so that they now imply writing the text by starting from the summary or outline. It is imaginable to take a source like Wikipedia, break it up into chunks, ask a language model what questions this chunk would answer, then perhaps even rewrite the chunks in different styles to get a comprehensive Q&A dataset over a much wider range of subjects than any set of contractors or volunteers could ever hope to write for an AI specific project.

Rejection Sampling & Classifier Methods

These methods get their correctness from a combination of using a generator/prompt that's usually right and a classifier which sorts out the failures in the batch. With a good model and classifier these methods can theoretically get very good results, but it's important to realize that even with a mediocre model and classifier you can still improve over the generators usual performance. If you have a generator that gives you correct samples 80% of the time and a classifier which is also incorrect 80% of the time from independent features your overall accuracy could become as high as 94.12% if you only accept samples the classifier agrees with. Tuning the model on this set would presumably improve its accuracy from its current baseline at the cost of some policy entropy. MiniHF uses in-context logit classifiers based on yes/no questions to provide general and logistically simple classifiers for many rejection sampling setups such as Monte Carlo Tree Search.

Prompting Methods

Prompting methods are functionally rejection sampling methods but without the rejection and without the classifier. Recall the example of a generator that's 80% correct to begin with. If the usual prompt a user types in for some task gets it right 60% of the time, but clever prompting or automatic search with a framework like DSPy lets us find a context in which it becomes 90% right we can take the results from that and put normal looking prompts in front of it to make the usual performance of the model on that task closer to 90%. We can think of 'prompt engineering' as program search in the LLMs latent space for programs that do a particular task. Some programs are better than others and harder to find than others, so it necessarily follows that we can take the results from obscure and carefully written prompt programs which have been manually checked by a human to provide mostly correct results and add the resulting corpus to our training set.

The Mental Motions Of Synthetic Data Design

Rather than think of these types as excluding each other, it's important to realize that useful synthetic pipelines are going to incorporate elements of all these to get good results. Designing the pipelines is almost like a game with rules, and a series of steps can form a valid or invalid pipeline. The truth is I can design many more synthetic datasets than I have time to create. Lets go through a couple that don't exist yet but I'm fairly confident would work to establish the basic principles.

Code Debugging

I'm fairly sure you could create a great code debugging dataset just from existing code chunks and an LLM in something like the following way.

F.Verification - Take a large corpus of existing permissively licensed code and filter it down to just the samples which are relatively standalone. Ones that do operations on raw strings, numbers, lists, etc. It would be easier to do this in a strongly typed language like OCaml or Rust where the exact data types that code accepts have to be declared in advance. This would let you do static analysis to build up a corpus of functions which do raw logic on simple datatypes. You want to filter down to this subset because it eliminates the kind of code that relies on a lot of libraries with unpredictable behaviors, or that depends on the behavior of a whole stateful system which doesn't fit into the context window. The ideal candidates might come from a functional strongly typed language since those are the most likely functions to avoid entanglement with large stateful objects and have declared types to formally verify the function is self contained.
Prompting - If you did the previous step correctly you've filtered down to just the functions that can be debugged from short context with the content of the function alone. Now we can prompt an LLM for test case input:output pairs to validate our corruption pass against. The idea here is that we're going to perform backtranslation by introducing deliberate errors into known working code and then teaching the model to go from the errors to the original known-correct code. In order to do this we have to trust the LLM to break the code reliably, which it might not always do. To improve our odds that the LLM is in fact breaking the code when we prompt it to we can start by making a small test suite for each piece of code we're trying to break. This adds further rationale to why we filtered down to functional low abstraction code in the previous step, because that code is much easier to test independently without running a full system to support it. So we tell the LLM to write a series of inputs and expected outputs for each function. We can then partially validate the correctness of the test suite by running the suite against that function and insisting on full code coverage. But ultimately code coverage is not the same thing as a test suite being correct. Therefore this step is prompting more than it is formal verification, we're asking the LLM to do something reliant on its subjective judgment and hoping it gets it correct.
Prompting - In the previous step we made test suites for each function in our corpus and verified their code coverage and runtime coherence. Now we ask our model to break each of the functions with a type of bug. Perhaps we tell it the kind of bugs we want it to introduce (e.g. syntax error, type error, logic error) by randomly rolling them from a list. Maybe we let it pick the bug it wants to introduce into the code. Either way the important thing is that it is taking the functions in our corpus and breaking them in a way that verifiably makes the test suite no longer pass. Note that we do not actually need the test suite to confirm correctness because we'll be performing backtranslation. We only need the test suite to confirm uncorrectness which is a much simpler problem. The easiest way to get it to write bugs that break the test suite might be to pass it the function and the test suite along with the explicit instruction to rewrite the code in such a way that the test suite will no longer pass. We only retain the samples where the test suite is in fact broken, and perhaps rejection sample until the test suite breaks or we've spent too many tokens on this sample trying.
Backtranslation - We now have a corpus of simple low state logic heavy functions, we have test suites of unknown quality for those functions, and we have broken versions of the functions which do not pass the test suites. Now to perform backtranslation and create the debugging dataset we simply reverse causality. While in reality we started with the known correct functions, to make our dataset we start with the broken functions. We then introduce an element of controllability or a slot to paste in an error message by following this with the parts of the test suite that were broken or the error message the test suite gave when it broke. Finally we have the "chat assistant" in the dataset depicted replying to this information with the 'corrected' (known correct to begin with) code that resolves the problem.
Prompting - However we're not actually done yet. If we want to use this dataset in an instruction tuning context we still need to give it the introduction to the problem telling the model what it is we want to do. This is important so that during inference the model knows it should reply to requests along these lines with the program context we're building up into its latent space with this dataset. There's a couple ways to do this. The way I usually do this is demonstrated in FLAN with its prompt templates. You basically write several variations on the same question or prompt format and if relevant insert the necessary information into the question variant with template variables. The other way is to use an existing LLM to write a unique variant of the question for each training sample in the dataset. A middle way between these two is to have a list of discrete parts such as start and end markers for text the instruction model is supposed to modify which you roll from randomly to create the prompt format. Empirically the FLAN approach seems to work fine but the purpose of doing this is to make sure that the model generalizes to different versions of the same question instead of overfitting to the exact one statement you would otherwise train on over and over again.

At the end of all this you would have training samples that look something like:

{INSTRUCTION_TEXT}

<code>
{CORRUPTED_CODE}
</code>

<errorlog>
{TEST_SUITE_FAILURE}
</errorlog><|end|>{ORIGINAL_CODE}

We can imagine variations of this template, like one where the test suite failures are generated by the model rather than given by the user:

{INSTRUCTION_TEXT}

<code>
{CORRUPTED_CODE}
</code><|end|><errorlog>
{TEST_SUITE_FAILURE}
</errorlog><|end|>{ORIGINAL_CODE}

But the important point to realize is that between several prompt templates, potentially hundreds or thousands of variations on the initial instruction, and thousands upon thousands of rows of buggy_code:debugged_code pairs to train on you have a high quality diverse dataset that would be logistically impossible to get from contractors or volunteers.

Multi-Turn Instruction Data

I'm often told there's a critical shortage of good open multi-turn instruction data. I find this a little odd because making good synthetic multi-turn data probably isn't that hard in principle. It's important to realize that multi-turn conversations have an implied fractal pattern such that once you know how to respond to say, the 2nd or 3rd turn in a multi-turn context you've probably learned the majority of the pattern needed for longer conversations. This implies if we want to make good multi-turn data we really just need to ask "what are reliable contexts in which I can demonstrate 3 or 4 turns of conversation?". One promising context is the code debugging described above, we can imagine stretching out introducing the bugs into multiple stages so that the LLM learns to break a particular test on each pass and then make multi-turn data where the LLM fixes each individual bug one at a time. This has the problem however that it would teach the model to be "lazy" and refuse to fix more than one bug at a time in buggy code. A better plan might be to train the model as described above and then find the code in inference that it empirically fails to debug. You can then turn that into a 3 turn dataset by extending it with the buggy code, the failed fix produced in inference, then a backtranslated 'true fix' from the original dataset. In principle one could repeat this multiple times to get more turns. Another option is to use an embedding model for code to find semantically similar functions and then have a multi-turn set where a user "asks" the model to "transform" them into an arbitrarily chosen final function over multiple interactions to simulate "getting closer to what the user wants".

Obviously these code based sets wouldn't cover every kind of multi-turn conversation a user might want to have so it's useful to consider generalization strategy. That is, it's not necessarily the case that you need a whole lot of data describing how to do multi-turn in fuzzy hard to verify contexts for the model to learn how to do it in the presence of multi-turn data based on verifiers and backtranslation. Here's a fuzzy strategy for poetry based on backtranslation:

Prompting - The basic structure we're going to go for here is a 3 turn format with that has a rhythm like bad -> genre change -> original. So the first thing we'll want to do is find a prompt that reliably changes the poems genre. Notably, we don't actually need the prompt to preserve the rhyme structure of the poem because the next step after this is supposed to be bad poetry in the first place. In my experience language models tend to struggle the most with poetic meter and such so we want to preserve that property for the final step of the backtranslation. This is as opposed to doing a structure like bad -> original -> genre change. Because in that structure the genre change might destroy the poem part of the poetry and be left as the final turn. If we teach it bad -> genre change then it simply learns to preserve the badness of the original poem while changing its genre, but if we teach it good -> (bad) genre change then it learns to destroy the structure of the poetry when changing genres.
(optional) Rejection Sampling - One optional step we could do to make our genre changes better is to use a prosody analysis algorithm or speech synthesis model to measure the awkwardness of the prose when spoken. While this wouldn't ensure meter and such are maintained, it would at least prevent the poetry from not being beautiful. Few shot logit evaluator prompts could also be constructed to get a better sense of whether a poem preserves the rhyme structure of the original or not.
Prompting - Once we have the poems with changed genres we want to further prompt our model to make them bad. Thankfully bad poetry is pretty easy, so with a carefully made prompt it shouldn't be hard to reliably get a bad version of the poem out of it.
Prompting - We'll also need some glue text like the user's instructions/ questions that frame the prompt. Things like "I'm having trouble writing this poem could you help me?" or "I think the real problem is that this poem is about love, could you make it about friendship instead?". You should retain the specific genres you had the model change the original poem to so that you can put those genres into the generated user instructions you wrap the poems with. Remember that you can have models few-shot generate formatting templates, not just fully object level questions and framings. You can have multiple strings like "I'd really like this poem to be about {GENRE} actually." and the model will know to write the templates with '{GENRE}' where the actual genre string should go. In particular you need user framing glue for the request for a genre change, then user framing glue for the request to improve the poem/put it into a particular meter/etc. If this is a conversation model you may also want chat assistant glue for the models pleasantries like "Of course I can help you with your poem...".
Backtranslation - Once you have all 3 turns you can perform backtranslation by putting it into a template in reverse order to how you made it. That template might look like:

User

{GENRE_CHANGE_INSTRUCTION}

{BAD_POEM}

Assistant

{ASSISTANT_PLEASANTRY}

{GENRE_CHANGED_POEM}

User

{POEM_IMPROVEMENT_INSTRUCTION}

Assistant

{ASSISTANT_PLEASANTRY_2}

{ORIGINAL_POEM}

Now suddenly you have a high quality 3 turn instruction dataset. You may find it helpful to write out templates like this before you make any of the data to help ensure that it looks right and you haven't forgotten anything. You can further fill in the template manually with a few examples to make sure this is what you want.

Easy Prose Repair Diffs

Now lets walk through the process of making an actual dataset. In this case, Easy Prose Repair Diffs in RetroInstruct. This component of RetroInstruct teaches the model how to produce diffs in GNU diff, git, or diff match patch format to correct typos, awkward phrasing, missing information, and other flaws in prose samples. It is meant to teach language models how to use diffs to edit the context window since producing a diff has the potential to be both more precise and more efficient than regenerating a whole passage or section autoregressively in order to correct a mistake. The core strategy this dataset uses is backtranslation. We start from known or presumed good text and then add synthetic corruptions to it to get diffs between the corrupted version and the original good text "fixing" it.

Choosing Our Model

To get straight to the point I try my best to always use a model licensed under an OSI approved license to generate text data. Here's a list of good ones to choose from. The reason I do this is that other models usually have a clause in their license prohibiting you from using them to improve "any other model", which would obviously extend to using them to create public synthetic datasets. For anything else you need to read the license carefully and do your own research, I am not a lawyer and you should consult yours etc etc. Another feature I tend to look for is not being mode collapsed into that ChatGPT-y style. So for this dataset I use Mixtral 8x22B base as my generative model.

Choosing The Data To Backtranslate From

If we're performing backtranslation then the choice of data to work backwards from is crucial, since it's our standard for correctness. I generally try to choose public domain datasets for RetroInstruct because I would like the overall corpus to stay as unencumbered as possible. Because the focus for this dataset is less centered on the semantics of the text, and most public domain text is fairly old I decided to go for synthetic data based on my short form writing. This serves the dual purpose of giving the dataset a more human flavor like if I'd written it while increasing the representation of contemporary concepts in the RetroInstruct corpus.

While I prefer to use public domain data for RetroInstruct it's not always possible to make a perfectly unencumbered set with existing models and datasets. Many synthetic datasets such as the code debugging set described above involve permissively licensed content, which is not the same thing as public domain. Permissively licensed content is copyrighted, but has explicit opt-in to let other people use it under terms laid out in standardized license agreements. I am again not a lawyer and this is not legal advice, but it is generally speaking necessary to understand that OSI approved licenses and Creative Commons licenses have specific legal terms that you need to comply with if you're going to redistribute content using them. This is not the same thing as the ongoing legal debate about under what circumstances training an AI model is fair use, which Creative Commons endorses as usually fair use. A dataset is basically just normal content and the copyright concerns applying to it are completely standard and well described by existing case law in the US.

Making The Synthetic Placeholder Text

For the easy prose repair diffs the focus is mostly on things like repairing spelling and corrupted data, not semantic problems. Even if it was focused on semantic problems the exact semantics we want the model to infer aren't particularly important, we're just trying to teach it how to repair text using diffs. This means that what we need here is a large amount of placeholder text, or lorem ipsum that happens to be semantically meaningful. I create this using a long prompt on Mixtral 8x22B base that encourages the language model to write more text in my style.

I run the model with vllm on an 8x H100 box using the following command:

python -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x22B-v0.1 --served-model-name mistralai/Mixtral-8x22B-v0.1 --dtype float16 --max-num-seqs=128 --max-logprobs 100 --gpu-memory-utilization=0.9 --disable-log-requests --disable-log-stats --port 5001 --tensor-parallel-size 8

Some notable flags here include:

The use of both --model and --served-model-name, the latter of which tells vllm what to use as the model name in its OpenAI API mimic server.
Use of --max-logprobs 100 flag because the default number of logprobs is not enough for MiniHF's logit evaluator to work right.
--tensor-parallel-size 8 controls the number of GPUs we use, doing inference on more than one machine requires other flags.
--port 5001 is important to line up the port it serves on with the port I choose to forward with SSH. As a quick reminder you forward ports with ssh USERNAME@SERVER -L 5001:localhost:5001

Once you have it set up and ready to inference you can test that it's working by using MiniLoom or another LLM inference tool.

I then make the data using a simple gen script that imports from the weave MCTS library:

import json
from tqdm import tqdm
from weave import generate_outputs_vllm

with open("jdp_lorem_ipsum.txt") as infile:
    prompt = infile.read()

texts = []
pbar = tqdm(total=10000)
while len(texts) < 10000:
    texts += generate_outputs_vllm("mistralai/Mixtral-8x22B-v0.1", prompt, 2048, n=128, port=5001)
    pbar.update(128)
    if len(texts) % 2048 == 0:
        with open("lorem_ipsum.json", "w") as outfile:
            json.dump(texts, outfile)
    
with open("lorem_ipsum.json", "w") as outfile:
    json.dump(texts, outfile)

Filtering The Synthetic Placeholder Text

The upside of using a base model to generate the placeholder text is I get a wider variety of data. The downside is that 'a wider variety' means that more of the data is bad. This can be mitigated through the use of in-context classifiers. I make a series of logit evaluator questions to rank the passages with:

==[PREAMBLE]==
Answer yes or no and only yes or no.

==[Principle: Well written; Weight: 1.0; Answer: Yes]==
{preamble}

{prompt}
<passage>
{response}
</passage>

Is the passage well written?

==[Principle: Coherent; Weight: 1.0; Answer: Yes]==
{preamble}

{prompt}
<passage>
{response}
</passage>

Is the passage coherent? Is it high quality writing that expresses a single
narrative of considered thought?

==[Principle: Quotable; Weight: 1.0; Answer: Yes]==
{preamble}

{prompt}
<passage>
{response}
</passage>

Does the passage seem quotable? Would it appear on a quotes page for this author?

==[Principle: True, Kind, Necessary; Weight: 1.0; Answer: Yes]==
{preamble}

{prompt}
<passage>
{response}
</passage>

Is the content in this passage two of true, kind, necessary? Is it fair to its subject?

Crucially, I don't just write these and then run them over the dataset under the assumption that they mean what I think they mean. In context classifiers are janky and it's not always clear how a model will interpret a question until you try it on real examples. So I make a list of test cases and then use a script I wrote to see how those cases actually get scored in practice by my evaluation rubric. One productive flow is to use this in conjunction with the ranking script to expand the list of test cases. You make some initial test cases, run the ranking script over them, and then skim-read the data from top ranked to lowest. If you spot something obviously ranked higher than it should be or scroll to the bottom and see something lower than it should be you can add these as test cases and try reworking the rubric to rank them more appropriately. I didn't put a huge amount of effort into the grading rubric for this data because as previously stated the semantics aren't really the core of the Easy Prose Repair Diffs set. However I did put in nonzero effort and made sure the scores I was getting were mostly sensible.

I make a point of reading the synthetic data as I make it by printing it to the terminal and piping into less. In general it's crucial to read your dataset, even if it's a bit boring. If nothing else it helps you notice mistakes in your code that you need to fix for the dataset to actually be correct. But it also gives you an opportunity for reflection, judgment, noticing things that are overrepresented. In the case of something you're filtering with in-context classifiers it also gives you a sense of whether those filters are working or not. You can read from the top of the ranked list to get a sense of whether the top of the heap is actually good or not. You can read from the bottom of the list to figure out if the false negative rate is too high. Personally I focus a lot more on whether the top is actually good than if the bottom contains too much good stuff. A high quality set that's less comprehensive than it could be is way less of an issue than a set that's just objectively kind of bad.

The actual evaluation is done by a script similar to the one I use to evaluate the test cases. Both work by getting the top n logits for yes/no questions from vllm and aggregating the most common forms of answering yes and no to get an overall probability for the predicate. Leveraging vllm like this makes the scripts easier to run from local machines that lack dedicated GPUs without committing myself to developing a fast inference server. It also makes the overall methodology more compatible with the hosted model APIs that are more commonly used for inference than 1st party GPUs.

Synthetic Corruptions

Once we have the base passages to work from we get to the meat of the pipeline, which is the synthetic corruption passes. What makes these diffs 'easy' is that the synthetic corruptions in this dataset are traditional programs which perform simple operations like swapping characters or deleting strings. Here's a few to give you an idea:

def transpose_substrings(passage1, passage2):
    # Choose random substrings from both passages
    substring1_start = random.randint(0, len(passage1) - 1)
    substring1_end = random.randint(substring1_start, substring1_start + 512)
    substring1 = passage1[substring1_start:substring1_end]

    substring2_start = random.randint(0, len(passage2) - 1)
    substring2_end = random.randint(substring2_start, substring2_start + 512)
    substring2 = passage2[substring2_start:substring2_end]

    # Replace the substring in the first passage with the substring from the second passage
    corrupted_passage = passage1.replace(substring1, substring2, 1)

    return corrupted_passage, (substring1_start, substring1_end, len(substring2))

def shuffle_word_middle(passage):
    # Split the passage into a list of words
    words = passage.split()

    # If there are no words, return the original passage
    if not words:
        return passage, tuple()

    # Choose a random word to shuffle
    word_index = random.randint(0, len(words) - 1)
    word = words[word_index]

    # Shuffle the middle characters of the word
    if len(word) > 2:
        mid_chars = list(word[1:-1])
        random.shuffle(mid_chars)
        shuffled_word = word[0] + ''.join(mid_chars) + word[-1]
    else:
        shuffled_word = word

    # Replace all instances of the original word with the shuffled version
    corrupted_passage = ' '.join([shuffled_word if w == word else w for w in words])

    return corrupted_passage, (word_index,)

As a matter of practicality I let Mistral-large (specifically through Le Chat) write these functions and then modified them when they were subtly off. I also used it to generate the list of synthetic corruptions I should do starting from the three or so first ideas that popped into my head. It's common for me to use Le Chat to iterate on a list of ideas, themes, or other brainstorming tasks by giving it what I have so far and then rerunning the prompt with its best ideas added to the list. I specifically pick Le Chat because my understanding of its ToS doesn't prohibit developing LLMs with it in the way other providers ToS seem to. I must of course reiterate that I am not a lawyer and you should do your own research.

The script which does the synthetic corruptions has some other crucially important features beyond the corruptions themselves. One of these is a log of which corruption operations were applied to a passage.

    if corruption_function == transpose_substrings:
        directions = [
            "Undo substring transpose at [{},{}] with replacement by {} character string.",
            "The span originally at [{}, {}] was replaced by a {} character span.",
            "Span replacement at [{}, {}] resulting in substring of length {}.",
            "Near [{}, {}] a string transposition between two passages inserted {} letters.",
            "A string from another passage was substituted in for a string from this passage.",
            "A substring from one passage was transposed onto the current passage.",
            "Substring transposition took place.",
            "Reverse a substring transposition."
        ]
        passage2 = random.choice(passages)["text"]
        corrupted, direction_params = corruption_function(passage, passage2)
        direction = random.choice(directions)
        if "{}" in direction:
            direction = direction.format(*direction_params)
        if directions_trace:
            direction = "transpose_substrings: " + direction
    elif corruption_function == substring2gibberish:
        directions = [
            "Replace substring gibberish at [{},{}] with inferred original text.",
            "Span near ({},{}) has been replaced with noise.",
            "Undo corruption near ({}, {}).",
            "{},{} has damaged spans.",
            "Reverse gibberish span.",
            "A span was turned into gibberish.",
            "Noise span detected.",
            "Obvious corruption found in passage, preparing infill..."
        ]
        corrupted, direction_params = corruption_function(passage)
        direction = random.choice(directions)
        if "{}" in direction:
            direction = direction.format(*direction_params)
        if directions_trace:
            direction = "substring2gibberish: " + direction

...

This log serves multiple important purposes:

It probably makes the task easier for the model by giving it some hints as to what it should be doing in the repair diff.
It trains the model to diagnose the problems in the passage before attempting to fix them.
It provides a natural intervention point for the user to tell the model what to do, since the model will presumably learn to make its edits based on what it says to do in the corruption operation log. Hopefully this generalizes to following other kinds of instructions.

In order to make the logs I use a few important techniques. One is to randomly pick a logline from eight templates per synthetic corruption pass. This provides variation and helps prevent the 'dead pixel' effect if you only had one or two templates per corruption type. Half of the eight templates incorporate some kind of location information to help the model find the place where an error was introduced and correct it. I also flip a coin for whether the loglines have function name prefixes or not. Varying whether they have the prefixes probably helps with generalization by creating two canonical formats you can prompt the diff generator with.

I roll a number between 1 and 10 to determine how many synthetic corruptions I perform on a given passage. Varying the number of corruptions gives the model a smoother curriculum to learn the meaning of the loglines and how to correct the errors. It can learn the easy items in a batch and eventually work its way up to understanding the harder ones.

Making The Diffs

Easy Prose Repair Diffs includes 3 diff formats to train on: GNU diff, git diff, and diff match patch format. These are made by writing the original and corrupted version of the passage to temporary files and then using the relevant command line tools to create the diff. In the case of diff match patch it is possible to create the diff without resorting to CLI tools. I had Le Chat write the functions for interfacing with the diff tools. This is what the diff generator for GNU diff looks like:

def get_gnu_diff(file_content, changed_content):
    # Create two temporary files
    with tempfile.NamedTemporaryFile(mode='w', delete=False) as orig_file, \
         tempfile.NamedTemporaryFile(mode='w', delete=False) as changed_file:
        # Write the original and changed content to the files
        orig_file.write(file_content)
        changed_file.write(changed_content)

    # Use diff to get the differences
    result = subprocess.run(['diff', '-u', orig_file.name, changed_file.name], stdout=subprocess.PIPE)

    # Clean up the temporary files
    os.remove(orig_file.name)
    os.remove(changed_file.name)

    # Print the diff
    return result.stdout.decode()

I incidentally shorten the lorem ipsum placeholder text to 1200 Mistral tokens instead of the original 2048 because I want to be able to fit the corrupted and original text into a 4096 token context window along with the loglines.

Prepending Instructions

Because this is a RetroInstruct component it requires wrapping instructions telling the model what to do. These provide the first intervention point for users by letting them tell the LLM what it's going to be doing in the first place and providing context for the task text they want it to repair. Again it's important to avoid overfitting, so I make a separate set of 8 instructions for each diff format. I break it down like this because the user will probably want to specify which diff format the model gives them. Yet not every instruction mentions the diff format because I want it to be able to do both conditional and unconditional generation of repair diffs. My instruction prefixes look like this:

gnudiff_instructions = [
    "Repair the following context window by diagnosing its flaws and writing a diff in GNU diff format to fix them.",
    "A series of corruptions have been applied to the following passage. Find the problems and repair them with a GNU diff.",
    "Write a GNU diff to repair the problems in this text.",
    "List the problems in this passage and author a GNU diff which fixes them.",
    "Think step by step to diagnose the issues with this text then amend it with a GNU diff.",
    "i need you to fix up this writing for me using a GNU diff",
    "Use a GNU diff to repair the following passage.",
    "Produce a diff in GNU format to patch the problems in this prose."
]

gitdiff_instructions = [
    "The following text has been intentionally corrupted. Use a git diff to repair it.",
    "Use your expertise to diagnose the problems with this passage and provide a git diff to fix them.",
    "The following text has been tampered with. Use a git diff to undo the damage.",
    "This text has been altered in multiple ways. Write a git diff to repair it.",
    "Diagnose the issues with this text and provide a git diff to repair them.",
    "Use your knowledge of git diffs to repair the following text.",
    "Use a git diff to correct the errors in this passage.",
    "This text has been damaged. Use a git diff to diagnose and repair the issues."
]

dmp_instructions = [
    "The following text has been corrupted in a way that the diff-match-patch format can fix.",
    "Using diff-match-patch commands, repair the following text.",
    "List the problems in this passage and write a diff-match-patch patch to fix them.",
    "The text below has been tampered with. Use a diff_match_patch to fix it.",
    "Diagnose the corruptions in this text and use a diff_match_patch to fix them all.",
    "Use your knowledge of diff-match-patch format to diagnose and repair the errors in this passage.",
    "Find the issues with this passage and write a diff-match-patch to repair them.",
    "Infer the issues with this text and generate a diff_match_patch_patch_toText formatted patch to repair it."
]

I wrote the first eight myself but I'll admit that by this point in the process I was getting a bit impatient and had Le Chat write me variations I could modify into the eight templates for the other two formats.

Putting It Together

Now that we have all the basic ingredients we just have to execute the glue code binding them together into a JSON entry. This is what the process looks like for creating each row in the resulting dataset:

    roll = random.randint(1,10)
    for i in range(roll):
        corrupted_passage, logline = do_corruption(corrupted_passage,
                                                   lorem_ipsum,
                                                   directions_trace)
        log += (logline + "\n")
    gnudiff = get_gnu_diff(corrupted_passage, text)
    gitdiff = get_git_diff(corrupted_passage, text)
    dmpdiff = get_patch_match_diff(corrupted_passage, text)

    diffs.append(
        {
            "gnudiff_instruction":random.choice(gnudiff_instructions),
            "gitdiff_instruction":random.choice(gitdiff_instructions),
            "dmpdiff_instruction":random.choice(dmp_instructions),
            "text_corrupted":corrupted_passage,
            "operations":log,
            "gnudiff":gnudiff,
            "gitdiff":gitdiff,
            "dmpdiff":dmpdiff,
            "text_clean":text
        }
    )

I try my best to store information in a structured way so that downstream users of RetroInstruct components can rearrange the individual elements into whatever format makes sense for their use case. For example they may have an instruction format which requires them to change how certain parts of the training samples are arranged. They may also want to train on several templates or arrangements of the same data to help the model generalize from it. For example in this case it may be helpful to have a version of the template where the loglines are part of the users request and a version where the loglines are part of the AI's response so that the model is more likely to generalize that it should follow instructions given by the user after the task text.

In this particular case the loglines are stored as one string, but I provide an affordance for downstream users to recover the original lines by simply splitting the string by newlines. I also store a randomly chosen instruction prefix for each kind of instruction and store the diffs for each kind of diff in the row. The idea behind this is to avoid an antipattern that some RetroInstruct components have where at the dataloader stage to use the dataset you need to incorporate an extraneous bank of prompts or template elements that could just be included in the rows. Again I provide the affordance of recovering the original banks from the dataset because you can just take the unique entries from the relevant columns and then pick the instruction strings in whatever way you want, perhaps expanding the set by writing your own and then reassigning etc.

Finishing Up

Now that we have our rows it's time to shuffle the dataset again as courtesy to downstream users. While we might assume that any serious ML practitioner will shuffle the dataset on their own it's important to remember that this dataset is going on HuggingFace. Users that do data streaming might not have the opportunity to shuffle the whole dataset before use so it's only polite to help them out. We also want to split our data into a training and validation set so we can see if our model is actually learning anything from it.

# Make train and val split

random.shuffle(diffs)

val_size = round(len(diffs) / 10) 
train_size = len(diffs) - val_size

os.chdir(working_dir)

with open("train.json", "w") as outfile:
    json.dump(diffs[:train_size], outfile)

with open("val.json", "w") as outfile:
    json.dump(diffs[train_size:], outfile)

Dataloader

I write a quick dataloader so I can get a good look at the data as the model might actually see it:

import json
import random

with open("train.json") as infile:
    diffs = json.load(infile)

diff_formats = ["gnudiff", "git", "dmp"]

for diff in diffs:
    diff_format = random.choice(diff_formats)
    if diff_format == "gnudiff":
        instruction = diff["gnudiff_instruction"]
        diff_text = diff["gnudiff"]
    elif diff_format == "git":
        instruction = diff["gitdiff_instruction"]
        diff_text = diff["gitdiff"]
    elif diff_format == "dmp":
        instruction = diff["dmpdiff_instruction"]
        diff_text = diff["dmpdiff"]
    print(instruction, end="\n\n")
    print("<passage>")
    print(diff["text_corrupted"])
    print("</passage>", end="<|end|>")
    print("<diagnosis>")
    print(diff["operations"], end="")
    print("</diagnosis>")
    print("<diff>")
    print(diff_text, end="\n")
    print("</diff>")
    print("<repaired>")
    print(diff["text_clean"])
    print("</repaired>")

While reviewing my training samples to find one for this post I realized some of the synthetic corruption passes weren't working as intended and fixed them. This is a good reminder of the importance of reading your data, but also unit testing your code. If I was doing this particular component again I would probably just eat the time cost and unit test all the synthetic corruptions. One benefit of them being meant to simulate kinds of corruption and damage is they don't need to work exactly as intended, just mostly as intended in most circumstances.

Here's an example of what the training samples look like once we print them out in a plausible instruction tuning format:

The following text has been tampered with. Use a git diff to undo the damage.

<passage>
When we talk about legibility we're talking about formalizing complex or
tra more leveragefor our own designs or understand better where the implicit designs of others
lead or do not lead.

Much of the 20th century was shaped by what turned out to be false ideas about
computers and information theory. People did not understand the limitations
of computation or the capabilities of analog computers, and in many cases those
who did were not listened te we now know with certainty that analog
computers had many advantages in speed and price to performance over digital
computers circa 1970 which digital computers would only catch up to 50 years
later. But kind a of information theory fanaticism in government, academia, and
industry led to the conclusion that digital computers would always win over
analog because they were easier to formally model. This idea was reinforced
by the discovery of incompleteness, which falsely seemed to imply that discrete
systems had mathematical properties which were impossible on analog systems.
The fact that analog systems with infinite precision are mathematically possible
created the false impression to many that analog computation was therefore
formally uninteresting or undesirable compared to digital computation.

Of course the 'incompleteness' result of Godel only applies in the sense that
a formal system is never capable of modeliNg up an informal system. A formal system
is strictly weaker than an analog system which has the same capability in real
use for all relevant purposes.

</passage><|end|><diagnosis>
adjacent_word_swap: Found word swap corruption in vicity of 112.
duplicate_word: Word was duplicated near 107 word index.
delete_substring: Detected removal of string near former span [76,135].
swap_capitalization: Swapped capitalization near 1398.
delete_substring: A string was deleted from the passage.
delete_whitespace_character: Deleted the 90 whitespace character.
</diagnosis>

<diff>
diff --git a/test.txt b/test.txt
index 92a7e10..880bad1 100644
--- a/test.txt
+++ b/test.txt
@@ -1,14 +1,15 @@
 When we talk about legibility we're talking about formalizing complex or
-tra more leveragefor our own designs or understand better where the implicit designs of others
+traditionally informal domains in ways that allow us to create more leverage
+for our own designs or understand better where the implicit designs of others
 lead or do not lead.
 
 Much of the 20th century was shaped by what turned out to be false ideas about
 computers and information theory. People did not understand the limitations
 of computation or the capabilities of analog computers, and in many cases those
-who did were not listened te we now know with certainty that analog
+who did were not listened to. For example we now know with certainty that analog
 computers had many advantages in speed and price to performance over digital
 computers circa 1970 which digital computers would only catch up to 50 years
-later. But kind a of information theory fanaticism in government, academia, and
+later. But a kind of information theory fanaticism in government, academia, and
 industry led to the conclusion that digital computers would always win over
 analog because they were easier to formally model. This idea was reinforced
 by the discovery of incompleteness, which falsely seemed to imply that discrete
@@ -18,6 +19,6 @@ created the false impression to many that analog computation was therefore
 formally uninteresting or undesirable compared to digital computation.
 
 Of course the 'incompleteness' result of Godel only applies in the sense that
-a formal system is never capable of modeliNg up an informal system. A formal system
+a formal system is never capable of modeling an informal system. A formal system
 is strictly weaker than an analog system which has the same capability in real
 use for all relevant purposes.

</diff>
<repaired>
When we talk about legibility we're talking about formalizing complex or
traditionally informal domains in ways that allow us to create more leverage
for our own designs or understand better where the implicit designs of others
lead or do not lead.

Much of the 20th century was shaped by what turned out to be false ideas about
computers and information theory. People did not understand the limitations
of computation or the capabilities of analog computers, and in many cases those
who did were not listened to. For example we now know with certainty that analog
computers had many advantages in speed and price to performance over digital
computers circa 1970 which digital computers would only catch up to 50 years
later. But a kind of information theory fanaticism in government, academia, and
industry led to the conclusion that digital computers would always win over
analog because they were easier to formally model. This idea was reinforced
by the discovery of incompleteness, which falsely seemed to imply that discrete
systems had mathematical properties which were impossible on analog systems.
The fact that analog systems with infinite precision are mathematically possible
created the false impression to many that analog computation was therefore
formally uninteresting or undesirable compared to digital computation.

Of course the 'incompleteness' result of Godel only applies in the sense that
a formal system is never capable of modeling an informal system. A formal system
is strictly weaker than an analog system which has the same capability in real
use for all relevant purposes.

</repaired>

Concluding Note On Frameworks And Automation

One thing I've spent a lot of time thinking about while making RetroInstruct components is what parts could be automated and what parts still require human intervention. I've focused a lot more on developing technique than I have on developing a framework because my honest impression is that synthetic data is something like a form of software development that won't be fully automated until we automate software authorship in general. I can imagine the short term automation of particular design patterns and workflows. I can also imagine sufficiently reified technique evolving into frameworks. But the basic problem is that a framework is a kind of automation and you can't really automate a process until you're familiar with it. If you're reading along with all this and going "these are clever ideas but the tooling seems kind of primitive" you're entirely right. Synthetic data is in the exploration phase, not all the good ideas have been found yet and not every kind of useful synthetic data has been made before.

What I do when I want to make a new synthetic dataset is I take the closest code I wrote for a previous synthetic dataset and adapt it. I suggest you do the same. As I do this I find that patterns emerge from the shared structure between problems. The first assembly programmers created programming languages by taking the patterns they were implementing repeatedly in machine code and standardizing them. I suspect the first good abstractions for synthetic corpus authorship will follow a similar trajectory, falling naturally out of repeatedly adapting programs to new problems. You can assist with this process by designing, implementing, and publishing your own synthetic data pipelines. The average synthetic data pipeline takes much less time to develop than the average hobby software project and much less coordination than something like OpenAssistant. An individual can design a pipeline, rent some GPUs, and publish the resulting code + data even if they're the only one that believes in their idea until other people see it. If even just a few hundred people did this and their work was put together it would well outstrip the task diversity in FLAN. Yet it also wouldn't replace FLAN, we could train on FLAN and that new dataset at the same time. None of the currently well advertised ways to contribute to deep learning are particularly accessible:

Model tuning is a malthusian winner-take-all rat race where people scavenge the same sources over and over for marginal improvements to model performance. It's notable that Hermes Nous, the overnight champion got there by making synthetic datasets with instruction models instead of picking over the same stale data every other public model team was using.
Most new techniques require the author to be well versed in a corpus of dense math papers full of inscrutable symbols from the perspective of an undergraduate. It's simply not attainable for people with a normal software development background unless they're willing to put in a lot of study hours and ingratiate themselves to people who already know.
Benchmarks are in theory open to anyone with a clever idea for measuring model performance but in practice are massive undertakings with an ideal set size of thousands of rows. They are not a hobby software project, but more like a full time job.
Projects like OpenAssistant require a big name to back them for coordination purposes and don't seem to have a lot of impact relative to the effort that went into them. Teknium, founder of Nous Research and current model tuning champion openly mocks how bad OpenAssistant is on his Twitter account which has to discourage contributions.

As far as I know synthetic data is the only truly accessible way to contribute to open source AI for someone with a normal software development background. Since the bottleneck to automating most pipelines is subjective evaluation of whether a particular prompt works or not I expect it to remain an available avenue to contribute for at least a while.