Comment On 'List your AI X-Risk cruxes!'

John David Pressman

Original Comment Here

It would take many hours to write down all of my alignment cruxes but here are a handful of related ones I think are particularly important and particularly poorly understood:

Does 'generalizing far beyond the training set' look more like extending the architecture or extending the training corpus? There are two ways I can foresee AI models becoming generally capable and autonomous. One path is something like the scaling thesis, we keep making these models larger or their architecture more efficient until we get enough performance from few enough datapoints for AGI. The other path is suggested by the Chinchilla data scaling rules and uses various forms of self-play to extend and improve the training set so you get more from the same number of parameters. Both curves are important but right now the data scaling curve seems to have the lowest hanging fruit. We know that large language models extend at least a little bit beyond the training set. This implies it should be possible to extend the corpus slightly out of distribution by rejection sampling with "objective" quality metrics and then tuning the model on the resulting samples.

This is a crux because it's probably the strongest controlling parameter for whether "capabilities generalize farther than alignment". Nate Soares's implicit model is that architecture extensions dominate. He writes in his post on the sharp left turn that he expects AI to generalize 'far beyond the training set' until it has dangerous capabilities but relatively shallow alignment. This is because the generator of human values is more complex and ad-hoc than the generator of e.g. physics. So a model which is zero-shot generalizing from a fixed corpus about the shape of what it sees will get reasonable approximations of physics which interaction with the environment will correct the flaws in and less-reasonable approximations of the generator of human values which are potentially both harder to correct and Optional on its part to fix. By contrast if human readable training data is being extended in a loop then it's possible to audit the synthetic data and intervene when it begins to generalize incorrectly. It's the difference between trying to find an illegible 'magic' process that aligns the model in one step vs. doing many steps and checking their local correctness. Eliezer Yudkowsky explains a similar idea in List of Lethalities as there being 'no simple core of alignment' and nothing that 'hits back' when an AI drifts out of alignment with us. This resolves the problem by putting humans in a position to 'hit back' and ensure alignment generalization keeps up with capabilities generalization.

A distinct but related question is the extent to which the generator of human values can be learned through self play. It's important to remember that Yudkowsky and Soares expect 'shallow alignment' because consequentialist-materialist truth is convergent but human values are contingent. For example there is no objective reason why you should eat the peppers of plants that develop noxious chemicals to stop themselves from being eaten, but humans do this all the time and call them 'spices'. If you have a MuZero style self-play AI that grinds say, lean theorems, and you bootstrap it from human language then over time a greater and greater portion of the dataset will be lean theorems rather than anything to do with the culinary arts. A superhuman math agent will probably not care very much about humanity. Therefore if the self play process for math is completely unsupervised but the self play process for 'the generator of human values' requires a large relative amount of supervision then the usual outcome is that aligned AGI loses the race compared to pure consequentialists pointed at some narrow and orthogonal goal like 'solve math'. Furthermore if the generator of human values is difficult to compress then it will take more to learn and be more fragile to perturbations and damage. That is rather than think in terms of whether or not there is a 'simple core to alignment' what we care about is the relative simplicity of the generator of human values vs. other forms of consequentialist objective.

My personal expectation is that the generator of human values is probably not a substantially harder math object to learn than human language itself. Nor are they distinct, human language encodes a huge amount of the mental workspace, it is clear at this point that it's more of a 1D projection of higher dimensional neural embeddings than 'shallow traces of thought'. The key question then is how reasonable an approximation of English do large language models learn? From a precision-recall standpoint it seems pretty much unambiguous that large language models include an approximate understanding of every subject discussed by human beings. You can get a better intuitive sense of this by asking them to break every word in the dictionary into parts. This implies that their recall over the space of valid English sentences is nearly total. Their precision however is still in question. The well worn gradient methods doom argument is that if we take superintelligence to have general-search like Solomonoff structure over plans (i.e. instrumental utilities) then it is not enough to learn a math object that is in-distribution inclusive of all valid English sentences, but one which is exclusive of invalid sentences that score highly in our goal geometry but imply squiggle-maximization in real terms. That is, Yudkowsky's theory says you need to be extremely robust to adversarial examples such that superhuman levels of optimization against it don't yield Goodharted outcomes. My intuition strongly says that real agents avoid this problem by having feedback-loop structure instead of general-search structure (or perhaps a general search that has its hypothesis space constrained by a feedback loop) and a solution to this problem exists but I have not yet figured out how to rigorously state it.