4 min read

Gaining clarity on Automated Alignment Research

I've been thinking about automated alignment research for a long time and have started working on a post on the topic.

I gave a talk on the topic a few months ago, and I've started writing a long treatment of the topic. Hopefully, I get around to publishing the full thing.

Below, I wanted to articulate a common crux of disagreement that leads folks to sometimes talk past each other.


Most AI safety plans include “automating AI safety research.” There’s a need for better clarity of what it looks like.

There are at least four things that get conflated in the term “automated research”:

  1. AI uses search to output what was already discovered (e.g. finds the solution in existing paper(s)).
  2. AI uses search to find pieces of a solution that come together to solve a problem (hopefully in a verifiable domain / lean proof).
  3. AI agents use existing research techniques we already know about, and apply them to a variety of new experiments. An example of AI safety research would be using insights/techniques from subliminal learning and emergent misalignment to study new dataset splits and models trained in new ways, while applying existing interpretability techniques with an auditor agent.
  4. Getting AIs to create novel techniques that substantially improve the domain in question. This is like getting an AI to come up with a new paradigm, which may change how we even think about that research area.

For AI safety, the crux of many disagreements is whether one believes that:

  • 3 & 4 are meaningfully different in ways that make it substantially harder to get 4 than it is to get 3. Some people even seem to fail to disentangle the two and end up convinced that AIs are solving research as some singular thing.
  • 4-level capabilities are already in the superintelligence-regime, so it defeats the purpose of using it for safety if you don't have guarantees that it is safe.
  • When talking about superintelligence (the kind that, e.g., can start and grow entire large-scale businesses on its own, solve long-term complex goals like eliminating cancer, and deal with any change in the world that goes beyond its initial training data), AI safety research needs novel paradigm-level breakthroughs (4) to reduce risks down to acceptable levels. Meaning that you might expect 3 to be too much within-paradigm, relatively unenlightened research.
  • 4 is unneeded for a safe transition. Some folks seem to believe that 3 (which could be described as "relatively unenlightened" research) will be enough to align every subsequent AI, even once we are past 4.
  • Scaffold and inference compute at not much higher level of capability is all you need to get 4, and that you'll be fine from a safety perspective because the models are currently useful for research and don't seem misaligned.
  • 3 may produce good research output (within that set of possible experiments), but you will basically get slop for anything in 4 (anything truly out-of-distribution). So, the AI was put through the wringer and believes it has substantially made the next model safe, but, because it is incapable of generalizing well OOD, it fails to align a 4-level model. It has good intentions, but basically only does good safety work for 3-level models and totally fails at generating sufficient safety research techniques for aligning 4. It just slops itself to a disaster.
  • Even if 3 is helpful, it doesn't end up meaningfully speeding up safety research in comparison to the pace of progress with respect to superintelligent capabilities.
  • 4 involves the AI continually updating its weights, consolidating insights and placing neatly within its world model. 3 has some sort of disjointed world model that can’t be overcome with fancy scratchpadding and RAG (like, imagine an AI with a knowledge cutoff from 2023 and you RAG in 2026 research, it’s missing years of build up in its world model). 3 is suitable for following templates and interpolating within what we know, but fails to understand what is OOD.

Ultimately, this seems like a highly important question to clarify, since I believe it is driving many people to be optimistic about AI safety progress, at least to the point that it allows them to keep chugging along the capabilities tech tree. Having clarity on what would convince people otherwise much sooner seems important.


So, for me, I've started to put a higher likelihood that current AIs are effectively massive world models of crystallized intelligence. It just doesn't feel to me that they are generalizing in the way I'd expect if they had "fluid" intelligence.

In terms of the alignment problem, this means that:

  • AIs need to maintain alignment through continual/online learning, and
  • The reward model needs to scale alongside it to help generalize human values out of distribution.

If your system drifts away from human values, it may remain as such.

If current models drift OOD, they likely won't have the internal framework (a human value model) to self-correct toward human values. Physics will bring a model back into alignment with physical reality (e.g., by performing tasks in the real world or writing code that meets specifications and passes unit tests), but there is no such "free" property for human values.

While crystallized intelligence can be highly capable, current models still seem to lack true reasoning (they have some weak form of it) and are poor at generalizing OOD. They can follow templates and interpolate within the pre-training data distribution quite well, but they seem to lack something important, which may contribute to alignment instability.

That said, you likely do seem to benefit from being in a world where the AIs are mostly aligned at this stage. But I think it’s so easy to dupe oneself into thinking that AI models' current capabilities must demonstrate we’re in an aligned-by-default world. Particularly because (imo) MIRI persistently made it seem as if AIs were this capable at code, for example, that we’d be dead by this point.