2 min read

Better model diffing is needed

An alignment technique I wish existed involves model diffing to understand model evolution through training/interventions (like model editing) and serves as a signal to guide training (with a strong control feedback mechanism) and study model drift.

All current techniques seem too costly, not unsupervised or active enough (petri-style stuff is nice, but feels like we need something a bit more fundamental, or at least give a new set of tools to the agent), etc.

If people are interested in the alignment implications of long-horizon RL, I think one key consideration is that future models will eventually discard context-specific heuristics they've been using, because they will be insufficient for solving increasingly complex problems we don't know how to solve (e.g., open-ended research). Therefore, I'd be curious if such model diffing techniques could pick up on such, potentially subtle, changes in the model.


This would be follow-up work on previous research I've done with collaborators. I've been trying to think about whether such things would be valuable for an AI safety startup, but I'm iffy on the idea because it always comes back to, "well, am I impacting internal deployment at AGI labs in any way?" It's clearly an important thing to figure out in the context of continual learning (as we pointed out in the research agenda post), though.

When we worked on this, we (mostly Quintin) tried to develop a modified technique called "contrastive decoding" where we'd try to do model diffing by effectively using the token distribution as a way to study which sets of tokens M2 prefers over M1 (or vice-versa).

The goal was to use the technique to gain some unsupervised understanding of unwanted behavioural side effects (e.g., training an AI to become more of a reasoner somehow impacts its political views). Ultimately, this technique wasn't very useful, and it was fairly costly to run because you were evaluating a lot of text. The main interesting observation was that one of the base Llama models was far more likely to upweight the "Question:" token after the <|startoftext|> token than the instruct model (which we believe was because Meta did some priming at the end of the base model's training to get it used to the question/answer format).

Anyway, having reliable, cheap sensors we can use throughout training to guide the process or keep track of how things are evolving in the network seems good. That said, I think this could fail due to not trying to develop techniques that work in the capability regime we are actually worried about or a misunderstanding of key issues like deep deceptiveness:

Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general.

As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.

In the case of deep deceptiveness and model diffing (model diffing is obviously in other cases), the thing I hope for the most is that changes in internal/external properties can be easily picked up by model diffing through training, and we have such a fine-grained signal of the model's cognition that it becomes clearer that the model is developing cognitive changes that are tangibly different to the current state of LLMs.