I've been ruminating on an idea ever since I read the section on deception in "The Core of the Alignment Problem is..." from my colleagues in SERI MATS.
Here's the important part:
When an agent interacts with the world, there are two possible ways the agent makes mistakes:
First: Its values were not aligned with the outer objective, and so it does something intentionally wrong,
Second: Its world model was incorrect, so it makes an accidental mistake.
Thus, the training process of an AGI will improve its values or its world model, and since it eventually gets diminishing marginal returns from both of these, both the world model and the values must improve together. Therefore, it is very likely that the agent will have a sufficiently good world model to understand that it is in a training loop before it has fully aligned inner values.
So, what if we prevented the model from recognizing it is in a training loop (e.g. preventing/delaying situational awareness) until we are certain it has fully aligned inner values? In other words, we could use some stronger forms of model editing to remove specific knowledge (or prevent the model from gaining that knowledge) from the model. Perhaps you penalize the model from learning things that are not useful for fully embedding aligned inner values (Tool AI-ish). Maybe even apply negative gradient steps to "unlearn" things.
I think this idea is similar to Evan's Precursor Checking idea:
Precursor checking: Another general type of training rationale that I think is worth calling attention to is what I’ll call “precursor checking,” which is the concept of using some method of gaining information about a model’s internals—e.g. transparency/interpretability or AI cognitive science—to check for some precursor to bad behavior rather than the bad behavior itself. This could involve substituting in some narrower, easier to check training goal—that still falls within the broader actual training goal—as the target for the training rationale. For example, if your training rationale involves ensuring that you don’t get a deceptive model that’s actively trying to trick its training process, then rather than explicitly trying to look for such deception (which could be especially hard since a deceptive model might actively try to avoid detection), you could instead try to ensure that your model has a short horizon length in terms of how far ahead its planning. Such a plan might work better, since horizon length might be easier to guarantee in a training rationale while still being consistent with the desired training goal and hopefully ruling out the possibility of deception. One issue with this sort of approach, however, is that you have to guarantee that whatever precursor for bad behavior you’re looking for is in fact a necessary condition for that bad behavior—if it turns out that there’s another way of getting that bad behavior that doesn’t go through the precursor, that could be a problem.
Counterarguments to this might be:
- The model might not be able to have fully aligned inner values that remain robust as capabilities eventually generalize far out of distribution.
- It will exceptionally difficult to know if we've actually removed this knowledge/capability from the model (even if it's possible).
Update: It seems that Jan Leike mentions something similar in his "why I'm optimistic about our alignment approach" post.
The model can be “narrower.” It doesn’t need to understand biology, physics, or human society that well. In practice we’d probably fine-tune from an LLM that does understand all of those things, but we could apply some targeted brain damage to the model as a safety precaution. More generally, the model only has to exceed human-level in a few domains, while it can be worse than humans in most others.
I'd be interested in hearing people's thoughts/criticisms on this. You can comment here.