But is it really in Rome? Limitations of the ROME model editing technique
I just published a new post on LessWrong. It's about the causal tracing and model editing paper (ROME).
Here's the intro:
The majority of this work was carried out this summer. Many people in the community were surprised when I mentioned some of the limitations of ROME (Rank-One Model Editing), so I figured it was worth it to write a post about it as well as other insights I gained from looking into the paper. Most tests were done with GPT-2, some were done with GPT-J.
The ROME paper has been one of the most influential papers in the prosaic alignment community. It has several important insights. The main findings are:
- Factual associations such as “The Eiffel Tower is in Paris” seem to be stored in the MLPs of the early-middle layers of a GPT model. As the Tower token passes through the network, the MLPs of the early-middle layers will write information (e.g. the Eiffel Tower’s location) into the residual so that the model can later read that information to generate a token about that fact (e.g. Paris).
- Editing/updating the MLP of a single layer for a given (subject, relationship, object) association allows the model to generate text with the updated fact when using new prompts/sentences that include the subject tokens. For example, editing “The Eiffel Tower is in
ParisRome” results in a model that outputs “The Eiffel Tower is right across from St Peter’s Basilica in Rome, Italy. “
In this post, I show that the ROME edit has many limitations:
- The ROME edit doesn’t generalize in the way you might expect. It’s true that if the subject tokens you use for the edit are found in the prompt, it will try to generalize from the updated fact. However, it doesn’t “generalize” in the following ways:
- It is not direction-agnostic/bidirectional. For example, the ROME edit is only in the "Eiffel Tower is located in ____" direction, not in the "Rome has a tower called the ____" direction.
- It’s mostly (?) the token association being edited, not the concept. “Cheese” and “Fromage” are separate things, you’d need to edit both.
- I hoped that if you edit X (e.g. The Rock) and then tried to describe X without using the token, the model would realize it’s talking about X and generate according to the edit. Based on the examples I tested, this does not seem to be the case. You mostly need the subject tokens that were used for the edit in the prompt.
- It seems to over/under-optimize depending on the new fact. It will want to talk about Rome (post-edit) when the Eiffel Tower is mentioned more than it will want to talk about Paris before the edit.
One point I want to illustrate with this post is that the intervention is a bit more finicky than one might initially think, and someone could infer too much from the results in the paper. With a lot of these interpretability techniques, we end up finding correlation rather than causation. However, my hope is that such interventions, while not perfect at validating hypotheses, will hopefully give us extra confidence in our interpretability results (in this case, the causal tracing method).