Fooled by Vast Knowledge

LLMs' vast training data makes it hard to tell interpolation from extrapolation. Why that confusion is leading AI safety research astray.

It's incredibly easy to be fooled by the capabilities of the current top-performing tech (LLM agents). It's easy because they have a vast amount of training data to interpolate from.

This works fine to acquire capabilities within our existing data distribution of the world (one that is also easy to verify), but what happens when they go out of distribution?

LLMs perform poorly! Yet, people seem to think they can actually generalize to new problems. Why is that?

It's, again, the vastness of their training data. It makes it hard to distinguish between interpolation and extrapolation (or hyperpolation, if you want to add a third dimension).

For example, a TypeScript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them specifically good at some of those things!

Related and great post from Beren, "Most Algorithmic Progress is Data Progress":

In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.

It might still be impressive, but models are largely remixing many things it has seen in great detail during training (many impressive headline results have even been determined to be the model re-using existing implementations/PRs via search instead of coming up with actually-new ones!). This is not about LLMs not doing impressive things! This is about precisely describing their capability profile, where it comes from, and whether more of the same (e.g., scale) gets you a whole new set of impressive outcomes (e.g., novel R&D that isn't just remixing existing research).

And yes, I know you can make a ton of discoveries by interpolating existing research (e.g., interdisciplinary research and automating research pipelines to run more experiments). I also think that people are overly confident that it means LLMs will be capable of novel R&D breakthroughs.

Even if you consider "researchers can come up with novel ideas and give them to the AIs", that likely involves longer timelines. But, just as importantly, LLMs may be exceptional at automating within-paradigm research, disproportionately better than at automating out-of-paradigm research. Therefore, you end up accelerating research that may largely be irrelevant for 'True' AGI (yes, you still accelerate many coding parts, but the speed-up is still bottlenecked in ways that it's not easy to just say the entire process of arriving at these research breakthroughs is now 1000x faster than before).

"But the models are still capable and growing more capable! Why does this matter? Scale will just solve this!"

It matters because:

1. Alignment is about generalizing human values out-of-distribution

The whole point of alignment has always been about generalizing 'human values' out-of-distribution. So, if alignment and capabilities are tied, it means models are capable of modeling the existing within-distribution 'values', but things may pull apart once we undergo the distributional shift of a post-AGI deployment world.

An example you can test right now is LLMs lacking a sense of how to engage with the world in this post-agent era. You have to keep reminding them about the current state of the world. The closer you get to novel R&D that the labs haven't paid millions in RL envs for (e.g. AI R&D), the starker this becomes.

You can point to continual learning 'solving' this, but that is kind of my point. These capability unlocks will fundamentally change the AI and its relationship with itself. Related: "You can't imitation-learn how to continual-learn".

Also, from "Training AI agents to solve hard problems could lead to Scheming":

Future AI models will be asked to solve hard tasks. We expect that solving hard tasks requires some sort of goal-directed, self-guided, outcome-based, online learning procedure, which we call the "science loop", where the AI makes incremental progress toward its high-level goal. We think this "science loop" encourages goal-directedness, instrumental reasoning, instrumental goals, beyond-episode goals, operational non-myopia, and indifference to stated preferences, which we jointly call "Consequentialism". We then argue that consequentialist agents that are situationally aware are likely to become schemers (absent countermeasures) and sketch three concrete example scenarios.

[...]

Self-guided online learning: There is an online learning component to it, i.e. the model has to condense the new knowledge it learned from iterations. For example, the model could run thousands of different trajectories in parallel. Then, it could select the trajectories that it expects to make the most progress toward its goal and fine-tune itself on them. The decisions about which data to select for fine-tuning are made by the model itself with little human correction, e.g. in some form of self-play fashion. Since the problem is hard, humans perform worse than the model at selecting different rollouts, and since there is a lot of data to sift through, humans couldn't read it all in time anyway.

2. Existing safety research may not generalize

It also matters because it means that the existing paradigm may be missing something so foundational that much of the safety research as it exists today will simply not generalize (off-distribution). They are testing the shallow within-distribution heuristic mimicking and generalization of LLMs.

It's like doing evals on a brain that regurgitates what it's seen, but hasn't actually gone through a thoughtful, reflective process to bring coherence to it all. The training data might let it mimic what we've fed it, but it still hasn't gone through the process of evolving its own beliefs as it engages with the world.

To me, all of this is consistent with the experiments and behaviour we see from LLMs, yet my interpretation of the results of experiments seems to be different from lots of the safety community. They seem to be looking for "scheming" and other such things, but the incoherent behaviour of LLMs seems much shallower than that, imo! (Relevant posts: "The Case Against AI Control Research" and "Current AIs seem pretty misaligned to me".)

The type of thing they are missing might mean that they don't really understand things. And the requirement for 'understanding' is also so interwoven with alignment, novel R&D, pursuing long-term complex goals in changing environments, etc. that existing (empirical) safety research gets itself fundamentally confused.

An LLM that is behaving 'nice' may be so shallow and heuristic-driven that it is effectively in a system 1-like mode despite the appearance of 'reasoning' and 'thinking'. In pursuit of complex, long-term goals, we might expect that an autonomously self-trained AI would systematically remove these weak heuristics as a necessary step to succeed at these goals.

Just imagine an AI starting a complex company where it needs to maximize shareholder value and is competing with an entire economy of other AIs. The world is changing; they all have similar heuristics. The change in behaviour needs to be more fundamental for it to win.

Ultimately, I think we need to provide further clarity on the above, as I believe it has led folks to misapply their vague understanding of traditional alignment research (which many new researchers should engage with more) to existing AI models, and it may be leading AI safety research of superintelligence astray.

1. Alignment is about generalizing human values out-of-distribution

2. Existing safety research may not generalize

Further reading

You might also like...

When Execution Gets Cheap, Does Taste Become the Moat?

Hard Truths About Where AI Is Headed