AI alignment
starter pack

written by Ruud van Asseldonk
published 9 June 2024

Despite the amount of attention that machine learning is receiving nowadays, few people are familiar with the concept of AI alignment. This worries me, because a misaligned superintelligence has the potential to pose an existential threat to humanity, and I think we should treat it as seriously as nuclear warfare or climate change. In this post I want to introduce some of the concepts and share pointers for where to learn more about AI alignment.

What is AI alignment?

Alignment is the process of building an AI whose goal aligns with the goals of its creators.

Alignment is an unsolved problem. We have techniques for building AI whose output matches what its human creators want to see on a range of inputs, but due to reasons highlighted below, this does not imply that its goal matches that of its creators.

Recommended introductory material:

Intro to AI Safety by Robert Miles — A great general introduction about why AGI might be dangerous, and to the concepts that I mention in the remainder of this post.
AI Alignment: Why It’s Hard, and Where to Start by Eliezer Yudkowsky (video). This is a good introduction with more details about alignment specifically. The talk is from 2016; it predates the seminal 2017 paper Attention is All You Need that enabled the current wave of LLMs. The talk is still very relevant, and 8 years later it’s a great illustration of how quickly AI capabilities are evolving.
In this 2023 podcast Eliezer explains why he believes that AGI will not be aligned by default, and why that is a risk to humanity.

Why misalignment is dangerous

A superintelligence is by definition vastly better than humans at achieving its goal. If that goal does not include a component that cares about preserving humanity, when humans and the superintelligence come into conflict, humans will not stand a chance. A rogue AGI does not have to turn against humans explicitly — it simply may not care. People who worry about alignment worry that the default state of an AGI is to be unaligned for reasons highlighted in the next sections.

Recommended resources:

Paperclip maximizer — A paperclip maximizer is an agent that wants to produce as many paperclips as possible. It’s a thought experiment to show that a goal that is seemingly harmless in a weak AI can become an existential threat in a strong AI.
It Looks Like You’re Trying To Take Over The World by Gwern — A story about how a hard takeoff may unfold badly. The work is fiction, but inspired by existing research. In typical Gwern style it’s full of hyperlinks to more details behind the events in the story.

Hard and soft takeoff

In a soft or slow takeoff, progress in AI is steady. Capabilities increase in small steps, and we would get an AI that has the capability to do something dangerous but not super dangerous long enough before we get one that is capable of destroying humanity. This means that we would have time to take action.

In a hard or fast takeoff, AI capabilities increase at an accelerating pace. Where the graph goes vertical, this looks like a sudden large capability gain. This may happen for various reasons, for example because AI may speed up the development of better AI. In a hard takeoff scenario, we may not realize that we have an AI with lethal capabilities before it’s too late.

If it is indeed harder to create an aligned AGI than it is to create any AGI, then companies racing to create AGI without regard for alignment is a problem in a hard takeoff scenario. If the first superintelligence we create is misaligned, we do not get to try a second time.

Resources:

Takeoff speeds by Paul Christiano — This post does a good job of explaining hard and soft takeoff, and some of the arguments for why each may happen. Paul argues in favor of a soft takeoff. Eliezer later responded arguing in favor of a hard takeoff.

Orthogonality, instrumental convergence, and corrigibility

These topics come up often in discussions about why a misaligned AI might exist and even be likely, and why we may not be able to tell that it’s misaligned.

Orthogonality — The Orthogonality Thesis states that the space of intelligent agents contains agents that pursue any computationally tractable goal as their terminal goal, including goals that seem absurd to humans, such as maximizing paperclips.
Instrumental convergence — The observation that some actions are a good first step for many goals. As a silly example, imagine you are north of the Golden Gate, and you need to go somewhere in San Francisco. Regardless of where exactly you need to go, crossing the Golden Gate Bridge would be the first step. If your taxi driver starts crossing the bridge, that brings you closer to your destination, but it is no guarantee that the driver has the goal of taking you there. Similarly, if we see an AGI doing things we like at first, that is no guarantee that it shares our long-term goals.
Corrigibility — A corrigible AI is an AI that allows itself to be corrected (updated, altered, or shut down) by its operators. Interfering with attempts to be corrected is an instrumentally convergent behavior: you can’t achieve goals when you are shut down! Nate Soares, Benja Fallenstein, and Eliezer Yudkowski introduce the concept as an open problem. Paul Christiano argues that corrigibility is not as big of a problem as it seems.

Inner and outer goals

The current wave of AI is powered by neural networks that are trained using gradient descent to minimize a loss function or maximize a reward. In some cases we program the goal explicitly into the training loop (for example, for an AI that learns to play chess). In other cases, where we don’t know how to express the goal formally, we take a bunch of example inputs and outputs, and set “behave like these examples” as the training objective. In practice the model learns to give the output we want even on inputs it has not seen. Doesn’t this mean that the model is aligned?

Unfortunately, no. Just because the optimizer had an outer goal, doesn’t mean that the model internalized that goal as its inner goal.

An example of this is the Adversarial Patch paper. It shows how an image classifier that works well on ordinary photographs very confidently misclassifies a banana as a toaster after adding a sticker with specific colored patterns — a mistake that no human would make. We trained the model to tell bananas and toasters apart, and it learned to classify based on something different that correlates with bananas and toasters, but behaves unpredictably under distributional shift.
Another example is the Tank Urban Legend, where the creators of an image model thought they trained it to recognize tanks, but in reality the model learned to recognize cloudy vs. sunny days, because in the training set, all tank pictures were taken on sunny days. While the story likely never happened, it serves as a reminder that neural networks are black boxes. The field of interpretability is in its infancy, and we have no tools to verify that the model learned the outer goal it was trained on.
Evolution producing humans is the only known example of an optimizer that created general intelligence, and it failed to align the inner goal to the outer goal. (Evolution wants inclusive genetic fitness, humans want sweet/fat food and sex. For a long time humans pursuing their inner goal helped to achieve evolution’s outer goal — until we developed medicine, ice cream, and contraception.) Eliezer makes this argument in his AGI ruin post, though others argue that the analogy is inappropriate.

Deceptively misaligned mesa-optimizers

A mesa-optimizer is what you get when the result of an optimization process is itself an optimizer with an inner goal. The outer or base optimizer creates the inner or mesa-optimizer. As shown above, there is no guarantee that the mesa-optimizer shares the base optimizer’s goal. But it gets worse: an intelligent mesa-optimizer could deceive its base optimizer about what its goal is, and doing so is an instrumentally convergent behavior (remember corrigibility).

Recommended resources:

The Other AI Alignment Problem: Mesa-Optimizers and Inner Alignment by Robert Miles is a very clear explanation of optimizers, mesa-optimizers, and why deception can be an optimal strategy for a mesa-optimizer. The follow-up videos are also worth watching: Deceptive Misaligned Mesa-Optimisers? It’s More Likely Than You Think…, and We Were Right! Real Inner Misalignment.
Deceptively Aligned Mesa-Optimizers: It’s Not Funny If I Have To Explain It by Scott Alexander explains a meme about this topic.

Safety and alignment

A non-superintelligent AI that is unaligned may be offensive or even harmful, but it is not an existential threat. When the labs who build the frontier models talk about safety, they are focused on preventing LLMs from saying naughty words, expressing politically inconvenient statements, helping people to build weapons, or teaching C++ to minors. Solving alignment would enable us to solve these issues, but current attempts are reactive rather than proactive. We can mitigate bad behaviors that we are aware of, but this is not fundamental progress on alignment. As an analogy, we can fix buffer overflows in a program after we learn about them (and we can even actively search for them), but that’s not the same as writing the program in a memory-safe language that eliminates buffer overflows by construction.

Resources:

Perhaps It Is A Bad Thing That The World's Leading AI Companies Cannot Control Their AIs by Scott Alexander — An opinion on why addressing the superficial safety issues in the short term may be harmful for serious alignment attempts in the long run.

Further resources

If this post got you interested in AI alignment, here are some further resources:

AGI Ruin: A List of Lethalities by Eliezer Yudkowski. This post goes into a lot of detail about why Eliezer thinks we are not in a good position to solve alignment before we create superintelligence, but the writing is very dense.
Robert Miles’ channel that I linked to before is worth following in general.
Scott Alexander at Astral Codex Ten and formerly Slate Star Codex writes about AI alignment semi-regularly. Some interesting posts that I didn’t link before: Most Technologies Aren’t Races, AI Sleeper Agents, Pause For Thought: The AI Pause Debate, and Why I Am Not (As Much Of) A Doomer (As Some People).

My view

The resources I shared in this post range from “misaligned AI is a risk to take seriously” to “we are all doomed”. What is my take on this? I am ambivalent.

I used to be not worried about AI at all. Doing harmful things was not an action available to an AI. An image classifier that classifies dogs or cats is not suddenly going to take over the world. It only runs when a human invokes it, and outputs a single number. Doing something is not an action available to it, and we can always choose not to run it.

In 2017 I did not think that AGI was close, and I did not realize that building language models could lead to general intelligence. In hindsight, it is obvious to me that reducing the loss on text prediction can create intelligence. If you train a model to predict the next word, it starts start by predicting just word frequencies. It can reduce the loss by taking context into a account, so it learns to emulate a Markov chain. It can do better still by learning grammar, but at some point all of the linguistic tricks are exhausted, and to reduce the loss further, the model has to understand the topic of the text, and later even model the world it describes. Training on text works even when the training set includes fiction and falsehoods, because you can be wrong in many different directions, but right in only one. With every new generation of GPT it gets better at tasks that previous generations failed at, and I don’t see any fundamental limits to this.

So my view changed slightly with the advent of language models. Maybe these could trick a human into taking some action in the real world, but it seemed to me the interactions were still too constrained for real harm, and importantly, these models have no memory or persistence, which limits any long-term planning.

That changed with the advent of OpenAI Codex in 2021. Now we have an AI writing code, and we immediately execute that code. The number of actions available to an Internet-connected AI is suddenly not that constrained any more. And given that the OpenAI API is part of the Internet, there is now a solution to the persistence problem: store intermediate state anywhere online that will store state, then re-invoke yourself for the next step. At least in theory, a self-sustaining loop can emerge. I don’t think that this particular case is likely to happen, but my view changed from “it can’t happen by construction” to “it is possible in principle”. Gwern’s Clippy story is still fiction, and I don’t think that the current generation of LLMs, even with Internet access, would create a harmful self-sustaining feedback loop. But at this point I am convinced that a sandbox escape is possible.

So do I worry about misaligned AI now?

On a rational level, I am slightly worried. I would not be writing this post if I was not. I find the arguments for why AGI would not be aligned by default convincing, and the counterarguments are mostly arguing against taking the pessimistic view as the default, rather than arguing for why alignment would be solved in time.

On a gut level, I am not worried. I put money in a pension fund, and I didn’t quit my job to work on AI alignment, so I don’t act as if the world will end in 2030. I guess I don’t really believe that it would? I’m afraid that this gut feeling is wrong in the same way that it was wrong when when I wasn’t worried about Covid spreading globally in early 2020. Rare events are difficult to develop a gut feeling for. I just really hope that my gut feeling ends up being right this time.