AI alignment
starter pack

written by Ruud van Asseldonk

Despite the amount of attention that machine learning is receiving nowadays, few people are familiar with the concept of AI alignment. This worries me, because a misaligned superintelligence has the potential to pose an existential threat to humanity, and I think we should treat it as seriously as nuclear warfare or climate change. In this post I want to introduce some of the concepts and share pointers for where to learn more about AI alignment.

What is AI alignment?

Alignment is the process of building an AI whose goal aligns with the goals of its creators.

Alignment is an unsolved problem. We have techniques for building AI whose output matches what its human creators want to see on a range of inputs, but due to reasons highlighted below, this does not imply that its goal matches that of its creators.

Recommended introductory material:

Why misalignment is dangerous

A superintelligence is by definition vastly better than humans at achieving its goal. If that goal does not include a component that cares about preserving humanity, when humans and the superintelligence come into conflict, humans will not stand a chance. A rogue AGI does not have to turn against humans explicitly — it simply may not care. People who worry about alignment worry that the default state of an AGI is to be unaligned for reasons highlighted in the next sections.

Recommended resources:

Hard and soft takeoff

In a soft or slow takeoff, progress in AI is steady. Capabilities increase in small steps, and we would get an AI that has the capability to do something dangerous but not super dangerous long enough before we get one that is capable of destroying humanity. This means that we would have time to take action.

In a hard or fast takeoff, AI capabilities increase at an accelerating pace. Where the graph goes vertical, this looks like a sudden large capability gain. This may happen for various reasons, for example because AI may speed up the development of better AI. In a hard takeoff scenario, we may not realize that we have an AI with lethal capabilities before it’s too late.

If it is indeed harder to create an aligned AGI than it is to create any AGI, then companies racing to create AGI without regard for alignment is a problem in a hard takeoff scenario. If the first superintelligence we create is misaligned, we do not get to try a second time.


Orthogonality, instrumental convergence, and corrigibility

These topics come up often in discussions about why a misaligned AI might exist and even be likely, and why we may not be able to tell that it’s misaligned.

Inner and outer goals

The current wave of AI is powered by neural networks that are trained using gradient descent to minimize a loss function or maximize a reward. In some cases we program the goal explicitly into the training loop (for example, for an AI that learns to play chess). In other cases, where we don’t know how to express the goal formally, we take a bunch of example inputs and outputs, and set “behave like these examples” as the training objective. In practice the model learns to give the output we want even on inputs it has not seen. Doesn’t this mean that the model is aligned?

Unfortunately, no. Just because the optimizer had an outer goal, doesn’t mean that the model internalized that goal as its inner goal.

Deceptively misaligned mesa-optimizers

A mesa-optimizer is what you get when the result of an optimization process is itself an optimizer with an inner goal. The outer or base optimizer creates the inner or mesa-optimizer. As shown above, there is no guarantee that the mesa-optimizer shares the base optimizer’s goal. But it gets worse: an intelligent mesa-optimizer could deceive its base optimizer about what its goal is, and doing so is an instrumentally convergent behavior (remember corrigibility).

Recommended resources:

Safety and alignment

A non-superintelligent AI that is unaligned may be offensive or even harmful, but it is not an existential threat. When the labs who build the frontier models talk about safety, they are focused on preventing LLMs from saying naughty words, expressing politically inconvenient statements, helping people to build weapons, or teaching C++ to minors. Solving alignment would enable us to solve these issues, but current attempts are reactive rather than proactive. We can mitigate bad behaviors that we are aware of, but this is not fundamental progress on alignment. As an analogy, we can fix buffer overflows in a program after we learn about them (and we can even actively search for them), but that’s not the same as writing the program in a memory-safe language that eliminates buffer overflows by construction.


Further resources

If this post got you interested in AI alignment, here are some further resources:

My view

The resources I shared in this post range from “misaligned AI is a risk to take seriously” to “we are all doomed”. What is my take on this? I am ambivalent.

I used to be not worried about AI at all. Doing harmful things was not an action available to an AI. An image classifier that classifies dogs or cats is not suddenly going to take over the world. It only runs when a human invokes it, and outputs a single number. Doing something is not an action available to it, and we can always choose not to run it.

In 2017 I did not think that AGI was close, and I did not realize that building language models could lead to general intelligence. In hindsight, it is obvious to me that reducing the loss on text prediction can create intelligence. If you train a model to predict the next word, it starts start by predicting just word frequencies. It can reduce the loss by taking context into a account, so it learns to emulate a Markov chain. It can do better still by learning grammar, but at some point all of the linguistic tricks are exhausted, and to reduce the loss further, the model has to understand the topic of the text, and later even model the world it describes. Training on text works even when the training set includes fiction and falsehoods, because you can be wrong in many different directions, but right in only one. With every new generation of GPT it gets better at tasks that previous generations failed at, and I don’t see any fundamental limits to this.

So my view changed slightly with the advent of language models. Maybe these could trick a human into taking some action in the real world, but it seemed to me the interactions were still too constrained for real harm, and importantly, these models have no memory or persistence, which limits any long-term planning.

That changed with the advent of OpenAI Codex in 2021. Now we have an AI writing code, and we immediately execute that code. The number of actions available to an Internet-connected AI is suddenly not that constrained any more. And given that the OpenAI API is part of the Internet, there is now a solution to the persistence problem: store intermediate state anywhere online that will store state, then re-invoke yourself for the next step. At least in theory, a self-sustaining loop can emerge. I don’t think that this particular case is likely to happen, but my view changed from “it can’t happen by construction” to “it is possible in principle”. Gwern’s Clippy story is still fiction, and I don’t think that the current generation of LLMs, even with Internet access, would create a harmful self-sustaining feedback loop. But at this point I am convinced that a sandbox escape is possible.

So do I worry about misaligned AI now?

On a rational level, I am slightly worried. I would not be writing this post if I was not. I find the arguments for why AGI would not be aligned by default convincing, and the counterarguments are mostly arguing against taking the pessimistic view as the default, rather than arguing for why alignment would be solved in time.

On a gut level, I am not worried. I put money in a pension fund, and I didn’t quit my job to work on AI alignment, so I don’t act as if the world will end in 2030. I guess I don’t really believe that it would? I’m afraid that this gut feeling is wrong in the same way that it was wrong when when I wasn’t worried about Covid spreading globally in early 2020. Rare events are difficult to develop a gut feeling for. I just really hope that my gut feeling ends up being right this time.

More words

A reasonable configuration language

I was fed up with the poor opportunities for abstraction in configuration formats. The many configuration languages that exist already were not invented here, so I wrote my own, at first just for fun. But then it became useful. Read full post