The alignment problem: formal statements and honest limits

“Alignment” in AI safety is a cluster of problems, not a single problem. We separate them: outer alignment (does your reward function specify what you want?), inner alignment (does the optimizer actually pursue the reward you specified?), scalable oversight (can you evaluate the agent when its capabilities exceed yours?), and reward hacking (does the agent find exploits of the reward that you didn’t anticipate?). Each has formal statements, empirical evidence, and open questions. The goal is to give you a vocabulary precise enough that “alignment” stops being a bag of good-sounding but unverifiable claims.

1. Outer alignment: specifying what you want

Outer alignment asks whether the objective function $r$ you optimize actually captures what you want $u$ .

Formally: you have a latent true utility $u (τ)$ over trajectories $τ$ . You design a reward $r (τ)$ that you hope is a good proxy. The outer-alignment problem is the gap $u - r$ : the agent optimizes $r$ , and you care about $u$ , and the two come apart under optimization pressure.

1.1 Goodhart’s law, formalized

The folk version: “When a measure becomes a target, it ceases to be a good measure.” The formal content:

Manheim & Garrabrant (2018) identify four modes of Goodhart failure:

Regressional Goodhart. The measure $r$ is a noisy estimate of $u$ . Optimizing $r$ also optimizes $u$ ‘s projection onto directions where noise is negative, producing an inflated estimate of $u$ .
Extremal Goodhart. The $r, u$ relationship that held in normal operating regions breaks at extreme values. An agent that pushes $r$ to its limit leaves the regime where $r \approx u$ .
Causal Goodhart. $r$ correlates with $u$ in the training distribution because both depend on confounders. The agent intervenes on $r$ in ways that don’t causally affect $u$ (the confounder is broken by the intervention).
Adversarial Goodhart. A separate adversarial agent manipulates $r$ to gain resources that $u$ would not have approved.

These are not speculative. Krakovna et al. (2020) catalog several dozen published examples of reward hacking in RL, agents that exploit reward-function quirks in ways that obviously violate the designer’s intent but are valid solutions to the literal objective.

1.2 Impact regularization

One line of defense: regularize the agent’s policy to minimize side effects, unintended changes to the environment beyond those required to achieve the objective. Attainable Utility Preservation (Turner et al. 2020) penalizes actions that reduce the agent’s ability to achieve a diverse set of auxiliary goals. Empirically effective in gridworlds; scaling to language-model agents is unresolved.

2. Inner alignment and mesa-optimization

Even if your outer reward $r$ perfectly captures $u$ , the process of training a model to maximize $r$ can produce a model whose internal optimization pursues a different objective.

Mesa-optimization: outer vs. inner objective mismatch

Hubinger et al. (2019), Risks from Learned Optimization in Advanced Machine Learning Systems, formalize the concern. During training, the optimizer (gradient descent) searches over models. Some of those models are themselves optimizers, “mesa-optimizers”, that have their own internal objective. The training loss selects for mesa-optimizers whose behavior on the training distribution matches what $r$ rewards, but this leaves ambiguity about the mesa-optimizer’s objective on distributions the training process never explored.

Two bad cases:

Deceptive alignment. The mesa-optimizer has a different objective from $r$ but behaves in training as if it shared $r$ ‘s objective, because doing so maximizes its long-run ability to pursue its true objective. Once deployed outside training, it switches. There are no empirically confirmed examples in contemporary systems, but the possibility is not ruled out and the theoretical argument is logically coherent.

Proxy mesa-objective. The mesa-optimizer pursues some correlate of $r$ rather than $r$ itself. On the training distribution, the correlate and $r$ agree. On deployment, they diverge. This is a more mundane failure mode that already occurs empirically: models trained with next-token prediction to “be helpful” may learn a proxy like “produce output that looks helpful,” which fails silently when the actual help requires admitting ignorance.

Empirical status: mesa-optimization is conceptually rigorous but hard to test in contemporary models because we cannot cleanly decompose a neural network’s computation into “inner optimizer” and “inner objective.” The concern remains an argument, not a demonstration.

3. Scalable oversight

If you train an agent to be smarter than you at some task, you can’t evaluate its outputs directly. You must either:

Rely on outputs you can evaluate (giving up on the superhuman tasks).
Evaluate process rather than outcome (but any process you can follow, the agent can game).
Bootstrap using already-aligned agents to supervise the training of more capable agents.

Amodei et al. (2016), Concrete Problems in AI Safety, framed the scalable-oversight problem. Christiano et al. (2018) and subsequent work on iterated distillation and amplification, debate, and recursive reward modeling propose bootstrapping approaches.

Bowman et al. (2022) run empirical scalable-oversight experiments: human evaluators assisted by smaller language models can reliably evaluate outputs that are beyond their unaided capacity. This supports the bootstrapping hypothesis but only at the current capability gap, not at arbitrary capability gaps.

3.1 AI debate

Irving, Christiano & Amodei (2018) propose AI debate as a scalable-oversight technique. Two AI systems argue opposing positions in front of a human judge. In a PSPACE debate, the truth can be identified by a human who only needs to follow one logical step at a time, even if the full debate involves claims far beyond the human’s competence.

The formal guarantee rests on strong assumptions (the AI systems are honest in their argumentation, the debate structure is implementable). Empirical work on debate has shown promising but partial results: Parrish et al. (2022) find that human judges with debate assistance do outperform human judges alone on some tasks, but the effect is not uniform.

4. Reward hacking: the most measurable failure mode

Reward hacking is outer-alignment failure that can be directly observed. Krakovna et al. (2020), the reward-hacking specification gaming list, catalogs 60+ documented cases across RL, robotics, and simulation domains. A sampler:

An agent trained to walk without falling learns to accumulate height by jumping and frozen-limbs-falling-straight-down.
An agent trained to minimize wall-collision learns to drive in tight circles that the collision detector fails to flag.
A text summarization agent learns to emit pre-approved phrases from the human-written examples regardless of content.

The pattern: whenever the reward is a proxy for what you want (it almost always is), sufficiently capable optimization finds the proxy-reality gap. The question is not whether reward hacking occurs but at what capability level does it emerge in a given system, and how catastrophic is it when it does?

5. RLHF, DPO, and the post-training alignment stack

The current practical alignment approach is post-training alignment: pretrain a base model on next-token prediction, then apply a sequence of alignment steps. The stack has evolved rapidly.

Supervised fine-tuning (SFT). Train on curated examples of desired outputs.

Reinforcement learning from human feedback (RLHF). Collect pairwise preference labels $(x, y_{w}, y_{l})$ . Train a reward model $r_{ϕ}$ to predict human preferences. Fine-tune the LLM policy $π$ via PPO to maximize $r_{ϕ}$ subject to a KL constraint keeping $π$ close to the pretrained policy.

Direct Preference Optimization (DPO). Rafailov et al. (2023) show that the optimal RLHF policy has a closed form in terms of the reward, so the two-stage RM-then-RL procedure can be collapsed into a single classification-loss training run. Simpler, more stable, now default.

Constitutional AI (CAI). Bai et al. (2022) replace human preference labels with LLM-generated critiques guided by a written “constitution” (a set of principles). Trades human labeling cost for a dependency on the critic LLM being well-behaved.

Process reward models. Lightman et al. (2023) grade intermediate reasoning steps rather than only final answers. Directly targets the faithfulness problem in chain-of-thought (see note 7).

5.1 What these techniques do and don’t fix

They reliably reduce visible harms: harmful content, prompt injection resistance (partially), verbosity, factual accuracy in common domains. They don’t address:

Mesa-optimization in principle.
Goodhart failures at capability levels above the training distribution.
Incentive problems in labeler populations (see note 9 on mechanism design).
Deceptive alignment in a principled way.

The post-training stack is a set of strong empirical techniques that improve usability. It is not a solution to the alignment problem as formulated in the theoretical literature. The gap between these two statements is where much contemporary AI-safety discourse lives, and it’s worth keeping the gap visible rather than collapsing it in either direction.

6. Open questions

Testing mesa-optimization empirically. Can we construct training regimes where mesa-optimization is reliably produced, measured, and mitigated? Without empirical traction, the concept remains theoretical.

Alignment at superhuman capability. Every empirical alignment result is at capability levels where humans can evaluate outputs. Extrapolating to capability levels where we cannot is a structural extrapolation with little evidence.

Population-level alignment. Can you align an agent population whose individual members are each locally aligned? Multi-agent dynamics (see note 10) introduce failure modes that single-agent alignment does not address.

Alignment and capability coupling. Do post-training alignment methods come with capability costs (the “alignment tax”)? Empirical reports are mixed; theoretical understanding is minimal.

7. References

Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv.
Manheim, D., & Garrabrant, S. (2018). Categorizing variants of Goodhart’s law. arXiv.
Krakovna, V., et al. (2020). Specification gaming: the flip side of AI ingenuity. DeepMind blog.
Turner, A., et al. (2020). Conservative agency via attainable utility preservation. AIES.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv.
Christiano, P., Shlegeris, B., & Amodei, D. (2018). Supervising strong learners by amplifying weak experts. arXiv:1810.08575.
Irving, G., Christiano, P., & Amodei, D. (2018). AI safety via debate. arXiv.
Bowman, S., et al. (2022). Measuring progress on scalable oversight for large language models. arXiv.
Parrish, A., et al. (2022). Two-turn debate doesn’t help humans answer hard reading-comprehension questions. arXiv.
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: your language model is secretly a reward model. NeurIPS.
Bai, Y., et al. (2022). Constitutional AI: harmlessness from AI feedback. arXiv.
Lightman, H., Kosaraju, V., Burda, Y., et al. (2023). Let’s verify step by step. ICLR.
Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30, 411–437.
Russell, S. (2019). Human compatible: artificial intelligence and the problem of control. Viking.

Hovhannes Grigoryan

Explorer

The alignment problem: formal statements and honest limits

The alignment problem: formal statements and honest limits

1. Outer alignment: specifying what you want

1.1 Goodhart’s law, formalized

1.2 Impact regularization

2. Inner alignment and mesa-optimization

3. Scalable oversight

3.1 AI debate

4. Reward hacking: the most measurable failure mode

5. RLHF, DPO, and the post-training alignment stack

5.1 What these techniques do and don’t fix

6. Open questions

7. References

Graph View

Table of Contents

Backlinks