Gödel, Löb, and the formal limits of self-referential AI

Gödel’s incompleteness theorems (1931) and Löb’s theorem (1955) are usually introduced as results in mathematical logic. They are also the most-cited, and most-misunderstood, formal arguments about the limits of AI. This note derives the two results accurately, clarifies what they do and do not imply about machines that reason, and surveys the recent AI-safety literature on self-referential agents where these results are technically load-bearing rather than rhetorical.

1. Gödel’s first incompleteness theorem

Let be a formal theory: a recursively enumerable set of axioms, a set of inference rules, and a language strong enough to express arithmetic. We say is:

  • Consistent if does not prove both and for any .
  • Complete if for every sentence , either or .

Theorem (Gödel 1931). Any consistent, recursively enumerable theory that includes Peano arithmetic is incomplete: there exists a sentence in the language of such that neither nor .

1.1 The proof idea, honestly

The proof constructs via diagonalization. In ‘s language, encode formal proofs as natural numbers (Gödel numbering). Define the predicate is the Gödel number of a sentence provable in “. This predicate exists and is expressible because is recursively enumerable.

Now construct a sentence that “says” , informally, “this sentence is not provable in “. The construction is via the diagonal lemma, which states that for any formula there exists a sentence such that .

Gödel's first incompleteness theorem by diagonalization

The argument:

  • If , then is provable, so is true. But asserts . By consistency, . So proves both and something equivalent to . Contradiction.
  • If , then (by the same reasoning applied to ) we get a similar contradiction provided is -consistent (a slightly stronger assumption than consistency; Rosser 1936 showed consistency alone suffices with a trickier construction).

So proves neither nor . Incompleteness.

1.2 What this does and does NOT say

Says: any sufficiently strong, consistent, recursively axiomatized formal theory has true sentences it cannot prove.

Does not say: humans can do things no machine can do. The sentence that cannot prove is provable in any stronger theory , including by a human working in a more expressive system. There is no privileged cognitive power here; there is just a specific theory-relative result.

Does not say: AI is impossible or that AI cannot reason about mathematics. Automated theorem provers are not one-true-theory machines; they can operate in multiple theories and, crucially, can recognize when a particular theory is too weak for a given statement.

The famous Penrose argument (The Emperor’s New Mind 1989) claims Gödel’s theorem proves human consciousness cannot be computational. The argument has been widely rebutted (see Davis 1993, Chalmers 1995, Feferman 1996): it requires the assumption that a human can know with certainty that their own reasoning is consistent, which Gödel’s second theorem (below) shows is exactly what no formal system can do for itself.

2. Gödel’s second incompleteness theorem

Theorem (Gödel 1931, second). A consistent, recursively axiomatized theory extending Peano arithmetic cannot prove its own consistency:

where (equivalently, “there is no proof of a contradiction in ”).

This sharpens the first theorem. It is not just that fails to prove some true sentences, it specifically fails to prove the statement of its own consistency, which (if is in fact consistent) is true.

The AI-relevant implication: an automated theorem prover that implements cannot certify its own correctness using the reasoning principles it embodies. Certification requires a stronger system, which in turn cannot self-certify. This is an infinite regress: no axiomatizable system of any strength can produce a fully internal guarantee of its own soundness.

3. Löb’s theorem and self-referential agents

Löb’s theorem (1955) extends the Gödel framework in a direction that is particularly relevant to self-modifying AI.

Theorem (Löb 1955). For a theory extending Peano arithmetic, and any sentence :

In words: if can prove “if is provable then ,” then can prove outright.

Contrapositive. cannot prove unless it can prove without going through ” proves “. ’s own claim that its proofs are sound is never available internally.

3.1 The Löbian obstacle

In the AI-alignment context, Löb’s theorem has been studied as an obstacle to principled self-trust in self-modifying agents. An agent that reasons about its own future actions wants to assert things like “if my future self proves , then is true.” Under Löb’s theorem, the agent cannot prove such conditional trust unconditionally, the conditional trust only exists when the antecedent collapses, which defeats the point.

Yudkowsky & Herreshoff (2013) formalized this as the Löbian obstacle. A self-improving AI that wants to verify the improvements it makes cannot do so via internal proof; it needs a stronger theory for verification, which itself has a Löbian obstacle, ad infinitum.

Fallenstein, Soares, Taylor, Yudkowsky and colleagues at MIRI subsequently proposed modal combat agents and logical induction (Garrabrant et al. 2016) as partial technical responses. These are active research areas with no fully satisfactory solution as of 2026.

4. The halting problem and its AI connections

Related but distinct: the halting problem (Turing 1936) is the undecidable question “does this program halt on this input?”

For AI: no algorithm can decide, for an arbitrary program, whether that program’s execution terminates. Practical consequences:

  • No general-purpose safety analyzer. You cannot build a verifier that decides whether a given AI program will ever enter some unsafe state.
  • Bounded verification. Practical safety analysis must restrict to decidable fragments (bounded-step behavior, restricted program classes) or accept incompleteness (sound but incomplete verifiers).
  • Rice’s theorem (1953) generalizes this: any non-trivial semantic property of programs is undecidable. “Will this program output a harmful response?” is undecidable in the general case.

These are not arguments against AI safety, they are arguments for principled, restricted safety analysis rather than blanket guarantees.

5. What the incompleteness results mean for contemporary AI

A few honest claims:

  • For theorem provers and formal verification tools, Gödel and Löb constrain what can be verified internally. Practical verification uses stronger meta-systems, accepts incompleteness, or uses bounded checks.
  • For learned systems (LLMs, RL agents), the incompleteness results apply only insofar as the system is operating as a theorem prover, which is rarely the right frame. An LLM is not a first-order theory; its “consistency” is not formally well-defined; Gödel doesn’t directly apply.
  • For self-improving AI or goal-preserving modification, the Löbian obstacle is a real technical constraint on proof-theoretic approaches, which is one (among many) reasons contemporary alignment research does not rely on proof-based self-verification.
  • For the “can AI be conscious / have understanding” debate, Gödel’s theorems are almost always misapplied.

6. Open questions

Probabilistic self-reference. Christiano et al. (2013) propose probabilistic reflection as a way around Löbian obstacles: agents assign probabilities to their own future statements rather than provability. The approach partially avoids the obstacle but has its own technical problems.

Logical induction (Garrabrant et al. 2016) constructs probability distributions on sentences of a formal language that converge to correct values over time, with self-referential properties that avoid the sharp Löbian traps. The framework is elegant but computationally impractical at scale.

Verification of neural networks. Formal verification of neural networks is an active subfield but faces the halting-problem wall for general networks. Progress is limited to restricted classes (piecewise-linear activations, bounded depth) and specific properties (local robustness).

7. References (verified April 2026, core papers via standard logic references)

  • Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38, 173–198. English translation in From Frege to Gödel (Heijenoort 1967).
  • Löb, M. H. (1955). Solution of a problem of Leon Henkin. Journal of Symbolic Logic, 20(2), 115–118.
  • Turing, A. M. (1936). On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, s2-42(1), 230–265.
  • Rosser, J. B. (1936). Extensions of some theorems of Gödel and Church. Journal of Symbolic Logic, 1(3), 87–91.
  • Rice, H. G. (1953). Classes of recursively enumerable sets and their decision problems. Trans. AMS, 74(2), 358–366.
  • Davis, M. (1993). How subtle is Gödel’s theorem? More on Roger Penrose. Behavioral and Brain Sciences, 16(3), 611–612.
  • Chalmers, D. (1995). Minds, machines, and mathematics: a review of Shadows of the Mind by Roger Penrose. PSYCHE, 2(9).
  • Feferman, S. (1996). Penrose’s Gödelian argument. PSYCHE, 2(7).
  • Yudkowsky, E., & Herreshoff, M. (2013). Tiling agents for self-modifying AI, and the Löbian obstacle. MIRI technical report.
  • Fallenstein, B., & Soares, N. (2015). Vingean reflection: reliable reasoning for self-improving agents. MIRI technical report.
  • Garrabrant, S., et al. (2016). Logical induction. arXiv.
  • Christiano, P., Yudkowsky, E., Herreshoff, M., & Barasz, M. (2013). Probabilistic self-reference. MIRI technical report.
  • Penrose, R. (1989). The emperor’s new mind. Oxford University Press. [For context; the arguments are rebutted by the references above.]
  • Boolos, G. (1979). The unprovability of consistency. Cambridge University Press.
  • Smullyan, R. M. (1992). Gödel’s incompleteness theorems. Oxford University Press.