World-model approaches to AGI

If a neural network could internally simulate the environment it acts in, it could plan by imagination rather than trial-and-error. This is the promise of world models: agents that learn a compact internal representation of the world’s dynamics and use it to plan, predict, and generalize. The approach has a long history in model-based RL, has produced state-of-the-art results on Atari (Dreamer), has been advocated by Yann LeCun as the path to AGI (JEPA, 2022), and contrasts sharply with the dominant paradigm of pure next-token prediction in LLMs. We cover the architecture, the empirical successes, the intellectual stakes, and the honest limits.

1. The motivation

A sample-efficient intelligent agent should not need to repeatedly test-execute every candidate plan. It should be able to imagine outcomes and plan with its imagination, the way humans plan routes, or debug code, or anticipate a chess move’s consequences.

The formal idea is model-based reinforcement learning: learn an environment model $\overset{p}{^} (s^{'} ∣ s, a)$ and use it for planning. Classical model-based RL (Sutton 1991, Dyna architecture) did this with tabular or linear models in small state spaces.

The modern approach learns neural world models capable of capturing complex dynamics. The hypothesis: given a good enough world model, an agent can plan via imagination and achieve substantially better sample efficiency than model-free approaches.

World-model agent loop

The architecture: an encoder maps raw observations to latent states; a transition model predicts next latent states given actions; a decoder (or value head) maps latent states back to observations or values; a planner uses the transition model to simulate and select actions.

2. The canonical example: Ha & Schmidhuber’s World Model

Ha & Schmidhuber (2018), World Models, demonstrated the template.

Architecture.

V: variational autoencoder (VAE) encoding image observations to low-dimensional latent state.
M: mixture-density RNN predicting the next latent state given current state and action.
C: controller (small linear policy) operating on latent state.

Procedure. Train V and M on random-policy rollouts. Freeze them. Train C in simulated rollouts using M, never touching the real environment again after the initial random-policy data collection.

Result. On CarRacing-v0 and VizDoom, this setup achieves state-of-the-art performance despite training the controller entirely in imagination. This was a striking demonstration that a learned world model can carry enough signal to support competent control.

The Dreamer line of work (Hafner et al. 2020, 2021, 2023) refined the approach and scaled it. The key architectural improvements:

Recurrent state-space model (RSSM). Combines a deterministic recurrent state with a stochastic latent state. The deterministic component provides long-term memory; the stochastic component models uncertainty.

Actor-critic in imagination. Learn a policy and a value function directly on imagined rollouts generated by the world model, using standard PPO or SAC objectives.

Symlog target scaling. Stabilize training across tasks with vastly different reward scales.

3.1 DreamerV3: scaling results

Hafner et al. (2023), DreamerV3, reported a single set of hyperparameters solving 150+ diverse tasks (control from pixels, proprioceptive robotics, Minecraft) at state-of-the-art sample efficiency. Collecting a diamond in Minecraft from scratch, with no human-generated data or reward shaping, is the flagship demonstration: an extremely sparse, long-horizon task solved by model-based imagination.

Significance. World models are genuinely sample-efficient. On hard exploration tasks, Dreamer-family agents routinely achieve with 100× less environment interaction what model-free methods require. The sample-efficiency gap is not theoretical; it is empirical and large.

4. JEPA: LeCun’s proposal for AGI via world models

Joint Embedding Predictive Architecture (LeCun 2022) is a proposed architectural family for building intelligent agents. The core move: predict latent representations rather than raw observations.

Motivation. Generative models (like VAEs) try to predict every pixel, spending capacity on irrelevant details (background textures, camera noise). Contrastive / predictive models (like CLIP, DINO, I-JEPA) learn abstract representations that discard those details and retain the semantically meaningful structure.

Architecture. Given an input, produce two embeddings: one from an observed portion, one from an unobserved or masked portion. Train so that the observed embedding predicts the unobserved embedding in latent space. This is self-supervised learning at scale without requiring pixel-level reconstruction.

Relation to AGI. LeCun argues that JEPA-trained world models are the missing piece for AGI: LLMs predict tokens but lack grounded world models; JEPA architectures provide grounded world models that LLMs currently do not. The ambitious version: an agent with a sufficiently rich JEPA world model, combined with a planning module and a cost function, could form the basis of AGI-level capability.

The position is influential and contested. For it: sample-efficiency gains of model-based methods are real; LLMs do have factual and physical-reasoning failures consistent with lack of a grounded world model. Against it: the JEPA line has produced excellent representation learning (I-JEPA, V-JEPA 2023–2025) but has not yet demonstrated the kind of general-purpose capability the AGI claim requires.

5. Model-based vs. model-free, and the question of “model-ness”

LLMs trained by next-token prediction are sometimes called “implicit world models”, they have learned something about the world from text, evident in their ability to answer factual questions, write code, and reason about physical scenarios.

Is that a world model?

Argument for: LLMs demonstrably capture information about the world. Their capabilities are closer to “having a world model” than traditional RL methods were.

Argument against: An LLM’s “world model” is about text, not about the world. It has learned statistical regularities of language; in domains where language is a faithful proxy for the world (high-level facts, explicit reasoning), it does well. In domains where the proxy is weak (spatial reasoning, physical dynamics, embodied interaction), it fails.

The empirical middle-ground as of 2026: LLMs have partial, uneven world models. Contemporary research (e.g., Meta’s V-JEPA work, DeepMind’s Genie video models) is probing whether genuine world models can be learned purely from video or sensorimotor data, which would bypass the text-proxy limitation.

6. The sample-efficiency case

The strongest empirical argument for world models is sample efficiency.

Atari 100K benchmark (100,000 environment steps budget): world-model methods (IRIS, EfficientZero) dominate model-free methods by wide margins.
Minecraft from scratch: only world-model methods have solved difficult Minecraft tasks without human-generated data.
Robotics: learned world models enable real-robot training that pure model-free RL cannot achieve in practical timeframes.

These gains are consistent with the theoretical argument: if you can plan with imagination, you need vastly less real interaction to improve.

7. Open questions

Compositional world models. Current world models capture dynamics at a single level of abstraction. Human world models are hierarchical: the same dynamics can be reasoned about at the level of atoms, molecules, objects, or economies. Hierarchical world models remain an open architectural challenge.

World models vs. LLMs as foundation. The two paradigms, world-model-centric and LLM-centric AGI proposals, have different bets about what the scaling curve eventually delivers. Both bets are consistent with current evidence; which one is right is an empirical question.

Continual learning. World models in deployment face distribution shift; retraining from scratch is expensive; fine-tuning with forgetting is common. How to keep a world model up-to-date with minimal supervision is an active research area.

Symbolic abstraction. World models in neural latent space are hard to interpret. Whether a symbolic or hybrid neuro-symbolic world model can match the expressivity of pure-neural world models while being more transparent is an open empirical question.

8. References

Ha, D., & Schmidhuber, J. (2018). World models. NeurIPS.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to control: learning behaviors by latent imagination. ICLR.
Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with discrete world models. ICLR. [DreamerV2]
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering diverse domains through world models. arXiv. [DreamerV3]
LeCun, Y. (2022). A path towards autonomous machine intelligence. Position paper, FAIR.
Assran, M., et al. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. CVPR. [I-JEPA]
Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: an introduction. 2nd ed., MIT Press.
Schrittwieser, J., et al. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588, 604–609. [MuZero]
Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari games with limited data. NeurIPS. [EfficientZero]
Micheli, V., Alonso, E., & Fleuret, F. (2023). Transformers are sample-efficient world models. ICLR. [IRIS]
Bruce, J., et al. (2024). Genie: generative interactive environments. ICML.

Hovhannes Grigoryan

Explorer

World-model approaches to AGI

World-model approaches to AGI

1. The motivation

2. The canonical example: Ha & Schmidhuber’s World Model

3. Dreamer: continuous refinement

3.1 DreamerV3: scaling results

4. JEPA: LeCun’s proposal for AGI via world models

5. Model-based vs. model-free, and the question of “model-ness”

6. The sample-efficiency case

7. Open questions

8. References

Graph View

Table of Contents

Backlinks