What is AMI Labs and why did LeCun start it?

Advanced Machine Intelligence (AMI) is a Paris-based AI startup founded by Turing Award winner Yann LeCun four months after he left Meta in November 2025. It raised $1.03 billion in seed funding — Europe's largest ever — at a $3.5 billion valuation. LeCun's thesis: current AI systems (LLMs) predict tokens, not cause and effect, making them fundamentally incapable of human-level reasoning. AMI is building world models based on JEPA (Joint Embedding Predictive Architecture) to replace the dominant paradigm.

What is an autoregressive model and what's wrong with it?

An autoregressive model predicts the next token (word, pixel, audio chunk) based on all previous tokens. Every prediction samples from a probability distribution — meaning there's always a chance of error. That error shifts the context, making the next prediction slightly wrong, which shifts context again. Over long sequences, errors compound exponentially. This is why LLMs hallucinate: it's not a bug to fix, it's a mathematical property of the architecture.

What is a diffusion model and how is it different from autoregressive?

Diffusion models start with pure random noise and iteratively remove noise over hundreds of steps until a coherent image, video, or audio emerges. They don't have the compounding-error problem of autoregression because the full output is refined simultaneously rather than built token by token. But they're still generative — they model 'what does this look like' rather than 'what causes what.' Sora can generate a perfect video of a ball rolling off a table, but it learned what gravity looks like, not what gravity is.

What is JEPA and how does it differ from both?

JEPA (Joint Embedding Predictive Architecture) doesn't generate anything. Instead of predicting pixels or words, it predicts in abstract representation space — learning what matters about the world and ignoring irrelevant details. It's an Energy-Based Model: it assigns low 'energy' when its prediction of a situation matches reality, and high energy when it doesn't. The goal is to learn the underlying structure of how the world works — causality, physics, relationships — not to produce realistic-looking outputs.

What will AMI Labs actually build with world models?

First targets: healthcare (partner Nabla already signed), robotics, autonomous driving (Toyota invested), wearables, and industrial automation. The CEO is explicit that this is fundamental research — no product in 3 months, no revenue in 6. Could take years. But the application payoff is large: an AI system that truly understands cause and effect is far more useful for high-stakes decisions (medical diagnosis, autonomous vehicles, industrial control) than one that hallucinates confidently.

Is LeCun right that LLMs are a dead end?

He's been saying this since 2022, while LLMs have become dramatically more capable. The honest answer: his architectural critique is technically sound — hallucinations are structural to autoregression, not a bug. But 'dead end' overstates it. LLMs are useful today and getting better. World models are a long-term bet on a different paradigm, not a replacement that arrives next quarter. Both will likely coexist for years, with LLMs dominating language tasks and world models eventually leading in physical reasoning, planning, and robotics.

What is V-JEPA 2.1 and how does it improve on V-JEPA 2?

V-JEPA 2.1 (arXiv:2603.14482, March 2026) improves on V-JEPA 2 by applying the JEPA self-supervised objective hierarchically across multiple intermediate encoder layers (deep self-supervision), not just the final layer. Combined with a denser predictive loss, this produces representations that are spatially and temporally grounded at every level. Key results: +20 points on real-robot grasping, state-of-the-art on Ego4D short-term action anticipation (7.71 mAP), and best-in-class on EPIC-KITCHENS action anticipation (40.8 Recall@5).

LeCun Just Raised $1B to Replace LLMs. Here's Why He Thinks They're a Dead End — and What He's Building Instead

By Prahlad Menon Published 2026-03-15 9 min read

In November 2025, Yann LeCun walked into Mark Zuckerberg’s office and told him he was leaving. He had spent twelve years building Meta’s AI research operation — published foundational work on convolutional neural networks, won the Turing Award, trained some of the researchers now leading the AI industry. And he thought the entire field had taken a wrong turn.

Four months later, he announced $1.03 billion in seed funding to prove it.

To understand what he’s betting on — and why it matters — you need to understand three different ways of building AI systems, and what each one fundamentally cannot do.

Three paradigms, three different bets

1. Autoregressive models — the LLM approach

An autoregressive model is trained to predict the next token in a sequence. A token can be a word, a patch of pixels, a chunk of audio. Given everything that came before, what comes next?

"The cat sat on the ___" → [mat: 34%, floor: 22%, roof: 8%, ...]

The model samples from that probability distribution and picks a token. Then it does it again for the next position. And again. The entire output is built one discrete step at a time.

This works remarkably well for language. Language is inherently discrete, follows predictable rules, and has enough structure that next-token prediction captures most of what matters.

The fundamental problem: every prediction is a sample from a probability distribution — which means every prediction has a chance of error. That error shifts the context. The next prediction is now slightly wrong. Which shifts context again. Over long sequences, errors compound exponentially.

This is why LLMs hallucinate. Not because they were trained on bad data, not because they’re insufficiently large, but because of the mathematics of sequential sampling. LeCun stated this clearly on X in September 2024: “Pure Auto-Regressive LLMs are a dead end on the way towards human-level AI.”

We covered the research behind this in our post on AI hallucinations — OpenAI’s own paper found that more capable models hallucinate more confidently, not less. The architecture is the ceiling.

Where autoregressive models excel: language generation, code, reasoning over text, creative tasks. GPT-4, Claude, Llama — all autoregressive. Extremely useful today.

Where they fail: physical world reasoning, long-horizon planning, causal understanding, any task where a confident wrong answer has irreversible consequences.

2. Diffusion models — the image/video approach

Diffusion models take a completely different path. Instead of building output left-to-right one token at a time, they start with pure random noise and iteratively refine it over hundreds of steps until coherent structure emerges.

[noise] → [slightly less noise] → [vague shape] → [recognizable form] → [final image]

Think of it like developing a photograph in a darkroom — the image gradually resolves from nothing. Each denoising step makes a small correction to the whole output simultaneously, rather than committing to each piece in sequence.

Why this is better than autoregression for images: no compounding error. Because the model refines the full output iteratively rather than locking in decisions token by token, it can correct mistakes made in earlier steps. This is why Midjourney, Stable Diffusion, DALL-E 3, and Sora produce coherent, high-quality images and video — problems that autoregressive image generators struggled with.

But diffusion doesn’t solve the reasoning problem.

Sora can generate a physically plausible video of a ball rolling off a table. It looks right — the ball accelerates, bounces, behaves like a ball should. But Sora didn’t learn what gravity is. It learned what gravity looks like from millions of videos. Ask it to reason about a novel physical scenario it hasn’t seen variants of, and it fails. It’s modeling appearance, not causality.

Diffusion models are generative — their goal is to produce realistic outputs. They model “what does X look like?” not “what causes X?”

Where diffusion models excel: image generation, video synthesis, audio generation, any task where the goal is producing high-quality outputs that look/sound like the real thing.

Where they fail: causal reasoning, planning, understanding physical systems from first principles, anything requiring a model of why things happen.

3. JEPA / World Models — LeCun’s bet

Joint Embedding Predictive Architecture (JEPA), proposed by LeCun in 2022, takes a fundamentally different approach. It doesn’t generate anything.

Instead of predicting pixels or words, JEPA predicts in abstract representation space.

Here’s the intuition. When you watch someone pick up a coffee cup, your brain doesn’t predict the exact color of every pixel on the cup at every frame. It maintains an abstract model: this is a cup, it has weight, it will behave predictably when lifted, the person’s arm will move in an arc consistent with the cup’s mass. You’re predicting the relevant structure, not the irrelevant details.

JEPA does the same thing. It learns to encode inputs (images, video, sensor data) into abstract representations that capture what matters, and predicts what those representations will look like after some future event or action — without trying to reconstruct the full sensory detail.

Mathematically, it’s an Energy-Based Model (EBM). For any two situations (current state, predicted future state), the model assigns an “energy” value — low energy when the prediction matches reality, high energy when it doesn’t. Training minimizes energy on real transitions and maximizes it on impossible ones. The model learns the structure of what’s physically possible.

Input state → [encoder] → abstract representation
                               ↓
                         [predictor] → predicted future representation
                               ↓
                    compare to actual future representation
                    (minimize prediction error in abstract space)

What this gives you that autoregression and diffusion don’t:

Causal understanding — because the model learns what causes what, not just what correlates with what
Planning — you can simulate forward: “if I take action A, what state will I end up in?”
Robustness to noise — irrelevant details (exact pixel values, word choice) are abstracted away before prediction, so noise doesn’t derail reasoning
No compounding hallucination — you’re not sampling from probability distributions at each step; you’re operating in a learned abstract space

Meta has already published V-JEPA (video JEPA) and VL-JEPA (vision-language JEPA). AMI Labs is the commercial vehicle to push this further.

The three paradigms, side by side

	Autoregressive (LLMs)	Diffusion	JEPA / World Models
How it works	Predict next token sequentially	Denoise iteratively from noise	Predict in abstract representation space
Error accumulation	Compounds exponentially	Correctable across steps	Operates in abstract space — noise ignored
What it models	Statistical patterns in sequences	Appearance of outputs	Causal structure of the world
Best at	Language, code, text reasoning	Images, video, audio generation	Physical reasoning, planning, robotics
Fails at	Physical reasoning, long planning	Causal understanding, novel physics	Not yet production-ready
Hallucination risk	Structural — inherent to architecture	Lower for generation tasks	Aims to eliminate it by design
Examples	GPT-4, Claude, Llama	Midjourney, Sora, Stable Diffusion	V-JEPA (Meta), AMI Labs (commercial)

What AMI Labs is actually building

The company’s first commercial partner is Nabla, the medical AI startup (also co-founded by CEO Alexandre LeBrun). The healthcare angle makes sense: in medicine, a confidently wrong answer can kill someone. A world model that genuinely understands cause and effect — this drug interacts with this condition in this way — is categorically more valuable than an LLM that hallucinates drug interactions with high confidence.

Other announced targets: robotics (physical dexterity requires understanding the physical world, not predicting tokens), autonomous driving (Toyota invested), wearables, and industrial automation.

LeBrun is unusually candid about the timeline: “It’s not your typical applied AI startup that can release a product in three months, have revenue in six months… It could take years for world models to go from theory to commercial applications.”

The investors accepting this timeline — Bezos Expeditions, Nvidia, Toyota, Samsung, Eric Schmidt — suggests they believe the long-term bet is worth the wait.

The honest assessment

LeCun has been saying LLMs are a dead end since 2022. During those three years, LLMs have become dramatically more capable. GPT-4, Claude 3, Gemini Ultra, and their successors have made autoregressive models useful for an enormous range of tasks, despite their architectural limitations.

“Dead end” is probably too strong. LLMs are genuinely useful today and will be for years. The accurate version of LeCun’s claim is: autoregressive models have a structural ceiling that better training and larger scale cannot overcome for physical-world reasoning and long-horizon planning. That’s a narrower but defensible claim.

What world models offer is not “better LLMs” but a different capability class — systems that can reason about cause and effect in physical environments, plan sequences of actions, and understand novel situations from first principles rather than from learned statistical patterns.

The practical question isn’t whether LLMs or world models “win.” It’s which paradigm you should be watching if you’re building applications in:

Language, code, content → LLMs are mature and getting better. Use them now.
Images, video, audio generation → Diffusion is mature. Sora, Midjourney, ElevenLabs.
Robotics, autonomous systems, physical world reasoning, high-stakes decisions → World models are the long-term bet. AMI Labs is the leading commercial bet on this. Watch closely.

Why this connects to everything else happening in AI

We’ve been tracking several threads that connect directly here.

The hallucination paper from OpenAI showed that more capable autoregressive models hallucinate more confidently — exactly the structural ceiling LeCun is pointing at.

The EBRD jobs analysis showed that physical-world roles (robotics, surgery, skilled trades) are currently low AI exposure precisely because LLMs and diffusion models can’t reason about the physical world. World models are the technology that changes that equation.

And Isomorphic Labs’ IsoDDE — the drug design engine that doubles AlphaFold 3’s accuracy by predicting in abstract molecular representation space rather than generating atom coordinates directly — is architecturally closer to the JEPA philosophy than to autoregression. Not coincidentally, it’s the most capable biomedical AI system built to date.

The direction is consistent across all of them: the most capable next-generation AI systems will predict in abstract space, not generate in pixel or token space.

AMI Labs is the $1 billion bet that LeCun is right.

Update: V-JEPA 2.1 — March 2026

Published on arXiv March 15, 2026 (arXiv:2603.14482) — still at Meta, ahead of AMI Labs’ commercial work.

Since this post was written, Meta published V-JEPA 2.1 — a meaningful step forward from V-JEPA 2, and a direct demonstration of the dense representation thesis argued above.

The key advance: instead of applying the self-supervised JEPA objective only at the final encoder layer, V-JEPA 2.1 applies it hierarchically across multiple intermediate layers (deep self-supervision). This forces the model to build spatially and temporally grounded representations at every level of the hierarchy — not just at the top. Combined with a denser predictive loss (both visible and masked tokens contribute to training), the result is representations that are structurally richer than V-JEPA 2.

What this unlocks — the benchmarks:

Task	V-JEPA 2.1 Result	Significance
Ego4D short-term action anticipation	7.71 mAP	State of the art
EPIC-KITCHENS high-level action anticipation	40.8 Recall@5	State of the art
Real-robot grasping success rate	+20 points over V-JEPA-2 AC	Direct robotics impact
TartanDrive robotic navigation	5.687 ATE	Strong
NYUv2 depth estimation (linear probe)	0.307 RMSE	Strong
Something-Something-V2	77.7	Competitive

The robotics number is the one to focus on: +20 points on real-robot grasping isn’t a benchmark curiosity — it’s a real physical system picking up real objects better because of better world model representations. That’s the JEPA thesis playing out in hardware.

Short-term action anticipation (predicting what a hand is about to do from video) is exactly the capability that separates a useful physical-world AI from a generative model. V-JEPA 2’s predecessor couldn’t do this well. 7.71 mAP on Ego4D is state-of-the-art as of March 2026.

The four design pillars of V-JEPA 2.1:

Dense predictive loss — both visible and masked tokens contribute to training, forcing spatial/temporal grounding
Deep self-supervision — JEPA objective applied at multiple intermediate encoder layers (not just the final output)
Multi-modal tokenizers — unified image and video training in the same model
Effective scaling — model capacity and data both scaled together

This is still Meta research, not AMI Labs product. But it validates LeCun’s architectural direction and narrows the gap between the theoretical promise of world models and practical deployment in robotics and autonomous systems.

Sources: TechCrunch · The Next Web · Wired · AMI Labs · JEPA paper (LeCun, 2022)