LeCun Just Raised $1B to Replace LLMs. Here's Why He Thinks They're a Dead End — and What He's Building Instead

By Prahlad Menon 9 min read

In November 2025, Yann LeCun walked into Mark Zuckerberg’s office and told him he was leaving. He had spent twelve years building Meta’s AI research operation — published foundational work on convolutional neural networks, won the Turing Award, trained some of the researchers now leading the AI industry. And he thought the entire field had taken a wrong turn.

Four months later, he announced $1.03 billion in seed funding to prove it.

To understand what he’s betting on — and why it matters — you need to understand three different ways of building AI systems, and what each one fundamentally cannot do.

Three paradigms, three different bets

1. Autoregressive models — the LLM approach

An autoregressive model is trained to predict the next token in a sequence. A token can be a word, a patch of pixels, a chunk of audio. Given everything that came before, what comes next?

"The cat sat on the ___" → [mat: 34%, floor: 22%, roof: 8%, ...]

The model samples from that probability distribution and picks a token. Then it does it again for the next position. And again. The entire output is built one discrete step at a time.

This works remarkably well for language. Language is inherently discrete, follows predictable rules, and has enough structure that next-token prediction captures most of what matters.

The fundamental problem: every prediction is a sample from a probability distribution — which means every prediction has a chance of error. That error shifts the context. The next prediction is now slightly wrong. Which shifts context again. Over long sequences, errors compound exponentially.

This is why LLMs hallucinate. Not because they were trained on bad data, not because they’re insufficiently large, but because of the mathematics of sequential sampling. LeCun stated this clearly on X in September 2024: “Pure Auto-Regressive LLMs are a dead end on the way towards human-level AI.”

We covered the research behind this in our post on AI hallucinations — OpenAI’s own paper found that more capable models hallucinate more confidently, not less. The architecture is the ceiling.

Where autoregressive models excel: language generation, code, reasoning over text, creative tasks. GPT-4, Claude, Llama — all autoregressive. Extremely useful today.

Where they fail: physical world reasoning, long-horizon planning, causal understanding, any task where a confident wrong answer has irreversible consequences.


2. Diffusion models — the image/video approach

Diffusion models take a completely different path. Instead of building output left-to-right one token at a time, they start with pure random noise and iteratively refine it over hundreds of steps until coherent structure emerges.

[noise] → [slightly less noise] → [vague shape] → [recognizable form] → [final image]

Think of it like developing a photograph in a darkroom — the image gradually resolves from nothing. Each denoising step makes a small correction to the whole output simultaneously, rather than committing to each piece in sequence.

Why this is better than autoregression for images: no compounding error. Because the model refines the full output iteratively rather than locking in decisions token by token, it can correct mistakes made in earlier steps. This is why Midjourney, Stable Diffusion, DALL-E 3, and Sora produce coherent, high-quality images and video — problems that autoregressive image generators struggled with.

But diffusion doesn’t solve the reasoning problem.

Sora can generate a physically plausible video of a ball rolling off a table. It looks right — the ball accelerates, bounces, behaves like a ball should. But Sora didn’t learn what gravity is. It learned what gravity looks like from millions of videos. Ask it to reason about a novel physical scenario it hasn’t seen variants of, and it fails. It’s modeling appearance, not causality.

Diffusion models are generative — their goal is to produce realistic outputs. They model “what does X look like?” not “what causes X?”

Where diffusion models excel: image generation, video synthesis, audio generation, any task where the goal is producing high-quality outputs that look/sound like the real thing.

Where they fail: causal reasoning, planning, understanding physical systems from first principles, anything requiring a model of why things happen.


3. JEPA / World Models — LeCun’s bet

Joint Embedding Predictive Architecture (JEPA), proposed by LeCun in 2022, takes a fundamentally different approach. It doesn’t generate anything.

Instead of predicting pixels or words, JEPA predicts in abstract representation space.

Here’s the intuition. When you watch someone pick up a coffee cup, your brain doesn’t predict the exact color of every pixel on the cup at every frame. It maintains an abstract model: this is a cup, it has weight, it will behave predictably when lifted, the person’s arm will move in an arc consistent with the cup’s mass. You’re predicting the relevant structure, not the irrelevant details.

JEPA does the same thing. It learns to encode inputs (images, video, sensor data) into abstract representations that capture what matters, and predicts what those representations will look like after some future event or action — without trying to reconstruct the full sensory detail.

Mathematically, it’s an Energy-Based Model (EBM). For any two situations (current state, predicted future state), the model assigns an “energy” value — low energy when the prediction matches reality, high energy when it doesn’t. Training minimizes energy on real transitions and maximizes it on impossible ones. The model learns the structure of what’s physically possible.

Input state → [encoder] → abstract representation

                         [predictor] → predicted future representation

                    compare to actual future representation
                    (minimize prediction error in abstract space)

What this gives you that autoregression and diffusion don’t:

  • Causal understanding — because the model learns what causes what, not just what correlates with what
  • Planning — you can simulate forward: “if I take action A, what state will I end up in?”
  • Robustness to noise — irrelevant details (exact pixel values, word choice) are abstracted away before prediction, so noise doesn’t derail reasoning
  • No compounding hallucination — you’re not sampling from probability distributions at each step; you’re operating in a learned abstract space

Meta has already published V-JEPA (video JEPA) and VL-JEPA (vision-language JEPA). AMI Labs is the commercial vehicle to push this further.


The three paradigms, side by side

Autoregressive (LLMs)DiffusionJEPA / World Models
How it worksPredict next token sequentiallyDenoise iteratively from noisePredict in abstract representation space
Error accumulationCompounds exponentiallyCorrectable across stepsOperates in abstract space — noise ignored
What it modelsStatistical patterns in sequencesAppearance of outputsCausal structure of the world
Best atLanguage, code, text reasoningImages, video, audio generationPhysical reasoning, planning, robotics
Fails atPhysical reasoning, long planningCausal understanding, novel physicsNot yet production-ready
Hallucination riskStructural — inherent to architectureLower for generation tasksAims to eliminate it by design
ExamplesGPT-4, Claude, LlamaMidjourney, Sora, Stable DiffusionV-JEPA (Meta), AMI Labs (commercial)

What AMI Labs is actually building

The company’s first commercial partner is Nabla, the medical AI startup (also co-founded by CEO Alexandre LeBrun). The healthcare angle makes sense: in medicine, a confidently wrong answer can kill someone. A world model that genuinely understands cause and effect — this drug interacts with this condition in this way — is categorically more valuable than an LLM that hallucinates drug interactions with high confidence.

Other announced targets: robotics (physical dexterity requires understanding the physical world, not predicting tokens), autonomous driving (Toyota invested), wearables, and industrial automation.

LeBrun is unusually candid about the timeline: “It’s not your typical applied AI startup that can release a product in three months, have revenue in six months… It could take years for world models to go from theory to commercial applications.”

The investors accepting this timeline — Bezos Expeditions, Nvidia, Toyota, Samsung, Eric Schmidt — suggests they believe the long-term bet is worth the wait.

The honest assessment

LeCun has been saying LLMs are a dead end since 2022. During those three years, LLMs have become dramatically more capable. GPT-4, Claude 3, Gemini Ultra, and their successors have made autoregressive models useful for an enormous range of tasks, despite their architectural limitations.

“Dead end” is probably too strong. LLMs are genuinely useful today and will be for years. The accurate version of LeCun’s claim is: autoregressive models have a structural ceiling that better training and larger scale cannot overcome for physical-world reasoning and long-horizon planning. That’s a narrower but defensible claim.

What world models offer is not “better LLMs” but a different capability class — systems that can reason about cause and effect in physical environments, plan sequences of actions, and understand novel situations from first principles rather than from learned statistical patterns.

The practical question isn’t whether LLMs or world models “win.” It’s which paradigm you should be watching if you’re building applications in:

  • Language, code, content → LLMs are mature and getting better. Use them now.
  • Images, video, audio generation → Diffusion is mature. Sora, Midjourney, ElevenLabs.
  • Robotics, autonomous systems, physical world reasoning, high-stakes decisions → World models are the long-term bet. AMI Labs is the leading commercial bet on this. Watch closely.

Why this connects to everything else happening in AI

We’ve been tracking several threads that connect directly here.

The hallucination paper from OpenAI showed that more capable autoregressive models hallucinate more confidently — exactly the structural ceiling LeCun is pointing at.

The EBRD jobs analysis showed that physical-world roles (robotics, surgery, skilled trades) are currently low AI exposure precisely because LLMs and diffusion models can’t reason about the physical world. World models are the technology that changes that equation.

And Isomorphic Labs’ IsoDDE — the drug design engine that doubles AlphaFold 3’s accuracy by predicting in abstract molecular representation space rather than generating atom coordinates directly — is architecturally closer to the JEPA philosophy than to autoregression. Not coincidentally, it’s the most capable biomedical AI system built to date.

The direction is consistent across all of them: the most capable next-generation AI systems will predict in abstract space, not generate in pixel or token space.

AMI Labs is the $1 billion bet that LeCun is right.



Update: V-JEPA 2.1 — March 2026

Published on arXiv March 15, 2026 (arXiv:2603.14482) — still at Meta, ahead of AMI Labs’ commercial work.

Since this post was written, Meta published V-JEPA 2.1 — a meaningful step forward from V-JEPA 2, and a direct demonstration of the dense representation thesis argued above.

The key advance: instead of applying the self-supervised JEPA objective only at the final encoder layer, V-JEPA 2.1 applies it hierarchically across multiple intermediate layers (deep self-supervision). This forces the model to build spatially and temporally grounded representations at every level of the hierarchy — not just at the top. Combined with a denser predictive loss (both visible and masked tokens contribute to training), the result is representations that are structurally richer than V-JEPA 2.

What this unlocks — the benchmarks:

TaskV-JEPA 2.1 ResultSignificance
Ego4D short-term action anticipation7.71 mAPState of the art
EPIC-KITCHENS high-level action anticipation40.8 Recall@5State of the art
Real-robot grasping success rate+20 points over V-JEPA-2 ACDirect robotics impact
TartanDrive robotic navigation5.687 ATEStrong
NYUv2 depth estimation (linear probe)0.307 RMSEStrong
Something-Something-V277.7Competitive

The robotics number is the one to focus on: +20 points on real-robot grasping isn’t a benchmark curiosity — it’s a real physical system picking up real objects better because of better world model representations. That’s the JEPA thesis playing out in hardware.

Short-term action anticipation (predicting what a hand is about to do from video) is exactly the capability that separates a useful physical-world AI from a generative model. V-JEPA 2’s predecessor couldn’t do this well. 7.71 mAP on Ego4D is state-of-the-art as of March 2026.

The four design pillars of V-JEPA 2.1:

  1. Dense predictive loss — both visible and masked tokens contribute to training, forcing spatial/temporal grounding
  2. Deep self-supervision — JEPA objective applied at multiple intermediate encoder layers (not just the final output)
  3. Multi-modal tokenizers — unified image and video training in the same model
  4. Effective scaling — model capacity and data both scaled together

This is still Meta research, not AMI Labs product. But it validates LeCun’s architectural direction and narrows the gap between the theoretical promise of world models and practical deployment in robotics and autonomous systems.

Sources: TechCrunch · The Next Web · Wired · AMI Labs · JEPA paper (LeCun, 2022)

Related: AI Hallucinations Are Mathematically Inevitable · The AI Jobs Chart That Explains What’s Coming for Your Career · Paul Conyngham’s Cancer Vaccine Pipeline