Meta’s Llama 4 is here, but long-context gets messy

In a newly published post, Simon Willison digs into what Meta’s Llama 4 Scout and Maverick actually deliver today—especially around multimodal MoE and eye-popping context claims. His early tests and provider limits paint a more complicated picture. Read more at simonwillison.net/2025/Apr/5/llama-…

Meta’s Llama 4 is here, but long-context gets messy

TL;DR

  • Llama 4 Maverick: 400B MoE (128 experts, 17B active), text+image, 1M token context claimed
  • Llama 4 Scout: 109B MoE (16 experts, 17B active), multimodal, 10M token context claimed
  • Llama 4 Behemoth: described, unreleased; 2T total, 288B active; used training Scout/Maverick
  • Access via OpenRouter: routes to Groq, Fireworks, Together; provider caps observed (e.g., 128K–328K for Scout)
  • Local running: community notes suggest giant MoE impractical on consumer GPUs; MLX results reported on M3 Ultra
  • Long-context tests: Maverick produced usable summary; Scout looped output; Meta reported mixed service quality; Gemini 2.5 Pro better

Meta’s new Llama 4 family landed over a weekend, and Simon Willison’s early notes are a useful snapshot of what’s actually shipping today versus what’s still aspirational—especially around multimodal MoE designs and the headline-grabbing long-context claims.

Two models now, one looming later

Willison highlights Meta’s two newly released models:

  • Llama 4 Maverick: a 400B MoE model (128 experts, 17B active parameters) with text + image input and a 1M token context length.
  • Llama 4 Scout: 109B total parameters (16 experts, 17B active) with multimodal inputs and a claimed 10M token context length—positioned as an “industry first.”

Meta also described (but hasn’t released) Llama 4 Behemoth, described as a 2T total parameter model with 288B active parameters that was used to train Scout and Maverick.

Access is easy; full context isn’t (yet)

One of the more grounded parts of the post is how quickly the “10M tokens” story runs into provider limits. Willison points to OpenRouter as a convenient entry point for both Scout and Maverick, routing requests through providers including Groq, Fireworks, and Together.

In practice, Scout’s available context windows appeared capped well below the claim depending on provider (with Willison observing limits like 128K or 328K), while Maverick’s 1M-class window varied by provider as well.

Running it locally: Macs, memory, and MoE reality

Willison collects a few telling community datapoints, including Jeremy Howard’s take that these giant MoEs won’t fit consumer GPUs—even quantized—while potentially making more sense on high-memory Macs. He also links to a thread reporting MLX results on an M3 Ultra, including tokens/sec and RAM requirements across quantization levels.

Separately, Willison calls out a surprising “suggested system prompt” section in Meta’s model card that reads like it’s compensating for behavioral quirks.

Early long-context testing: one model loops, another “OK,” Gemini still wins

Willison’s own experiments—using his LLM CLI tooling with OpenRouter and later Groq—are where the post gets especially interesting for developers thinking about long-context summarization. Maverick produced a workable Hacker News thread summary, while Scout (via OpenRouter) returned looping nonsense that looked like a broken deployment. A later update cites Meta acknowledging mixed quality across services while implementations get “dialed in.”

He also compares results against Gemini 2.5 Pro, which produced dramatically better output in his test.

Where it might go next

The post closes with a pragmatic hope: that Llama 4 follows Llama 3’s arc and expands into a broader size range—especially smaller models that run comfortably on phones and mid-sized laptop setups.

Full write-up: Initial impressions of Llama 4.

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community