Meta’s new Llama 4 family landed over a weekend, and Simon Willison’s early notes are a useful snapshot of what’s actually shipping today versus what’s still aspirational—especially around multimodal MoE designs and the headline-grabbing long-context claims.
Two models now, one looming later
Willison highlights Meta’s two newly released models:
- Llama 4 Maverick: a 400B MoE model (128 experts, 17B active parameters) with text + image input and a 1M token context length.
- Llama 4 Scout: 109B total parameters (16 experts, 17B active) with multimodal inputs and a claimed 10M token context length—positioned as an “industry first.”
Meta also described (but hasn’t released) Llama 4 Behemoth, described as a 2T total parameter model with 288B active parameters that was used to train Scout and Maverick.
Access is easy; full context isn’t (yet)
One of the more grounded parts of the post is how quickly the “10M tokens” story runs into provider limits. Willison points to OpenRouter as a convenient entry point for both Scout and Maverick, routing requests through providers including Groq, Fireworks, and Together.
In practice, Scout’s available context windows appeared capped well below the claim depending on provider (with Willison observing limits like 128K or 328K), while Maverick’s 1M-class window varied by provider as well.
Running it locally: Macs, memory, and MoE reality
Willison collects a few telling community datapoints, including Jeremy Howard’s take that these giant MoEs won’t fit consumer GPUs—even quantized—while potentially making more sense on high-memory Macs. He also links to a thread reporting MLX results on an M3 Ultra, including tokens/sec and RAM requirements across quantization levels.
Separately, Willison calls out a surprising “suggested system prompt” section in Meta’s model card that reads like it’s compensating for behavioral quirks.
Early long-context testing: one model loops, another “OK,” Gemini still wins
Willison’s own experiments—using his LLM CLI tooling with OpenRouter and later Groq—are where the post gets especially interesting for developers thinking about long-context summarization. Maverick produced a workable Hacker News thread summary, while Scout (via OpenRouter) returned looping nonsense that looked like a broken deployment. A later update cites Meta acknowledging mixed quality across services while implementations get “dialed in.”
He also compares results against Gemini 2.5 Pro, which produced dramatically better output in his test.
Where it might go next
The post closes with a pragmatic hope: that Llama 4 follows Llama 3’s arc and expands into a broader size range—especially smaller models that run comfortably on phones and mid-sized laptop setups.
Full write-up: Initial impressions of Llama 4.