GLM-4.7 from Z.ai is now available on the Cerebras Inference Cloud, delivering a blend of improved developer-focused intelligence and much lower inference latency. On Cerebras hardware, GLM-4.7 generates at approximately 1,000 tokens per second — whopping 1000 tokens per seconds — and can reach up to 1,700 TPS in some cases.
Frontier intelligence tuned for coding and agent workflows
GLM-4.7 advances several practical areas for developer workflows compared with GLM-4.6. It shows stronger code generation and editing, improved multilingual output, and better handling of long, iterative sessions. The model demonstrates an improved ability to understand project context, recover from errors, and progressively refine solutions across multiple turns.
Agentic workflows benefit from a more robust planning and tool-calling behavior. Two internal reasoning patterns are emphasized: interleaved thinking, where reasoning is performed before each action or tool call rather than as a single upfront step, and preserved thinking, which allows reasoning context to persist across turns. These changes aim to produce more consistent behavior on complex math, logic, and tool-augmented tasks and to reduce the need to rebuild plans from scratch during long interactions.
On developer-oriented evaluations such as SWEbench, τ²bench, and LiveCodeBench, GLM-4.7 ranks as the top open-weight model, outperforming other open models like DeepSeek-V3.2 across a broad set of advanced benchmarks.
Record speed on Cerebras hardware
A defining characteristic of this deployment is throughput. When served from Cerebras’ wafer-scale engine, GLM-4.7 delivers real-time generation rates that the announcement emphasizes as critical for user-facing and latency-sensitive applications. The combination of frontier-level model quality with real-time inference makes live coding assistants, responsive agents, and similar use cases more practical without shifting latency to the critical path.
Price-performance in practice
Rather than focusing solely on price per token, the practical economics highlighted here center on how quickly useful output is produced. GLM-4.7 on Cerebras is said to run up to an order of magnitude faster than leading closed models such as Claude Sonnet 4.5 on real coding and agentic workloads. That speed advantage translates to lower end-to-end costs by shortening sessions, reducing concurrency requirements, and decreasing infrastructure needs to deliver comparable user experiences.
Cerebras positions GLM-4.7 as delivering roughly 10× higher price-performance versus Claude Sonnet 4.5, and comparable price-performance to DeepSeek-V3.2 while offering higher accuracy on the cited developer evaluations.
Deployment and migration
GLM-4.7 is compatible with existing GLM-4.6 chat completions workflows and uses the same API surface, so migrating typically involves updating the model name. The Cerebras Cloud offers a pay-as-you-go developer tier starting at $10, with rate limits intended to simplify prototyping and early scaling.
Migration and trial resources:
- Migration checklist: GLM-4.7 migration checklist
- Cerebras Cloud: Try GLM-4.7 on Cerebras Cloud
- Model details from Z.ai: https://z.ai/blog/glm-4.7
- Community: Discord, X
Original source: GLM-4.7 on Cerebras