GLM-4.7: Z.AI’s $3/month Coding AI with 200K Context & Thinking Modes

GLM-4.7 (GLM Coding Plan, $3/mo) delivers 200K context, 128K outputs, thinking modes, streaming, function calling and context caching. Optimized for end-to-end, multi-step coding, agentic workflows and automated UI ge...

z-ai cover

TL;DR

  • Text-only I/O; context length 200K; max output 128K; capabilities: thinking modes, real-time streaming, function calling, context caching, structured output.
  • Shift from single-turn snippets to end-to-end task completion with a “think before acting” pattern and three reasoning strategies: interleaved, retention-based, round-level.
  • Targeted scenarios: agentic full‑stack prototypes, multimodal real-time apps, automated web UI generation, improved multi-turn dialogue/collaboration, and creative/office asset production; frontend aesthetics improved and PPT 16:9 compatibility rose from 52% to 91%.
  • Reported benchmark results: programming parity with Claude Sonnet 4.5; tool invocation BrowseComp 67 and τ²‑Bench 84.7; reasoning HLE 42.8% (41% vs GLM‑4.6); SWE‑bench 73.8% (+5.8%), SWE Multilingual 66.7% (+12.9%), LiveCodeBench V6 84.9 (open-source SOTA), Terminal Bench 2.0 41% (+16.5%); Code Arena ranked first among open-source/domestic models.
  • Documentation includes cURL, Python, and Java quick-starts; enable thinking via the thinking parameter set to "enabled" and use stream: true for incremental responses; API reference: https://docs.z.ai/api-reference/llm/chat-completion — thinking mode guide: https://docs.z.ai/guides/capabilities/thinking-mode
  • GLM Coding Plan listed starting at $3/month (subscription link: https://z.ai/subscribe?utm_source=zai&utm_medium=link&utm_term=guide&utm_content=glm-coding-plan&utm_campaign=Platform_Ops&_channel_track_key=Xz9zVAvo)

Overview

Z.AI’s GLM-4.7 has been introduced as the core model in the GLM Coding Plan subscription, positioned for AI-assisted coding workflows and integrated into tools such as Claude Code, Cline, OpenCode, and Roo Code. The plan is listed as starting at $3/month. GLM-4.7 emphasizes end-to-end task delivery across multi-step development scenarios while offering expanded context and output capacity.

Key technical details

  • Input/Output: text-only
  • Context length: 200K
  • Maximum output tokens: 128K
  • Notable capabilities: thinking modes, real-time streaming, function calling, context caching, structured output

These capabilities aim to support more complex interactions, longer conversations, and integration with external toolchains.

What GLM-4.7 focuses on

GLM-4.7 shifts emphasis from single-turn code snippets toward complete task completion: understanding requirements, decomposing solutions, and producing executable, multi-file or multi-stack code when needed. The model introduces a “think before acting” pattern in programming contexts to stabilize multi-step reasoning and agent-based execution. Several thinking/reasoning strategies are highlighted:

  • Interleaved reasoning — performs reasoning before each response or tool invocation to improve instruction compliance.
  • Retention-based reasoning — preserves reasoning blocks across multi-turn sessions to improve cache hit rates and reduce computation for long tasks.
  • Round-level reasoning — allows control over reasoning overhead per session round (e.g., toggling for latency vs. accuracy trade-offs).

Usage scenarios

GLM-4.7 is presented for a range of developer-focused and creative tasks:

  • Agentic coding: full-stack prototypes and demos where frontend, backend, and peripheral interactions must interoperate.
  • Multimodal real-time apps: camera input, gesture control, and interactive controls combined with application logic.
  • Web UI generation: automated layout and styling with more consistent default aesthetics.
  • High-quality dialogue and collaboration: improved context retention for multi-turn exchanges.
  • Creative content and office assets: enhanced prose for narratives and better PPT/poster layout generation.

The documentation highlights improved default frontend aesthetics—layout structures, color harmony, and component styling—to reduce manual fine-tuning for prototypes and low-code tools. For office outputs, PPT 16:9 compatibility reportedly rose from 52% to 91%.

Benchmarks and improvements

GLM-4.7 reports measurable gains against prior GLM releases and other models across multiple benchmarks and internal evaluations:

  • Programming & agent skills: claims of parity with Claude Sonnet 4.5 on several coding benchmarks.
  • Tool invocation: 67 on BrowseComp and 84.7 on the τ²-Bench interactive tool invocation benchmark (open-source SOTA in the document).
  • Reasoning: 42.8% on HLE (a reported 41% improvement vs. GLM-4.6), and stated to surpass GPT-5.1 on that metric.
  • Coding benchmarks: top open-source scores on SWE-bench (73.8% Verified, a +5.8% lift), SWE-bench Multilingual (66.7%, +12.9%), LiveCodeBench V6 (open-source SOTA 84.9), and Terminal Bench 2.0 (41%, +16.5%).
  • Code Arena: ranked first among open-source and domestic models in blind tests cited in the documentation.

All metric values and comparisons are taken from the source documentation.

Developer quick start

The documentation provides ready examples for calling the chat/completion endpoint using cURL, the official Python and Java SDKs, and the OpenAI-style Python SDK. The calls typically demonstrate enabling the thinking mode by setting a thinking parameter to "enabled" and optionally using streaming (stream: true) for incremental responses.

A developer-focused API reference link is provided for implementation details: API Documentation. The GLM-4.7 quick-start sections include sample cURL requests, Python and Java snippets, and SDK installation instructions.

Resources

Original source: https://docs.z.ai/guides/llm/glm-4.7

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community