Inside OpenAI's Codex CLI Agent Loop: Streaming, Tools, and Caching

An engineering write-up explains how OpenAI's Codex CLI runs an agent loop: it builds structured prompts, streams Responses API output, executes local tools (like shell commands), appends results, and re-queries the model until completion. It also outlines prompt caching, stateless requests, and …

Inside OpenAI's Codex CLI Agent Loop: Streaming, Tools, and Caching

TL;DR

  • Agent loop: Alternates model inference and tool execution; assembles prompts, streams Responses API events, runs tool calls locally, appends outputs, and repeats until the model emits a final assistant message.
  • Structured prompt fields: Uses a Responses API payload with instructions, tools, and input items instead of a flat text prompt.
  • Responses API streams via SSE and emits discrete events (reasoning summaries, function calls, output text); model steps are stored as typed conversation items (reasoning, function_call, function_call_output).
  • Tools can modify the local filesystem or call external services; tool outputs are captured and appended so the model can reason over observed results; MCP servers may supply tools dynamically.
  • Prompt caching and performance: Context-window limits managed by placing static instructions/tools up front and variable content later; exact-prefix prompt caching reuses sampling work to reduce latency.
  • Compaction and stateless operation: avoids previous_response_id for ZDR-compatible stateless requests; uses /responses/compact and an auto_compact_limit to replace long histories with a compacted item.

OpenAI’s Codex CLI centers on an agent loop that coordinates model reasoning, tool execution, and local environment changes to produce reliable software edits. The recent engineering write-up lays out how the CLI assembles multi-part prompts, streams model output via the Responses API, executes tool calls locally (for example, shell commands), and iterates until the model emits a final assistant message.

The agent loop, in practice

The agent loop alternates between two phases: model inference and tool execution. A conversation turn begins by assembling a prompt from a set of items (system/developer instructions, tool definitions, environment description, and the user’s message). The model is queried via the Responses API, which returns a stream of events. If the model issues a tool call (for example, a request to run a shell command), the agent runs that tool locally, appends the tool output to the prompt, and re-queries the model. This cycle repeats until the model produces an assistant message that signals the turn’s end.

Key technical points:

  • The Responses API interaction is streamed via SSE and produces discrete events such as reasoning summaries, function calls, and output text.
  • Model outputs are appended back into the conversation as structured items (e.g., reasoning, function_call, function_call_output) so subsequent requests include the model’s internal steps.

How prompts are constructed

Codex does not send a flat text prompt directly; instead, the Responses API payload contains structured fields. The most relevant fields are:

  1. instructions — system or developer guidance used for model context
  2. tools — a schema-defined list of tools the model may call (Codex supplies built-in tools such as a local shell and an update_plan function, plus support for MCP-provided tools and Responses API web_search)
  3. input — a sequenced list of message items (developer messages, environment context, aggregated project docs, and the user message)

Each input item carries a role (system/developer/user/assistant) and a typed content block. Codex inserts additional items up front—sandbox and permission instructions, optional developer_instructions, aggregated project docs and skills metadata, plus a description of the local environment (cwd and shell)—before appending the user message.

Tools, execution, and observable state

Because tools can change the local filesystem or query external services, tool outputs are explicitly captured and appended to the conversation so the model can reason with observed results. MCP servers can supply tools dynamically, which adds flexibility but can also change the prompt mid-conversation.

Managing performance and long histories

Two practical constraints shape the design:

  • Context window: every model has a maximum token context. As conversations accumulate tool calls and messages, the prompt grows and can exhaust that window.
  • Prompt caching: sampling cost dominates overall latency; the Responses API supports prompt caching based on exact prefix matches. To realize caching benefits, static instructions and tools are placed near the start of the prompt while variable content is appended later. When the input is an exact prefix of a prior request, sampling work can be reused, making sampling linear rather than quadratic across iterations.

Codex intentionally avoids the Responses API’s optional previous_response_id to keep requests stateless and to support ZDR (Zero Data Retention) configurations. Stateless requests simplify operation for providers and align with ZDR requirements; encrypted reasoning content can still preserve private latent state without persisting conversation data.

When prompt length threatens the context window, Codex uses conversation compaction. Initial implementations required manual compacting; the Responses API now exposes a special /responses/compact endpoint that returns a compacted list of items (including an opaque compaction item with encrypted_content) which can replace the prior input. Codex automatically invokes compaction once the configured auto_compact_limit is exceeded.

Factors that can cause cache misses include changing the tools list, switching the model, or modifying sandbox or working-directory configuration mid-conversation. To minimize cache disruption, configuration changes are often appended as new messages rather than mutating earlier items.

Looking ahead

The agent loop functions as the core harness for Codex’s editing capabilities, tying model reasoning to deterministic tool effects on the local machine. The engineering post indicates further deep dives are planned into the CLI architecture, tool implementation, and sandboxing behavior.

Original source: https://openai.com/index/unrolling-the-codex-agent-loop/

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community