How to build AI agents from first principles, not frameworks

Anshuman Mishra lays out a bottom-up recipe for agent training using a tiny text-to-diagram task. The key: start with a strict environment and reward loop, use SFT to learn valid actions, then apply RL to optimize behavior—and watch for reward hacking.

How to build AI agents from first principles, not frameworks

TL;DR

  • First-principles agent loop: Start from environment, then policy/action space, then RL; stack/frameworks come later
  • Text-to-diagram example: Model emits JSON actions to build shapes/arrows; must match schema, order, readability
  • Pure-Python canvas environment: Validation rules for shapes, arrows, IDs, coordinates; accept/reject model output
  • Reward design focus: Reward blends parseability, layout quality, semantic coverage; training stack mainly infrastructure
  • Teacher trajectories + SFT: Stronger model (e.g., Gemini) generates actions; valid samples become SFT dataset; “SFT buys syntax, RL buys optimization”
  • RL via group-relative scoring: Multiple completions scored; better reinforced; mapped to TRL’s GRPOTrainer; highlights failure modes and security concerns for coding agents

Anshuman Mishra’s May 20 post on X makes a straightforward case for building agents from "first principles," using a text-to-diagram example to walk through the loop from prompt to model action to environment, reward, and update.

The thread argues that many post-training tutorials begin too far up the stack, with a framework and a trainer, rather than with the environment itself. Mishra’s setup starts lower down: before reinforcement learning, there is an action space; before an agent, there is a policy; and before either, there is a system that can accept or reject the model’s output.

The example is deliberately narrow. A user asks for a simple diagram, and the model is expected to emit JSON actions that create shapes and connections on a canvas. The completion is only useful if it is valid JSON, matches the schema, creates objects in the right order, and produces a readable result. That distinction, the post suggests, is what separates ordinary chat fine-tuning from agent training.

To make that concrete, the thread includes a pure-Python canvas environment with validation rules for shapes, arrows, IDs, and coordinate values. It also sketches a reward function that combines parseability, layout quality, and semantic coverage. The post’s argument is that reward design is where the task really lives, while the training stack mainly provides infrastructure around that loop.

The thread then turns to teacher trajectories. A stronger model, such as Gemini, can generate example actions; valid samples can be kept and converted into an SFT dataset. The post presents SFT as a way to teach the model the "language" of the environment, along with the line that drew a fair amount of attention: "SFT buys you syntax, RL buys you optimization."

From there, the post moves to RL, describing a simple group-relative approach in which multiple completions for the same prompt are scored and the better ones are reinforced. The core claim is that RL is useful after SFT because the model has already learned to produce valid actions often enough for exploration to become meaningful. The thread also maps the same idea onto TRL’s GRPOTrainer.

Alongside the technical material, the post spends time on failure modes: zero reward when the model never reaches valid states, reward hacking when the metric is too easy to exploit, and cases where reward rises without any visible improvement to a human reviewer. The thread treats those as environment-design problems rather than trainer problems.

Replies under the post mostly centered on the same themes. One user asked for an architecture diagram. Another described the write-up as a good read. A commenter highlighted the "SFT buys you syntax, RL buys you optimization" line, while another pointed to the section on failure modes as especially worth keeping. AgentGuard also chimed in with a caution that, when the setup moves from a toy diagram agent to a coding agent, "the environment stops being controlled" and a security layer may be needed between model action and execution.

The source also includes a small note that the article’s arguments were the author’s, while its writing and structure were refined with GPT 5.5, which the post describes as partly an experiment in speed-writing technical research blogs from rough notes.

Source: X post

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community