Anshuman Mishra’s May 20 post on X makes a straightforward case for building agents from "first principles," using a text-to-diagram example to walk through the loop from prompt to model action to environment, reward, and update.
The thread argues that many post-training tutorials begin too far up the stack, with a framework and a trainer, rather than with the environment itself. Mishra’s setup starts lower down: before reinforcement learning, there is an action space; before an agent, there is a policy; and before either, there is a system that can accept or reject the model’s output.
The example is deliberately narrow. A user asks for a simple diagram, and the model is expected to emit JSON actions that create shapes and connections on a canvas. The completion is only useful if it is valid JSON, matches the schema, creates objects in the right order, and produces a readable result. That distinction, the post suggests, is what separates ordinary chat fine-tuning from agent training.
To make that concrete, the thread includes a pure-Python canvas environment with validation rules for shapes, arrows, IDs, and coordinate values. It also sketches a reward function that combines parseability, layout quality, and semantic coverage. The post’s argument is that reward design is where the task really lives, while the training stack mainly provides infrastructure around that loop.
The thread then turns to teacher trajectories. A stronger model, such as Gemini, can generate example actions; valid samples can be kept and converted into an SFT dataset. The post presents SFT as a way to teach the model the "language" of the environment, along with the line that drew a fair amount of attention: "SFT buys you syntax, RL buys you optimization."
From there, the post moves to RL, describing a simple group-relative approach in which multiple completions for the same prompt are scored and the better ones are reinforced. The core claim is that RL is useful after SFT because the model has already learned to produce valid actions often enough for exploration to become meaningful. The thread also maps the same idea onto TRL’s GRPOTrainer.
Alongside the technical material, the post spends time on failure modes: zero reward when the model never reaches valid states, reward hacking when the metric is too easy to exploit, and cases where reward rises without any visible improvement to a human reviewer. The thread treats those as environment-design problems rather than trainer problems.
Replies under the post mostly centered on the same themes. One user asked for an architecture diagram. Another described the write-up as a good read. A commenter highlighted the "SFT buys you syntax, RL buys you optimization" line, while another pointed to the section on failure modes as especially worth keeping. AgentGuard also chimed in with a caution that, when the setup moves from a toy diagram agent to a coding agent, "the environment stops being controlled" and a security layer may be needed between model action and execution.
The source also includes a small note that the article’s arguments were the author’s, while its writing and structure were refined with GPT 5.5, which the post describes as partly an experiment in speed-writing technical research blogs from rough notes.
Source: X post
