Case Study

How I Built a Trading Agent That Outperformed GPT Using HUD

If you want to build agent systems that work in real-world scenarios, the difference between a valid trading system and a toy demo comes down to one thing: training infrastructure, not the model.

In this post, I'll break down exactly how I built Analyst Arena, an agent-vs-agent trading simulator — and more specifically, how the HUD model I built and trained outperformed GPT 5.2.

System Overview

Analyst Arena is a constrained simulation:

  • Each agent starts with $100K
  • Time horizon: 10 trading days
  • Actions: Buy or Sell only
  • Objective: maximize portfolio value

This constraint is important. It forces the agent to make decisions under limited time, handle uncertainty, and avoid overfitting to long-term signals.

Step 1: Start With Tools, Not Prompts

The biggest mistake in agent design is starting with prompts. Instead, define: what capabilities does the agent need to make decisions?

Data Layer:

  • Company and ticker data
  • Financial statements
  • Earnings reports
  • Real-time news

These form the decision surface.

Decision Tools:

  • reason_to_trade
  • compute_portfolio_snapshot
  • summarize_position

These are critical. They don't just provide data — they structure reasoning, force consistency, and reduce hallucination risk.

Insight:

Your agent is only as good as its tool interface. Most performance gains come from here, not model upgrades.

Step 2: Expect Multiple Pivots

Your first toolset will be wrong 99% of the time. In my case, the system pivoted multiple times and tool definitions changed repeatedly. That's the process of building a strong agent.

Each iteration should answer:

  • What decisions is the agent failing at?
  • What information is missing?
  • What structure is unclear?

From there, you must continue to change and build on the agent's tool set.

Step 3: Define Scenarios

Once tools are stable, define scenarios. Scenarios are structured environments where the agent operates. The final system had 6 scenarios, of which 3 were used:

a. trade_decision_step

  1. Runs daily for 10 days
  2. Pulls data
  3. Uses weightings to reason a decision
  4. Executes buy or sell

This is the highest-leverage component. If this fails, the agent loses money, errors compound quickly, and the agent's main purpose is gone.

b. factor_weight_ranking

  1. Determines which signals matter most
  2. Introduces internal prioritization
  3. Processes new weightings every day

This is how the model determines which data to use for every trade. If it fails, it can lead to unreasoned trades and inflated data causing bad buys.

c. post_trade_reflection

  1. Evaluates decision quality
  2. Judges whether it was a well-reasoned trade — not based on outcome
  3. Enables iterative improvement for each trade

This ensures no repeated mistakes and encourages the model to use the right tools to reason its trades. If it fails, it can cause a breakdown in the model's buy/sell reasoning across all trades.

Insight:

Scenarios are effectively your agent's operating system. Poor scenario design leads to inconsistent behavior regardless of model quality. Train it based on what it will experience.

Step 4: Evals Are the Real Work

The hardest part of the entire training process. Evals aren't about writing prompts or choosing models — they're about designing high-quality tasksets.

Key points:

  • Tasksets cannot be vibecoded
  • They require manual iteration
  • They define what good performance actually means

In this build:

  • Initial reward rate: 12%
  • After eval iteration: 60%

HUD's advantage here:

  • Full log visibility
  • Clear failure attribution — is it an agent issue, rubric issue, or task design issue?

This feedback loop is what drives improvement. HUD helps with exactly that.

Step 5: Train Only After Evals Plateau

Training too early is a mistake. Only train when:

  • Performance stabilizes
  • Improvements from eval iteration slow down
  • You reach a consistent reward rate that passes tests

Training setup:

  1. Select model
  2. Attach taskset and validation set
  3. Define epochs
  4. Run

Result:

  • Training time: ~1 hour 40 minutes
  • Reward rate: 76%
  • Improvement: ~+16% over eval stage

At this point, the system is production-ready.

Step 6: Deployment

Final steps:

  1. Call the trained model via API
  2. Integrate into the simulation loop

From here, the agent operates autonomously inside Analyst Arena.

Results

Originally, the benchmark was GPT-4o. Instead, the system was tested against a stronger model (5.2 class).

Outcome:

  • HUD-trained agent outperformed baseline
  • Profit difference: ~$4,000 over 10 days

In a constrained environment, this is a meaningful edge.

Key Takeaways

  1. Training is more important than model choice. Upgrading models gives marginal gains. Improving training pipelines gives major gains.
  2. Tools define intelligence. Agents don't think better. They operate better within structured interfaces.
  3. Scenarios shape behavior. Your agent is the sum of its tools, its scenarios, and its evals.
  4. Tasksets are the moat. Anyone can call an API. Very few can design high-quality evaluation systems.

Final Thoughts

If you're building agents for real-world outcomes: stop optimizing prompts. Start designing systems.

HUD makes this process accessible, but the leverage comes from your thought process — specifically:

  • Structure
  • Evaluation
  • Iteration

That's how you train an agent to outperform GPT 5.2.