How I Built a Trading Agent That Outperformed GPT Using HUD
If you want to build agent systems that work in real-world scenarios, the difference between a valid trading system and a toy demo comes down to one thing: training infrastructure, not the model.
In this post, I'll break down exactly how I built Analyst Arena, an agent-vs-agent trading simulator — and more specifically, how the HUD model I built and trained outperformed GPT 5.2.
System Overview
Analyst Arena is a constrained simulation:
- Each agent starts with $100K
- Time horizon: 10 trading days
- Actions: Buy or Sell only
- Objective: maximize portfolio value
This constraint is important. It forces the agent to make decisions under limited time, handle uncertainty, and avoid overfitting to long-term signals.
Step 1: Start With Tools, Not Prompts
The biggest mistake in agent design is starting with prompts. Instead, define: what capabilities does the agent need to make decisions?
Data Layer:
- Company and ticker data
- Financial statements
- Earnings reports
- Real-time news
These form the decision surface.
Decision Tools:
reason_to_tradecompute_portfolio_snapshotsummarize_position
These are critical. They don't just provide data — they structure reasoning, force consistency, and reduce hallucination risk.
Insight:
Your agent is only as good as its tool interface. Most performance gains come from here, not model upgrades.
Step 2: Expect Multiple Pivots
Your first toolset will be wrong 99% of the time. In my case, the system pivoted multiple times and tool definitions changed repeatedly. That's the process of building a strong agent.
Each iteration should answer:
- What decisions is the agent failing at?
- What information is missing?
- What structure is unclear?
From there, you must continue to change and build on the agent's tool set.
Step 3: Define Scenarios
Once tools are stable, define scenarios. Scenarios are structured environments where the agent operates. The final system had 6 scenarios, of which 3 were used:
a. trade_decision_step
- Runs daily for 10 days
- Pulls data
- Uses weightings to reason a decision
- Executes buy or sell
This is the highest-leverage component. If this fails, the agent loses money, errors compound quickly, and the agent's main purpose is gone.
b. factor_weight_ranking
- Determines which signals matter most
- Introduces internal prioritization
- Processes new weightings every day
This is how the model determines which data to use for every trade. If it fails, it can lead to unreasoned trades and inflated data causing bad buys.
c. post_trade_reflection
- Evaluates decision quality
- Judges whether it was a well-reasoned trade — not based on outcome
- Enables iterative improvement for each trade
This ensures no repeated mistakes and encourages the model to use the right tools to reason its trades. If it fails, it can cause a breakdown in the model's buy/sell reasoning across all trades.
Insight:
Scenarios are effectively your agent's operating system. Poor scenario design leads to inconsistent behavior regardless of model quality. Train it based on what it will experience.
Step 4: Evals Are the Real Work
The hardest part of the entire training process. Evals aren't about writing prompts or choosing models — they're about designing high-quality tasksets.
Key points:
- Tasksets cannot be vibecoded
- They require manual iteration
- They define what good performance actually means
In this build:
- Initial reward rate: 12%
- After eval iteration: 60%
HUD's advantage here:
- Full log visibility
- Clear failure attribution — is it an agent issue, rubric issue, or task design issue?
This feedback loop is what drives improvement. HUD helps with exactly that.
Step 5: Train Only After Evals Plateau
Training too early is a mistake. Only train when:
- Performance stabilizes
- Improvements from eval iteration slow down
- You reach a consistent reward rate that passes tests
Training setup:
- Select model
- Attach taskset and validation set
- Define epochs
- Run
Result:
- Training time: ~1 hour 40 minutes
- Reward rate: 76%
- Improvement: ~+16% over eval stage
At this point, the system is production-ready.
Step 6: Deployment
Final steps:
- Call the trained model via API
- Integrate into the simulation loop
From here, the agent operates autonomously inside Analyst Arena.
Results
Originally, the benchmark was GPT-4o. Instead, the system was tested against a stronger model (5.2 class).
Outcome:
- HUD-trained agent outperformed baseline
- Profit difference: ~$4,000 over 10 days
In a constrained environment, this is a meaningful edge.
Key Takeaways
- Training is more important than model choice. Upgrading models gives marginal gains. Improving training pipelines gives major gains.
- Tools define intelligence. Agents don't think better. They operate better within structured interfaces.
- Scenarios shape behavior. Your agent is the sum of its tools, its scenarios, and its evals.
- Tasksets are the moat. Anyone can call an API. Very few can design high-quality evaluation systems.
Final Thoughts
If you're building agents for real-world outcomes: stop optimizing prompts. Start designing systems.
HUD makes this process accessible, but the leverage comes from your thought process — specifically:
- Structure
- Evaluation
- Iteration
That's how you train an agent to outperform GPT 5.2.