The problem: when you have multiple LLM agents working together and something fails, which agent is responsible? Traditional RL gives you one reward at the end, so all agents share the blame equally.
Our approach: an external LLM (we used Gemini) watches each agent's actions and tool outputs, then assigns per-action scores. When agent 3 crashes because agent 1 forgot to save a file, the coach traces back through the tool outputs and blames agent 1, not agent 3.
This gives you dense training signal without needing ground truth labels. The coach provides the supervision.
Practical angle: you use the API calls only during training. Afterward you have a team of local models that run offline. We tested with Qwen and LLaMA base models.
Results: +17pp on AIME math competition, +38% F1 on Kaggle-style data science tasks.
Hardware requirement is 2-8x 80GB GPUs depending on model size. Code is MIT licensed.
The framework is general - plug in your own agents, your own task, your own coach model.
The problem: when you have multiple LLM agents working together and something fails, which agent is responsible? Traditional RL gives you one reward at the end, so all agents share the blame equally.
Our approach: an external LLM (we used Gemini) watches each agent's actions and tool outputs, then assigns per-action scores. When agent 3 crashes because agent 1 forgot to save a file, the coach traces back through the tool outputs and blames agent 1, not agent 3.
This gives you dense training signal without needing ground truth labels. The coach provides the supervision.
Practical angle: you use the API calls only during training. Afterward you have a team of local models that run offline. We tested with Qwen and LLaMA base models.
Results: +17pp on AIME math competition, +38% F1 on Kaggle-style data science tasks.
Hardware requirement is 2-8x 80GB GPUs depending on model size. Code is MIT licensed.
The framework is general - plug in your own agents, your own task, your own coach model.