PPO is so Back!

Recently, I was given a sizeable budget (imo) and the mission to train a small open source model on an internal environment, with the hope of showing general gains on public benchmarks. Naturally, I'm jumping for joy. Those who know me, know of my lifelong quest to find funded compute. Without revealing too much here are some of the things that went through my head when I got the news, besides happiness obviously. For PIAA, I also won't share too much on model selection, tasks, etc. Better safe than sorry.

Obviously the number one thing on my mind is compute provider. I debated between Modal and Lambda but, ultimately I'm deciding on Modal because I'm more familiar with them and because I want to run one stack and part of my experiment is running large scale baselines of the base model on various benchmarks, which I can nicely parallelize at the CPU level. Big bill incoming!

Ok now to the things that really matter. The environment is of the long-horizon tool-calling sort. I came of age in a GRPO world. When I started doing research around a year ago, if you wanted better performance (generalization, consistency, whatever really) you pull out ole reliable GRPO + Rubrics. Now on the day I got the go-ahead to plan a training experiment, GLM released their 5.2 blogpost. Had I not read that, and the reactions from the research community at scale (especially Prime Intellect), I'd probably be sitting here writing about how I'm working on a complex composite reward function. But nope, seems PPO is back. And I have compute now so I'm really not mad about it fr. I've felt for a while that GRPO is not good for long context/multi-turn environment tasks. If GLM's telling me that PPO negates GRPO's length bias and token distribution problems I'm buying it. I should probably reread the PPO paper tbh. You know how people used to say that GRPO was PPO without the critic. I'm like "oh, PPO is just GRPO with a reward estimator."

I'm very excited to train on a full node of 8xH200s. I'm counting on models being significantly better at debugging optimizer/tensor/expert sharding than they were in early 2025. Debugging DDP in Cursor with Sonnet-3.7 was my personal hell. I think my compute budget alone is gonna save me so many problems. Also Opus 4.8 has convinced me that the critic need not be the same size as the actor. As I'm writing this I'm currently diving into Natural Language Actor-Critic and Asymmetric PPO: mini-critics boost LLM reasoning. You are actually a part of my process realtime. I'm thinking this validates smaller critic large actor (although no I will not be using a set of mini critics each trained on disjoint prompt shards no matter how efficient that may be).

Now that I'm getting to think about it, I am very excited to do asymmetric PPO and benchmark that against GRPO, which I believe my handlers are interested in. I think there is a misconception that GRPO will get us the results we want faster. I feel more like it will give us noisy results, and slower convergence. Also entropy collapse, we don't want that but we all know it'll happen.

It's also been really interesting to dive into MoEs. I've only ever worked on dense models so I'm excited to get into slime and learn more about things like expert sharding (like it makes sense but also crazy that someone said hey let's do TP and also EP).

I hope to get some interesting results and plan to update the experiment blog frequently, and for the work I'm doing with Terry J.C. Zhang. I have a sense that my intuition will sharpen a ton and that many of the decisions will begin to make absolute sense. I was revisiting GAE to understand how it improves reward assignment on multi-turn tasks especially those with compaction or multiple actions per turn and the discounted backup just made so much sense this time around. I remember working through the ARENA workbooks a couple months ago when my world was all GRPO and yeah it made no sense.

Looking forward to sharing more of my learnings!