Research Engineer Intern
Sep – Dec 2025The LLM Data Company (YC x25)
- Co-authored evaluation rubrics (~40 criteria per task, 10 domains, 100 tasks) for Perplexity's DRACO Benchmark, an open-source benchmark for evaluating frontier deep research agents, now used to score systems from Perplexity, Google DeepMind, and OpenAI.
- Built long-horizon, tool-use, and computer-use evaluation environments for benchmarking frontier models (GPT-4/5, Claude Sonnet/Opus 4.5, Gemini 3 Pro). Designed and reviewed hundreds of complex rubrics across non-verifiable and document-grounded domains including medicine, finance, and law. Contributed technical scoping for evaluation proposals to external labs.
- Designed and implemented an alternative architecture to GEPA for reflective prompt optimization in non-verifiable domains, using N candidates per round condensed through a reflection node that carried forward accumulated insights. Integrated this architecture into an internal pipeline for synthetic rubric-generation experiments.
- Expanded infrastructure for synthetic rubric evaluation experiments to align rubrics with task difficulty and calibrate criteria against high-, medium-, and low-quality model outputs.
- Conducted exploratory Search-R1-style training experiments with veRL on Modal to post-train Qwen2.5-0.5B for a tool-use agent environment.