Skip to content

I lead the Evals team at Groq, where we're building OpenBench - an open-source standard for running evals easily, reliably, and in a reproducible manner. Before Groq, I was at Nous Research developing synthetic data pipelines for training language models.

When I'm not working on eval infrastructure, I find great joy in reading epic fantasy novels (especially Brandon Sanderson's works) and optimizing little parts of my life with software. The best way to reach me is a DM on X.

OpenBench

Provider-agnostic, open-source evaluation infrastructure for language models. Standardized benchmarking across 20+ evaluation suites.

Eris

An evaluation framework using debate simulations to assess AI models' reasoning and communication skills.

Set-Eval

A multimodal benchmark for testing vision capabilities and reasoning in AI models.

The Internship Game

and why you shouldn't play it6 min read

There's a curious disconnect in how we talk about tech internships. The conventional wisdom—polish your resume, practice interview questions, network aggressively—isn't wrong, exactly. But it misses something essential.

Public Engagements

RAG, Agents, and Latency – Webinar (with Jason Liu)

Talk

Talking about fast RAG and why most agent infra is slow by default

AI Evals for Engineers & PMs – Guest Lecture

Maven CourseLecture

Teaching engineers how to evaluate AI in the context of reliability and infrastructure

Stanford CS224G – Guest Lecture

Stanford CS224GLecture

Guest lectured at Stanford about evals with Ben Klieger