I lead evaluation systems at Groq. I'm interested in how we can better understand AI capabilities through systematic evaluation. In my spare time, I build benchmarks that test the capabilities of LLMs.
Before Groq, I was at Nous Research developing synthetic data pipelines for training language models.
ProjectsView all →
Eris
An evaluation framework using debate simulations to assess AI models' reasoning and communication skills.
Set-Eval
A multimodal benchmark for testing vision capabilities and reasoning in AI models.
BlogView all →
The Internship Game
There's a curious disconnect in how we talk about tech internships. The conventional wisdom—polish your resume, practice interview questions, network aggressively—isn't wrong, exactly. But it misses something essential.
Public EngagementsView all →
RAG, Agents, and Latency – Webinar (with Jason Liu)
Talking about fast RAG and why most agent infra is slow by default
AI Evals for Engineers & PMs – Guest Lecture
Teaching engineers how to evaluate AI in the context of reliability and infrastructure
Stanford CS224G – Guest Lecture
Guest lectured at Stanford about evals with Ben Klieger