I lead evaluation systems at Groq. I'm interested in how we can better understand AI capabilities through systematic evaluation. In my spare time, I build benchmarks that test the capabilities of LLMs.

Before Groq, I was at Nous Research developing synthetic data pipelines for training language models.

ProjectsView all →

Eris

An evaluation framework using debate simulations to assess AI models' reasoning and communication skills.

Set-Eval

A multimodal benchmark for testing vision capabilities and reasoning in AI models.

BlogView all →

The Internship Game

and why you shouldn't play it6 min read

There's a curious disconnect in how we talk about tech internships. The conventional wisdom—polish your resume, practice interview questions, network aggressively—isn't wrong, exactly. But it misses something essential.

Public EngagementsView all →

RAG, Agents, and Latency – Webinar (with Jason Liu)

Talk

Talking about fast RAG and why most agent infra is slow by default

AI Evals for Engineers & PMs – Guest Lecture

Maven CourseLecture

Teaching engineers how to evaluate AI in the context of reliability and infrastructure

Stanford CS224G – Guest Lecture

Stanford CS224GLecture

Guest lectured at Stanford about evals with Ben Klieger