I lead evaluation systems at Groq. I'm interested in how we can better understand AI capabilities through systematic evaluation. In my spare time, I build benchmarks that test the capabilities of LLMs.
Before Groq, I was at Nous Research developing synthetic data pipelines for training language models.
Featured Projects
Eris
An evaluation framework using debate simulations to assess AI models' reasoning and communication skills.
Set-Eval
A multimodal benchmark for testing vision capabilities and reasoning in AI models.
Blog
The Internship Game
There's a curious disconnect in how we talk about tech internships. The conventional wisdom—polish your resume, practice interview questions, network aggressively—isn't wrong, exactly. But it misses something essential.
Read more →