I lead evaluation systems at Groq. I'm interested in how we can better understand AI capabilities through systematic evaluation. In my spare time, I build benchmarks that test the capabilities of LLMs.

Before Groq, I was at Nous Research developing synthetic data pipelines for training language models.

Featured Projects

An evaluation framework using debate simulations to assess AI models' reasoning and communication skills.

A multimodal benchmark for testing vision capabilities and reasoning in AI models.

Blog

and why you shouldn't play it6 min read

There's a curious disconnect in how we talk about tech internships. The conventional wisdom—polish your resume, practice interview questions, network aggressively—isn't wrong, exactly. But it misses something essential.

Read more →