Aarush Sah - Head of Evals @ Groq

I lead evaluation systems at Groq. I'm interested in how we can better understand AI capabilities through systematic evaluation. In my spare time, I build benchmarks that test the capabilities of LLMs.

Before Groq, I was at Nous Research developing synthetic data pipelines for training language models.

ProjectsView all →

Eris

July 22, 2024

An evaluation framework using debate simulations to assess AI models' reasoning and communication skills.

Announcement & Details • Code

Set-Eval

March 22, 2024

A multimodal benchmark for testing vision capabilities and reasoning in AI models.

Announcement & Details • Code

BlogView all →

The Internship Game

and why you shouldn't play it • September 03, 2024• 6 min read

There's a curious disconnect in how we talk about tech internships. The conventional wisdom—polish your resume, practice interview questions, network aggressively—isn't wrong, exactly. But it misses something essential.

Read post

Public EngagementsView all →

RAG, Agents, and Latency – Webinar (with Jason Liu)

June 4, 2025• Talk

Talking about fast RAG and why most agent infra is slow by default

Webinar page

AI Evals for Engineers & PMs – Guest Lecture

May 2025• Maven Course• Lecture

Teaching engineers how to evaluate AI in the context of reliability and infrastructure

Course page

Stanford CS224G – Guest Lecture

February 6, 2025• Stanford CS224G• Lecture

Guest lectured at Stanford about evals with Ben Klieger

Lecture slides