Aarush Sah - Head of Evals @ Groq

I lead the Evals team at Groq, where we're building OpenBench - an open-source standard for running evals easily, reliably, and in a reproducible manner. Before Groq, I was at Nous Research developing synthetic data pipelines for training language models.

When I'm not working on eval infrastructure, I find great joy in reading epic fantasy novels (especially Brandon Sanderson's works) and optimizing little parts of my life with software. The best way to reach me is a DM on X.

Projects

#View all →

OpenBench

July 31, 2025

Provider-agnostic, open-source evaluation infrastructure for language models. Standardized benchmarking across 20+ evaluation suites.

Announcement•GitHub•Launch Tweet

Eris

July 22, 2024

An evaluation framework using debate simulations to assess AI models' reasoning and communication skills.

Announcement & Details•Code

Set-Eval

March 22, 2024

A multimodal benchmark for testing vision capabilities and reasoning in AI models.

Announcement & Details•Code

Writing

#View all →

The Internship Game

and why you shouldn't play it • September 03, 2024• 6 min read

There's a curious disconnect in how we talk about tech internships. The conventional wisdom—polish your resume, practice interview questions, network aggressively—isn't wrong, exactly. But it misses something essential.

Read post

Public Engagements

#View all →

RAG, Agents, and Latency – Webinar (with Jason Liu)

June 4, 2025• Talk

Talking about fast RAG and why most agent infra is slow by default

Webinar page

AI Evals for Engineers & PMs – Guest Lecture

May 2025• Maven Course• Lecture

Teaching engineers how to evaluate AI in the context of reliability and infrastructure

Course page

Stanford CS224G – Guest Lecture

February 6, 2025• Stanford CS224G• Lecture

Guest lectured at Stanford about evals with Ben Klieger

Lecture slides