Frontiers in AI Evaluation Conference

I’ve been to a number of these Stanford Causal Science conferences and they’ve never disappointed. The crowd has certainly expanded over the years and has a lot more industry representation than before. Lots of professors on leave pursuing startups - gather ye rosebuds and all. Anyway about the talks:

AI’s models of the world and ours - Jon Kleinberg

This was a deep dive into the “drosophila of AI”: chess! Basically he makes the point that chess has been the domain of superhuman AI ever since Deep Blue and maybe we can learn from this as AI starts spreading to more domains. He makes 4 points:

Chess is as popular as ever (though I suspect that has more to do with modern media than anything else). It was rather interesting to me that centaurs (teams of man + machine: an idea I initially encountered from Average is Over) are basically obsolete
Spectators know as much as the participants. There is a great democratization of knowledge: he had a great line “the two people who know the least about the game are the two playing it” as everyone else has the benefit of spectating with the engine running.
Aesthetics no longer are a proxy for utility. Basically chess started looking uglier and it kept working better. There’s an analogy to code: will there eventually be a point where spaghetti AI slop is actually a sign of quality (as in it’s clearly not human and thus expected to be better)?
The AI sets you up to fail. This is the technical portion of the talk: basically they look into what happens when your laptop runs out of battery and you need to finish out the chess game manually. It turns out that things go really badly: the engine is pursuing a line that’s highly counterintuitive but in its hands highly performant. You, uh, don’t understand any of that and blunder everything away. In some sense though you were set up to fail: the AI could have planned for such an occurrence and not taken that line to begin with. He trains a model to safely hand over positions to MAIA (their chess engine that tries to emulate humans).

Why we must go beyond post-training for robust AI alignment - Dylan Hadfield-Menell

Basically post-training is brittle. He goes through a few different exercises showing how folks have bypassed post-training: refusal is a single direction, jailbreaking, attacker moves second. So if we’re relying on post-training to keep our LLMs from cheerfully sharing the recipe for anthrax, we’re in pretty bad shape.

So what do we need to do instead? Well, there is pretraining: if we didn’t have the formula for anthrax in the training data that would help. Unfortunately this doesn’t help a bunch: as you add all sorts of innocuous information, you could end up providing the requisite biological and chemical background knowledge to reconstruct it :(.

He doesn’t precisely have a solution for us. But he does have a suggestion for mitigation that harkens back to the old days of cybersecurity. Basically instead of trying to harden a broad attack surface start off by first narrowing the surface and then your task is much simpler.

To achieve this he suggests narrow deployment: train your LLM such that you achieve good performance on in-scope prompts and terrible performance on out-of-scope prompts. Then at the very least you only have to worry about the anthrax problem for your biology LLM and not your kids toy LLM.

Scalable Evaluation of Multimodal AI Systems for Creative Optimization - Bahareh Azarnoush

Netflix apparently does a lot with GenAI content generation. The nice thing about working within an existing ecosystem is that you can get some external validity: does an AB test of the output of two models directionally match your eval? Basically you get RLHF for free.

Evaluation under Pressure: Lessons from Deploying Clinical AI at Scale - Zachary Lipton

He talks about his medical transcription startup and all sorts of eval troubles that show up in practice. And it’s rather important: he points out that med tech companies often fail when their tech which performed well in their lab, doesn’t work out in the wild.

The obvious problem of course is distribution shift: things like wrist-based heart rate monitors failing to work on darker skin. But then generative AI gets you into even weirder cases. Initially they were evaluating summaries of transcripts: all well and good, you can have a human create a summary as well and pick which is best. But then they started incorporating the entire medical history: now you can’t scale that for a human version! So you just kludge something together. This is where he introduced me to the dichotomy of neats and scruffies based on how your affinity for theory-based vs kludgy approaches to AI research. I certainly feel much more of “neat” myself, though most AI folks these days seem to be “scruffies”.

Indeed, if you aren’t a scruffie your company is unlikely to survive. There is a tension between moving fast and being rigorous in your evals. There’s a bit of a prisoner’s dilemma: it would be best if you could eval and know you’re hill climbing on performance. But changes are probably making things better, and another startup that skimps on their evals could steal market share from you. So you’d better skimp on your eval too. This works up until the point where you’re no longer making obviously good changes (or when you never had good performance in the first place). The only solution I can see is some external pressure on evaluation, which in medical fields seems obviously desirable albeit currently déclassé.

Computer Assisted Learning in the Real World - Emma Brunskill

Basically can we figure out what the effect sizes of AI in the classroom are? Much like the medical field, this seems like somewhere where we should be rigorous in our evaluations instead of moving fast and breaking kids. This was a great talk especially after listening to a podcast hyping Alpha School.

Admittedly, the reigor is far more depressing than the hype. Effect sizes in education are tiny. Quite plausibly, your best bet for improving education is removing the obvious hindrances to learning: absenteeism, behavior issues, truly bad teachers. I am particularly interested in the top end of education where those low-hanging fruit have long since been picked. It’s plausible there’s effect sizes here, though I’d really like to see rigorous evaluations.

The Benchmark Problem - Benjamin Recht

This talk was particularly provocative in a good way.

He gets to something I have struggled with when understanding these benchmarks: why do we particularly care if a model scores 80% or something. It’s not a math test. I can make a benchmark score 99% consisting of 99 copies of “what’s 1+1” and the last being “invent a time machine”. Recht makes the point that average performance doesn’t mean anything when there isn’t a meaningful generating distribution you care about.

So we really need to be thinking about these benchmarks differently. As always, let’s look to 20th century statisticians for solutions to modern problems. Recht points to item response theory which was developed in the psychometrics literature. You can basically think of it as:

\[ P(success) = logit(\alpha_{i} - \delta_{j}) \]

where \(\alpha\) stands for /a/bility and indexes the models and \(\delta\) stands for /d/ifficulty and indexes the questions. The nice property is that testing then becomes a measurement: with some assumptions you can measure abilities independently of the data distribution!

I struggle to see how this doesn’t just devolve into the model that gets the highest score on the benchmark is the best. Though it does suggest we should be striving for quality instead of quantity in the benchmark, which seems like a win. I want to do a deeper dive into this topic to understand better: this is exactly why I love going to these things.

He also uses the xkcd plot package for hypothetical data which I’m definitely stealing.

Benchmarks to advance the AI Frontier - Ofir Press

This is one of the folks behind SWE Bench! He is in the benchmark creation business and business is booming. As benchmarks get saturated (not enough \(\delta\)!), he keeps making new ones. Indeed, he basically framed his goal as providing sufficiently hard problems to guide ML labs towards prioritizing particular domains (since they want benchmark results for PR purposes).

I found him to be quite thoughtful on the validation side. In particular, he works towards finding things that are machine- verifiable and most of the work of creating a new benchmark is figuring out a good way of deterministically verifying it. He demurs on LLMs-as-judges as they’re not robust or accurate enough (for refusal the judge deemed empty responses as “failing to refuse” as it wanted an explicit string like “as a large language model I cannot…”). Feeling pretty good about the industry after listening to this talk.

Keeping up with Capabilities - David Rein

This is one of the folks behind the “LLMs can do 12-hour tasks but not 13-hour tasks” chart at METR. This one was significantly less coherent than the previous talk on benchmark creation: a little more optimized for headlines rather than pushing forward the field. Feeling pretty bad about the industry after listening to this talk.

Towards Self Driving Software Reliability - Anish Agarwal

This was an advertisement for a LLM as SRE startup. It does seem like a hard problem to crack: there’s a firehose of data generated by the proliferation of observability. During an incident SREs need to quickly identify root causes and mitigate them. He makes the observation that this is an interesting application area for causal inference: which yes, but also sounds way too hard. Bonne chance!

Trust the coding agent: evals, optimization, security - Anupam Datta

Basically, throw automated prompt optimization and a bunch of LLM judges at the problem of improving performance. I’ll be interested to see how this approach holds up over time. I am highly skeptical tof prompt optimization: it seems like throwing darts until one eventually hits (and only works for your particular model version).