Test Time Compute and Statistics

A coworker gave us a presentation on his coursework on LLMs. I don’t follow this literature so it was enlightening. One particular gem was on test-time compute which seems to basically mean “generate N solutions and hope one of them works”. The paper (arxiv) makes the obvious allusion: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.

They make a couple claims:

The log probability of success with repeated sampling is polynomial in the number of trials
Majority voting doesn’t scale in the same way

Reading the paper over myself it’s not surprising to find that modern LLM folks seem to have been completely divorced from statistical thinking. If you think about this it’s pretty obvious that you’re just dealing with Bernoulli trials here: your repeated samples are iid and it’s a binary outcome.

Thus the second claim is not that surprising. Majority voting effectively turns your estimator into an indicator for p > 0.5 and you quickly get LT convergence. This is just dumb if you think your problem is hard. Though it does help with things like hallucinations if you think those are rare.

The first claim is a little more complicated. Using the Bernoulli you find that your probability of success is \(1 - (1 - p)^N\)¹ which is actually exponential in the number of trials not polynomial.

Now how do we square this with their observed power laws? In comes another paper (arxiv) How Do Large Language Monkeys Get Their Power (Laws)? They make this same observation as I do and validate it empirically on a per-problem basis. They infer that there must be a distributional effect: as you integrate over the exponential curves of each individual problem you recover the power law in the aggregate. It turns out that there’s a good number of distributions that fit the bill: you just need a heavy tail of low probability problems).

I’m not sure why statistics is perpetually late to the party. I certainly understand why we don’t get involved in the training of LLMs or neural networks in earlier iterations of this phenomenon. But surely the evaluation could benefit more from statistical thinking. I quite enjoyed the paper “Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations” (arvix) and would love to see more work in this area!

More evidence of the lack of statistical influence. For their estimate of pass@1 they cryptically cite an OpenAI paper (arxiv) which finds the naive \(1 - (1 - \hat{p})^{n}\) is biased. Of this estimator they state “while [it] may look correct, it underestimates the true value by a considerable margin” This is just Jensen’s inequality: nothing mysterious!! ↩︎