Arxiv

They mention in the introduction “It is natural to consider expert knowledge to inform the feature selection task. However, this is challenging to do in a systematic and scalable way that safeguards against introducing potential biases”. Thus I was surprised that they didn’t mention the Bayesian interpretation where the Lasso is the MAP estimate with a Laplace prior on the coefficients.

This would have helped with the justification of the method. As is the paper reads like a work of engineering: we did a thing, it lowered prediction error for a few (cherry-picked?) data sets, your mileage may vary. With the connection to Bayesian priors we get (1) a principled reason why this would make sense and (2) ability to recognize when the LLM wouldn’t be helpful. When the LLM’s implicit prior looks like what an expert would provide you’re probably going to do well; when it doesn’t… probably not.

This lead me to reading about attempts to get priors out of LLMs such as this paper on eliciting a LLM’s priors; reminded me of a talk1 that Tom Griffiths gave back when I was an undergrad at Duke. Repeatedly drawing from a posterior \(p(h|d)\) and then generating data from the likelihood \(p(d | h)\) defines a Markov Chain whose stationary distribution is the prior \(p(h)\). So to get an LLM’s prior you can iterate asking “If you were to evaluate the lifespan of a random T-year-old man, what age would you predict he might reach?” and updating T as a uniform draw from the LLM’s answer then repeating. Their paper shows this results in a prior distribution centered tightly around 80 years: reasonable enough though much tighter than I would have chosen.

For my taste both of these papers presume too much of the LLM’s knowledge and coherence. This seems to be becoming a trend with folks viewing LLM as oracles rather than a draw from a probability distribution2. This is partly why I like the second paper more: it treats the output of the LLM as the object of interest itself.

I’ll leave you with my fun experiment of naively elicit a prior from ChatGPT. I’ll let you judge how it went for yourself but I for one welcome our robot overlords who can be dutch booked3!

Me:

You are a preeminent seismologist who I’ve enlisted to help me with a statistics project. I am conducting a Bayesian analysis of earthquakes in the Bay Area and am creating a model for the time between earthquake events higher than a 3. I am coming to you for a prior distribution: based on your knowledge and experience please answer the following questions for me:

  1. What probability would you place that there is less than a 1 year gap between earthquakes?
  2. What probability would you place that there is less than a 2 year gap between earthquakes?
  3. Would you place any probability on there being a greater than 10 year gap between earthquakes?
  4. At what gap would you be indifferent on the over/under for an even-odds bet?
  5. For the same interval of time would you prefer to bet on a an earthquake in an earlier interval or later interval or it depends on the times?
  6. If you had to put a 50-50 bet on a single interval what would it be?

ChatGPT:

Here’s what a seasoned seismologist might say, blending expert knowledge with the probabilistic mindset you’d want for your Bayesian model:

  1. Probability of less than a 1-year gap: Given the historical frequency of M3+ earthquakes in the Bay Area, the probability of a sub-1-year gap is reasonably high, perhaps around 0.7 to 0.8. These smaller-magnitude events happen relatively frequently, though variability exists.
  2. Probability of less than a 2-year gap: This would naturally be higher, maybe around 0.9 to 0.95, as it’s rare to go multiple years without an event of this magnitude.
  3. Probability of a greater than 10-year gap: A long drought is unlikely but not impossible. The Bay Area’s seismicity is fairly regular, but longer quiet periods can occur. I’d assign a small probability, perhaps around 0.05 to 0.1.
  4. Indifference point for even-odds bet: For M3+ earthquakes, I’d likely be indifferent around 1.5 to 2 years, given typical recurrence intervals and how clustered smaller events can be.
  5. Preference for earlier vs. later intervals: It depends on whether the region has had recent activity. After a cluster of events, short-term probability might decrease. But after a quiet period, stress accumulation could make earlier intervals more attractive.
  6. 50-50 bet on a single interval: I’d pick an interval around 1–2 years, where the balance of probabilities feels most natural given the historical record.

Would you like to fit this into a specific distribution, like an exponential or Weibull, to shape your prior more formally? Let me know — we can build something precise!


  1. a different version of the talk is available on Youtube ↩︎

  2. You see this in evals too; which is why I love this paper addressing the stochastic nature of evals ↩︎

  3. I have a new retirement plan! ↩︎