Why Bayes Can Fail for Institutions, And How to Fix It
Economics is the right lens, and we develop a new approach to fix the situation. Takeaways from Bates, Jordan, Sklar (author), Soloff (2022)
Classical statistical theory was built for scientists working in good faith. But institutions distribute rewards, which make them both truth-seekers and gate-keepers. Think about academic journals, corporate A/B testing suites, advertising experiments, and the FDA : decisions can swing the fate of careers, or companies, or billions of dollars. Too-loose statistical standards can reward applicants for mass-producing BS, leading to a replication crisis.
Our latest paper develops theory and statistical tools to prevent these incentive issues.
“Just use Bayes?” Not so fast. If the payout for a successful application is much larger than the submission cost, running a Bayesian analysis on the existing applicant population is vulnerable to new entrants submitting profitable 0-effect “bluffs.” Relaxing standards might be beneficial temporarily but counterproductive in equilibrium, as new participants will dilute the Bayesian prior distribution.
“Just use p<.05” is not the answer either, even with pre-registration. If costs and rewards are fixed constants, then we need p < costs/rewards to deter bluffing. This payoff-dependent level is the most generous we can get. Any looser, and the process will reward spamming.
What should institutions do?
The Bayesian approach is ideal if its decision thresholds are already strict enough to deter bluffing. Otherwise, if the ratio of application-cost/reward-for-success is small enough to encourage bluffs yet lower-bounded and not very variable, a strict frequentist standard is ideal. At medical regulators like the FDA, we give evidence that the frequentist standard for drug approvals probably can’t be relaxed in the largest markets without raising incentive problems, but could be relaxed for smaller markets.
In our upcoming paper, we develop statistical theory for a more flexible approach: we generalize to sequential profit-taking and accruing costs with “the profit license.” The license approach assumes that payouts can be quantified and capped if the evidence is poor. The license value increases with favorable evidence and as investments are made, and decreases with unfavorable evidence and as money is taken out. The profit license evolves in a way that prevents profitable bluffing, while allowing arbitrarily large total payouts. We can solve for the optimal profit license with dynamic programming.
What happens if we add a profit license requirement on top of an existing system? The addition of the profit license gives us peace of mind, by ensuring that bluffing is unprofitable. But we might be concerned that this unnecessarily hurts the good actors already in the system. It turns out that we can connect these conditions: if the game already deters bluffing, then the good participants should always be able to construct licenses such that their plans are permitted.
Practically speaking, I do not think the FDA should aim to implement a profit license yet. Current process seems to be working; though this might change if new trillion-dollar drug markets open up or clinical trials become much cheaper.
Below is a more detailed list of key points. Also, here’s a link to our full paper.
Key Points:
(1) We have many “statistical filtering institutions” in the world: medical regulators assessing drugs; journal reviewers assessing claims in a submitted paper; corporate A/B testing suites testing the effectiveness of coding changes or marketing buys. Evidence is continuous-scale, but it’s a fact of life that organizations must make accept/reject decisions. (Waiting for more evidence is the same thing as a sequence of “reject, reject, accept” decisions.)
(2) These mechanisms will fail if they are exploitable. If it is profitable to bluff - i.e., for a pharma company to submit an ineffective drug, a scientist to study a treatment with no effect, an engineer to submit an ineffective feature - then the system will be overrun by low-value activity. (To speak overly broadly: this overrunning has already happened in many areas of science, but has NOT happened to the FDA. For corporate A/B testing, I have seen cases going both ways. This variability is interesting and deserves a future post.)
(3) Bayesian decision theory is wonderful and there are motivating arguments to use it. The Bayesian approach to filtering would be to estimate the prior distribution of applicants, incorporate the likelihood of each experiment, form the posterior, and assess costs and benefits to determine the optimal threshold. This plays well with adaptive sampling, multi-level inference, and various forms of analysis. In fact under typical assumptions, Wald (1950) shows that this approach is essentially optimal, such that ANY good decision rule must be a Bayesian decision rule for some prior (or the limit of some sequence of such rules.)
(4) But the standard Bayesian analysis depends on the assumption that the prior does not adapt to the decision rule. If an application to the system is very cheap to produce, then a Bayesian analysis based on the current applicant pool may suggest a threshold which is susceptible to profitable bluffing. If so, and if bluffing would occur at a scale worth disincentivizing, then the “optimal” filter must be more conservative than the Bayesian threshold.
(5) In particular, if the ratio of potential profits / application costs is large and fixed, then the optimal incentive-aligned solution boils down to a frequentist p-value thresholding rule. For example, if false-positives would reward applicants with an expected 100x return-on-investment, then a threshold of p < .01 is required to prevent bluffing attempts. Notice that this incentive-based threshold could be larger or smaller than the Bayesian threshold as we vary the payoff. If a Bayesian analysis is already sufficient to deter bluffs it may be optimal (assuming that the prior does not depend on the threshold in other ways). In contrast, if the payoff ratio grows very large and there is a large supply of potential bluffers, keeping the bluffers out is necessary and requires stronger evidence than is needed to persuade a Bayesian - in this case the frequentist p-value threshold dominates and may be the optimal filter.
(5b) This offers a resolution the Bayes-vs-frequentist debate!
(6) The incentives framing has a number of implications for how we should set up filtering systems. Frequentism as practiced by medical regulatory institutions with frequentist thresholding filters may actually be close to optimal. Statisticians proposing to loosen current policy should consider Chesterton’s fence. (See our paper where we analyze this in a few basic cases - and contrast our results against the Bayesian analysis approach of Isakov et al (2019) which in some disease areas proposes a Type I Error of 20% or greater.)
(7) In our paper, we also derive a flexible “profit license” framework to ensure incentive-alignment in an adaptive fashion, which caps the profit that can be extracted based on evolving evidence, previous investment, and profits taken so far. The cap grows as more evidence is presented against the null. We are able to prove that this procedure is non-exploitable using martingales (e-values), and can solve for the optimal profit license over multiple decision points with dynamic programming. (See our paper).
(8) In our sequential setup, where a current participant faces costs and rewards as more evidence is revealed in sequence, then the game is either exploitable with 0-effect bluffing, OR there exists a valid profit license which can be associated to the payouts. Furthermore, that associated license can be constructed by dynamic programming.
(9) The above conjecture is a powerful optimality result. The way Wald (1950) establishes optimality of Bayes under a fixed prior, by saying: “Don’t like Bayesian decision theory? Too bad, if your filter isn’t suboptimal, it’s Bayes” - This conjecture gives us a similar statement in the presence of potential bluffs: “Don’t like the profit license? Too bad, if your filter isn’t inviting bluffs, it’s a profit license.”
(10) To be clear, I am not advocating to implement an actual profit license at the FDA for several reasons: (a) implementing price or profit controls on pharma would likely result in political risk and long-term damage to innovation, (b) a license system would be costly to implement, since it relies upon comprehensive tracking of value received and evasion might be possible, (c) I’ve not seen evidence that FDA’s current policy is failing. But I do find it a useful model, perhaps a “shadow theory” of the institution which is better left unacknowledged. At minimum, this work provides an under-appreciated argument for why filtering systems can use strict frequentist standards as a default. It also offers reasonable framework for error control in platform trials, where there is ongoing debate (i.e., does each arm get its own Type I Error? What about FWER / FDR control?) For each arm one should consider what was paid to add it and what stands to be won by the sponsor; the hypothesis threshold for the arm should be set below this ratio.
Additional Notes:
-The initial inspiration for this profit license approach came from learning to play poker from Doug Polk’s youtube channel, and thinking about the concept of MDF over multiple streets as applied to the multiple phases of clinical trials. Learning poker was a great way to slack off from my PhD. The game embeds an incredible number of advanced stats concepts. In retrospect, I wish I had included elements of thoughtful gambling in the introductory statistics courses I taught.
-Related literature, which shares some of the core ideas:
(book) “Game-Theoretic Foundations for Probability and Finance” (2019) by Vovk and Shafer
“An Economic Theory of Statistical Testing” (2016) by Tetenov
“(When) should you adjust inferences for multiple hypothesis testing?” (2022) by Viviano, Wuthrich, and Niehaus