Today's note is to point to an essay of that title by John P. A. Ioannidis, Professor of Medicine and of Health Research and Policy at Stanford University School of Medicine and Professor of Statistics at Stanford University School of Humanities and Sciences. By false findings he is referring to research that passed the usual test of statistical significance but was ultimately proven to be false, non-reproducible. The problem is that the number of false findings is much, much higher than would be predicted from the researchers' own assessments of statistical significance.
His essay was published in 2005. It is available here, at no charge. There is also an article in The Economist that briefly reports on his work. "The ASA's Statement on p-Values: Context, Process, and Purpose" goes over the same territory and has a reference list of 40 articles on this very topic. The ASA is the American Statistical Association.
I want to talk a little about how the circumstances are a bit different for findings by quantitative analysts who work for funds managing securities. Can you guess why? Come on... it's easy.
That's not a typo. Roughly speaking, a "p value" is an estimate of the odds that chance alone could have produced results at least as rosy-sounding as the published research results. So if the p value is high and the experiment is repeated enough then the chances are good that the efficacy of whatever was supposed to have brought about the rosy result will no longer be apparent.
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.— American Statistical Association (ASA)
Well if that's an "informal" definition I'd hate to see a formal one. With "informally" they are perhaps cluing you in on the fact that your failure to immediately understand what is meant by a "specified statistical model" isn't that you're stupid; rather, it's that they couldn't figure out how to write succinctly on this topic. Probably it would be better were they to discuss examples.
In the context of the Retail Backtest project the statistical summary of the data is the risk-adjusted return, computed as the Sharpe ratio of the cumulative return on the portfolio. After a single pass the Sharpe ratio that is the result of running the algorithm on the historical price data has no variation, as there is just one set of data and the algorithm each day dictates what positions to hold in each security, which positions bring about the cumulative return.
A variation in the Sharpe ratio comes about if we also have a random number generator in our computer program that effectively simulates monkeys throwing darts at a board. The monkeys pick which securities to hold and when to hold them instead of the algorithm. The "null hypothesis" is that our algorithm for picking which securities to invest in and when is no better than that of the monkeys.
At this point I have to inject that the approach of the Retail Backtest project to the task of establishing the likelihood that a portfolio management program's past good performance is likely to be continued is hardly reliant on p values. Yes, I do compute them as a final step but they are almost an afterthought as the basic scheme of the program for determining which securities to hold and when to hold them is first subjected to a "walkthrough" procedure and then to a "suboptimization" step. Each of those involves a form of out-of-sample testing and both administer substantial "haircuts" to expectations. When the thus-diminished expected Sharpe ratio nonetheless turns out to be about twice that of buy-and-hold, which it often is with the algorithms that I have been testing, the p value is always in the general ballpark of ≈0.01, not ≈0.05. And I would not want to accept a scheme that improves the Sharpe ratio by much less than a factor of two. The overall effect of that is that I don't actually accept or reject algorithms based on the p value.
So if the monkeys try a thousand times and beat our pet scheme's Sharpe ratio less than one out of twenty times we would say that the p value was less than 1 in 20, or p<0.05. Across many fields of endeavor that entirely arbitrary standard of 1 in 20 has been adopted as the criterion of significance: you can say that your research results are "statistically significant" if your computed p value is less than 0.05. Right away we should realize that if the p value is high, say 0.10, or 0.50, something well above 0.05, that doesn't mean that the null hypothesis is true. It just means that it hasn't been adequately refuted—it hasn't been shown that it is quite unlikely that it is true.
If that all isn't perfectly clear to you, don't be alarmed as you are in very good company. Actually the entire ASA statement is a condemnation of the casual use of p values.
Underpinning many published scientific conclusions is the concept of “statistical significance,” typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted. This has led to some scientific journals discouraging the use of p-values, and some scientists and statisticians recommending their abandonment, with some arguments essentially unchanged since p-values were first introduced.— ASA
Returning now to Professor Ioannidis and his grim report on the reliability of published research results, he sees not only the problem of naïve reliance on p values, but other systemic issues. Here are some subheadings from his article:
- The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
- The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
- The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
- The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
- The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
- The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
The article is not extremely easy to read, but I can tell you that it is not full of the impenetrable jargon that statisticians customarily wrap themselves up in. It's not Greek! He explains his mathematics, which is very Šidák-like as in this blog entry of mine. However he adds other variables, one for bias and one for the ratio of the number of “true relationships” to “no relationships” among those tested in the field. Sometimes the ratio is roughly known before any tests are made. A low ratio is bad news when it comes to the probability that a research finding is indeed true.
Comparisons With Financial Analysis
I'm referring here to financial analysis for the purpose of improving portfolio management. But we'll have to break it down further. There's analysis of that kind that is done by academicians, who then publish the details of what they did. And then there's analysis of that kind done, say, for hedge funds in-house, which is almost never published but is instead held to be proprietary. I'd have to say that the Retail Backtest project does publish testing details. The entire walkthrough plus suboptimization program is described, with data, in the online article Does Momentum Work?
But generally the retail investor, even the institutional investor, has to pick and choose among prospective money managers who have no such candor about them. The investor gets a large dose of sales talk, is shown the distinguished résumés of the principals, and gets to look at past performance. But professional money managers are good at finding bandwagons to jump on, and create new funds and programs at such a rapid rate that the performance history can be quite brief. Anyway, there is not even a p value offered for consideration, much less evidence that anyone ever agonized over how to discern whether or not good past performance, hypothetical or real, would be likely to be sustained.
So that's the major difference between financial analysts working for funds and the medical researchers with whose work Dr. Ioannidis is primarily familiar. The former generally would never even mention "hypothesis testing" to you the investor. They would instead offer you expectations of "red meat" of some kind, with nary a mention of the odds of getting porridge instead.
I can offer some other distinctions, going down the list above. About smaller studies leading to less likelihood that research findings are true, we who do quantitative analysis based on price histories or other histories pertaining to securities have a considerable advantage there: we have vast amounts of data to use, all of it precise.
About small effects being suspect, every offering of the Retail Backtest project would have improved the Sharpe ratio of the cumulative return by something like a factor of two. But the Renaissance Technologies Corporation's Medallion Fund under mathematician Jim Simons had a ten-year Sharpe ratio of 1.89 throughout the 1990s, with a 2.52 ratio for the last five years of the decade [source]. That beats, by far, Retail Backtest's best hypothetical past results (which are however based on comparatively infrequent trading such as a retail investor might be able to do). And for comparison, simply buying and holding the S&P500 stocks will get you a Sharpe ratio of well under 0.50.
So we're not necessarily limited to small effects in the analysis of portfolio management schemes. We're now on the the third and fourth of the bulleted items on the list above. We need to be careful there. Ioannidis is talking about research results that are reached via tested relationships that are too numerous and when there is too much flexibility in designs, definitions, outcomes, and analytical modes— but he means without there being any further step to make up for the folly of all of that. You could say that the walkthrough step of the Retail Backtest approach has that kind of risk built into it. But, there is a further step built into the Retail Backtest project's way of doing things, the aforementioned suboptimization, and it is utterly unforgiving, not involving flexibility or different testing relationships.
Furthermore, we'd have to consider that Ioannidis may principally have in mind medical studies about constant phenomena, that don't change with time. If a certain kind of cancer is caused by a certain carcinogen then it was probably always thus, so the research done on that doesn't have to adapt to the phenomenon changing with time. Ahh, but in finance we do! The walkthrough step of the Retail Backtest approach is adaptive, potentially adaptive enough to overcome changes in the way that the market trades, regime changes, that would stymie a more static system.
About the last two bulleted items on the list above, yes, financial analysis can suffer from those problems.