Two things I think are interesting here, one discussed by the author and one not. (1) As mentioned at the bottom, forecasting usually should lead to decisionmaking, and when it gets disconnected, it can be unclear what the value is. It sounds like Rosenfield is trying to use forecasting to give added weight to his statistical conclusions about past data, which I agree sounds suspect.
(2) it's not clear what the "error bars" should mean. One is a confidence interval[1] (e.g. model gives 95% chance the output will be within these bounds). Another is a standard deviation (i.e. you are pretty much predicting the squared difference between your own point forecast and the outcome).
[1] acknowledged: not the correct term
Error bars in forecasts can only mean uncertainty your model has. Without error bars over models, you can say nothing about how good your model is. Even with them, your hypermodel may be inadequate.
To me, this comes back to the question of skin in the game. If you have skin in the game, then you produce the best uncertainty estimates you can (by any means). If you don't, you just sit back and say "well these are the error bars my model came up with".
It's worse than that. Oftentimes the skin in the game provides a motivation to mislead. C.f. most of the economics profession.
How do economists have skin in the game?
Many of them eg work in universities and some even have tenure. There's not much skin in the game between any forecasts they might make and their academic prospects.
Economists working for companies often have to help them understand micro and macro-economics. Eg (some of) Google's economists help them design the ad auctions. It's relatively easy to figure out for Google how well those ad auctions work. So they certain have skin in the game. But: what motivation to mislead do those economists have?
Many economists are so fully bought into their models that they can't think of any alternatives, despite them being essentially useless. I interpreted skin-in-the-game in that way - as professionally committed. Perhaps something different was meant.
How do you know that? Whenever I interact with economists, mostly online via blogs but also sometimes via email, they always seem painfully aware of the shortcomings of their models, and don't seem to confuse them with reality.
Perhaps you have studied a different sub-population of economists than the ones I have anecdotal experience with?
I'm a sense, that makes my point. Why do they persist with models that don't represent reality despite knowing it? Eventually you must realise that adding epicycles isn't going to cut it, yet still the sage voices echo the standard dogma when economies are dragged into the doldrums by policy posed by useless models.
Bought into is not the same as believing.
Why do physicists ignore friction whenever possible?
In general, for any task, you take the simplest model that represents the aspects of reality that you care about. But you stay aware of the limits. That's true in physics or engineering just as much as in economics.
That's why NASA uses Newtonian mechanics for all their rocket science needs, even though they have heard of General Relativity.
That's why people keep using models known to have limits.
You do know that most of published economics is about the limits of the 'standard dogma'? That's what gets you published. I often wish people would pay more attention to the orthodox basics, but confirming well-known rules isn't interesting enough for the journals.
So if eg you can do some data digging and analysis that can show that maybe under this very specific circumstances restriction on free trade might perhaps increase national wealth, that can get you published. But the observation that most of the time free trade, even if the other guy has tariffs, is the optimal policy, is too boring to get published.
Compare also crap like 'Capital in the Twenty-First Century' that catapults its author to stardom with its comparatively boring refutation by orthodox economists that no one cares about.
Most orthodox economics is pretty unanimous about basic policies: for free trade, against occupational licensing, for free migration, for free movement of capital, for simple taxes without loopholes, against messing with the currency, against corruption, against subsidies, for taxes instead of bans (eg on drugs, or emissions, or guns), against price floors or ceilings or other price controls, etc.
Many doldrums happen when policy ignores or contradicts these basic ideas. Alas, economics 101 is not popular with the electorate almost anywhere.
Many of the policies you mentioned sound great in a world of spherical cows but break down in the real world.
For example you say a basic policy is a tax on guns instead of a ban. First of all I dispute that is even orthodox economics. Second there is some strong evidence that gun bans reduce violence.
Free migration is another one. It is an insanely complicated issue in the real world. No country has 100% free migration or they wouldn’t be a country. There are all kinds of very complex rules and effects of these rules. And it is not clear that “free migration” is “good”. (I am sure the native americans probably didn’t like the free migration)
First, I apologize for using guns as an example. That's a needlessly divisive topic. The general principle of 'taxes instead of bans' is rather orthodox. You see that more often applied to the example of drugs or emissions.
Second, what evidence do you have for gun bans reducing violence? And reduce violence compared to what baseline?
I am very willing to believe that if you compare a free-for-all with a ban on guns, that the latter will see less violence. (I haven't looked into the evidence. Results might differ depending on details and on when and where you do that, and who gets exceptions to the bans. Eg police and military presumably are still allowed guns? Hunters probably as well? Etc. It's not so important.)
My point is that in terms of violent crime avoided, a situation where each gun and each bullet comes with a million dollar tax would be statistically indistinguishable from a ban.
And in practice, a less severe tax would probably be enough to achieve those goals whilst still preserving access to guns for those who prefer it that way.
What kind of definition of 'country' are you using here that breaks down in this way? (And what do you mean by '100%'? How nitpicky do you want to be?)
A history lesson from Wikipedia https://en.wikipedia.org/wiki/Passport
Btw, Switzerland as a country does not restrict immigration. That's left to the Kantone (which are sort-of the equivalent to American states). Yet, you'd be hard pressed to argue that Switzerland is not a country. If memory serves right, the US used to have similar arrangements in their past?
It's interesting that if you oblige your models to fit a set of policy positions then they return that set of policy positions and are pretty useless in general. A cynic might say that's by design.
Orthodox macroeconomic modelling is laughably naive and mathematically wrong before even getting to the basic issues of failure to validate. Let's not compare it to disciplines where validation is the entire point.
Your rhetoric clearly shows you don't want to think too critically about this so I'll sign off now.
You aren’t going to get hired by the Chicago crowd if you start espousing Kensyian ideas let alone Marxist ones. You aren’t getting hired by Exxon if you start talking about the negative externalities of climate change.
That might or might not be true, but it's not what 'skin in the game' means.
This is a pretty sweeping generalization, but if you have concrete examples to offer that support your claim, I’d be curious.
There are ways of scoring forecasts that reward accurate-and-certain forecasts in a manner where it's provably optimal to provide the most accurate estimates for your (un)certainty as you can.
Yes, of course. I don't see that as very related to my point. For example, consider how 538 or The Economist predict elections. They might claim they'll use squared error or log score, but when it comes down to a big mistake, they'll blame it on factors outside their models.
Well, but at least 538 has a reputation to defend as an accurate forecaster. So they have some skin in the game.
(Of course, that's not as good as betting money.)
They can also mean pushed forward uncertainty from input parameters which isn't exactly the same as model error
I'm not sure I see the distinction. Would you mind clarifying?
Model: water freezes below 0° C.
Input: temperature is measured at -1° C.
Prediction: water will freeze.
Actual: water didn't freeze.
Actual temperature: 2° C.
The model isn't broken, it gives an incorrect result because of input error.
Well I'd say the model is broken because it didn't capture the uncertainty in the measurements.
Taking the example in this comment thread, even if the model takes an arbitrary nonparametric distribution of input temperatures and perfectly returns the posterior distribution of freezing events there is still a difference in model error and forward UQ error.
The model itself can perfectly describe the physics, but it only knows what you can give it. This may be limited by measurement uncertainty of your equipment, etc, but it is separate from the model itself.
In this area, "the model" is typically considered as the input parameter to quantity of interest map itself. It's not the full problem from gathering data to prediction.
Model error would be things like failing to capture the physics (due to approximations, compute limits, etc), intrinsic aleatoric uncertainty in the freezing process itself, etc.
Making this distinction helps talk about where the uncertainty comes from, how it can be mitigated, and how to use higher level models and resampling to understand its impact across the full problem.
That's not what a confidence interval is. A confidence interval is a random variable that covers the true value 95% of the time (assuming the model is correctly specified).
Ok, the 'reverse' of a confidence interval then -- I haven't seen a term for the object I described other than misuse of CI in the way I did. ("Double quantile"?)
You're probably thinking of a predictive interval
It is a very common misconception and one of my technical crusades. I keep fighting, but I think I have lost. Not knowing what the "uncertainty interval" represents (is it, loosely speaking, an expectation about a mean/true value or about the distribution of unobserved values?) could be even more dangerous, in theory, than using no uncertainty interval at all.
I say in theory because, in my experience in the tech industry, with the usual exceptions, uncertainty intervals, for example on a graph, are interpreted by those making decisions as aesthetic components of the graph ("the gray bands look good here") and not as anything even marginally related to a prediction.
Agreed! I also think it's extremely important as practitioners to know what we're even trying to estimate. Expected value (i.e. least squares regression) is the usual first thing to go for, does that even matter? We're probably actually interested in something like an upper quantile for planning purposes. And then the whole model component of it, the interval that's being simultaneously estimated is model driven and if that's wrong, then the interval is meaningless. There's a lot of space for super interesting and impactful work in this area IMO, once you (the practitioner) think more critically about the objective. And then don't even get me started on interventions and causal inference...
True. But a conditional quantile is much harder to accurately estimate from data than a conditional expectation (particularly if you are talking about extreme quantiles).
Oh absolutely, so it's all the more important to be precise in what we're estimating and for what purpose, and to be honest about our ability to estimate it with appropriate uncertainty quantification (such as by using conformal prediction methods/bootstrapping).
From a statistical point of view, I agree that there is a lot of interesting and impactful work to be done on estimating predictive intervals, more in ML than in traditional statistical modeling.
I have more doubts when it comes to actions taken when considering properly estimated predictive intervals. Even I, who have a good knowledge of statistical modeling, after hearing "the median survival time for this disease is 5 years," do not stop to think that the median is calculated/estimated on an empirical distribution, so there are people who presumably die after 2 years, others after 8. Well, that depends on the variance.
But if I am so strongly drawn to a central estimate, is there any chance for others not so used to thinking about distributions?
If you don't mind typing it out, what do you mean formally here?
I think they mean either what is E[x| y] (standard regression point estimate) along with a confidence interval (this assumes that the mean is a meaningful quantity), or the interval s.t. F(x | y) -- the PDF of x -- is between .025 and .975 (the 95% predictive interval centered around .5). The point is that the width of the confidence interval around the point estimate of the mean converges to 0 as you add more data because you have more information to estimate this point estimate, while the predictive interval does not, it converges to the interval composed of the aleatoric uncertainty of the data generating distribution of x conditioned on the measured covariates y
That's exactly what I was talking about. The nature of the uncertainly intervals is made even more nebulous when not using formal notation, something I was guilty of doing in my comment--even if I used the word "loosely" for that purpose.
If you think about linear regression, it makes sense, given the assumptions of linear regression, that confidence interval E[x|y] is narrower around the mean of x and y.
If I had to choose between the two, confidence intervals in a forecasting context are less useful in the context of decision-making, while prediction intervals are, in my opinion, always needed.
Ah, that makes sense. The word expectation was really throwing me off, along with the fact that, in the kind of forecasting setting of this post, the mean and confidence interval (used in the correct sense) are not meaningful, while the quantile or 'predictive interval' are meaningful.
And, from what I understand, this is what is happening in this article.
The person is providing an uncertainty interval for their mean estimator and not for future observations (i.e., the error bars reflect the uncertainty of the mean estimator, not the uncertainty over observations).
Like you said: before adding error bars, it probably makes sense to think a bit about what type of uncertainty those error bars are supposed to represent.
Thanks, this finally clarifies for me what the article was actually doing!
And it's very different from what I expected, and it doesn't make a lot of sense to me. I guess if statisticians already believe your model, then they want to see the error bars on the model. But I would expect if someone gives me a forecast with "error bars", those would relate to how accurate they think the forecast would be.
Yes, that term captures what I'm talking about.
"Credible interval":
https://en.wikipedia.org/wiki/Credible_interval
No, predictive interval is more precise, since we are dealing with predicting an observation rather than forming a belief about a parameter.
What's a predictive interval?
I don't normally use that term, but someone else in reply to me did, and it captures what I wanted to say:
https://en.wikipedia.org/wiki/Prediction_interval
A position espoused by Bill Phillips [1], and to which I now adhere:
"You should be willing to take either side of the bet that confidence interval implies." (paraphrasing; he says it better).
For a concrete example, with a 95% confidence interval, you should be as willing to accept the 19:1 odds that the true value is outside the interval as you are the 1:19 odds that the true value is inside the interval.
Aside from being generally correct, this approach is immediately actionable by making the meaning more visceral in discussions of uncertainty. Done right, it pushes you to assign uncertainties that are neither too conservative nor too optimistic.
If the notion of letting your reader take either side of the bet makes your stomach a little queasy, you're on the right track. The feeling will subside when you're pretty sure you got the errorbar right and your reasoning is documented and defensible.
Edit for OP's explicit question: One standard-deviation errorbars are 68% confidence intervals. Two standard deviations are 95% confidence intervals. (assuming you're a frequentist, of course)
[1] https://www.nobelprize.org/prizes/physics/1997/phillips/fact...
Also assuming normal distribution, I think?
I would like to build some edge into my bets. If a reader takes both sides of your example, they would be come out exactly even.
But since readers are not forced to take any side at all, they will only take the bet if one of the sides has an advantage.
So I would like to be able to say, 20:1 payout the true value is inside the error bar, and 1:10 payout it's outside the error bar (or something like that).
The tighter the spread I am willing to quote, the more confident I am that I got the error estimates right. (I'm not sure how you translate these spreads back into the language of statistics.)
95% is 95% regardless of the distribution.
You can imagine yourself being equally unhappy to take either side of the bet, if that's easier than imagining yourself being happy to take either side.
It is for me, which is probably something to bring up in therapy.
I also think that framing things as bets brings in all the cultural baggage around gambling and so it isn't always helpful. I'm not sure what a better framing is though.
Standard deviation away from the mean don't correspond to the same percentiles for all distributions, or do they?
If you want to be (almost) independent of distribution, you need Chebyshev's inequality. But that one is far weaker.
https://en.wikipedia.org/wiki/Chebyshev%27s_inequality
Underwriting insurance without going bankrupt, perhaps?
If they had said a two standard deviation interval then you would have needed to know the distribution, but they said 95% which gives you all the information you need to make the bet.
I was replying to this addendum, which only really works for the normal distribution:
You are right about the first part of the original comment.
Expected value does not equal utility. I am not willing to mortgage my 1 million dollar house for a 1 in a 1000 shot at a billion.
Then one should be very careful when assigning 99.9% confidence intervals.
The two are unrelated.
What you probably want is the standard error, because you are not interested in how much your data differ from each other but in how much your data differ from the true population.
I don't see how standard error applies here. You are only going to get one data point, e.g. "violent crime rate in 2023". What I mean is a prediction, not only of what you think the number is, but also of how wrong you think your prediction will be.
Standard error is exactly what the statsmodels ARIMA.PredictionResults object actually gives you and the confidence interval in this chart is constructed from a formula that uses the standard error.
ARIMA is based on a few assumptions. One, there exists some "true" mean value for the parameter you're trying to estimate, in this case violent crime rate. Two, the value you measure in any given period will be this true mean plus some random error term. Three, the value you measure in successive periods will regress back toward the mean. The "true mean" and error terms are both random variables, not a single value but a distribution of values, and when you add them up to get the predicted measurement for future periods, that is also a random variable with a distribution of values, and it has a standard error and confidence intervals and these are exactly what the article is saying should be included in any graphical report of the model output.
This is a characteristic of the model. What you're asking for, "how wrong do you think the model is," is a reasonable thing to ask for, but different and much harder to quantify.
Thanks for explaining how it works - I don't use R (I assume this is R). This does not seem like a good way to produce "error bars" around a forecast like the one in this case study. It seems more like a note about how much volatility there has been in the past.
Just to clarify... this is Python code, not R.
Thanks.
Another import point of discussion:
This definitely seems to me to be what the original author is motivating: forecasts should have "error bars" in the sense that they should depict how wrong they might be. In other words, when the author writes:
The second sentence does not sound like a good solution to the problem in the first sentence.
Recently someone on hacker news described statistics as trying to measure how surprised you should be when you are wrong. Big fat error bars would give you the idea that you should expect to be wrong. Skinny ones would highlight that it might be somewhat upsetting to find out you are wrong. I don't think this is an exhaustive description of statistics but I do find it useful when thinking about forecasts.
This can be true but depends on other sources of error being small enough. Standard error is just a formula and it varies inversely with the square of the sample size, so you trivially narrow a confidence interval by sampling more often. In this specific case, imagine you had daily measures of violent crime instead of only annual. You'd get much tighter error bars.
Does that mean you should be more surprised if your predictions are wrong? It depends. You've only reduced model error, but this is the classic precision versus accuracy problem. You can very precisely estimate the wrong number. Does the model really reflect the underlying data generating process? Are the inputs you're giving reliable measurements? If both answers are yes, then your more precise model should be converging toward a better prediction of the true value, but if not, you're only getting a better prediction of the wrong thing.
We can ask these questions with this very example. Clearly, ARIMA is not a causally realistic model. Criminals don't look at last year's crime rates and decide whether to commit a crime based on that. The assumption is that, whatever actually does cause crime, it tends to happen at fairly similar levels year to year, that is, 2020 should be more different than 2010 than it is different from 2019. We may not know what the causually relevant factors really are or we may not be able to measure them, but we at least assume they follow that kind of rule. This sounds plausible to me, but is it true? We can backtest by making predictions of past years and seeing how close they are to the measured value, but the possibility of this even working depends upon the answer to the second question.
So then the second question. Is the national violent crime data actually reliable? I don't know the answer to that, but it certainly isn't perfect. There is a real crime rate for every crime, but it isn't exactly the reported number. Recording and reporting standards vary from jurisdiction to jurisdiction. Many categories of crime go undereported, the extent of which can change over time. Changes may reflect different policing emphasis as much as or more than changes in the underlying true rate. I believe the way the FBI even collects and categorizes data has changed in the past, so I'm not sure a measurement from 1960 can be meaningfully compared to a measurement from 2020.
Ultimately, "how surprised you should be when you are wrong" needs to take all of these sources of error into account, not just the model's coefficient uncertainty.
You can arbitrarily scale error bars based on real world feedback, but the underlying purpose of a model is rarely served by such tweaking. Often the point of error bars is less “How surprised you should be when you are wrong” than it is “how wrong you should be before you’re surprised.”
When trying to detect cheating in online games you don’t need to predict exact performance, but you want to decent anomalies quickly. Detecting cereal killers, gang wars, etc isn’t about nailing the number of murders on a given day but patterns within those cases etc.
Is this the difference between Bayesian and Frequentist approaches?
I disagree.
You only really need to take those sources of error into account if you want an absolute measure of error, which as you explain, seems pretty impossible.
An error for weather only needs to be relative -- for example, if the error for rain today is higher than yesterday, it's not important the exact number higher that it is -- only that it's higher. (Not that I know if this is possible.)
It's like how you can't describe how biased a certain news source is or how to read a Yelp or Rotten Tomatoes review -- you just have to read it often enough to get an intuitive sense that a 4.1 star Yelp-rated restaurant with 800 reviews is probably good while a 4.6 star restaurant with 5 reviews is quite possibly terrible.
Error bars (either confidence interval or standard deviations) are of little use because they do not tell you how the probability is distributed within the confidence interval band. The holy grail of forecasting is the Probabilistic Forecast that predicts the entire posterior distribution so that you can sample from it generating scenarios or realizations of the underlying random process.
While I agree, we can always have multiple overlapping error bars to understand how the probability is distributed. I am not sure how a probabilistic forecast method is able to perform this better because the confidence interval is always generated through sampling in either situation.
Though probabilistic forecasting methods may have a Bayesian approach, it is Monte Carlo sampling that helps generate the confidence intervals.
Feel free to correct me if I am wrong! Thanks :)
As far as error bars are concerned, you could report some% credible intervals calculated from taking the some%tile out of your results. It’s somewhat Bayesian thinking but it will work better than confidence intervals.
The intuition would be that some% of your forecasts are between the bounds of the credible interval.