Forecasts need to have error bars

Two things I think are interesting here, one discussed by the author and one not. (1) As mentioned at the bottom, forecasting usually should lead to decisionmaking, and when it gets disconnected, it can be unclear what the value is. It sounds like Rosenfield is trying to use forecasting to give added weight to his statistical conclusions about past data, which I agree sounds suspect.

(2) it's not clear what the "error bars" should mean. One is a confidence interval[1] (e.g. model gives 95% chance the output will be within these bounds). Another is a standard deviation (i.e. you are pretty much predicting the squared difference between your own point forecast and the outcome).

[1] acknowledged: not the correct term

Error bars in forecasts can only mean uncertainty your model has. Without error bars over models, you can say nothing about how good your model is. Even with them, your hypermodel may be inadequate.

To me, this comes back to the question of skin in the game. If you have skin in the game, then you produce the best uncertainty estimates you can (by any means). If you don't, you just sit back and say "well these are the error bars my model came up with".

It's worse than that. Oftentimes the skin in the game provides a motivation to mislead. C.f. most of the economics profession.

How do economists have skin in the game?

Many of them eg work in universities and some even have tenure. There's not much skin in the game between any forecasts they might make and their academic prospects.

Economists working for companies often have to help them understand micro and macro-economics. Eg (some of) Google's economists help them design the ad auctions. It's relatively easy to figure out for Google how well those ad auctions work. So they certain have skin in the game. But: what motivation to mislead do those economists have?

Many economists are so fully bought into their models that they can't think of any alternatives, despite them being essentially useless. I interpreted skin-in-the-game in that way - as professionally committed. Perhaps something different was meant.

Many economists are so fully bought into their models [...]

How do you know that? Whenever I interact with economists, mostly online via blogs but also sometimes via email, they always seem painfully aware of the shortcomings of their models, and don't seem to confuse them with reality.

Perhaps you have studied a different sub-population of economists than the ones I have anecdotal experience with?

I'm a sense, that makes my point. Why do they persist with models that don't represent reality despite knowing it? Eventually you must realise that adding epicycles isn't going to cut it, yet still the sage voices echo the standard dogma when economies are dragged into the doldrums by policy posed by useless models.

Bought into is not the same as believing.

Why do they persist with models that don't represent reality despite knowing it?

Why do physicists ignore friction whenever possible?

In general, for any task, you take the simplest model that represents the aspects of reality that you care about. But you stay aware of the limits. That's true in physics or engineering just as much as in economics.

That's why NASA uses Newtonian mechanics for all their rocket science needs, even though they have heard of General Relativity.

That's why people keep using models known to have limits.

[...] sage voices echo the standard dogma [...]

You do know that most of published economics is about the limits of the 'standard dogma'? That's what gets you published. I often wish people would pay more attention to the orthodox basics, but confirming well-known rules isn't interesting enough for the journals.

So if eg you can do some data digging and analysis that can show that maybe under this very specific circumstances restriction on free trade might perhaps increase national wealth, that can get you published. But the observation that most of the time free trade, even if the other guy has tariffs, is the optimal policy, is too boring to get published.

Compare also crap like 'Capital in the Twenty-First Century' that catapults its author to stardom with its comparatively boring refutation by orthodox economists that no one cares about.

[...] dragged into the doldrums by policy posed by useless models.

Most orthodox economics is pretty unanimous about basic policies: for free trade, against occupational licensing, for free migration, for free movement of capital, for simple taxes without loopholes, against messing with the currency, against corruption, against subsidies, for taxes instead of bans (eg on drugs, or emissions, or guns), against price floors or ceilings or other price controls, etc.

Many doldrums happen when policy ignores or contradicts these basic ideas. Alas, economics 101 is not popular with the electorate almost anywhere.

Many of the policies you mentioned sound great in a world of spherical cows but break down in the real world.

For example you say a basic policy is a tax on guns instead of a ban. First of all I dispute that is even orthodox economics. Second there is some strong evidence that gun bans reduce violence.

Free migration is another one. It is an insanely complicated issue in the real world. No country has 100% free migration or they wouldn’t be a country. There are all kinds of very complex rules and effects of these rules. And it is not clear that “free migration” is “good”. (I am sure the native americans probably didn’t like the free migration)

For example you say a basic policy is a tax on guns instead of a ban. First of all I dispute that is even orthodox economics. Second there is some strong evidence that gun bans reduce violence.

First, I apologize for using guns as an example. That's a needlessly divisive topic. The general principle of 'taxes instead of bans' is rather orthodox. You see that more often applied to the example of drugs or emissions.

Second, what evidence do you have for gun bans reducing violence? And reduce violence compared to what baseline?

I am very willing to believe that if you compare a free-for-all with a ban on guns, that the latter will see less violence. (I haven't looked into the evidence. Results might differ depending on details and on when and where you do that, and who gets exceptions to the bans. Eg police and military presumably are still allowed guns? Hunters probably as well? Etc. It's not so important.)

My point is that in terms of violent crime avoided, a situation where each gun and each bullet comes with a million dollar tax would be statistically indistinguishable from a ban.

And in practice, a less severe tax would probably be enough to achieve those goals whilst still preserving access to guns for those who prefer it that way.

No country has 100% free migration or they wouldn’t be a country.

What kind of definition of 'country' are you using here that breaks down in this way? (And what do you mean by '100%'? How nitpicky do you want to be?)

A history lesson from Wikipedia https://en.wikipedia.org/wiki/Passport

A rapid expansion of railway infrastructure and wealth in Europe beginning in the mid-nineteenth century led to large increases in the volume of international travel and a consequent unique dilution of the passport system for approximately thirty years prior to World War I. The speed of trains, as well as the number of passengers that crossed multiple borders, made enforcement of passport laws difficult. The general reaction was the relaxation of passport requirements.[18] In the later part of the nineteenth century and up to World War I, passports were not required, on the whole, for travel within Europe, and crossing a border was a relatively straightforward procedure. Consequently, comparatively few people held passports.

During World War I, European governments introduced border passport requirements for security reasons, and to control the emigration of people with useful skills. These controls remained in place after the war, becoming a standard, though controversial, procedure. British tourists of the 1920s complained, especially about attached photographs and physical descriptions, which they considered led to a "nasty dehumanisation".[19] The British Nationality and Status of Aliens Act was passed in 1914, clearly defining the notions of citizenship and creating a booklet form of the passport.

Btw, Switzerland as a country does not restrict immigration. That's left to the Kantone (which are sort-of the equivalent to American states). Yet, you'd be hard pressed to argue that Switzerland is not a country. If memory serves right, the US used to have similar arrangements in their past?

Most orthodox economics is pretty unanimous about basic policies: for free trade, against occupational licensing, for free migration, for free movement of capital, for simple taxes without loopholes, against messing with the currency, against corruption, against subsidies, for taxes instead of bans (eg on drugs, or emissions, or guns), against price floors or ceilings or other price controls, etc

It's interesting that if you oblige your models to fit a set of policy positions then they return that set of policy positions and are pretty useless in general. A cynic might say that's by design.

Orthodox macroeconomic modelling is laughably naive and mathematically wrong before even getting to the basic issues of failure to validate. Let's not compare it to disciplines where validation is the entire point.

Your rhetoric clearly shows you don't want to think too critically about this so I'll sign off now.

You aren’t going to get hired by the Chicago crowd if you start espousing Kensyian ideas let alone Marxist ones. You aren’t getting hired by Exxon if you start talking about the negative externalities of climate change.

That might or might not be true, but it's not what 'skin in the game' means.

This is a pretty sweeping generalization, but if you have concrete examples to offer that support your claim, I’d be curious.

There are ways of scoring forecasts that reward accurate-and-certain forecasts in a manner where it's provably optimal to provide the most accurate estimates for your (un)certainty as you can.

Yes, of course. I don't see that as very related to my point. For example, consider how 538 or The Economist predict elections. They might claim they'll use squared error or log score, but when it comes down to a big mistake, they'll blame it on factors outside their models.

Well, but at least 538 has a reputation to defend as an accurate forecaster. So they have some skin in the game.

(Of course, that's not as good as betting money.)

They can also mean pushed forward uncertainty from input parameters which isn't exactly the same as model error

I'm not sure I see the distinction. Would you mind clarifying?

Model: water freezes below 0° C.

Input: temperature is measured at -1° C.

Prediction: water will freeze.

Actual: water didn't freeze.

Actual temperature: 2° C.

The model isn't broken, it gives an incorrect result because of input error.

Well I'd say the model is broken because it didn't capture the uncertainty in the measurements.

Taking the example in this comment thread, even if the model takes an arbitrary nonparametric distribution of input temperatures and perfectly returns the posterior distribution of freezing events there is still a difference in model error and forward UQ error.

The model itself can perfectly describe the physics, but it only knows what you can give it. This may be limited by measurement uncertainty of your equipment, etc, but it is separate from the model itself.

In this area, "the model" is typically considered as the input parameter to quantity of interest map itself. It's not the full problem from gathering data to prediction.

Model error would be things like failing to capture the physics (due to approximations, compute limits, etc), intrinsic aleatoric uncertainty in the freezing process itself, etc.

Making this distinction helps talk about where the uncertainty comes from, how it can be mitigated, and how to use higher level models and resampling to understand its impact across the full problem.

That's not what a confidence interval is. A confidence interval is a random variable that covers the true value 95% of the time (assuming the model is correctly specified).

Ok, the 'reverse' of a confidence interval then -- I haven't seen a term for the object I described other than misuse of CI in the way I did. ("Double quantile"?)

You're probably thinking of a predictive interval

It is a very common misconception and one of my technical crusades. I keep fighting, but I think I have lost. Not knowing what the "uncertainty interval" represents (is it, loosely speaking, an expectation about a mean/true value or about the distribution of unobserved values?) could be even more dangerous, in theory, than using no uncertainty interval at all.

I say in theory because, in my experience in the tech industry, with the usual exceptions, uncertainty intervals, for example on a graph, are interpreted by those making decisions as aesthetic components of the graph ("the gray bands look good here") and not as anything even marginally related to a prediction.

Agreed! I also think it's extremely important as practitioners to know what we're even trying to estimate. Expected value (i.e. least squares regression) is the usual first thing to go for, does that even matter? We're probably actually interested in something like an upper quantile for planning purposes. And then the whole model component of it, the interval that's being simultaneously estimated is model driven and if that's wrong, then the interval is meaningless. There's a lot of space for super interesting and impactful work in this area IMO, once you (the practitioner) think more critically about the objective. And then don't even get me started on interventions and causal inference...

We're probably actually interested in something like an upper quantile for planning purposes.

True. But a conditional quantile is much harder to accurately estimate from data than a conditional expectation (particularly if you are talking about extreme quantiles).

Oh absolutely, so it's all the more important to be precise in what we're estimating and for what purpose, and to be honest about our ability to estimate it with appropriate uncertainty quantification (such as by using conformal prediction methods/bootstrapping).

From a statistical point of view, I agree that there is a lot of interesting and impactful work to be done on estimating predictive intervals, more in ML than in traditional statistical modeling.

I have more doubts when it comes to actions taken when considering properly estimated predictive intervals. Even I, who have a good knowledge of statistical modeling, after hearing "the median survival time for this disease is 5 years," do not stop to think that the median is calculated/estimated on an empirical distribution, so there are people who presumably die after 2 years, others after 8. Well, that depends on the variance.

But if I am so strongly drawn to a central estimate, is there any chance for others not so used to thinking about distributions?

is it, loosely speaking, an expectation about a mean/true value or about the distribution of unobserved values

If you don't mind typing it out, what do you mean formally here?

I think they mean either what is E[x| y] (standard regression point estimate) along with a confidence interval (this assumes that the mean is a meaningful quantity), or the interval s.t. F(x | y) -- the PDF of x -- is between .025 and .975 (the 95% predictive interval centered around .5). The point is that the width of the confidence interval around the point estimate of the mean converges to 0 as you add more data because you have more information to estimate this point estimate, while the predictive interval does not, it converges to the interval composed of the aleatoric uncertainty of the data generating distribution of x conditioned on the measured covariates y

That's exactly what I was talking about. The nature of the uncertainly intervals is made even more nebulous when not using formal notation, something I was guilty of doing in my comment--even if I used the word "loosely" for that purpose.

If you think about linear regression, it makes sense, given the assumptions of linear regression, that confidence interval E[x|y] is narrower around the mean of x and y.

If I had to choose between the two, confidence intervals in a forecasting context are less useful in the context of decision-making, while prediction intervals are, in my opinion, always needed.

Ah, that makes sense. The word expectation was really throwing me off, along with the fact that, in the kind of forecasting setting of this post, the mean and confidence interval (used in the correct sense) are not meaningful, while the quantile or 'predictive interval' are meaningful.

Not knowing what the "uncertainty interval" represents (is it, loosely speaking, an expectation about a mean/true value or about the distribution of unobserved values?) could be even more dangerous, in theory, than using no uncertainty interval at all.

And, from what I understand, this is what is happening in this article.

The person is providing an uncertainty interval for their mean estimator and not for future observations (i.e., the error bars reflect the uncertainty of the mean estimator, not the uncertainty over observations).

Like you said: before adding error bars, it probably makes sense to think a bit about what type of uncertainty those error bars are supposed to represent.

Thanks, this finally clarifies for me what the article was actually doing!

And it's very different from what I expected, and it doesn't make a lot of sense to me. I guess if statisticians already believe your model, then they want to see the error bars on the model. But I would expect if someone gives me a forecast with "error bars", those would relate to how accurate they think the forecast would be.

Yes, that term captures what I'm talking about.

"Credible interval":

https://en.wikipedia.org/wiki/Credible_interval

No, predictive interval is more precise, since we are dealing with predicting an observation rather than forming a belief about a parameter.

What's a predictive interval?

I don't normally use that term, but someone else in reply to me did, and it captures what I wanted to say:

https://en.wikipedia.org/wiki/Prediction_interval

A position espoused by Bill Phillips [1], and to which I now adhere:

"You should be willing to take either side of the bet that confidence interval implies." (paraphrasing; he says it better).

For a concrete example, with a 95% confidence interval, you should be as willing to accept the 19:1 odds that the true value is outside the interval as you are the 1:19 odds that the true value is inside the interval.

Aside from being generally correct, this approach is immediately actionable by making the meaning more visceral in discussions of uncertainty. Done right, it pushes you to assign uncertainties that are neither too conservative nor too optimistic.

If the notion of letting your reader take either side of the bet makes your stomach a little queasy, you're on the right track. The feeling will subside when you're pretty sure you got the errorbar right and your reasoning is documented and defensible.

Edit for OP's explicit question: One standard-deviation errorbars are 68% confidence intervals. Two standard deviations are 95% confidence intervals. (assuming you're a frequentist, of course)

[1] https://www.nobelprize.org/prizes/physics/1997/phillips/fact...

Edit for OP's explicit question: One standard-deviation errorbars are 68% confidence intervals. Two standard deviations are 95% confidence intervals. (assuming you're a frequentist, of course)

Also assuming normal distribution, I think?

If the notion of letting your reader take either side of the bet makes your stomach a little queasy, you're on the right track. The feeling will subside when you're pretty sure you got the errorbar right and your reasoning is documented and defensible.

For a concrete example, with a 95% confidence interval, you should be as willing to accept the 19:1 odds that the true value is outside the interval as you are the 1:19 odds that the true value is inside the interval.

I would like to build some edge into my bets. If a reader takes both sides of your example, they would be come out exactly even.

But since readers are not forced to take any side at all, they will only take the bet if one of the sides has an advantage.

So I would like to be able to say, 20:1 payout the true value is inside the error bar, and 1:10 payout it's outside the error bar (or something like that).

The tighter the spread I am willing to quote, the more confident I am that I got the error estimates right. (I'm not sure how you translate these spreads back into the language of statistics.)

Also assuming normal distribution, I think?

95% is 95% regardless of the distribution.

I would like to build some edge into my bets. If a reader takes both sides of your example, they would be come out exactly even.

You can imagine yourself being equally unhappy to take either side of the bet, if that's easier than imagining yourself being happy to take either side.

It is for me, which is probably something to bring up in therapy.

I also think that framing things as bets brings in all the cultural baggage around gambling and so it isn't always helpful. I'm not sure what a better framing is though.

95% is 95% regardless of the distribution.

Standard deviation away from the mean don't correspond to the same percentiles for all distributions, or do they?

If you want to be (almost) independent of distribution, you need Chebyshev's inequality. But that one is far weaker.

Its practical usage is similar to the 68–95–99.7 rule, which applies only to normal distributions. Chebyshev's inequality is more general, stating that a minimum of just 75% of values must lie within two standard deviations of the mean and 88.89% within three standard deviations for a broad range of different probability distributions.[1][2]

https://en.wikipedia.org/wiki/Chebyshev%27s_inequality

I also think that framing things as bets brings in all the cultural baggage around gambling and so it isn't always helpful. I'm not sure what a better framing is though.

Underwriting insurance without going bankrupt, perhaps?

Standard deviation away from the mean don't correspond to the same percentiles for all distributions, or do they?

If they had said a two standard deviation interval then you would have needed to know the distribution, but they said 95% which gives you all the information you need to make the bet.

I was replying to this addendum, which only really works for the normal distribution:

Edit for OP's explicit question: One standard-deviation errorbars are 68% confidence intervals. Two standard deviations are 95% confidence intervals. (assuming you're a frequentist, of course)

You are right about the first part of the original comment.

Expected value does not equal utility. I am not willing to mortgage my 1 million dollar house for a 1 in a 1000 shot at a billion.

Then one should be very careful when assigning 99.9% confidence intervals.

The two are unrelated.

Another is a standard deviation (i.e. you are pretty much predicting the squared difference between your own point forecast and the outcome).

What you probably want is the standard error, because you are not interested in how much your data differ from each other but in how much your data differ from the true population.

I don't see how standard error applies here. You are only going to get one data point, e.g. "violent crime rate in 2023". What I mean is a prediction, not only of what you think the number is, but also of how wrong you think your prediction will be.

Standard error is exactly what the statsmodels ARIMA.PredictionResults object actually gives you and the confidence interval in this chart is constructed from a formula that uses the standard error.

ARIMA is based on a few assumptions. One, there exists some "true" mean value for the parameter you're trying to estimate, in this case violent crime rate. Two, the value you measure in any given period will be this true mean plus some random error term. Three, the value you measure in successive periods will regress back toward the mean. The "true mean" and error terms are both random variables, not a single value but a distribution of values, and when you add them up to get the predicted measurement for future periods, that is also a random variable with a distribution of values, and it has a standard error and confidence intervals and these are exactly what the article is saying should be included in any graphical report of the model output.

This is a characteristic of the model. What you're asking for, "how wrong do you think the model is," is a reasonable thing to ask for, but different and much harder to quantify.

Thanks for explaining how it works - I don't use R (I assume this is R). This does not seem like a good way to produce "error bars" around a forecast like the one in this case study. It seems more like a note about how much volatility there has been in the past.

I don't use R (I assume this is R)

Just to clarify... this is Python code, not R.

Thanks.

Another import point of discussion:

What you're asking for, "how wrong do you think the model is," is a reasonable thing to ask for, but different and much harder to quantify.

This definitely seems to me to be what the original author is motivating: forecasts should have "error bars" in the sense that they should depict how wrong they might be. In other words, when the author writes:

Point forecasts will always be wrong – a more reasonable approach is to provide the prediction intervals for the forecasts. Showing error intervals around the forecasts will show how Richard interpreting minor trends is likely to be misleading.

The second sentence does not sound like a good solution to the problem in the first sentence.

Recently someone on hacker news described statistics as trying to measure how surprised you should be when you are wrong. Big fat error bars would give you the idea that you should expect to be wrong. Skinny ones would highlight that it might be somewhat upsetting to find out you are wrong. I don't think this is an exhaustive description of statistics but I do find it useful when thinking about forecasts.

This can be true but depends on other sources of error being small enough. Standard error is just a formula and it varies inversely with the square of the sample size, so you trivially narrow a confidence interval by sampling more often. In this specific case, imagine you had daily measures of violent crime instead of only annual. You'd get much tighter error bars.

Does that mean you should be more surprised if your predictions are wrong? It depends. You've only reduced model error, but this is the classic precision versus accuracy problem. You can very precisely estimate the wrong number. Does the model really reflect the underlying data generating process? Are the inputs you're giving reliable measurements? If both answers are yes, then your more precise model should be converging toward a better prediction of the true value, but if not, you're only getting a better prediction of the wrong thing.

We can ask these questions with this very example. Clearly, ARIMA is not a causally realistic model. Criminals don't look at last year's crime rates and decide whether to commit a crime based on that. The assumption is that, whatever actually does cause crime, it tends to happen at fairly similar levels year to year, that is, 2020 should be more different than 2010 than it is different from 2019. We may not know what the causually relevant factors really are or we may not be able to measure them, but we at least assume they follow that kind of rule. This sounds plausible to me, but is it true? We can backtest by making predictions of past years and seeing how close they are to the measured value, but the possibility of this even working depends upon the answer to the second question.

So then the second question. Is the national violent crime data actually reliable? I don't know the answer to that, but it certainly isn't perfect. There is a real crime rate for every crime, but it isn't exactly the reported number. Recording and reporting standards vary from jurisdiction to jurisdiction. Many categories of crime go undereported, the extent of which can change over time. Changes may reflect different policing emphasis as much as or more than changes in the underlying true rate. I believe the way the FBI even collects and categorizes data has changed in the past, so I'm not sure a measurement from 1960 can be meaningfully compared to a measurement from 2020.

Ultimately, "how surprised you should be when you are wrong" needs to take all of these sources of error into account, not just the model's coefficient uncertainty.

You can arbitrarily scale error bars based on real world feedback, but the underlying purpose of a model is rarely served by such tweaking. Often the point of error bars is less “How surprised you should be when you are wrong” than it is “how wrong you should be before you’re surprised.”

When trying to detect cheating in online games you don’t need to predict exact performance, but you want to decent anomalies quickly. Detecting cereal killers, gang wars, etc isn’t about nailing the number of murders on a given day but patterns within those cases etc.

Is this the difference between Bayesian and Frequentist approaches?

I disagree.

You only really need to take those sources of error into account if you want an absolute measure of error, which as you explain, seems pretty impossible.

An error for weather only needs to be relative -- for example, if the error for rain today is higher than yesterday, it's not important the exact number higher that it is -- only that it's higher. (Not that I know if this is possible.)

It's like how you can't describe how biased a certain news source is or how to read a Yelp or Rotten Tomatoes review -- you just have to read it often enough to get an intuitive sense that a 4.1 star Yelp-rated restaurant with 800 reviews is probably good while a 4.6 star restaurant with 5 reviews is quite possibly terrible.

Error bars (either confidence interval or standard deviations) are of little use because they do not tell you how the probability is distributed within the confidence interval band. The holy grail of forecasting is the Probabilistic Forecast that predicts the entire posterior distribution so that you can sample from it generating scenarios or realizations of the underlying random process.

While I agree, we can always have multiple overlapping error bars to understand how the probability is distributed. I am not sure how a probabilistic forecast method is able to perform this better because the confidence interval is always generated through sampling in either situation.

Though probabilistic forecasting methods may have a Bayesian approach, it is Monte Carlo sampling that helps generate the confidence intervals.

Feel free to correct me if I am wrong! Thanks :)

As far as error bars are concerned, you could report some% credible intervals calculated from taking the some%tile out of your results. It’s somewhat Bayesian thinking but it will work better than confidence intervals.

The intuition would be that some% of your forecasts are between the bounds of the credible interval.

Uncertainty quantification is a neglected aspect of data science and especially machine learning. Practitioners do not always have the statistical background, and the ML crowd generally has a "predict first and asks questions later" mindset that precludes such niceties.

I always demand error bars.

So is it really science? These are concepts from stats 101. And the reasons and need, and the risks of not having them are very clear. But you have millions being put into models without these pre-requisites, and being sold to people as solutions, and waved away as "if people buy is it's bc it has value". People also pay fraudsters.

Mostly not. Very few data "scientists" working in industry actually follow the scientific method. Instead they just mess around with various statistical techniques (including AI/ML) until they get a result that management likes.

Most decent companies and especially tech do AB testing for everything including having people whose only job is to make sure those test results are statistically valid.

The magic words here are make sure.

In my experience they make sure a ton more in industry than in academia.

Any anecdotes you can share? Also, I meant "make sure" in a negative way. As in you "make sure things are statistically significant" by e.g. p-hacking. Not that this isn't done in science but I think you're more in danger of being embarrassed during peer review than by the C-suite reading your executive summary...

Most companies that care will run everything though an AB test, the number of AB tests is physically limited by traffic volume and the team in charge of measuring the results of the AB tests is not the team that created the experiment. That makes it much harder to p-hack since you cannot re-run experiments infinitely on a laptop, and the measurement team is judged on the accuracy of their forecasts versus the revenue impact.

The worse things can and often do generate more revenue. So what are the goals of the AB test?

The worse things can and often do generate more revenue.

The goal of a business is usually to generate revenue so I'm confused about your question.

Science is just fooling around with data until you get a result a journal reviewer likes.

But even in academia, where supposedly "true science" is, if not done, at least pursued, uncertainty intervals are rarely, with respect to the times they would be needed, understood and used.

When I used to publish stats- and math-heavy papers in the biological sciences, very rarely the reviewers--and I used to publish in intermediate and up journals--were paying any attention to the quality of the predictions, beyond a casual look at the R2 or R2-equivalents and mean absolute errors.

You can demand error bars but they aren't always possible or meaningful. You can more or less "fudge" some sort of normally distributed IID error estimate onto any method, but that doesn't necessarily mean anything. Generating error bars (or generally error distributions) that actually describe the common sense idea of uncertainty can be quite theoretically and computationally demanding for a general nonlinear model even in the ideal cases. There are some good practical methods backed by theory like Monte Carlo Dropout, but the error bars generated for that aren't necessarily always the error you want either (MC DO estimates the uncertainty due to model weights but not say, due to poor training data). I'm a huge advocate for methods that natively incorporate uncertainty, but there are lots of model types that empirically produce very useful results but where it's not obvious how to produce/interpret useful estimates of uncertainty in any sort of efficient manner.

Another, separate, issue that is often neglected is the idea of calibrated model outputs, but that's its own rabbit hole.

I'm going to sound incredibly subjectivist now, but... the human running the model can just add error bars manually. They will probably be wide, but that's better than none at all.

Sure, you'll ideally want a calibrated estimator/superforecaster to do it, but they exist and they aren't that rare. Any decently sized organisation is bound to have at least one. They just need to care about finding them.

Even subjectively, on what basis would they generate uncertainties that at least keeps some grounding in reality? Any human generation would be ad hoc and likely very wrong, humans are notoriously awful at estimating risk and I'd argue by extension uncertainties with any consistency. And that's not even considering how one would assign an uncertainty to some huge model with 350 wacko features trained on 40 million examples. Lastly, models don't necessarily attend to the same details a human does so even if a human is able to slap an uncertainty on a prediction based on their own analysis that doesn't mean it's representing the uncertainty of what the model based its decision on.

I do think having people in the loop is a very important aspect, however, and can provide an important subjective complement to the more mathematically formulated idea of uncertainty. I don't care if the model I'm using provides the most iron clad and rigorous uncertainties ever, I'm still going to spot check it and play with it before I consider it reliable.

on what basis would they generate uncertainties that at least keeps some grounding in reality?

By having their forecasts continuously evaluated against outcomes. If someone can show me they have a track record of producing calibrated error bars on a wide variety of forecasts, I trust them to slap error bars on anything.

even if a human is able to slap an uncertainty on a prediction [...] that doesn't mean it's representing the uncertainty of what the model based its decision on.

This sounds like it's approaching some sort of model mysticism. Models don't make forecasts, humans do. Humans can use models to inform their opinion, but in the end, the forecast is made by a human. The human only needs to put error bars on their own forecast, not on the internal workings of the model.

This sounds like it's approaching some sort of model mysticism. Models don't make forecasts, humans do.

By forecasts I only mean output of a model, I've been wrapped up in time series methods where that's the usual term for model outputs. Assigning confidence to the conclusions drawn by an analyst using some model as a tool is a different task that may or may not roll up formal model output uncertainties and usually involves a lot of subjectivity. This is an important thing too, but is downstream.

Uncertainty is inherently tied to a specific model, since it characterizes how the model propagates uncertainty of inputs and its own fit/structure/assumptions onto its outputs. If you aren't building uncertainties contingent on the characteristics of a specific model then it isn't an uncertainty. But there's no mysticism about models possibly being unintuitive, most of the popular model forms nowadays are mystery black boxes. Some function fit to a specific dataset until it finds a local minimum in a loss function that happens to do a good job (simplifying). There's plenty of work that shows ML models often exploit features and correlations that are highly unintuitive to a human or are just plain spurious.

Well, in reality tools like Tensorflow probability can help you model both aleatoric and epistemic uncertainty with probabilistic layers that have learnable priors and posteriors. The issue there is that for the average ML person might not have the required math skills to model the problem in these terms.

For instance, if you look at https://blog.tensorflow.org/2019/03/regression-with-probabil... until the case 4 it's easy to follow and digest, but if you look at the _Tabula rasa_ section I am pretty sure that such content isn't understandable by many. Where you get stuck because the ideas become too complex depends on your math skills.

Yeah I've used those methods and am a fan, though they are far from perfect. For one thing they're somewhat invasive methods to implement and they still require you to formulate a likelihood function to varying degrees; a task which is not always possible up front. I've also had issues with getting them to converge during training when using them. They also sometimes don't estimate uncertainty correctly, particularly if you make a mistake modeling the likelihood.

I guess my point is, there is no silver bullet. Adding defensible uncertainty is complicated and problem specific, and comes with downsides (often steep).

Conformal prediction solves that problem. Split conformal and Jackknife+ are two simplest examples.

Also, error bars qua statistics can indicate problems with the underlying data and model, eg. if they're unrealistically narrow, symmetric etc.

Error bars are important. But most people misinterpret their meaning, see https://errorbars.streamlit.app/

I have, in my life as a web developer, had multiple "academics" urgently demand that i remove error bands, bars, notes about outliers, confidence intervals etc from graphics at the last minute so people are not "confused"

Its depressing

The depressing part is that many people actually need them removed in order to not be confused.

But aren’t they still confused without the error bars? Or confidently incorrect? And who could blame them, when that’s the information they’re given?

It seems like the options are:

- no error bars which mislead everyone

- error bars which confuse some people and accurately inform others

Yeah, when people remove that kind of information to not confuse people, they are aiming into making them confidently incorrect.

Yep.

See also: Complaints about poll results in the last few rounds of elections in the US. "The polls said Hillary would win!!!" (no, they didn't).

It's not just error margins, it's an absence of statistics of any sort in secondary school (for a large number of students).

After a lot of back-and-forth some years ago, we settled on a third option: If the error bars would be too big (for whatever definition of "too big" we used back then), don't show the data and instead show a "not enough data points" message. Otherwise, if we were showing the data, show it without the error bars.

That is baldly justifying a feeling of superiority and authority over others. It's not your job to trick other people "for their own good". Present honest information, as accurately as possible, and let the chips fall where they may. Anything else is a road to disaster.

Statistically illiterate people should not be making decisions. I'd take that as a signal to leave.

Statistically speaking, you're in the minority. ;)

Maybe not in the minority for taking it as a signal to leave, but in the minority for actually acting on that signal.

That's fair. :)

Some people won't understand error bars. Given that we evolved from apes and that there's a distribution of intelligences, skill sets, and interests across all walks of society, I don't place blame on anyone. We're just messy as a species. It'll be okay. Everything is mostly working out.

We're just messy as a species. It'll be okay. Everything is mostly working out.

{Confidence interval we won't cook the planet}

Sometimes they do this because the data doesn't entirely support their conclusions. Error bars, noting data outliers etc often make this glaringly apparent.

Can you be more specific (maybe point to a website)? I am trying to imagine the scenarios where a web developer would work with academics and does the data processing for the representation? Of the few scenarios that I could think about where an academic works directly with a web developer they would almost always provide the full figures.

Most people really don’t understand error bars, see https://errorbars.streamlit.app/

I obviously cannot assess the validity of the requests you got, but as a former researcher turned product developer, I had several times to take the decision _not_ to display confidence intervals in products, and to keep them as an internal feature for quality evaluation.

Why, I hear you ask? Because, for the kind of system of models I use (detailed stochastic simulations of human behavior), there is no good definition of a confidence interval that can be computed in a reasonable amount of computing time. One can design confidence measures that can be computed without too much overhead, but they can be misleading if you do not have a very good understanding of what they represent and do not represent.

To simplify, the error bars I was able to compute were mostly a measure of precision, but I had no way to assess accuracy, which is what most people assume error bars mean. So showing the error bars would have actually given a false sense of quality, which I did not feel confident to give. So not displaying those measures was actually done as a service to the user.

Now, one might make the argument that if we had no way to assess accuracy, the type of models we used was just rubbish and not much more useful than a wild guess... Which is a much wider topic, and there are good arguments for and against this statement.

It really depends what it is for. If the assessment is that the data is solid enough for certain decisions you might indeed only show a narrow result in order not to waste time and attention. If it is for a scientific discussion then it is different, of course.

I really thought that this was going to be about the weather.

Same, but in a human context, are mundane atmospheric events so far off today that error bars would have any practical value and/or potentially introduce confusion?

Absolutely. 15 years ago I could reasonably trust forecasts regarding whether it’s going to rain in a given location 2 days in advance. Today I can’t trust forecasts about whether it’s raining currently.

It seems unlikely that the modelling and forecasting has become worse, so I guess there is some sort of change happening to the climate making it more unstable and less predictable?

I guess there is some sort of change happening to the climate making it more unstable and less predictable?

I've been seeing this question come up a lot lately. The answer is no, weather forecasting continues to improve. The rate is about 1 day improvement every 10 years so a 5 day forecast today is as good as a 4 day forecast 10 years ago.

I think that is a change in definition. 15 years ago it was only rain if you were sure to get drenched. Now rain means 1mm of water hit the ground in your general vicinity. I blame an abundance of data combined people who refuse to get damp and need an umbrella if there is any chance at all.

No. I mean getting drenched, when forecast predicted no rain. 1mm isn’t really noticeable by humans.

Sure -- just a few day outs the forecast is not much better than the climatological average -- see e.g. https://charts.ecmwf.int/products/opencharts_meteogram?base_...

Up until that point, error bars increase. At least to me, there's a big difference between "1 mm rain guaranteed" and "90 % chance of no rain but 10 % chance of 10 mm rain" but both have the same average.

The other thing is that many forecasts have a large range. So it might be that the left third of the forecast area has like a 99.9% chance of zero precipitation, the right third has a 99.9% chance of some precipitation, and there is some uncertainly about where the border will be, (and near the border might have less precipitation than the edge).

The result is that I will be told ~50% chance of precipitation, but the places I care about might well be in the essentially 0% or essentially 100% parts.

The single forecast for a decent size city problem impacts other parts like forecasted highs/low. Even without a front, in some cities that have incorporated much of their suburbs, it is not uncommon for the eastmost part and westmost part to differ by 5-6 degrees, and that is ignoring the inherent temperature differences found in the downtown areas.

For this reason I really enjoy reading the text products and area forecast discussion for interesting weather: https://forecast.weather.gov/product.php?site=NWS&issuedby=p...

Anybody happen to know if there's anything more detailed from Environment Canada than their forecast pages?

https://weather.gc.ca/city/pages/on-143_metric_e.html

I really like that discussion type forecast.

Me too, and I was looking forward to the thread that talks about error bars in weather models, which is totally a thing!

It turns out the ECMWF does do an ensamble model where they run 51 concurrent models, presumably with slightly different initial conditions, or they vary the model parameters within some envelope. From these 51 models you can get a decent confidence interval.

But this is a lower resolution model, run less frequently. I assume they don't do this with their "HRES" model (which has twice the spacial resolution) in an ensemble because, well, it's really expensive.

[1]: https://en.wikipedia.org/wiki/Integrated_Forecast_System#Var...

A lot of weather agencies across the world run ensembles including US, Canada, and the UK. Ensembles are the future of weather forecasting but weather models are so computationally heavy models have a resolution/forecast length tradeoff which is even bigger when trying to run 20-50 ensemble members. You can have a high resolution model that runs to 2 days or so or have a longer range model at much coarser resolution.

ECMWF recently upgraded their ensemble to run at the same resolution as the HRES. The HRES is basically the ensemble control member at this point [1]

[1] https://www.ecmwf.int/en/about/media-centre/news/2023/model-...

I'm someone working in the field with weather modelers/forecasters but not a modeler myself. Reading this discussion has been an absolute treat ...

The most fascinating thing about the concept of error in models, including in ensembles, is you can only calculate and propagate error for contributors that you can quantify. There are many unquantifiable sources of error. Imagine a physical process that you are unaware of that propagates as a bias, for example ice nucleation via aerosols. Perhaps you don't even model aerosols. How do you account for error here? What does error even mean?

Ensembles only show you intramodel variability. Which is like error, sort of, but only really represents a combination of "real" variability in initial conditions and how that propagates through your physics/parameterizations.

"models" the HN commentators make for their businesses surely have parallel concepts, but I don't see anyone talking about them. Only discussion about the errors you know when the ugliest errors are the ones that no one knows.

Even with all the hype around deep learning and GPUs, the weather services are building a lot of new, CPU-only supercomputers [1]. There are plans to use more GPUs and ML in the coming years, but the weather models are real, tangible, economically impactful and strategically essential products. It's pretty inspiring to see governments dumping a huge amount of computing power into the working models, and also funding research ideas.

[1]: https://www.mdpi.com/2073-431X/11/7/114

I've been using meteoblue for a while now and they tell you how sure they are of their predictions. Right now I can see that they rate their predictability as medium for tomorrow, but high for the day after.

https://content.meteoblue.com/en/research-education/specific...

I'll give you one better. The ECMWF publishes their probabilistic ensemble forecasts with boxplots for numeric probabilities: https://charts.ecmwf.int/products/opencharts_meteogram?base_...

They also have one for precipitation type distribution: https://charts.ecmwf.int/products/opencharts_ptype_meteogram...

Completely agree with this idea. And I would add a corollary...date estimates (i.e. deadlines) should also have error bars. After all, a date is a forecast. If a stakeholder asks for a date, they should also specify what kind of error bars they're looking for. A raw date with no estimate of uncertainty is meaningless. And correspondingly, if an engineer is giving a date to some other stakeholder, they should include some kind of uncertainty estimate with it. There's a huge difference between saying that something will be done by X date with 90% confidence versus three nines confidence.

A deadline implies the upper limit of error bar cannot exceed it. That means you need to appropriately buffer to hit the deadline.

I don't think that's the way it works out in practice. The fact of the matter is that deadlines are missed all the time. In many cases, there is no such thing as 100% certainty that you'll hit a "deadline"--there are always circumstances outside your control (global pandemics anyone?). There's just some implicit confidence threshold or other assumptions lurking around that probably need to be communicated. Do you want three 9s of confidence? Five 9s? Those things are very different and the cost to actually achieve the latter can often be prohibitive. Everyone benefits if we make explicit our pre-conceived idea of precisely what "cannot exceed" means.

Going even further, deadlines are often a tool for signaling "the organization is trying really hard to achieve this fast". The shorter the deadline (as long as it's at least somewhat in theory plausible), the harder you're trying. Often most people involved (even those deciding on the date for the deadline) know from the very start that the deadline will almost certainly be missed.

I regularly see two kinds of deadlines. "Planning deadlines" describe an estimate when something will be done. "Signalling deadlines" signal priorities and motivation to employees or clients. Sometimes both exist in parallel for the same task and there is a subset of people who know both.

So much this. I've written about it before, but one of the big bonuses you get from doing it this way is that it enables you to learn from your mistakes.

A date estimation with no error bars cannot be proven wrong. But! If you say "there's a 50 % chance it's done before this date" then you can look back at your 20 most recent such estimations and around 10 of them better have been on time. Otherwise your estimations are not calibrated. But at least then you know, right? Which you wouldn't without the error bars.

The problem is that date estimates for deadlines are NOT a standard distribution and everybody's normal statistical tools do not work.

They are pretty much a one sided distribution power law. Deadlines almost never come in early and, when they do, it's rarely by much. On the other hand, deadlines can come in late by wild amounts.

Generating confidence intervals on that is really hard.

Yes, please! I was part of an org that ran thousands of online experiments over the course of several years. Having some sort of error bars when comparing the benefit of a new treatment gave a much better understanding.

Some thought it clouded the issue. For example, when a new treatment caused a 1% "improvement", but the confidence interval extended from -10% to 10%, it was clear that the experiment didn't tell us how that metric was affected. This makes the decision feel more arbitrary. But that is exactly the point - the decision is arbitrary in that case, and the confidence interval tells us that, allowing us to focus on other trade-offs involved. If the confidence interval is 0.9% to 1.1%, we know that we can be much more confident in the effect.

A big problem with this is that meaningful error bars can be extremely difficult to come by in some cases. For example, imagine having something like that for every prediction made by an ML model. I would love to have that, but I'm not aware of any reasonable way to achieve it for most types of models. The same goes for online experiments where a complicated experiment design is required because there isn't a way to do random allocation that results in sufficiently independent cohorts.

On a similar note, regularly look at histograms (i.e., statistical distributions) for all important metrics. In one case, we were having speed issues in calls to a large web service. Many calls were completing in < 50 ms, but too many were tripping our 500 ms timeout. At the same time, we had noticed the emergence of two clear peaks in the speed histogram (i.e., it was a multimodal distribution). That caused us to dig a bit deeper and see that the two peaks represented logged-out and logged-in users. That knowledge allowed us to ignore wide swaths of code and spot the speed issues in some recently pushed personalization code that we might not have suspected otherwise.

This makes the decision feel more arbitrary.

This is something I've started noticing more and more with experience: people really hate arbitrary decisions.

People go to surprising lengths to add legitimacy to arbitrary decisions. Sometimes it takes the shape of statistical models that produce noise that is then paraded as signal. Often it comes from pseudo-experts who don't really have the methods and feedback loops to know what they are doing but they have a socially cultivated air of expertise so they can lend decisions legitimacy. (They used to be called witch-doctors, priests or astrologers, now they are management consultants and macroeconomists.)

Me? I prefer to be explicit about what's going on and literally toss a coin. That is not the strategy to get big piles of shiny rocks though.

That caused us to dig a bit deeper and see that the two peaks represented logged-out and logged-in users.

This is extremely common and one of the core ideas of statistical process control[1].

Sometimes you have just the one process generating values that are sort of similarly distributed. That's a nice situation because it lets you use all sorts of statistical tools for planning, inferences, etc.

Then frequently what you have is really two or more interleaved processes masquerading as one. These distributions generate values that within each are sort of similarly distributed, but any analysis you do on the aggregate is going to be confused. Knowing the major components of the pretend-single process you're looking at puts you ahead of your competition -- always.

[1]: https://two-wrongs.com/statistical-process-control-a-practit...

The interesting example in this article is nowcasting! The art of forecasting the present or past while you're waiting for data to come in.

It's sloppy science / statistics to not haven error ranges.

Makes sense that such a thing would exist, though it renders less funny my joke: "I'm almost clairvoyant; I can predict things shortly after they happen"

Not easy to always say what the benefit is: if you present in-model uncertainty from a stochastic model that might still say nothing about an estimation error vs the actual process. For forecasting to show actual uncertainty you need to be in a quite luxurious position to know the data generating process. You could try to fudge it with a lot of historical data where available - but still...

What is the best explanations to error bars ?

If you keep doing that for a long time, actuals will fall within the error bars about 95% of the time.

see https://errorbars.streamlit.app/

Let me suggest a solution https://github.com/valeman/awesome-conformal-prediction

That's not a solution, but more like lots of different solutions, many too complicated to understand.

I would suggest split conformal first.

Reminds me of this paper[1]

An illusion of predictability in scientific results: Even experts confuse inferential uncertainty and outcome variability

Traditionally, scientists have placed more emphasis on communicating inferential uncertainty (i.e., the precision of statistical estimates) compared to outcome variability (i.e., the predictability of individual outcomes). Here, we show that this can lead to sizable misperceptions about the implications of scientific results. Specifically, we present three preregistered, randomized experiments where participants saw the same scientific findings visualized as showing only inferential uncertainty, only outcome variability, or both and answered questions about the size and importance of findings they were shown. Our results, composed of responses from medical professionals, professional data scientists, and tenure-track faculty, show that the prevalent form of visualizing only inferential uncertainty can lead to significant overestimates of treatment effects, even among highly trained experts. In contrast, we find that depicting both inferential uncertainty and outcome variability leads to more accurate perceptions of results while appearing to leave other subjective impressions of the results unchanged, on average.

[1] https://www.microsoft.com/en-us/research/publication/an-illu...

Very interesting reference, there is a whole field about Uncertainty modelling.

Doesn't work.

For instance in a business setting, if I say "it'll be done in 10 days +/- 4 days", they'll immediately say "ok so you're saying it'll be done in 14 days tops then".

More effective to sound as unsure as possible, disclaim everything in slippery language, and promise to give updates to your predictions as soon as you realise they've changed (granted this wouldn't work as well for an anonymous reader situation like in this article).

Is it wrong of them to interpret you that way? What is the correct interpretation?

I'm reminded of Walter Lewin's analogous point about measurements from his 8.01 lectures:

  any measurement that you make without any knowledge
  of the uncertainty is meaningless

https://youtu.be/6htJHmPq0Os

You could say that forecasts are measurements you make about the future.

To that point, similarly:

"Being able to quantify uncertainty, and incorporate it into models, is what makes science quantitative, rather than qualitative. " - Lawrence M. Krauss

From https://www.edge.org/response-detail/10459

Looking at the graph, changes in this decade are noise. But what happened back in 1990?

Probably no simple answer, but here's a long paper I just found: https://pubs.aeaweb.org/doi/pdf/10.1257/089533004773563485

Another famous hypothesis is the phasing out of lead fuel: https://en.wikipedia.org/wiki/Lead%E2%80%93crime_hypothesis

And also claims that say "x improves y", should include std and avg in the title.

Not just forecasts.

Accounting should do it too in their reporting.

I would love to see a balance sheet with a proper 'certainty range' around the values in there.

Linear error rate

If you are forecasting both "Crime" and "Economy", it's VERY likely you have domain expertise for neither.

This is a great advantage of Gaussian Process Regression aka. Kriging.

https://en.wikipedia.org/wiki/Gaussian_process#Gaussian_proc...

Every estimate/prediction/forecast/interpolation/extrapolation should have a confidence/prediction/ or tolerance interval (application dependent) that incorporates the assumptions that the team is putting into the problem.

two inches deeper, is too much money! So Alphabet prefer to loose the idea, the contract, the business plan rather than make the things solid and works for a long time. a $ is a dollars, a project, is just a piece of fun. Alphabet LOL XD

Not only forecast need error bars. Every statistic needs error bars. But even then most people interpret error bars wrongly, see e.g. https://errorbars.streamlit.app/

Forecasts can also be useful without error bars. Sometimes all one needs is a point prediction to inform actions. But sometimes full knowledge of the predictive distribution is helpful or needed to make good decisions.

"Point forecasts will always be wrong" - true that for continuous data but if you can predict that some stock will go to 2.01x it's value instead of 2x that's still helpful.

Take the error bars for Scrum estimations, 3, 5, 8 - people treat them as real things although they have huge errors and are very course.

I'm just imagining adding error bars to my schedule forecasting (with schedules that are typically one the optimistic side thanks to management), with bars pointing in the bad direction, and seeing management still insist it'll take too long.