Most people don't appreciate linear regression. 1) All common statistical tests are linear models: https://lindeloev.github.io/tests-as-linear/ 2) Linear models are linear in the parameters, not the response! E.g. y = a*sin(x)+bx^2 is a linear model. 3) By choosing an appropriate spline basis, many non-linear relationships between the predictors and the response can be modelled by linear models. 4) And if that flexibility isn't enough, by virtue of the Taylor Theorem, linear relations are often a good approximation of non-linear ones.
I'd love to see linear regression taught by say a quant researcher from Citadel. How do these guys use it? What do they particularly care about? Any theoretical results that meaningfully change the way they view problems? And so on.
I have some experience. Variants of regularization are a must. There are just too few samples and too much noise per sample.
In a related problem, covariance matrix estimation, variants of shrinkage is popular. The most straight forward one being Linear Shrinkage (Ledoit, Wolf).
Excepting neural nets, I think most people doing regression simply use linear regression with above type touches based on the domain.
Particularly in finance you fool yourself too much with more complex models.
There are just too few samples and too much noise per sample.
Call it 2000 liquid products on the US exchanges. Many years of data. Even if you approximate it down from per tick to 1 minutely, that doesn't feel like you're struggling for a large in sample period?
It sounds like you are assuming the joint distribution of returns in the future is equal to that of the past, and assuming away potential time dependence.
These may be valid assumptions, but even if they are, "sample size" is always relative to between-sample unit variance, and that variance can be quite large for financial data. In some cases even infinite!
Regarding relativity of sample size, see e.g. this upcoming article: https://two-wrongs.com/sample-unit-engineering
If the distribution changes enough, multiple years of data may as well be no data.
They may have been referring to (for example) reported financial results or news events which are more infrequent/rare but may have outsized impact on market prices.
Yes these are good points and probably the most important ones as far as the maths is concerned, though I would say regularisations methods are really standard things one learns in any ML / stat course. Ledoit, Wolf shrinkage is indeed more exotic and very useful.
The linear regression - and with a single predictor at that - is the workhorse. As if - the cross-product x'*y is too little, divided by dot-product x'*x is just right (regression), and dividing it again by another dot-product y'*y (correlation, with the sqrt) would be over-doing it. :-)
There is no big mystery I'm afraid, there is no big reveal. It's as Jim Simons described in the Numberphile video interview: a slow painstaking accumulation of weak signals, plus crafting and improving various boxes of the system. (the interfaces between them are largely known) The fitting method used does not buy that much in the grand scheme of things - as long as it does not ruin things, that is.
(I've not been at Citadel but been quant R&D&trading last 20yrs)
When I was at CMU a decade ago I took 36-401 and 36-402 (then taught by Shalizi) and they were both very good statistical classes and they forced me to learn base R, for better or for worse.
A big weakness of linear regression that I had to learn the hard way is that the academic assumptions for valid interpretation of the coefficients are easy to construct for small educational datasets but rarely applicable to messy real world data.
The flip side is with messy real world data you just need a model that's ok enough, rather than being concerned whether the p-value is this or that.
At that point, if you don't care about interpretable coefficients, you might as well use gradient-boosted trees or a full neural network instead.
It depends on the "severity" of the violation of assumptions--you can also use GAMs to add flexible nonlinear relationships--and the amount of data you are working with. Statistical modeling is a nuanced job.
I tried to argue that while at CMU and it didn't go well.
They may not know at CMU that the vast majority of applied, trained-on-data statistical models that help run the modern world seriously violate one or more of the model's assumptions.
It depends. The most important assumption is independence of the observations. If that is not given, you have to either account for correlated responses using a mixed-effects model or mean-aggregate those responses (computing the mean decreases the variance but also reduces the number of data points and those two cancel each other out in calculating the t-statistic of the Wald test).
With regard to other assumptions, e.g. normality of the residuals, linear models can often deal with some degree of violation against those. But I agree that it's always good to understand the influence of those violations, e.g. by using simulations and making p-value histograms of null-data.
The most important skill in regression is to RECOGNIZE the intercept. It sounds trivial, and is, until you start including interactions between terms. The number of times I've found a young graduate student screw this up...
Take a simple linear model involving a test score, their age in years (age range 7-16 years), and a binary categorical variable autism diagnosis (0=control,1=autism): score = age + diagnosis + age:diagnosis score = (X1)age + (X2)diagnosis + (X3)age:diagnosis.
If the X2 is significant, the naive student would say, "look a group difference!!", not realizing this is the predicted group difference at the intercept, which is when participants were 0 years old. [[ You center age by the mean, or median, or better yet, the age you are most interested in. Once interactions are in the equation, all "lower order" parameter estimates are in reference to the intercept.]]
They might also note a significant effect of age, and then assume it applies to both groups, but the parameter X1 only tells you what the predicted slope is for the reference group (controls), while the interaction tests if the age slopes differ between groups...moreover, even if the interaction isn't significant, the age effect in the autism group might not significantly differ from zero...the data is in the wish washy zone, and you have to be careful in how one interprets the data.
To some here all this will seem obvious, but to many, getting their head firmly into the conditional space of parameters when their are interaction terms takes work. (note: for now I am ignoring other ways of coding groups (grand mean vs one group being the reference) but the lesson still remains. Understand what the intercept means and to whom/what it refers.
I think this is accurate.
A significant loading on diagnosis (X2) does not tell you anything about the effect of diagnosis at any particular age (except age 0).
You’d have to recenter the model about the age of interest.
I always struggle to get a good intuition into models with interaction terms. I usually try to write down for every class of responses which terms of the model go into it and often that helps with interpretation. There's also the ExploreModelMatrix [1] that helps with that task.
[1] https://www.bioconductor.org/packages/release/bioc/html/Expl...
If I said something stupid above, please let me know. I'm always learning. If you are a strong Bayesian who doesn't like p-values, that is also fine. I get it. I just wanted to provide my observations about a great number of bright students I've worked with who have nevertheless struggled to fluidly interpret models with interaction terms, and point them in the right direction.
It looks like this article does not mention it, but linear regression will also exhibit Double Descent phenomenon, commonly seen in deep learning. You would need to introduce some regularization, in order to see this. It would be nice to add this discussion.
Are there some papers in particular that you're referring to? Does the second descent happen after the model becomes overparameterized, like with neural nets? What kind of regularization?
[Submitted on 24 Mar 2023] Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W. Rocks, Ila Rani Fiete, Oluwasanmi Koyejo
https://arxiv.org/abs/2303.14151
Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime. This drop in test error flies against classical learning theory on overfitting and has arguably underpinned the success of large models in machine learning. This non-monotonic behavior of test loss depends on the number of data, the dimensionality of the data and the number of model parameters. Here, we briefly describe double descent, then provide an explanation of why double descent occurs in an informal and approachable manner, requiring only familiarity with linear algebra and introductory probability. We provide visual intuition using polynomial regression, then mathematically analyze double descent with ordinary linear regression and identify three interpretable factors that, when simultaneously all present, together create double descent. We demonstrate that double descent occurs on real data when using ordinary linear regression, then demonstrate that double descent does not occur when any of the three factors are ablated. We use this understanding to shed light on recent observations in nonlinear models concerning superposition and double descent. Code is publicly available
This looks very interesting, do you know a way to transform this PDF to a mobile-optimized form?
Also see Shalizi’s “Data Analysis from an Elementary Point of View” is a good introductory textbook:
https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/
It is rightly over-weight on linear and additive models and simulation. 90% of the book is useless without a computer but that is a modern truth.
I love that Ridge Regression is introduced in the context of multicollinearity. It seems almost everyone these days learns about it as a regularization technique to prevent overfitting, but one of its fundamental use cases (and indeed its origin I believe) is in balancing weights among highly correlated (or nearly linearly dependent) predictors, which can cause huge problems even if you plenty of data.
pfft, couldnt build an LLM with it
We had to revisit linear regression multiple times in different courses for my undergrad classes. It's fascinating that optimality is provable using statistics and probability theory, although given assumptions hold of course.
For my cs phd I looked mostly at regression problems using deep learning models. I didn't look at this specifically but I still think it would be neat if there is some way to translate the rigid proofs and theorems for classical linear models to deep regression models.
Thanks for sharing. Add someone teaching regression (with XGBoost) this month, this is a good read. Very well written, and approachable, unlike many academic texts.
I particularly like chapter 6, visual diagnosis. Very well done.
These are all fantastic points, and I strongly agree that most people don't appreciate linear models nearly enough.
Another one I would add that is very important: Human beings, especially in groups, can only reasonably make linear decisions.
That is, when we are in a meeting making decisions for the direction of the company we can only say things like "we need to increase ad spend, while reducing the other costs of acquisition such as discount vouchers". If you want to find the balance between "increasing ad spend" while "decreasing other costs" that's a simple linear model.
Even if you have a great non-linear model, it's not even a matter of "interpretability" so much as "actionability". You can bring the results of a regression analysis to a meeting and very quickly model different strategies with reasonable directional confidence.
I struggled communicating actionable insights upward until I started to really understand regression analysis. After that it became amazingly simple to quickly crack open and understand fairly complex business processes.
There are absolutely decisions that need to get made, and do get made, that are not linear. Step functions are a great example. "We need to decide if we are going to accept this acquisition offer" is an example of a decision with step function utility. You can try to "linearize" it and then apply a threshold -- "let's agree on a model for the value at which we would accept an acquisition offer" -- but in many ways that obscures that the utility function can be arbitrarily non-linear.
A single decision could still be easily modeled by a 0/1 variable (as an input) and a real variable (as an output, like revenue for example).
That 0/1 input variable could also have arbitrary interactions with other variables, which would also amount to “step function “ input effects.
See for example the autism/age setup down thread.
Discrete linear optimisation is infinitely more complicated than continuous linear optimisation. The former is NP complete, the latter is in P.
Which seems almost ironic, because continuous linear optimization almost certainly doesn't exist really because real numbers can only be approximated, and so we're always doing discrete linear optimization at some level.
Who cares about real numbers in this context?
If all the numbers that appear in your constraints are rational (p/q with finite p and q), then any solution is also a rational number (with finite nominator and finite denominator).
(Well, any finite solution. Your solution could also be unbounded, then you might have infinities in there.)
A computer can represent finite rational numbers just fine. See eg https://docs.python.org/3/library/fractions.html or https://hackage.haskell.org/package/base-4.20.0.1/docs/Data-... for some libraries.
Though in most cases, people just use floating point numbers in practice, but that's of no philosophical concern.
No argument with that fact.
But the parent comment is not talking about constrained optimization, just gradient following.
In the context of this post, that’s just “which of these N discrete variables, if moved from 0 to 1, will increase the quantity of interest according to the linear model?” “Which will decrease it?”
The question is not, “if I can only set M of these N variables to 1, which should I choose?”
That’s a good question, and it leads to problems in NP, but that’s not what the comment was referring to.
Yes, you are right in that abstract setting.
If you always have the full hypercube of available, the problem is as easy as you describe. But if there are constraints between the variables, it gets hairier.
If you add a multilevel structure to shrink your (generalized) linear predictors, this framework becomes incredibly powerful.
There are entire statistics textbooks devoted to multilevel linear models, you can get really far with these.
Shrinking through information sharing is really important to avoid overly optimistic predictions in the case of little data.
Especially when you use the mixed model (aka MLM) framework to automatically select the smoothing penalty for your splines. So in one simple and very intuitive framework, you can estimate linear and nonlinear effects, account for repeated measurements and nested data, and model binary, count, or continuous outcomes (and more), all fitting the model in one shot, yielding statistically valid confidence intervals and p-values.
R's mgcv package (which does all of the above) is probably the single reason I'm still using R as my primary stats language.
Is there a Python eqivalent?
statsmodels is the closest thing in python to R. statsmodels has mixed model support, but mgcv apparently requires more. It is well above my paygrade, but this seems relevant: https://github.com/statsmodels/statsmodels/issues/8029 (i.e. no out of the box support, you might be able to build an approximation on your own).
If you like shrinkage (I do), I highly recommend the work of Matthew Stephens, e.g. ashr [1] and vash [2] for shrinkage based on an empirically derived prior.
[1] https://cran.r-project.org/web/packages/ashr/index.html [2] https://github.com/mengyin/vashr
Yes, the article linked to ashr is quite famous.
I have a degree in statistics yet I've never thought about the relationship between linear models and business decisions in this way. You're absolutely right. This is the best comment I've read all month.
I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?
I'm also curious about what a non-actionable non-linear suggestion would look like.
How I understand the comment: a non-linear suggestion is that the budget for X should be 300k. The (supposedly linear) alternative is that the budget for X should increase.
What I think is the important part, is that it is better to ask decision makers for decisions on setting a continuous parameter, than to make binary yes/no or go/no-go decisions. When it's a decision by committee, I can see why that is.
I’m neither of the previous posters, so I may be off…
For simplicity, I’m going to assume each variable in the model is independent of every other variable.
We can interpret the coefficients in linear models. This relationship holds for the model for the range of values it is based on. This relationship is the same for the whole range of the model. (We can’t extrapolate outside of what’s been modeled.)
y = c1x1 + c2x2 +…+ cnxn (excuse the poor formatting)
The sign tells you the direction (+ means it will increase the value of y, - means it will decrease the value of y), the value of the coefficient tells you how much the y will change for a given 1-unit change in the x value.
Since this is linear, you get the same change to the output for the relevant increases no matter your starting point.
So, the regression model would say x1, x3, and x5 have positive coefficients and variables x2, x4 have negative coefficients. If you want y to increase, either start doing more of x1, x3, x5 or do less of x2, x4. Depending on what these are and your limited investment budget, for example, you may pick doing x3 if that is the largest positive coefficient.
Again, since this is linear, you can keep on putting resources into the largest coefficient and get the same increase up until your model is no longer valid.
For non-linear models, you can still interpret the coefficients, but the interpretation depends on your starting conditions and where you are on the graph.
There may be asymptotes in your non-linear model, so there is a point of diminishing returns where if you keep putting resources into a variable with a positive coefficient, this will not keep getting you commensurate results.
Sorry I don’t have any actual examples here and I don’t have time to go digging through my old textbooks to look for any.
No, that's not true. Human groups are very able to make discrete decisions. Actually, often they tend to go for discrete decisions, when something continuous (and perhaps linear) would be a lot better.
(Just to be clear: if you force your linear models to make discrete predictions, they are no longer linear in any sense of the word. That's why linear optimisation is a problem that can be solved in polynomial time, and integer linear optimisation is NP complete.
Even convex optimisation, which is no longer linear but still continuous, can be solved in roughly polynomial time.)
Often people demand more decisive decisions, of 'yes'/'no' or concrete action, not shades of grey and fiddling at the margins.
Getting people to even appreciate linear models is already a step forward. Like it or not, your business strategy meetings are already a step ahead of what most people would naturally be inclined to.
And working with these people is so painful.
I often make fun of McKinsey- style four quadrants when overused, but they really boil down to something that makes a lot of sense in communicating a problem space:
a) carefully choose the two most important dimensions of concern (as Alan Kay said: the correct point of view is worth 80 Iq points)
b) make them binary: are we happy here or do we need to change?
In a way similar to the pareto ratio, you keep a surprising amount of value in something “so simple it cant be possibly so useful”.
Of course, you can also weaponise the choice of axes for your (office) politics: pick the two axes right, and the policy outcome you want to pick might already be baked into the whole process from the start.
Yes. I also found that in many cases being able to turn problems that require discrete decisions into problems that admit continuous decisions, eg by re-arranging how the business works etc, can unlock a lot of business value.
In my concrete cases I mostly saw that in the direct sense of being able to deploy more mathematics and operations research, eg for netting out (partially) offsetting financial instruments for a bank.
But by introspection you can come up with more example. Eg that's a common selling point for running your servers on AWS instead of building your own hardware.
This seems to be getting a lot of attention. I couldn't agree more, we assume linearity all the time because reasoning non-linearly is exceptionally difficult. Yes we can do it sometimes, but it is not the default. Reasoning linearly has its flaws, and we should recognize we are making an imperfect decision, but it is still extremely useful.
I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?
Quadratics might be more useful for optimizing (min/max problems)..
I have very little math knowledge and point 2 surprises me. Some quick googling suggests that a linear model should produce a straight line when graphed but the example equation you offered isn't straight. I'm missing something basic aren't I?
The thing being learned here are (a,b) and you do that using data (x,y). We can rewrite our input to be of the form z = {sin(x), x^2} and then now we have the model y = a z_1 + b z_2 which is now obviously linear in z. Since x is given to us and z is just a function of x, nothing strange is happening here. Just manipulating the data.
IANAS, but the example is not linear in x. But you can pick one or more axes where it would be linear. In this case for y=a*sin(x)+bx^2, you set x'=sin(x) and x"=x^2 and plot y=ax'+ bx". You can also pick an arbitrary function for y and do a similar transformation.
When statisticians talk about linear models, they talk about the parameters being linear, not your variables x_0..x_n. So y = a*sin(x) + b is a linear model, because y is linear in a and b.
if you want to convert people into loving linear models (and you should), we need to make sure that they learn the difference between 'linear models' and 'linear models fit using OLS'
i've met smart people that cant wrap their head around how it's possible to create linear model where the number of parameters exceeds the number of data points (that's an OLS restriction).
or they're worried about how they can apply their formula for calculating the std error on the parameters. bruh, it's the future and we have big computers. just bootstrap em and don't make any assumptions.
Okay, I’ll bite.
Help me understand the pitch. What linear models are you referring to here that aren’t estimated with OLS? How should I wrap my head around having more parameters than observations?
Linear models that aren't estimated with OLS: - Theil-Sen - Huber - RANSAC
Models that can cope with more parameters than observations: - Ridge - Lasso
Linear models have many solutions fitting the data exactly in that parameter regime, many more fitting it approximately for any metric still satisfying the idea that identical outputs are preferable, and sometimes multiple solutions even with more data.
So.....not just for OLS, but for most metrics (where you'd prefer to match or approximately match the data), the parameters are underconstrained.
How much that matters depends on lots of things. If you have additional constraints (a common one that's particularly easy to program is looking for a minimum-norm solution), that trivially solves the problem. Otherwise, you might still have issues. E.g., non-minimum-norm solutions often perform badly on slightly out-of-distribution samples (since those extra basis vectors were unconstrained and thus might be large).
Is there something I'm missing where 'linear models' are used to represent something wildly different than I'm used to? Are people using norms with discontinuities or something in practice? Is the criticism of OLS perhaps unrelated to the overparameterization issue? I think I'm missing some detail that would relate all of those.
For point (3), in most of my academic research and work in industry, I have used Generalized Additive Models with great technical success (i.e., they fit the data well). Still, I have noticed that they have been rarely understood or given the proper appreciation by--it is a broad category--stakeholders. Out of laziness and habit, mostly.
I've looked at additive models, but I have so far shied away because I've read that they are not super equipped to deal with non-additive interactions.
They actually deal with non-additive "low-order" interactions quite well. In R's mgcv for example, let's say you had data from many years of temperature readings across a wide geographic area, so your data are (lat, long, year, temperature). mgcv lets you fit a model like:
where you have (1) a nonlinear two-way interaction (i.e. a smooth surface) across two spatial dimensions, (2) a univariate nonlinear effect of time, and (3) a three-way nonlinear interaction, i.e. "does the pattern of temperature distributions shift over time?"You still can't do arbitrary high-order interactions like you can get out of tree-based methods (xgboost & friends) but that's a small price to pay for valid confidence intervals and p-values. For example, the model above will give you a p-value for the ti() term, which you can use as formal statistical evidence to say -- at what level of confidence -- a spatiotemporal trend exists.
This Rmarkdown file (not rendered sadly) shows how to do this and other tricks https://github.com/eric-pedersen/mgcv-esa-workshop/blob/mast...
Hey cool. I'll take a closer look then. Thanks! I assume that there are mixed model variants out there too.
Re. 2) Then you end up doing feature engineering. For applications where you don't know the data generating process it is often better to just throw everything at the model let it extract the features.
I don't disagree in the context of the current tools. But this has always been a bugbear of mine- data science has an unhealthy bias towards modeling over data preperation.
I'd love to see tools in the ecosystem around extracting relevant features that then can be used on a lower cost, more predictable model.
There are feature management platforms like https://featurebyte.com/
An SVM is purely a linear model from the right perspective, and if you're being really reductive, RELU neural networks are piecewise linear. I think this may be obscuring more than it helps; picking the right transformation for your particular case is a highly nontrivial problem; why sin(x) and x^2, rather than, say, tanh(x) and x^(1/2).
ReLU networks have the nice property of being piecewise linear, but also during training they optimise their own non-linear transformation over time.
Do you have a useful reference for "3)"?
A common problem I encounter in the literature is authors over-interpreting the slopes of a model with quadratic terms (e.g. Y = age + age^2) at the lowest and highest ages. In variably the plot (not the confidence intervals) will seem to indicate declines (for example) at the oldest ages (example: random example off internet [1]), when really the apparent negative slope is due to the limitations of quadratic models not being able to model an asymptote.
The approach I've used, when I do not have a theoretically driven choice to work with) is using fractionated polynomials [2], e.g. x^s where s = {−2, −1, −0.5, 0, 0.5, 1, 2, 3}, and then picking a strategy to pick the best fitting polynomial while avoiding overfitting.
Its not a bad technique; I've tried others like piecewise polynomial regression, knots, etc [3],but I could not figure out how to test (for example) for a group interaction between two knotted splines). Also additive models.
[1] https://www.researchgate.net/figure/Scatter-plot-of-the-quad...) [2] https://journal.r-project.org/articles/RN-2005-017/RN-2005-0... [3] https://bookdown.org/ssjackson300/Machine-Learning-Lecture-N...
For my applications, using natural cubic splines provided by the 'ns' function in R, combined with trying out where knots should be positioned, is sufficient. Maybe have a look at the gratia package [1] for plotting lots of diagnostics around spline fits.
[1] https://cran.r-project.org/web/packages/gratia/vignettes/gra...
As a student who's only been exposed to stats in undergrad (in the context of using multiple regression in Econometrics), where can I learn more about this? especially about choosing a spline basis and Taylor's theorem?
Yeah but let’s not go crazy. Linear models perform very badly on partition-able tabular data where tree models excel. They are also obviously no replacement or competition in deep learning related tasks.
Point 3 — just pick the right basis — is very difficult outside a handful of kernels that are known to work. And how are you going to extrapolate your spline for prediction for example? Linearly is usually the answer…
Point 4 — sure for differentiable functions, but most people are fitting data not functions, and if you know it’s generating function why would you bother with a linear model?