return to table of content

The Truth About Linear Regression (2015)

aquafox
48 replies
2d1h

Most people don't appreciate linear regression. 1) All common statistical tests are linear models: https://lindeloev.github.io/tests-as-linear/ 2) Linear models are linear in the parameters, not the response! E.g. y = a*sin(x)+bx^2 is a linear model. 3) By choosing an appropriate spline basis, many non-linear relationships between the predictors and the response can be modelled by linear models. 4) And if that flexibility isn't enough, by virtue of the Taylor Theorem, linear relations are often a good approximation of non-linear ones.

crystal_revenge
26 replies
2d

These are all fantastic points, and I strongly agree that most people don't appreciate linear models nearly enough.

Another one I would add that is very important: Human beings, especially in groups, can only reasonably make linear decisions.

That is, when we are in a meeting making decisions for the direction of the company we can only say things like "we need to increase ad spend, while reducing the other costs of acquisition such as discount vouchers". If you want to find the balance between "increasing ad spend" while "decreasing other costs" that's a simple linear model.

Even if you have a great non-linear model, it's not even a matter of "interpretability" so much as "actionability". You can bring the results of a regression analysis to a meeting and very quickly model different strategies with reasonable directional confidence.

I struggled communicating actionable insights upward until I started to really understand regression analysis. After that it became amazingly simple to quickly crack open and understand fairly complex business processes.

addaon
6 replies
1d23h

Human beings, especially in groups, can only reasonably make linear decisions.

There are absolutely decisions that need to get made, and do get made, that are not linear. Step functions are a great example. "We need to decide if we are going to accept this acquisition offer" is an example of a decision with step function utility. You can try to "linearize" it and then apply a threshold -- "let's agree on a model for the value at which we would accept an acquisition offer" -- but in many ways that obscures that the utility function can be arbitrarily non-linear.

mturmon
5 replies
1d22h

A single decision could still be easily modeled by a 0/1 variable (as an input) and a real variable (as an output, like revenue for example).

That 0/1 input variable could also have arbitrary interactions with other variables, which would also amount to “step function “ input effects.

See for example the autism/age setup down thread.

eru
4 replies
1d11h

Discrete linear optimisation is infinitely more complicated than continuous linear optimisation. The former is NP complete, the latter is in P.

naasking
1 replies
1d9h

Which seems almost ironic, because continuous linear optimization almost certainly doesn't exist really because real numbers can only be approximated, and so we're always doing discrete linear optimization at some level.

eru
0 replies
1d7h

Who cares about real numbers in this context?

If all the numbers that appear in your constraints are rational (p/q with finite p and q), then any solution is also a rational number (with finite nominator and finite denominator).

(Well, any finite solution. Your solution could also be unbounded, then you might have infinities in there.)

A computer can represent finite rational numbers just fine. See eg https://docs.python.org/3/library/fractions.html or https://hackage.haskell.org/package/base-4.20.0.1/docs/Data-... for some libraries.

Though in most cases, people just use floating point numbers in practice, but that's of no philosophical concern.

mturmon
1 replies
20h57m

No argument with that fact.

But the parent comment is not talking about constrained optimization, just gradient following.

In the context of this post, that’s just “which of these N discrete variables, if moved from 0 to 1, will increase the quantity of interest according to the linear model?” “Which will decrease it?”

The question is not, “if I can only set M of these N variables to 1, which should I choose?”

That’s a good question, and it leads to problems in NP, but that’s not what the comment was referring to.

eru
0 replies
17h48m

In the context of this post, that’s just “which of these N discrete variables, if moved from 0 to 1, will increase the quantity of interest according to the linear model?” “Which will decrease it?”

Yes, you are right in that abstract setting.

If you always have the full hypercube of available, the problem is as easy as you describe. But if there are constraints between the variables, it gets hairier.

nextos
5 replies
1d23h

If you add a multilevel structure to shrink your (generalized) linear predictors, this framework becomes incredibly powerful.

There are entire statistics textbooks devoted to multilevel linear models, you can get really far with these.

Shrinking through information sharing is really important to avoid overly optimistic predictions in the case of little data.

levocardia
2 replies
1d19h

Especially when you use the mixed model (aka MLM) framework to automatically select the smoothing penalty for your splines. So in one simple and very intuitive framework, you can estimate linear and nonlinear effects, account for repeated measurements and nested data, and model binary, count, or continuous outcomes (and more), all fitting the model in one shot, yielding statistically valid confidence intervals and p-values.

R's mgcv package (which does all of the above) is probably the single reason I'm still using R as my primary stats language.

stevesimmons
1 replies
1d11h

Is there a Python eqivalent?

gpderetta
0 replies
1d6h

statsmodels is the closest thing in python to R. statsmodels has mixed model support, but mgcv apparently requires more. It is well above my paygrade, but this seems relevant: https://github.com/statsmodels/statsmodels/issues/8029 (i.e. no out of the box support, you might be able to build an approximation on your own).

nextos
0 replies
1d22h

Yes, the article linked to ashr is quite famous.

madrox
4 replies
1d23h

I have a degree in statistics yet I've never thought about the relationship between linear models and business decisions in this way. You're absolutely right. This is the best comment I've read all month.

highfrequency
3 replies
1d17h

I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?

resonious
1 replies
1d16h

I'm also curious about what a non-actionable non-linear suggestion would look like.

317070
0 replies
1d11h

How I understand the comment: a non-linear suggestion is that the budget for X should be 300k. The (supposedly linear) alternative is that the budget for X should increase.

What I think is the important part, is that it is better to ask decision makers for decisions on setting a continuous parameter, than to make binary yes/no or go/no-go decisions. When it's a decision by committee, I can see why that is.

wheelinsupial
0 replies
1d3h

I’m neither of the previous posters, so I may be off…

For simplicity, I’m going to assume each variable in the model is independent of every other variable.

We can interpret the coefficients in linear models. This relationship holds for the model for the range of values it is based on. This relationship is the same for the whole range of the model. (We can’t extrapolate outside of what’s been modeled.)

y = c1x1 + c2x2 +…+ cnxn (excuse the poor formatting)

The sign tells you the direction (+ means it will increase the value of y, - means it will decrease the value of y), the value of the coefficient tells you how much the y will change for a given 1-unit change in the x value.

Since this is linear, you get the same change to the output for the relevant increases no matter your starting point.

So, the regression model would say x1, x3, and x5 have positive coefficients and variables x2, x4 have negative coefficients. If you want y to increase, either start doing more of x1, x3, x5 or do less of x2, x4. Depending on what these are and your limited investment budget, for example, you may pick doing x3 if that is the largest positive coefficient.

Again, since this is linear, you can keep on putting resources into the largest coefficient and get the same increase up until your model is no longer valid.

For non-linear models, you can still interpret the coefficients, but the interpretation depends on your starting conditions and where you are on the graph.

There may be asymptotes in your non-linear model, so there is a point of diminishing returns where if you keep putting resources into a variable with a positive coefficient, this will not keep getting you commensurate results.

Sorry I don’t have any actual examples here and I don’t have time to go digging through my old textbooks to look for any.

eru
4 replies
1d11h

Another one I would add that is very important: Human beings, especially in groups, can only reasonably make linear decisions.

No, that's not true. Human groups are very able to make discrete decisions. Actually, often they tend to go for discrete decisions, when something continuous (and perhaps linear) would be a lot better.

(Just to be clear: if you force your linear models to make discrete predictions, they are no longer linear in any sense of the word. That's why linear optimisation is a problem that can be solved in polynomial time, and integer linear optimisation is NP complete.

Even convex optimisation, which is no longer linear but still continuous, can be solved in roughly polynomial time.)

Often people demand more decisive decisions, of 'yes'/'no' or concrete action, not shades of grey and fiddling at the margins.

Getting people to even appreciate linear models is already a step forward. Like it or not, your business strategy meetings are already a step ahead of what most people would naturally be inclined to.

wiz21c
3 replies
1d11h

Often people demand more decisive decisions, of 'yes'/'no' or concrete action, not shades of grey and fiddling at the margins.

And working with these people is so painful.

harperlee
1 replies
1d4h

I often make fun of McKinsey- style four quadrants when overused, but they really boil down to something that makes a lot of sense in communicating a problem space:

a) carefully choose the two most important dimensions of concern (as Alan Kay said: the correct point of view is worth 80 Iq points)

b) make them binary: are we happy here or do we need to change?

In a way similar to the pareto ratio, you keep a surprising amount of value in something “so simple it cant be possibly so useful”.

eru
0 replies
17h57m

Of course, you can also weaponise the choice of axes for your (office) politics: pick the two axes right, and the policy outcome you want to pick might already be baked into the whole process from the start.

eru
0 replies
1d11h

Yes. I also found that in many cases being able to turn problems that require discrete decisions into problems that admit continuous decisions, eg by re-arranging how the business works etc, can unlock a lot of business value.

In my concrete cases I mostly saw that in the direct sense of being able to deploy more mathematics and operations research, eg for netting out (partially) offsetting financial instruments for a bank.

But by introspection you can come up with more example. Eg that's a common selling point for running your servers on AWS instead of building your own hardware.

jvans
0 replies
1d2h

Human beings, especially in groups, can only reasonably make linear decisions.

This seems to be getting a lot of attention. I couldn't agree more, we assume linearity all the time because reasoning non-linearly is exceptionally difficult. Yes we can do it sometimes, but it is not the default. Reasoning linearly has its flaws, and we should recognize we are making an imperfect decision, but it is still extremely useful.

highfrequency
0 replies
1d17h

I don't follow - could you explain this with a couple of examples? What would a business proposal look like that is analogous to a nonlinear model vs. one that is analogous to a linear model?

antwerp1
0 replies
1d13h

Quadratics might be more useful for optimizing (min/max problems)..

phrotoma
3 replies
1d6h

I have very little math knowledge and point 2 surprises me. Some quick googling suggests that a linear model should produce a straight line when graphed but the example equation you offered isn't straight. I'm missing something basic aren't I?

hervature
0 replies
1d4h

The thing being learned here are (a,b) and you do that using data (x,y). We can rewrite our input to be of the form z = {sin(x), x^2} and then now we have the model y = a z_1 + b z_2 which is now obviously linear in z. Since x is given to us and z is just a function of x, nothing strange is happening here. Just manipulating the data.

gpderetta
0 replies
1d6h

IANAS, but the example is not linear in x. But you can pick one or more axes where it would be linear. In this case for y=a*sin(x)+bx^2, you set x'=sin(x) and x"=x^2 and plot y=ax'+ bx". You can also pick an arbitrary function for y and do a similar transformation.

Tomte
0 replies
1d6h

When statisticians talk about linear models, they talk about the parameters being linear, not your variables x_0..x_n. So y = a*sin(x) + b is a linear model, because y is linear in a and b.

parpfish
3 replies
1d20h

if you want to convert people into loving linear models (and you should), we need to make sure that they learn the difference between 'linear models' and 'linear models fit using OLS'

i've met smart people that cant wrap their head around how it's possible to create linear model where the number of parameters exceeds the number of data points (that's an OLS restriction).

or they're worried about how they can apply their formula for calculating the std error on the parameters. bruh, it's the future and we have big computers. just bootstrap em and don't make any assumptions.

tylerrobinson
1 replies
1d15h

Okay, I’ll bite.

If you want to convert people into loving linear models (and you should), we need to make sure that they learn the difference between 'linear models' and 'linear models fit using OLS'

Help me understand the pitch. What linear models are you referring to here that aren’t estimated with OLS? How should I wrap my head around having more parameters than observations?

solresol
0 replies
1d14h

Linear models that aren't estimated with OLS: - Theil-Sen - Huber - RANSAC

Models that can cope with more parameters than observations: - Ridge - Lasso

hansvm
0 replies
1d18h

where the number of parameters exceeds the number of data points

Linear models have many solutions fitting the data exactly in that parameter regime, many more fitting it approximately for any metric still satisfying the idea that identical outputs are preferable, and sometimes multiple solutions even with more data.

So.....not just for OLS, but for most metrics (where you'd prefer to match or approximately match the data), the parameters are underconstrained.

How much that matters depends on lots of things. If you have additional constraints (a common one that's particularly easy to program is looking for a minimum-norm solution), that trivially solves the problem. Otherwise, you might still have issues. E.g., non-minimum-norm solutions often perform badly on slightly out-of-distribution samples (since those extra basis vectors were unconstrained and thus might be large).

Is there something I'm missing where 'linear models' are used to represent something wildly different than I'm used to? Are people using norms with discontinuities or something in practice? Is the criticism of OLS perhaps unrelated to the overparameterization issue? I think I'm missing some detail that would relate all of those.

borroka
3 replies
2d

For point (3), in most of my academic research and work in industry, I have used Generalized Additive Models with great technical success (i.e., they fit the data well). Still, I have noticed that they have been rarely understood or given the proper appreciation by--it is a broad category--stakeholders. Out of laziness and habit, mostly.

SubiculumCode
2 replies
1d23h

I've looked at additive models, but I have so far shied away because I've read that they are not super equipped to deal with non-additive interactions.

levocardia
1 replies
1d19h

They actually deal with non-additive "low-order" interactions quite well. In R's mgcv for example, let's say you had data from many years of temperature readings across a wide geographic area, so your data are (lat, long, year, temperature). mgcv lets you fit a model like:

  gam(temperature ~ te(long, lat) + s(year) + ti(long, lat,year))  
where you have (1) a nonlinear two-way interaction (i.e. a smooth surface) across two spatial dimensions, (2) a univariate nonlinear effect of time, and (3) a three-way nonlinear interaction, i.e. "does the pattern of temperature distributions shift over time?"

You still can't do arbitrary high-order interactions like you can get out of tree-based methods (xgboost & friends) but that's a small price to pay for valid confidence intervals and p-values. For example, the model above will give you a p-value for the ti() term, which you can use as formal statistical evidence to say -- at what level of confidence -- a spatiotemporal trend exists.

This Rmarkdown file (not rendered sadly) shows how to do this and other tricks https://github.com/eric-pedersen/mgcv-esa-workshop/blob/mast...

SubiculumCode
0 replies
1d14h

Hey cool. I'll take a closer look then. Thanks! I assume that there are mixed model variants out there too.

esafak
2 replies
1d22h

Re. 2) Then you end up doing feature engineering. For applications where you don't know the data generating process it is often better to just throw everything at the model let it extract the features.

benrutter
1 replies
1d10h

I don't disagree in the context of the current tools. But this has always been a bugbear of mine- data science has an unhealthy bias towards modeling over data preperation.

I'd love to see tools in the ecosystem around extracting relevant features that then can be used on a lower cost, more predictable model.

waveBidder
1 replies
1d21h

An SVM is purely a linear model from the right perspective, and if you're being really reductive, RELU neural networks are piecewise linear. I think this may be obscuring more than it helps; picking the right transformation for your particular case is a highly nontrivial problem; why sin(x) and x^2, rather than, say, tanh(x) and x^(1/2).

eru
0 replies
1d11h

ReLU networks have the nice property of being piecewise linear, but also during training they optimise their own non-linear transformation over time.

SubiculumCode
1 replies
1d23h

Do you have a useful reference for "3)"?

A common problem I encounter in the literature is authors over-interpreting the slopes of a model with quadratic terms (e.g. Y = age + age^2) at the lowest and highest ages. In variably the plot (not the confidence intervals) will seem to indicate declines (for example) at the oldest ages (example: random example off internet [1]), when really the apparent negative slope is due to the limitations of quadratic models not being able to model an asymptote.

The approach I've used, when I do not have a theoretically driven choice to work with) is using fractionated polynomials [2], e.g. x^s where s = {−2, −1, −0.5, 0, 0.5, 1, 2, 3}, and then picking a strategy to pick the best fitting polynomial while avoiding overfitting.

Its not a bad technique; I've tried others like piecewise polynomial regression, knots, etc [3],but I could not figure out how to test (for example) for a group interaction between two knotted splines). Also additive models.

[1] https://www.researchgate.net/figure/Scatter-plot-of-the-quad...) [2] https://journal.r-project.org/articles/RN-2005-017/RN-2005-0... [3] https://bookdown.org/ssjackson300/Machine-Learning-Lecture-N...

aquafox
0 replies
1d23h

For my applications, using natural cubic splines provided by the 'ns' function in R, combined with trying out where knots should be positioned, is sufficient. Maybe have a look at the gratia package [1] for plotting lots of diagnostics around spline fits.

[1] https://cran.r-project.org/web/packages/gratia/vignettes/gra...

wiresong
0 replies
15h27m

As a student who's only been exposed to stats in undergrad (in the context of using multiple regression in Econometrics), where can I learn more about this? especially about choosing a spline basis and Taylor's theorem?

usgroup
0 replies
1d11h

Yeah but let’s not go crazy. Linear models perform very badly on partition-able tabular data where tree models excel. They are also obviously no replacement or competition in deep learning related tasks.

Point 3 — just pick the right basis — is very difficult outside a handful of kernels that are known to work. And how are you going to extrapolate your spline for prediction for example? Linearly is usually the answer…

Point 4 — sure for differentiable functions, but most people are fitting data not functions, and if you know it’s generating function why would you bother with a linear model?

eachro
7 replies
2d1h

I'd love to see linear regression taught by say a quant researcher from Citadel. How do these guys use it? What do they particularly care about? Any theoretical results that meaningfully change the way they view problems? And so on.

mikaeluman
5 replies
2d

I have some experience. Variants of regularization are a must. There are just too few samples and too much noise per sample.

In a related problem, covariance matrix estimation, variants of shrinkage is popular. The most straight forward one being Linear Shrinkage (Ledoit, Wolf).

Excepting neural nets, I think most people doing regression simply use linear regression with above type touches based on the domain.

Particularly in finance you fool yourself too much with more complex models.

Ntrails
3 replies
1d23h

There are just too few samples and too much noise per sample.

Call it 2000 liquid products on the US exchanges. Many years of data. Even if you approximate it down from per tick to 1 minutely, that doesn't feel like you're struggling for a large in sample period?

kqr
0 replies
1d11h

It sounds like you are assuming the joint distribution of returns in the future is equal to that of the past, and assuming away potential time dependence.

These may be valid assumptions, but even if they are, "sample size" is always relative to between-sample unit variance, and that variance can be quite large for financial data. In some cases even infinite!

Regarding relativity of sample size, see e.g. this upcoming article: https://two-wrongs.com/sample-unit-engineering

energy123
0 replies
1d18h

If the distribution changes enough, multiple years of data may as well be no data.

bormaj
0 replies
1d21h

They may have been referring to (for example) reported financial results or news events which are more infrequent/rare but may have outsized impact on market prices.

fasttriggerfish
0 replies
2d

Yes these are good points and probably the most important ones as far as the maths is concerned, though I would say regularisations methods are really standard things one learns in any ML / stat course. Ledoit, Wolf shrinkage is indeed more exotic and very useful.

ljosifov
0 replies
19h47m

The linear regression - and with a single predictor at that - is the workhorse. As if - the cross-product x'*y is too little, divided by dot-product x'*x is just right (regression), and dividing it again by another dot-product y'*y (correlation, with the sqrt) would be over-doing it. :-)

There is no big mystery I'm afraid, there is no big reveal. It's as Jim Simons described in the Numberphile video interview: a slow painstaking accumulation of weak signals, plus crafting and improving various boxes of the system. (the interfaces between them are largely known) The fitting method used does not buy that much in the grand scheme of things - as long as it does not ruin things, that is.

(I've not been at Citadel but been quant R&D&trading last 20yrs)

minimaxir
6 replies
1d23h

When I was at CMU a decade ago I took 36-401 and 36-402 (then taught by Shalizi) and they were both very good statistical classes and they forced me to learn base R, for better or for worse.

A big weakness of linear regression that I had to learn the hard way is that the academic assumptions for valid interpretation of the coefficients are easy to construct for small educational datasets but rarely applicable to messy real world data.

bdjsiqoocwk
4 replies
1d22h

The flip side is with messy real world data you just need a model that's ok enough, rather than being concerned whether the p-value is this or that.

minimaxir
3 replies
1d21h

At that point, if you don't care about interpretable coefficients, you might as well use gradient-boosted trees or a full neural network instead.

borroka
2 replies
1d20h

It depends on the "severity" of the violation of assumptions--you can also use GAMs to add flexible nonlinear relationships--and the amount of data you are working with. Statistical modeling is a nuanced job.

minimaxir
1 replies
1d13h

I tried to argue that while at CMU and it didn't go well.

borroka
0 replies
59m

They may not know at CMU that the vast majority of applied, trained-on-data statistical models that help run the modern world seriously violate one or more of the model's assumptions.

aquafox
0 replies
1d22h

It depends. The most important assumption is independence of the observations. If that is not given, you have to either account for correlated responses using a mixed-effects model or mean-aggregate those responses (computing the mean decreases the variance but also reduces the number of data points and those two cancel each other out in calculating the t-statistic of the Wald test).

With regard to other assumptions, e.g. normality of the residuals, linear models can often deal with some degree of violation against those. But I agree that it's always good to understand the influence of those violations, e.g. by using simulations and making p-value histograms of null-data.

SubiculumCode
3 replies
1d23h

The most important skill in regression is to RECOGNIZE the intercept. It sounds trivial, and is, until you start including interactions between terms. The number of times I've found a young graduate student screw this up...

Take a simple linear model involving a test score, their age in years (age range 7-16 years), and a binary categorical variable autism diagnosis (0=control,1=autism): score = age + diagnosis + age:diagnosis score = (X1)age + (X2)diagnosis + (X3)age:diagnosis.

If the X2 is significant, the naive student would say, "look a group difference!!", not realizing this is the predicted group difference at the intercept, which is when participants were 0 years old. [[ You center age by the mean, or median, or better yet, the age you are most interested in. Once interactions are in the equation, all "lower order" parameter estimates are in reference to the intercept.]]

They might also note a significant effect of age, and then assume it applies to both groups, but the parameter X1 only tells you what the predicted slope is for the reference group (controls), while the interaction tests if the age slopes differ between groups...moreover, even if the interaction isn't significant, the age effect in the autism group might not significantly differ from zero...the data is in the wish washy zone, and you have to be careful in how one interprets the data.

To some here all this will seem obvious, but to many, getting their head firmly into the conditional space of parameters when their are interaction terms takes work. (note: for now I am ignoring other ways of coding groups (grand mean vs one group being the reference) but the lesson still remains. Understand what the intercept means and to whom/what it refers.

mturmon
0 replies
1d22h

I think this is accurate.

A significant loading on diagnosis (X2) does not tell you anything about the effect of diagnosis at any particular age (except age 0).

You’d have to recenter the model about the age of interest.

aquafox
0 replies
1d22h

I always struggle to get a good intuition into models with interaction terms. I usually try to write down for every class of responses which terms of the model go into it and often that helps with interpretation. There's also the ExploreModelMatrix [1] that helps with that task.

[1] https://www.bioconductor.org/packages/release/bioc/html/Expl...

SubiculumCode
0 replies
1d23h

If I said something stupid above, please let me know. I'm always learning. If you are a strong Bayesian who doesn't like p-values, that is also fine. I get it. I just wanted to provide my observations about a great number of bright students I've worked with who have nevertheless struggled to fluidly interpret models with interaction terms, and point them in the right direction.

g42gregory
2 replies
1d21h

It looks like this article does not mention it, but linear regression will also exhibit Double Descent phenomenon, commonly seen in deep learning. You would need to introduce some regularization, in order to see this. It would be nice to add this discussion.

gotoeleven
1 replies
1d21h

Are there some papers in particular that you're referring to? Does the second descent happen after the model becomes overparameterized, like with neural nets? What kind of regularization?

ds_opseeker
0 replies
1d8h

[Submitted on 24 Mar 2023] Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W. Rocks, Ila Rani Fiete, Oluwasanmi Koyejo

https://arxiv.org/abs/2303.14151

Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime. This drop in test error flies against classical learning theory on overfitting and has arguably underpinned the success of large models in machine learning. This non-monotonic behavior of test loss depends on the number of data, the dimensionality of the data and the number of model parameters. Here, we briefly describe double descent, then provide an explanation of why double descent occurs in an informal and approachable manner, requiring only familiarity with linear algebra and introductory probability. We provide visual intuition using polynomial regression, then mathematically analyze double descent with ordinary linear regression and identify three interpretable factors that, when simultaneously all present, together create double descent. We demonstrate that double descent occurs on real data when using ordinary linear regression, then demonstrate that double descent does not occur when any of the three factors are ablated. We use this understanding to shed light on recent observations in nonlinear models concerning superposition and double descent. Code is publicly available

yu3zhou4
0 replies
2d

This looks very interesting, do you know a way to transform this PDF to a mobile-optimized form?

usgroup
0 replies
1d11h

Also see Shalizi’s “Data Analysis from an Elementary Point of View” is a good introductory textbook:

https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/

It is rightly over-weight on linear and additive models and simulation. 90% of the book is useless without a computer but that is a modern truth.

rkp8000
0 replies
1d21h

I love that Ridge Regression is introduced in the context of multicollinearity. It seems almost everyone these days learns about it as a regularization technique to prevent overfitting, but one of its fundamental use cases (and indeed its origin I believe) is in balancing weights among highly correlated (or nearly linearly dependent) predictors, which can cause huge problems even if you plenty of data.

brcmthrowaway
0 replies
1d21h

pfft, couldnt build an LLM with it

ackbar03
0 replies
1d16h

We had to revisit linear regression multiple times in different courses for my undergrad classes. It's fascinating that optimality is provable using statistics and probability theory, although given assumptions hold of course.

For my cs phd I looked mostly at regression problems using deep learning models. I didn't look at this specifically but I still think it would be neat if there is some way to translate the rigid proofs and theorems for classical linear models to deep regression models.

__mharrison__
0 replies
1d16h

Thanks for sharing. Add someone teaching regression (with XGBoost) this month, this is a good read. Very well written, and approachable, unlike many academic texts.

I particularly like chapter 6, visual diagnosis. Very well done.