Why is machine learning 'hard'? (2016)

I used to work on an ML research team. In addition to what the author mentions, there is an entirely separate issue: whether or not what you're attempting to do is possible with the approach you've chosen. Consider making an iOS app. For the most part, an experienced software engineer can tell you if making a given app is possible, and they'll have a relatively clear idea about the steps required to realize the idea. Compare this to ML problems: you don't often know if your data or model selection can produce the results you want. On top of that, you don't know if you're not getting results because of a bug (i.e. the debugging issues mentioned by the author) or if because you have fundamental blocks elsewhere in the pipeline.

you don't often know if your data or model selection can produce the results you want.

Like, not knowing if your data set actually contains anything predictive of what you're trying to predict?

Here’s an example of something similar. Say you have a baseline model with an AUC of 0.8. There’s a cool feature you’d like to add. After a week or two of software engineering to add it, you get it into your pipeline.

AUC doesn’t budge. Is it because you added it in the wrong place? Is the feature too noisy? Is it because the feature is just a function of your existing features? Is it because your model isn’t big enough to learn the new feature? Is there a logical bug in your implementation?

All of these hypotheses will take on the order of days to check.

All of these hypotheses will take on the order of days to check.

OK, but you can check them, right? How is that different from a regular software bug?

In software engineering you can test things in something on the order of seconds to minutes. Functions have fixed contracts which can be unit tested.

In ML your turnaround time is days. That alone makes things harder.

Further, some of the problems I listed are open-ended which makes it very difficult to debug them.

In software engineering you can test things in something on the order of seconds to minutes. Functions have fixed contracts which can be unit tested.

I think this only applies to a certain subset of software engineering, the one that rhymes with "tine of christmas".

Implementing bitstream formats is an area I'm very familiar with, and I dance when an issue takes seconds to resolve. Sometimes you need to physically haul a vendor's equipment to the lab. In broadcast we have this thing called "Interops" where tons of software and hardware vendors do just this, but in a more convention-esque style (actually is often done at actual conventions).

rhymes with "tine of christmas".

What?

line of business

I've been an ML researcher for the last 11 years. Last week I spent 3 days debugging an issue in my code which had nothing to do with ML. It was my search algorithm not properly modifying the state of an object in my quantization algorithm. Individually, both algorithms worked correctly but the overall result was incorrect.

Looking back at my career, the hardest bugs to spot were not in ML, but in distributed systems, parallel data processing algorithms, task scheduling, network routing, and in dynamic programming. Of course I also had a ton of ML related bugs, but over the years I developed tools and intuition to deal with them, so usually I can narrow down an issue (like impacted accuracy, or training not converging) fairly quickly. I don't think these kind of bugs are fundamentally different from any other bugs in software engineering. You typically try to isolate an issue by simplifying the environment, break it down into parts, test on toy problems, trace program execution and print out or visualize values.

What makes some bugs in ML algorithms hard to spot is that many of then hinder, but do not prevent, the model from learning. They can be really hard to spot because you do see the model learning and getting better, and yet without that bug the predictions could be even more accurate. Only with domain experience you can tell that something might be wrong.

Moreover, this kind of issues are usually related to the mathematical aspect of the model, meaning that you need to understand the theoretical motivation of things and check all operations one by one. Just this week for example I was dealing with a bug there where we were normalizing on the wrong dimension of a tensor.

Only with domain experience you can tell that something might be wrong.

Obviously. How is this different from any other field of science or engineering?

you need to understand the theoretical motivation of things and check all operations one by one.

Again, this is true when debugging any complex system. How else would you debug it?

a bug there where we were normalizing on the wrong dimension of a tensor

If you describe the methodology you used to debug it, it will probably be applicable to debugging a complicated issue in any other SWE domain.

Because the difference is that statistical models are by definition somewhat stochastic. Some incorrect answers are to be expected, even if you do everything right.

In software engineering you have test code. 100% of your tests should pass. If one doesn’t you can debug it until it does.

How is this different from any other field of science or engineering?

The difference is that in most cases it is not so clear how well any given approach will work in a given scenario. Often the only option is to try, and if performance is not satisfying it is not easy to find a reason for it. Besides bugs or wrong model choice, it could be wrong training parameters, the quality or quantity of the data, and who knows how much more you would need.

It's not necessarily different from SWE, problem solving is a general skill, the difficulty comes from the fact that there is no clear definition of "it works" and that there are no guidelines or templates to follow to find out what is wrong, if anything at all. In particular, many issues are not about the code.

AUC doesn’t budge. Is it because you added it in the wrong place? Is the feature too noisy? Is it because the feature is just a function of your existing features? Is it because your model isn’t big enough to learn the new feature? Is there a logical bug in your implementation?

Or is it because lack of expertise and experience and because someone tries stuff blindly without understanding a bit in the hope they will nail it with enough fiddling?

Isn't that all of AI? I get the impression that not even the "experts" really understand what new techniques will get good results - they're guided by past successes, and have loose ideas about why past successes were successful, but can't really predict what else will work.

It seemed like the tremendous success of transformer architectures was a surprise to everyone, who had previously been throwing stuff at the wall and watching it not stick for multiple decades. And when you look at what a transformer is, you can see why QKV attention blocks might be useful to incorporate into machine learning models... but not why all the other things that seem like they might be useful weren't, or why a model made of only QKV attention blocks does so much better than, say, a model made of GRU blocks.

No, it was not a surprise. Transformers architecture resulted from systematic exploration at scale with Seq2Seq. And it was quite clear when this architecture came out that it was very promising.

The issue was not technology, it was lack of investment. In 2017 with a giant sucking sound, Autonomous Vehicles research took all the investment money and nearly all talent. Myself is a good example, I was working on training code generation models for my startup Articoder, using around 8TB of code, scrapped from GitHub. Had some early successes, automatic pull requests generated and accepted by human users, got past YC application stage into the interview. The amount of VC funding for that was exactly zero. I've filed a patent, put everything on hold and went to work on AVs.

As to watching things not stick for multiple decades, you simply had too few people working on this. And no general availability of compute. It was a few tiny labs, with a few grad students and little to no compute available. Very few people had a supercomputer in their hands. In 2010, for example, a GPU rig like 2xGTX 470 (that could yield some 2 TFLOP of performance) was an exception. And in the same year, the top conference, like NeuralIPS had attendance of around 600.

Or is it because lack of expertise and experience and because someone tries stuff blindly without understanding a bit in the hope they will nail it with enough fiddling?

So, 99% of software development?

You could say that. No one has even a decade of experience with transformers. Most of this stuff is pretty new.

More broadly though, it’s because there aren’t great first principles reasons for why things work or not.

Why do you spend weeks adding something instead of directly testing all the later hypotheses?

In some cases you can directly test hypotheses like that, but more often than not, there isn’t a way to test without just trying.

The Farmer is Turkey's best friend. Turkey believes so because every day Farmer gives Turkey food, lots of food. Farmer also keeps the Turkey warm and safe from predators. Turkey predicts that the Farmer will keep on being his best friend also tomorrow. Until one day, around Thanksgiving, the prediction goes wrong, awfully wrong.

Three body problem, reference,yeh

A bit older than that. This joke predates Bell labs.

But aren't all science basically like this? If you know your hypothsis works before you do the experiments, it's not science anymore.

You are right. That's why you want to avoid doing science, when you can.

Ideally, you want to be solving engineering problems instead.

You being? We really need people to try to solve problems that might not work because that is how new technology develops.

Let's rephrase: If you have to solve a problem, you'd better hope that problem is an engineering problem rather than a science problem.

Even many engineering problems are too difficult to know a priori that they will work. No new knowledge is needed as such, just the bounds on what is possible might be fuzzy.

Yes, of course.

If you care about solving your specific problem, you want to avoid having to acquire new knowledge. That's risky. (And I don't just mean knowledge that's new to you, that you get from reading a book. I mean knowledge that's new to humanity as far as you can tell.)

Eventually, someone will have to acquire some new knowledge to drive humanity forward. Just like in war, someone will have to go and fight the opponent; but you still prefer to achieve your objectives without having to fight whenever possible.

Not really. It’s easy to tell relative to existing methods whether the size of data will solve the problem. For example, if you’re trying to solve a classification problem with a large number of labels, but only have a small amount of training data for some (or all) of them, it will probably never work.

classification problem with a large number of labels, but only have a small amount of training data for some (or all) of them, it will probably never work

That is true. On the other hand i have seen someone once perform a trick which looked miraculous to me.

We had a classification problem with a small number of labels (~3). And one of the labels had unfortunately way less samples in our training set. Then someone trained a GAN to turn the images of the abundant labels into images of the rare labels. We added those syntetically generated images to the training set and it improved our classification performance as best as we could tell.

That one still feels a bit like black magic to me to be honest. Almost as if we got more out of less with a trick.

but only have a small amount of training data for some (or all) of them, it will probably never work.

Transfer learning might help in some cases.

Yes, and science is "hard" compared to software development in a lot of ways. Less certainty of success and poorly defined success criteria.

Sometimes something you've done works, but you really don't know why/how. You then have to walk it back to figure out by experimenting what is causing it to work. I feel like this happen(ed|s) in chemistry a lot. Was it the fact that I stirred it counter clockwise this time, or that I got distracted and the temp went 5° hotter than intended, or that I didn't quite clean my beaker properly and some residue contaminated this batch, or any number of other steps.

You nailed where I currently stand. At my company I've been a jack of all trades but mostly software/dba work. My boss and I were very excited about ML when the hype cycle was taking off several years ago and completed a successful project. Fast forward to today, I got loaned out to another team that lost their data scientist, and for the first time in my career I'm having to say - "I don't think we can do what you want." To me the "science" part really stands out. I have a decent grasp of methodologies and tools, but after weeks of dissecting the issue my conclusion is that they just don't have enough useful data...

The situation is not bad then. Can they collect more data? Can they generate more data?

A more relevant question would be:

Is "not enough data" their problem, or the kind of data?

they just don't have enough useful data...

I'm not well-versed in this, or not as well as you are, but this has been my conclusion as well about a lot of ML project ideas from teams I've been on.

You need so much data to do useful things. Especially the magical kinds of things people tend to want to do. I think these types of datasets are on a scale most software developers typically don't see. Even with the data in hand, it's nothing like trivial to determine how to do something half-way useful with it.

It can be a very empirical art. If you can't generate more data at the time you can sometimes invest in reviewing the hand labeling ground truth to verify no false classifications slipped by.

This is where good simulations are useful. If you can show even in ideal simple data scenarios, you encounter inference problems, it’s a strong signal that real data has little chance of doing better.

In general these are model identifiability issues.

Every data science team should have a wall decoration with John Tukey's quote "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."

I have always thought of ML (not DL) as phenomena that can be modelled mathematically.

It turns out that not all problems have a great mathematical model like self driving cars for instance and so the search continues...

Why would self-driving cars not have a great mathematical model?

Or do you mean that the models are either black boxes (like deep learning) that we don't understand, and the white box models are not good enough?

All of ML, including DL, are literally implemented using mathematical models. Alas, a model is just a model and doesn’t imply it works well or imply that it’s simple or easily discoverable.

I have the same sentiments about DIY electronic designs. If I take someone else's designs and build it at home, I know it's all on my build skills lacking if it doesn't work as there is already working examples. If I design a device from the electronics to the software, I don't know if the thing isn't working because of bugs in the code, problems with the build of the electronics, or fundamental flaw in the design itself. At least not without a ton of time debugging it all.

However, we now have techniques for debugging electronics. Electronics tends to be designed to be decomposable into subunits, with some way to do unit testing. At least in the prototype, before it's shrunk for production. Test gear can be expensive, but it exists, all the way down to the wafer if needed.

That wasn't always the case. Electronics problems used to be more mysterious. The conquest of electronic design and debugging is what allows making really complex electronics that works. It really is amazing that smartphones work at all, with all those radios in that little case. That RF engineers can get a GPS receiver and a GSM transmitter to work a few centimeters apart is just amazing.

Machine learning isn't that far along yet. When it doesn't work, the tools for figuring out why are inadequate. It's not even clear yet if this is a technique problem which can be fixed with tooling, or an inherent problem with having a huge matrix of weights.

I never understood the black magic behind things like 4g until I saw a teardown of some pole equipment and saw the the solid copper beam forming cavities inside. Blew my mind.

whether or not what you're attempting to do is possible with the approach you've chosen

Knowing that depends on your level of understanding the field and the math behind and also experience. If you just know how to make API calls, then it's hard.

What would be problematic if you want to do sentiment analysis for some product reviews? Result is the public perception within a margin of error, you have your data, you know what you want, you know how to get there.

Well, even with a high level of understanding, any sufficiently advanced use case will still have some uncertainty regarding its "feasibility". Of course, you might think that some problems are "solved", e.g., OCR, translation, (common) object recognition, but MANY other problems exist where, no matter how experienced and knowledgeable you are, you can only have an educated guess as to whether a given model can achieve a given performance without actually trying it out.

Where experience and knowledge really pays off is in telling apart model performance from bugs. There is a real know-how in troubleshooting ML pipelines and models in general.

Welcome to the last year of work for me. Now, I firmly believe what I set out to do cannot be done. However when I started, it seemed quite reasonable that the model I would build would be successful at it's purpose.

On top of that, vast majority of engineers and researchers who had joined the field, only did it in the last few years.

While, like with many other fields, it takes decades to get to a level of a well-rounded expert. One paper a day, one or two projects a year. It just takes time. No matter how brilliant or talented you are.

And then the research moves on. And more is different. A GFLOPS shift to TFLOPS and then PFLOPS over a single decade is a seismic shift.

In addition to what the author mentions, there is an entirely separate issue: whether or not what you're attempting to do is possible with the approach you've chosen.

I had a fun project I tried once. I wanted to see if a neural network could be fed a Bitcoin public key and output the private key. To make things simple, I tried to see if it could even predict a single bit of the private key. 256 bits of input, 1 bit of output.

I created a set of 1000 public/private key pairs to act as the test set. Then, I looped, generating a new set of 1000 key pairs, trained for several epochs, then tested on the test set.

After 3 days of training (granted, on a CPU several years ago), the results on the test set did not converge. Nearly every trial on the test set was 47-53% correct. I think I had one run that was 60% correct, but that was likely pure lock. Do enough trials of 1000 coin flips and you'll likely find one where you get at least 600 heads.

Back to your original comment...Is what I was attempting to do even possible? Did my network need more layers? More nodes per layer? Or was it simply not possible?

Based on what I know about cryptography, it shouldn't be possible. A friend of mine said that for a basic feed-forward neural network to solve a problem, the problem has to be solvable via a massive polynomial that the training will suss out. Hashing does not have such a polynomial, or it would indicate the algorithm is broken.

But I still always wonder...what if I had a bigger network...

i.e. there a lot more unknown unknowns, which takes a lot more effort and intelligence to not stumble into haphazardly then most other fields.

I think one issue also, is that ML is so large as a field, it ecompasses huge subfields, or related fields (statistics, optimizatiob, etc...)

The #1 thing that makes it ‘hard’ in real life is that nobody wants to make training and test sets. So we have 50,000 papers on the NIST digits but no insight into ‘would this work for a different problem?’ (Ironically the latter might have been exactly what academics would have needed to understand why these algorithms work!)

Would there be enough of a financial incentive to do so? Seems like a prime startup opportunity.

> Seems like a prime startup opportunity.

Sometimes it's just ... hard. Apply some thought maybe before blindly parroting "profit!"

Reporter: "Why is it hard to cure cancer?". Crowd: "Would there be enough of a financial incentive to do so? Seems like a prime startup opportunity!"

Reporter: "Why is it hard to end World poverty?". Crowd: "Would there be enough of a financial incentive to do so? Seems like a prime startup opportunity!"

Reporter: "Why is it hard to build a warp engine?". Crowd: "Would there be enough of a financial incentive to do so? Seems like a prime startup opportunity!"

Reporter: "Why is it hard to wipe your ass using the left hand?". Crowd: "Would there be enough of a financial incentive to do so? Seems like a prime startup opportunity!"

You get the idea...

A cure for cancer would be terribly profitable. For a while.

Reporter: "Why is it hard to cure cancer?". Crowd: "Would there be enough of a financial incentive to do so? Seems like a prime startup opportunity!"

What you want to optimize for is the money amount that you make at some quantile of the probablity distribution of the profits; say, the profits that are guaranteed in the best, say, 3 %, 5 %, 10 % or even 20 % of all possible outcomes. With a probablity of 97 % (if you choose the best 3 % of the outcomes), you won't make sufficient money if you attempt to cure cancer to be worth the risk, so the financial incentive is not there.

TLDR: Financial incentives do matter, but work differently from how many people think that they are structured.

I believe that Scale.ai was founded to do exactly this.

There is plenty of money in it but you need to sell b2b and tp enterprise. That is not fun and as such no one is doing it.

Put another way,if I were trying to do a start up in this space I'd spend 50% of my budget on marketing 25% on a third world data labelling sweatshop, 20% on data pipeline engineering and 5% on sexy ml stuff.

I was just thinking the same, but I'm skeptical.

When researchers want to publish a paper, are they going to pay extra money for extra difficulty in publishing their paper? No, they'll just use whatever toy environment is free or already established and get that paper published!

You’re not paying tribute to MNIST-1D and many other datasets (including the massive segmentation dataset released by Meta with SAM). Read the literature before lecturing the community.

We still don't have enough data and people are still wasting their time with trying to extend algorithms instead of making better training data.

I've worked on a dozen ml projects two of them before Alex net came out and I've never gone wrong by spending 80% of my time creating a dataset specific to the problem and then using whatever algorithm is top dog right now.

Labelled data is king.

Personally I am happy to use a model that isn't quite "top dog".

I have a classification task where I can train multiple models to do automated evaluation in about 3 minutes using BERT + classical ML. The models are consistently good.

Sometimes you can do better fine-tuning the BERT model with your training set but a single round takes 30 minutes. The best fine-tuned models are about as good as my classical ML models but the results are not consistent and I haven't developed a reliable training procedure and if I did it would probably take 3 hours or more because I'd have to train multiple models with different parameters.

Even if I could get 82% AUC over 81% AUC I'm not so sure it is worth the trouble, and if I really felt I needed a better AUC (the number I live by, not the usually useless 'accuracy' and F1) I could develop a stacked model based on my simple classifier which shouldn't be too hard because of the rapid cycle time it makes possible.

My favorite arXiv papers are not the ones where people develop "cutting edge" methods but where people have a run-of-the-mill problem and apply a variety of run-of-the-mill methods. I find people like that frequently get results like mine so they're quite helpful.

no, this is routinely cited in introductory remarks these days, but ignores some practical aspects of the competitive context, among other things.

what is "this"?

The #1 thing that makes it ‘hard’ in real life is that nobody wants to make training and test sets.

The issue is the obsession with benchmark datasets and their flaky evaluation

What else could you do to test it besides it works for me and this test said it's good at talking?

Discussed at the time:

Why is machine learning ‘hard’? - https://news.ycombinator.com/item?id=12936891 - Nov 2016 (88 comments)

Love that thread. The top comment is excellent:

Like picking hyperparamters - time and time again I've asked experts/trainers/colleagues: "How do I know what type of model to use? How many layers? How many nodes per layer? Dropout or not?" etc etc And the answer is always along the lines of "just try a load of stuff and pick the one that works best".

To me, that feels weird and worrying. It's like we don't yet understand ML properly yet to definitively say, for a given data set, what sort of model we'll need.

This embodies the very fundamental difference between science and engineering. With science, you make a discovery, but rarely do we ask "what was the magical combination that let me find the needle in the haystack today?" We instead just pass on the needle and show everyone we found it.

Should we work on finding out the magic behind hyperparameters? In bioinformatics, the brilliant mathematician Lior Pachter once attacked the problem of sequence alignment using the tools of tropical algebra: what parameters to the alignment algorithms resulted in which regimes of solutions? It was beautiful. It was great to understand. But I'm not sure if it even ever got published (though it likely did). Having reasonable parameters is more important than understanding how to pick them from first principles, because even if you know all the possible output regimes for different segments of the hyper parameter space, really the only thing we care about is getting a functionally trained model at the end.

Sometimes deeper understanding provides deeper insights to the problems at hand. But often, they don't, even when the deeper understanding is beautiful. If the hammer works when you hold it a certain way, that's great, but understanding all possible ways to hold a hammer doesn't always help get the nail in better.

This embodies the very fundamental difference between science and engineering.

Not really though. In engineering, you have heuristics, even if you don't know why they work. In the case of deep learning / AI, there seems to be very little in the way of built up heuristic knowledge - it's just "try stuff and see what works for every problem".

All the model topologies that have names are heuristics. The idea of a 'layer' is a heuristic. And so on.

You don't really just try stuff. You choose very few things to try from the space of models. And you choose craftily.

We have quite a lot of domain craftiness now, if you think about it that way.

I think if this was truly the case then wouldn't ML algorithm development be a solved problem with AutoML? I don't think AutoML is close to ubiquitous which means there must still be value in heuristics and a deeper understanding of our tools.

I think that's also the difference between science and engineering: has the tool/technology been around enough to learn heuristics, or is everything still in the "fuck around and find out" phase?

I do a lot of model tuning and I’m almost ashamed to say I tell GPT what performance I’m aiming for and have it generate the hyper parameters (as in just literally give me a code block). Then I see what works, tell GPT, and try again.

I’m deeply uncomfortable with such a method…but my models perform quite well. Note I spend a TON of time generating the right training data, so it’s not random.

1/8th (soon to be 1/2) of the working world:

"I do a lot of X and I'm almost ashamed to say I tell GPT Y then I see if it works and try again".

Well, I do want to know more about how it works. Anything important I will teach myself, it’s just hard to justify the time investment during work hours when the robot does it. Which I think is also important: these tools save time, but with downsides.

The hammer analogy doesn't make much sense because for a hammer we can actually use our scientific knowledge to compute the best possible way to hold the tool, and we can make instruments that are better than hammers, like pneumatic hammers, pile drivers, etc.

With your argument, we would be stuck with the good old, but basic hammer for the rest of time.

That seems like a different analogy; making better hammers is a different thing than understanding why holding a hammer a certain way works well. We did eventually invent enough physics to understand why we hold hammers where we do, but we got really far just experimenting without first principles. And even if we use first principles, we are going to discover a lot more by actually using the modified hand-held hammer and testing it, than necessarily hitting it out of the park with great physical modeling of the hammer and the biomechanics of the human body.

And in any case, I'm not saying we shouldn't search for deep understanding of what hyperparameters work on a first try, I'm just saying there's a good chance that even if the principles are fully discovered, it may be that calculating using those principles is more expensive than a bunch of experimentation and won't matter in the end.

That's the trick about science, it's more about finding the right question to answer than how to find answers, and often times the best questions only become apparent afterwards.

Sometimes deeper understanding provides deeper insights to the problems at hand. But often, they don't, even when the deeper understanding is beautiful. If the hammer works when you hold it a certain way, that's great, but understanding all possible ways to hold a hammer doesn't always help get the nail in better.

Is it true? I mean, in mathematics having a proof of something is way stronger than having a conjecture. And in engineering, proving that your solution is optimal is way stronger than saying "hey look, I tried many things and finally it works!".

Worse, in statistics if you throw a bunch of tests and pick the one that "works" you might have false conclusions all the time. And AI is statistics.

Sure it works to test out 10 datatset and whatever number of different machine learning, but it takes time and money and might be suboptimal from an engineering POV.

Yes, this makes it very difficult to apply ML and RL in non-simulated scenarios.

With simulated scenarios you can just replay and "sweep" across hyperparameters to find the best one.

In a realworld scenario with limited information, fine tuning hyperparameters is much harder as you quickly find yourself in local maxima.

Similar:

Machine learning is still too hard for software engineers - https://news.ycombinator.com/item?id=30432987 - 151 comments

Machine learning is easier than it looks - https://news.ycombinator.com/item?id=6770785 167 comments

I think a big difference between ML and regular programming is how the components at scale make the systems viable. When I was learning computer science, it seemed quite intuitive to me that you would start out with assembly, then go to a C like compiler, then abstract that to a JIT/dynamic type language, and go from that to the UI. I could see how each step in the layer added value and presented its tradeoffs.

Contrast that to ML and even though I have done a large amount of work in it (in both university and in industry), I still can't fully appreciate how the building blocks interact to form an entire system. I find that I use intuition from other systems I have read about or implemented (e.g. decision trees and tabular data, ReLUs and images) to reason about the results in new systems and guess at better configurations and architectures.

Might say more about me, but I always found ML was a "start big and go backwards" deal whereas computer science was a "start small and go forwards" deal.

When building models it is useful to spend some time finding out what you already know about the problem. Things you yet don't know you know. This kind of knowledge will greatly simplify the model.

I see newcomers making this mistake very often. In industrial vision, for example, the newcomers like to create very complicated models. I then show them that the "box" you trained a entire model to recognize will actually always be there in this position for the camera, because it sits in a conveyor belt which restricts its lateral movement. The problem can simply be solved with simple image processing. Stuff like that happens all the time.

You let newcomers dick around for months with a net on an industrial vision problem? This stuff was solved two decades ago. Why didn’t you just tell them?

Sometimes its really hard to convince an organization that their strategy wont work. Especially if ego or higher ups are convinced.

Many developers also mistakenly think their mighty dev super brain genius powers make them capable of accurately evaluating and critiquing any profession. I’ve had many developers try to explain design to me, even knowing I’m an experienced, educated professional designer. I can see why some might roll their eyes.

While I can understand your point, I don't think it was fair to assume that my case is like that.

This wasn't an issue of design. It was an issue of them not making decisions without good information. Trying to throw an fconv network at a task that clearly requires positional invariance (convnet was appropriate). They had never done ml before and didnt know anything about it.

Another one was going crazy with docker, and costed them a good year and a half. Idk why the boss insisted everything be behind docker. and i mean everything. He just thought containers were a selling point

They usually don't take months. Would be optimal if I could catch them at the get go. Not what happens most of the time. When that happen most of the times you will se a manager who is not technical leading a group selected by "professional" recruiters. There is a lot of waste out there.

Oh. I understand. Carry on then lol

Been there. And you’re right.

When I was working on a pet project to teach myself how to build a scoring model based on analysis of images on mobile I went down a whole rabbit hole on how to detect where the image is in a photo to draw a box around it and then compress to 500x500.

In reality, if I'm using a phone,I can just create a square frame for the user to center the image in and then compress.

Sometimes the simplest solution is the one you don't get to till after you slog through the harder approach. I'm glad I learned a bit about image processing with Pytorch along the way.

I would start with a solid understanding of probabilities, statistics, calculus.

Machine learning isn't comparable to software development. It is a statistical modelling exercise. This is like asking why advertising is hard - if a non-expert wades in to a different domain then they may find it has different challenges than what they are used to! This is just a specific case of the normal things that analysts routinely deal with.

The major challenges in this youthful field of machine learning are building appropriate hardware and making it work. That, so far, has kept it the domain of the software engineer. As the situation continues to stabilise this is going to become the playground of statisticians and analysts.

Or to put it another way - if you compare any field to software engineering, the problem is that other disciplines have a much harder time debugging things. Software is almost unique in that debugging is cheaper and quicker than building things right the first time.

I dont feel like im doing statistical modeling when i do ml. Usually feels more like pipe alignment, followed by tremendous amounts of debugging.

The heuristic I use for distinguishing between statistical modelling, machine learning and AI is is through feature engineering and model specification:

- Statistical modelling: Manual feature engineering, manual model specification (y = ax + b)

- Machine learning: Manual feature engineering, automated model specification (y = ax + b or y = ax^2 + b, I don't care, the algorithm should figure it out).

- AI: Automated feature engineering (e.g. CNN), automated model specification

IDK about this. The model y = ax + b is not specified, parameters a and b have to be chosen by optimization. Now add regularization that some of those parameters are shrinked to 0 (lasso) and you have "automatated model specification" where only some parameters are left and others discarded.

And furthermore the models are always chosen from a predefined hypothesis set, so there can never be truly automated specification.

That is also what statistical modelling feels like. EDA and data cleaning.

It depends on what is it you do when you “do ml”.

Machine learning isn't comparable to software development. It is a statistical modelling exercise.

It's neither of the two. Machine learning isn't comparable to any other human endeavor because in many cases, much more value comes out of the models than (seemingly) goes in.

LLMs for example are punching way above their weight. The ideas underlying their software implementations are extremely simple compared to the incredibly complex behavior they produce. Take some neural networks that can be explained to a bright high schooler, add a few more relatively basic ML concepts, then push an unfiltered dump of half the Internet into it, and suddenly you get a machine that talks like a human.

Obviously I'm simplifying here, but consider that state-of-the-art LLM architectures are still simple enough that they can be completely understood through a 10-hour online course, and can be implemented in a few hundred lines of PyTorch code. That's absolutely bananas considering that the end result is something that can write a poem about airplanes in the style of Beowulf.

I'm not sure it's as simple as you make it sound.

Lots of problems have very simple solutions. And progress often means finding simpler solutions over time.

But coming up with those solutions, and debugging them, is what's hard.

For a comparison, have a look at how pistols got simpler over the last two hundred years. Have a look at the intricate mechanism of the P08 Luger https://www.youtube.com/watch?v=9adOzT_qMq0 and compare it to a modern pistol of your choice. (And the P08 Luger is already pretty late invention.)

Or have a look at modern Ikea furniture, which can be assembled and understood by untrained members of the general public. But designing new Ikea furniture is much harder.

I think one of the issues is that fixing a problem is a lot harder in ML than in software engineering. You know that the model fails on this particular data point. If you have identified a bug in the code and wrote a fix did a pull request as long you are able to test the code for conditions you failed on you would have solved the problem. With modern ml especially with nueral nets as long as you don't have a way to spin up a data engine to track the problems you are facing and collect similar points you problem is not fixed.

I remember running into a paper from Google circa 2017 IIRC discussing the maintainability issues with machine learning models but haven’t been able to track down since. Does anyone know which one this is and have a link?

https://research.google/pubs/machine-learning-the-high-inter...

Looks like I found a similar one just now too from the group. Thanks!

https://proceedings.neurips.cc/paper_files/paper/2015/file/8...

Certainly resonates with me. When the problem is something like a list being the wrong shape or two lists not maintaining a parallel order its basically invisible unless you load the mental model of the code and think deeply about it. Not like you’re going to notice your list of length 2,340,383 should start with [0.12, 1.67, 0.66 instead of [0.412, 0.567, 0.23

I'm a bit sceptical of the exponentially harder debugging claim.

First it looks polynomially harder for the given example :p.

Second other engineering domains arguably have additional dimensions which correspond to the machine learning ones mentioned in the article. The choice of which high level algorithm to implement is another dimension to traditional software engineering that seemingly exists and corresponds to the model dimension. This is often codified as 'design'.

The data dimension often exists as well in standard learning software engineering. [Think of a system that is 'downstream' of other].

It's probably a lot simpler to deal with these dimensions in standard software engineering - but then this is what makes machine learning harder, not that there are simply 'more dimensions'.

The delayed debugging cycles point seems a lot more valid.

The article also pretends that there is only one correct answer, which seems atypical of the domain. The 1 green spot should extend somewhat fuzzily in each dimension in the ML case.

There isn’t only one correct answer. Quite the opposite actually, many configurations give you local maximas. The difficulty is that it can be hard to explain from first principles why one local maxima is good.

i would subscribe to your newsletter if you offered one.

I had similar thoughts on this. It looks polynomial because he classified errors into certain groups and spread them across an axis which we implicitly think is a dimension. But it's not... as he says, "Along any one dimension we might have a combination of issues (i.e. multiple implementation bugs)". So looking at the grid and thinking you're at a single point is wrong... but so is thinking you could be any configuration of locations (you couldn't be at points (1, 1) and (2, 2), you also need to be at points (1, 2) and (2, 1), i.e. the set of points you're at is transitive). Thus this is a bad way to "enumerate the failure cases" and makes his notion of "adding a dimension" pretty unintuitive. It makes more sense to see every point of his dimensions as itself a dimension. Each of these dimensions could be either correct or incorrect, 1 or 0. So you're configuration could be visualized as a string of 1's and 0's. Thus, the number of possible configurations grows exponentially (2^n).

"An aspect of this difficulty involves building an intuition for what tool should be leveraged to solve a problem."

While I agree with the good point about debugging, like many others, I am rather worried that we're increasingly deploying AI/ML where we shouldn't be deploying it. Hence, the above quote.

Want to agree with you, as so many ML apps seem to be solutions looking for problems. But I actually feel that we are rapidly deploying ML in a development context for vastly improved results. The way that good models are built relies on many ML steps, and when the results finally come together the result is superior to what could have been custom designed. Broad adoption of something like probabilistic programming is coming soon.

I’m old enough to have learned that the secret to success is much less knowing the tool of the moment than picking the right tool for a job.

The right tool may in fact be the new one, and LLM do open a lot of doors with zero shot capabilities, but oftentimes they can underperform a well tuned heuristic. It’s the ability to pick the right tool that is key.

As with everything, if you don't know the fundamentals, the basics, you are very limited on what you can achieve and find stuff difficult.

To use a parallel from software if you just know a programming language, how to call libraries and APIs you can't compare to a guy with a solid CS background who also understands algorithms, data structures, complexity, math and knows how computers work down to NAND gates.

Because I have studied some ML fundamentals during my batchelor and master degrees I know I am not competent in ML and AI. But I know how to become competent if I need or want to. Most people do believe finishing a bootcamp and calling some frameworks in Python makes them competent.

and knows how computers work down to NAND gates.

I lol'd

Yes, bootcamps are really a great way to score high on the Dunning-Kruger score, but not so much on any other metric of value. You need lots of sweat and tears to become a capable and competent ML engineer, especially to tell apart intrinsic limitations (i.e., no signal in your data) vs extrinsic issues (e.g., a bug anywhere between ETL and prediction, or a methodological mistake).

Machine learning systems are hard because your system can be badly under performing but still doing objectively great. I've seen systems that produce amazing improvements in core business metrics and then years later some subtle bug in the labeling process is accidentally uncovered and fixing it boosts performance by an additional 20%.

A ton of time in ML is spent on minimizing the surface area for bugs because it can be difficult to even know they exist.

It's hard because there's rarely one single correct solution, design, answer, etc. It's hard in the same way as any other open-ended research work is hard. It will never not be hard in that sense.

I find it difficult to buy the argument that “it’s not difficult because of the math”, at least in the way the author meant.

I do however literally agree with the author: ML is not difficult because of the math. It’s difficult because for some reason people think the math is not important. But ML is math.

ML without math is like trying to play 5th Symphony without knowing notes, chords or any musical theory.

This article annoyed me. Not because I think it overestimates any of the difficulties, but it condescendingly compares against "standard software engineers" as if their problems are so much easier and lack additional dimensions beyond algorithm and implementation. It doesn't help that they call N^4 "exponential".

Long debugging cycles are not new. In fact, the field started out that way, when computers were slow enough that even a single edit-compile-run cycle could take hours (or days and some political capital, if you go back far enough). Even today, long debugging cycles are far from being restricted to ML, especially for failures that are only seen intermittently and in production.

Performance mattering is not unique to ML. Obviously.

Data issues are not unique to ML. Anytime you need to run against a "representative workload", you'll bump into them. Heck, anytime you run against a large test suite you'll run into issues with the quality of the comparison data, especially if the tests are long in the tooth. Furthermore, anything you're doing statistics on in general is going to bump into data issues—unless doing statistics is automatically ML?

ML lacks various other dimensions of difficulty. Distributed systems. Web browsers that need to execute arbitrary code that hasn't been written yet. Backwards compatibility. Forwards compatibility. In-production databases. Cross platform development. Power usage optimization. Concurrency (ML can have this, but it doesn't always need to, and it can be in the "embarrassingly parallel" bucket). Deep dependency graphs.

The author thinks ML development is hard because that's what they work on. They are but a grasshopper.

> It becomes essential to build an intuition for where something went wrong based on the signals available.

This has always been my approach. I learned programming way before I had access to debuggers and other methods to dig in, set breakpoints and step through code to see where it was going wrong. As a result, when I got in the real world I kind of looked down on people using those tools (mostly because I hate tools actually). But then I saw people get to the root of problems that I don't think I ever could have solved, and I started to appreciate those tools and the detail you could get to. My preference is still to have a great understanding of how algorithms work, how the code is written, and what the problem is, and noodle out what and where things may be going wrong. I only switch to detailed monitoring of the insides when "thinking about it" fails. Maybe I should have gone into this ML stuff ;-)

I found the explanation quite convoluted (pun intended). Any system with billions of parameters is bound to be difficult to reason about.

For me it’s 2 things: math is already hard even with a teacher and after you found a job it’s just you and your screen. The other part is research moves at a crazy speed it’s very hard to keep up I think while having a job.

This is a dumb question: could we not make an ML debugger out of ML?

It seems ML, or at least model training and fine tuning, is all about pattern recognition.

Could the same thing not be done where a model is trained on ML algos running correctly vs. not and it performs the pattern recognition (i.e. intuition) to find the root cause?

Unless computer scientist really try to understand how actual brain build their networks, I don't think ML will really bring real advances.

Computer science is pointless when it's not applied to another field of science.

Great teachers(if you can find them): Andrej Karpathy, Andrew Gelman & Ben Goodrich(Columbia), Subbarao Khambampati(ASU) to name a few I know of.

Go Where the hard problems are (or find someone who is doing it): If you don't have a good intuition where to get good problems to practice on for a pay, choose a place to work where a Data Scientist is not just building dashboards/analytics but the company/team relies on them for answer to questions like: "What goals should we set for the next half based on what you see"?

ML practitioner (read: use ML tech to do/debug X) different from ML Engineer (Read: Implement ML algorithm X e2e on data ) is different from Applied Statistician (think marketing sciences or powering experiments like A/B tests): All three areas of work in different areas of ML in one form or another. But make sure what you want to work in is clear in your head and your expectations from it.

A lot of ML/Stats can be not with big data and yet really intuitive: I would say look for a problem domain in social/life/pharma/eco/political/survey/edtech sciences. They are full of intuitive models that need to be explainable and are often debuggable. An example here is usage of Stan software for Multilevel/Heirarchical Regression problems. Training here also makes you a great DS.

Debugging is a problem. But the real problem I'm seeing is our expectations as software developers. We're used to being able to fix any problem that we see. If a div is misaligned or a column of numbers is wrong we can open the file, find the offending lines of code and FIX it.

Machine learning is different because every implementation has a known error rate. If your application has a measured 80% accuracy then 20% of cases WILL have an error. You don't know which 20% and you don't get to choose. There's no way to notice a problem and immediately fix it, like you can with almost every other kind of engineering. At best you can expand your dataset, incorporate new models, fix actual bugs in the code. Doing those things could increase the accuracy up to, say, 85%. This means there will be fewer errors overall, but the one that you happened to notice may or may not still be there. There's no way to directly intervene.

I see a lot of people who are new to the field struggle with this. There are many ways to improve models and handle edge cases. But not being able to fix a problem that's in front of you takes some getting used to.

It's not hard. It just doesn't work. It's modern alchemy.

To be fair, chip design has both of these problems to a much greater degree than machine learning:

- Exponentially Difficult Debugging

- Delayed Debugging Cycles

It's harder because there are no unifying theories. The ones that exist are seriously lacking.

With regard to model selection, one thing I learned a long time ago that provides powerful intuitive guidance on which model to use is the question: "How could this be compressed further?"

There are some deep connections between data compression and generalized learning, both at the statistical level and even lower at the algorithmic level (see Solomonoff induction).

For a specific example at the statistical level, suppose you fit a linear trendline to some data points using OLS. Now compute the standard deviation of the residual terms for each data point, and using the CDF of the normal distribution, for each residual, map its value into the interval [0, 1]. Sum together the logarithms of the trendline coefficients, the standard deviation, and the normalized residuals. This value is approximately proportional to "sizeof(data | model) + sizeof(model)". It represents how well you compressed the data using the OLS model.

But now suppose you plot the distribution of the residuals and find out that they do not in fact resemble a Gaussian distribution. This is a problem because our model assumed the error terms were distributed normally, and since this is not the case, our compression is suboptimal.

So you back out some function f that closely maps between the uniform distribution on [0, 1] and the distribution that the residuals form and use this f to define a new model: yᵢ = m*xᵢ + b + εᵢ, with εᵢ distributed according to f(x;Θ), Θ being a parameter vector. When you sum the logarithms again, you will find that the new total is smaller than the original total obtained using OLS. The new trend line coefficients will slightly mess up the residual distribution again, so iterate on this process until you've converged on stable values for m, b, and Θ.

At the algorithmic level, the recommendation to use compression as a model selection guide applies even to LLMs, but it's a bit harder to use in practice because "sizeof(data | model) + sizeof(model)" isn't the entire story here. Suppose you had a "perfect" language model. In this case, you would achieve minimization of K(data | training data), where K is Kolmogorov complexity. In practice, what is being minimized with each new LLM version is "sizeof(data | LLM model) + sizeof(training data | LLM model) + sizeof(LLM model)". You can assume "sizeof(LLM model)" is the smallest Turing machine equivalent to the LLM program.

What do we even mean by "Machine Learning"?

Understanding ML algorithms and the math behind, being able to change the algorithms and devise new ones?

Using some ML libraries and frameworks?

Taking an implemented solution and training it and fine tuning parameters?

Let's draw a paralel. It's algebra hard? No if you want to solve simple equations. Yes if you want to achieve masters hip.

I don't think ML is harder than other technical and scientific fields. Being a good software developer, a good software scientist, a good mathematician, a good astronomer, a good physicist is hard.

And of course, there are hard fields outside of science. Being good at any kind of art is hard.

Just because some people believe they are good at something (because Dunning Krueger effect) doesn't mean that field is easier than ML. They assume ML is harder because they happen to know a few bits in a particular field and nothing about ML. Then, the same kind of people will learn some superficial bits on ML and think it's easy.

Any human endeavor done properly means time, sweat, critical thinking, determination. And lots of it.

there are uncountable sets all over the place, and in practical terms, the repl loop may have a week long training lag after you hit enter.

also, the data is almost always complete shit.

lol. there’s no mystery why it’s hard.

I don't think machine learning is particularly hard. It just involves a lot of brute force work and most of the time, you get a middling result that isn't particularly exciting. One of my friends had no CS background whatsoever but managed to make a basic Runescape mining bot by manually labeling hundreds of the correct colored rocks. His account got banned after a couple days though.