I used to work on an ML research team. In addition to what the author mentions, there is an entirely separate issue: whether or not what you're attempting to do is possible with the approach you've chosen. Consider making an iOS app. For the most part, an experienced software engineer can tell you if making a given app is possible, and they'll have a relatively clear idea about the steps required to realize the idea. Compare this to ML problems: you don't often know if your data or model selection can produce the results you want. On top of that, you don't know if you're not getting results because of a bug (i.e. the debugging issues mentioned by the author) or if because you have fundamental blocks elsewhere in the pipeline.
Like, not knowing if your data set actually contains anything predictive of what you're trying to predict?
Here’s an example of something similar. Say you have a baseline model with an AUC of 0.8. There’s a cool feature you’d like to add. After a week or two of software engineering to add it, you get it into your pipeline.
AUC doesn’t budge. Is it because you added it in the wrong place? Is the feature too noisy? Is it because the feature is just a function of your existing features? Is it because your model isn’t big enough to learn the new feature? Is there a logical bug in your implementation?
All of these hypotheses will take on the order of days to check.
All of these hypotheses will take on the order of days to check.
OK, but you can check them, right? How is that different from a regular software bug?
In software engineering you can test things in something on the order of seconds to minutes. Functions have fixed contracts which can be unit tested.
In ML your turnaround time is days. That alone makes things harder.
Further, some of the problems I listed are open-ended which makes it very difficult to debug them.
I think this only applies to a certain subset of software engineering, the one that rhymes with "tine of christmas".
Implementing bitstream formats is an area I'm very familiar with, and I dance when an issue takes seconds to resolve. Sometimes you need to physically haul a vendor's equipment to the lab. In broadcast we have this thing called "Interops" where tons of software and hardware vendors do just this, but in a more convention-esque style (actually is often done at actual conventions).
What?
line of business
I've been an ML researcher for the last 11 years. Last week I spent 3 days debugging an issue in my code which had nothing to do with ML. It was my search algorithm not properly modifying the state of an object in my quantization algorithm. Individually, both algorithms worked correctly but the overall result was incorrect.
Looking back at my career, the hardest bugs to spot were not in ML, but in distributed systems, parallel data processing algorithms, task scheduling, network routing, and in dynamic programming. Of course I also had a ton of ML related bugs, but over the years I developed tools and intuition to deal with them, so usually I can narrow down an issue (like impacted accuracy, or training not converging) fairly quickly. I don't think these kind of bugs are fundamentally different from any other bugs in software engineering. You typically try to isolate an issue by simplifying the environment, break it down into parts, test on toy problems, trace program execution and print out or visualize values.
What makes some bugs in ML algorithms hard to spot is that many of then hinder, but do not prevent, the model from learning. They can be really hard to spot because you do see the model learning and getting better, and yet without that bug the predictions could be even more accurate. Only with domain experience you can tell that something might be wrong.
Moreover, this kind of issues are usually related to the mathematical aspect of the model, meaning that you need to understand the theoretical motivation of things and check all operations one by one. Just this week for example I was dealing with a bug there where we were normalizing on the wrong dimension of a tensor.
Only with domain experience you can tell that something might be wrong.
Obviously. How is this different from any other field of science or engineering?
you need to understand the theoretical motivation of things and check all operations one by one.
Again, this is true when debugging any complex system. How else would you debug it?
a bug there where we were normalizing on the wrong dimension of a tensor
If you describe the methodology you used to debug it, it will probably be applicable to debugging a complicated issue in any other SWE domain.
Because the difference is that statistical models are by definition somewhat stochastic. Some incorrect answers are to be expected, even if you do everything right.
In software engineering you have test code. 100% of your tests should pass. If one doesn’t you can debug it until it does.
The difference is that in most cases it is not so clear how well any given approach will work in a given scenario. Often the only option is to try, and if performance is not satisfying it is not easy to find a reason for it. Besides bugs or wrong model choice, it could be wrong training parameters, the quality or quantity of the data, and who knows how much more you would need.
It's not necessarily different from SWE, problem solving is a general skill, the difficulty comes from the fact that there is no clear definition of "it works" and that there are no guidelines or templates to follow to find out what is wrong, if anything at all. In particular, many issues are not about the code.
Or is it because lack of expertise and experience and because someone tries stuff blindly without understanding a bit in the hope they will nail it with enough fiddling?
Isn't that all of AI? I get the impression that not even the "experts" really understand what new techniques will get good results - they're guided by past successes, and have loose ideas about why past successes were successful, but can't really predict what else will work.
It seemed like the tremendous success of transformer architectures was a surprise to everyone, who had previously been throwing stuff at the wall and watching it not stick for multiple decades. And when you look at what a transformer is, you can see why QKV attention blocks might be useful to incorporate into machine learning models... but not why all the other things that seem like they might be useful weren't, or why a model made of only QKV attention blocks does so much better than, say, a model made of GRU blocks.
No, it was not a surprise. Transformers architecture resulted from systematic exploration at scale with Seq2Seq. And it was quite clear when this architecture came out that it was very promising.
The issue was not technology, it was lack of investment. In 2017 with a giant sucking sound, Autonomous Vehicles research took all the investment money and nearly all talent. Myself is a good example, I was working on training code generation models for my startup Articoder, using around 8TB of code, scrapped from GitHub. Had some early successes, automatic pull requests generated and accepted by human users, got past YC application stage into the interview. The amount of VC funding for that was exactly zero. I've filed a patent, put everything on hold and went to work on AVs.
As to watching things not stick for multiple decades, you simply had too few people working on this. And no general availability of compute. It was a few tiny labs, with a few grad students and little to no compute available. Very few people had a supercomputer in their hands. In 2010, for example, a GPU rig like 2xGTX 470 (that could yield some 2 TFLOP of performance) was an exception. And in the same year, the top conference, like NeuralIPS had attendance of around 600.
So, 99% of software development?
You could say that. No one has even a decade of experience with transformers. Most of this stuff is pretty new.
More broadly though, it’s because there aren’t great first principles reasons for why things work or not.
Why do you spend weeks adding something instead of directly testing all the later hypotheses?
In some cases you can directly test hypotheses like that, but more often than not, there isn’t a way to test without just trying.
The Farmer is Turkey's best friend. Turkey believes so because every day Farmer gives Turkey food, lots of food. Farmer also keeps the Turkey warm and safe from predators. Turkey predicts that the Farmer will keep on being his best friend also tomorrow. Until one day, around Thanksgiving, the prediction goes wrong, awfully wrong.
Three body problem, reference,yeh
A bit older than that. This joke predates Bell labs.
But aren't all science basically like this? If you know your hypothsis works before you do the experiments, it's not science anymore.
You are right. That's why you want to avoid doing science, when you can.
Ideally, you want to be solving engineering problems instead.
You being? We really need people to try to solve problems that might not work because that is how new technology develops.
Let's rephrase: If you have to solve a problem, you'd better hope that problem is an engineering problem rather than a science problem.
Even many engineering problems are too difficult to know a priori that they will work. No new knowledge is needed as such, just the bounds on what is possible might be fuzzy.
Yes, of course.
If you care about solving your specific problem, you want to avoid having to acquire new knowledge. That's risky. (And I don't just mean knowledge that's new to you, that you get from reading a book. I mean knowledge that's new to humanity as far as you can tell.)
Eventually, someone will have to acquire some new knowledge to drive humanity forward. Just like in war, someone will have to go and fight the opponent; but you still prefer to achieve your objectives without having to fight whenever possible.
Not really. It’s easy to tell relative to existing methods whether the size of data will solve the problem. For example, if you’re trying to solve a classification problem with a large number of labels, but only have a small amount of training data for some (or all) of them, it will probably never work.
That is true. On the other hand i have seen someone once perform a trick which looked miraculous to me.
We had a classification problem with a small number of labels (~3). And one of the labels had unfortunately way less samples in our training set. Then someone trained a GAN to turn the images of the abundant labels into images of the rare labels. We added those syntetically generated images to the training set and it improved our classification performance as best as we could tell.
That one still feels a bit like black magic to me to be honest. Almost as if we got more out of less with a trick.
Transfer learning might help in some cases.
Yes, and science is "hard" compared to software development in a lot of ways. Less certainty of success and poorly defined success criteria.
Sometimes something you've done works, but you really don't know why/how. You then have to walk it back to figure out by experimenting what is causing it to work. I feel like this happen(ed|s) in chemistry a lot. Was it the fact that I stirred it counter clockwise this time, or that I got distracted and the temp went 5° hotter than intended, or that I didn't quite clean my beaker properly and some residue contaminated this batch, or any number of other steps.
You nailed where I currently stand. At my company I've been a jack of all trades but mostly software/dba work. My boss and I were very excited about ML when the hype cycle was taking off several years ago and completed a successful project. Fast forward to today, I got loaned out to another team that lost their data scientist, and for the first time in my career I'm having to say - "I don't think we can do what you want." To me the "science" part really stands out. I have a decent grasp of methodologies and tools, but after weeks of dissecting the issue my conclusion is that they just don't have enough useful data...
The situation is not bad then. Can they collect more data? Can they generate more data?
A more relevant question would be:
Is "not enough data" their problem, or the kind of data?
I'm not well-versed in this, or not as well as you are, but this has been my conclusion as well about a lot of ML project ideas from teams I've been on.
You need so much data to do useful things. Especially the magical kinds of things people tend to want to do. I think these types of datasets are on a scale most software developers typically don't see. Even with the data in hand, it's nothing like trivial to determine how to do something half-way useful with it.
It can be a very empirical art. If you can't generate more data at the time you can sometimes invest in reviewing the hand labeling ground truth to verify no false classifications slipped by.
This is where good simulations are useful. If you can show even in ideal simple data scenarios, you encounter inference problems, it’s a strong signal that real data has little chance of doing better.
In general these are model identifiability issues.
Every data science team should have a wall decoration with John Tukey's quote "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."
I have always thought of ML (not DL) as phenomena that can be modelled mathematically.
It turns out that not all problems have a great mathematical model like self driving cars for instance and so the search continues...
Why would self-driving cars not have a great mathematical model?
Or do you mean that the models are either black boxes (like deep learning) that we don't understand, and the white box models are not good enough?
All of ML, including DL, are literally implemented using mathematical models. Alas, a model is just a model and doesn’t imply it works well or imply that it’s simple or easily discoverable.
I have the same sentiments about DIY electronic designs. If I take someone else's designs and build it at home, I know it's all on my build skills lacking if it doesn't work as there is already working examples. If I design a device from the electronics to the software, I don't know if the thing isn't working because of bugs in the code, problems with the build of the electronics, or fundamental flaw in the design itself. At least not without a ton of time debugging it all.
However, we now have techniques for debugging electronics. Electronics tends to be designed to be decomposable into subunits, with some way to do unit testing. At least in the prototype, before it's shrunk for production. Test gear can be expensive, but it exists, all the way down to the wafer if needed.
That wasn't always the case. Electronics problems used to be more mysterious. The conquest of electronic design and debugging is what allows making really complex electronics that works. It really is amazing that smartphones work at all, with all those radios in that little case. That RF engineers can get a GPS receiver and a GSM transmitter to work a few centimeters apart is just amazing.
Machine learning isn't that far along yet. When it doesn't work, the tools for figuring out why are inadequate. It's not even clear yet if this is a technique problem which can be fixed with tooling, or an inherent problem with having a huge matrix of weights.
I never understood the black magic behind things like 4g until I saw a teardown of some pole equipment and saw the the solid copper beam forming cavities inside. Blew my mind.
Knowing that depends on your level of understanding the field and the math behind and also experience. If you just know how to make API calls, then it's hard.
What would be problematic if you want to do sentiment analysis for some product reviews? Result is the public perception within a margin of error, you have your data, you know what you want, you know how to get there.
Well, even with a high level of understanding, any sufficiently advanced use case will still have some uncertainty regarding its "feasibility". Of course, you might think that some problems are "solved", e.g., OCR, translation, (common) object recognition, but MANY other problems exist where, no matter how experienced and knowledgeable you are, you can only have an educated guess as to whether a given model can achieve a given performance without actually trying it out.
Where experience and knowledge really pays off is in telling apart model performance from bugs. There is a real know-how in troubleshooting ML pipelines and models in general.
Welcome to the last year of work for me. Now, I firmly believe what I set out to do cannot be done. However when I started, it seemed quite reasonable that the model I would build would be successful at it's purpose.
On top of that, vast majority of engineers and researchers who had joined the field, only did it in the last few years.
While, like with many other fields, it takes decades to get to a level of a well-rounded expert. One paper a day, one or two projects a year. It just takes time. No matter how brilliant or talented you are.
And then the research moves on. And more is different. A GFLOPS shift to TFLOPS and then PFLOPS over a single decade is a seismic shift.
I had a fun project I tried once. I wanted to see if a neural network could be fed a Bitcoin public key and output the private key. To make things simple, I tried to see if it could even predict a single bit of the private key. 256 bits of input, 1 bit of output.
I created a set of 1000 public/private key pairs to act as the test set. Then, I looped, generating a new set of 1000 key pairs, trained for several epochs, then tested on the test set.
After 3 days of training (granted, on a CPU several years ago), the results on the test set did not converge. Nearly every trial on the test set was 47-53% correct. I think I had one run that was 60% correct, but that was likely pure lock. Do enough trials of 1000 coin flips and you'll likely find one where you get at least 600 heads.
Back to your original comment...Is what I was attempting to do even possible? Did my network need more layers? More nodes per layer? Or was it simply not possible?
Based on what I know about cryptography, it shouldn't be possible. A friend of mine said that for a basic feed-forward neural network to solve a problem, the problem has to be solvable via a massive polynomial that the training will suss out. Hashing does not have such a polynomial, or it would indicate the algorithm is broken.
But I still always wonder...what if I had a bigger network...
i.e. there a lot more unknown unknowns, which takes a lot more effort and intelligence to not stumble into haphazardly then most other fields.
I think one issue also, is that ML is so large as a field, it ecompasses huge subfields, or related fields (statistics, optimizatiob, etc...)