I had the opportunity to work on TinyML, it's a wonderful field! You can do a lot even with very small hardware.
For example, it's possible to get real-time computer vision system with an esp32-s3 (dual-core XTensa LX7 @ 240 MHz cost like 2$), of course using the methods given in the article (Pruning, Quantization, Knowledge distillation, etc.). The more important thing is to craft the model to fit as much as possible your need.
More than that, it's not that hard to get into, with solution named AutoML that do a lot for you. Checkout tool like Edge impulse [0], NanoEdge AI Studio [1], eIQ® ML [2]
There is a lot of tooling that is more low-level too, like model compiler (TVM or glow) and Tensorflow Lite Micro [3].
It's very likely that TinyML will get a lot more of traction. A lot of hardware companies are starting to provide MCU with NPU to keep consumption as low as possible. Company like NXP with the MCX N94x, Alif semiconductor [4], etc.
At my work we have done an article with a lot of information, it's in French but you can check it out: https://rtone.fr/blog/ia-embarquee/
[1]: https://stm32ai.st.com/nanoedge-ai/
[2]: https://www.nxp.com/design/design-center/software/eiq-ml-dev...
One thing I've wondered in this space: Let's say for a really basic example I want to identify birds and houses. Is it better to make one large model that does both, or two small(er) models that each does one?
Why not three models? One model does basic feature detections, like lines, shapes, etc. A second model that can take the first model's output as its input, and identify birds. A third model can take the first model's output as its input, and identify houses.
This is a lesson I've watched people, and companies learn for the past 7-8 years.
An end to end model will always outperform a sequence of models designed to target specific features. You truncate information when you render the data into output space (the model output vector) from feature space (much richer data inside the model), thats the primary reason why to do transfer learning all layers are frozen, the final layer is chopped off, and then the output of the internal layer is sent into the next model. Not the output itself.
Yes you can create a large tree of smaller models, but the performance cieling is still lower.
Please don't tell people to do this. Ive seen millions wasted on this.
When you train a vision model it will already develop a heirarchy of fundamental point, and line detectors in the first few layers. And they will be particularly well chosen for the domain. It happens automatically. No need to manually put them there.
As someone not in ML but curious about the field this is really interesting. Intuitively indeed it would be natural to aim for some sort of inspectable composition of models.
Is there specific tooling to inspect intermediate layers or will they be unintelligible for humans?
The unending quest for "Explainability" has yielded some tools but has been utterly overrun and outpaced by newer more complicated architectures and unfathomably large models. (Banks and insurance, finance etc really want explainability for auditing.)
The early layers in a vision model are sort of interpetable. They look like lines and dots and scratchy patterns being composited. You can see the exact same features in L1 and L2 biological neural networks in cats, monkeys, mice, etc. As you get deeper into the network the patterns become really abstract. For a human, the best you can do is render a pattern of inputs that maximizes a target internal neurons activation to see what it detects.
You can sort of see what they represent in vision. Dogs, fur, signs, face, happy, sad, etc, but once its a multimodal model and there is time and language involved it gets really difficult. And at that point you might as well just use the damn thing, or just ask it.
In finance, you cant tell what the fuck any of the feature detectors are. Its just very abstract.
As for tooling, a little bit of numpy and pytorch, dump some neurpn weights to a png, there you go. Download a small convnet pretrained network, amd i bet gpt4 can walk you through the process.
Ok since we are at it, in your opinion:
Is it feasible for someone with a SWE background with fair amount of industry years to transition into ML without a deep dive into a PhD and publications to show?
I am considering following the fastAI course or perhaps other MOOC courses but I am not sure if any of this would be reasonably taken seriously within the field?
It is reasonable. If you have time and are willing to put in the effort I can forcefeed you resources, and review code and such. I've raised a few ML babies. Mooc are probably the wrong way to go. Thats where i started and I got stuck for a while. You really need to be knee deep in code, and a notebook.
As for getting jobs I cant help you with that part. You'll have to do your own networking, etc.
gibsonmart1i3@gmail.com Shoot me an email if your serious lets schedule a call.
Just emailed you. Thank you.
I'm genuinely confused at how you made these assumptions about what I'm describing. Because the "more correct" design you contrast with the strawman you've concluded I'm describing is actually what I'm talking about, if perhaps imprecisely. A pretrained model like mobilenetV2, with its final layer removed, and custom models trained on bird and house images, which take this mobilenetv2[:-1] output as input. MobilenetV2 is 2ish megabytes at 224x224, and these final bird and house layers will be kilobytes. Having two multiple-megabyte models that are 95% identical is a giant waste of our embedded target's resources. It also means that a scheme that processed a single image with two full models (instead of one big, two small) would spend 95% of the second full model's processing time redundantly performing the same operations on the same data. Breaking up the models across two stages produces substantial savings of both processing time and flash storage, with a single big model as the "feature detection" first stage of both overall inferences, with small specialized models as a second stage.
Sorry to upset you. It was not clear from your description that this was the process you were referring to. Others will read what you wrote and likely misunderstand as I did. (Which was my concern because I've seen the "mixture of idiots" architecture attempted since 2015. Even now... Its a common misconception and an argument every ml practitioner has at one point or another with a higher up.)
As for your ammendment, it is good to reduce compute when you can, and reduce up front effort for model creation when you can. Reusing models may be valid, but even in your ammended process you will still end up not reaching the peak performance of a single end to end model trained on the right data. Composite models are simply worse, even when transfer learning is done correctly.
As for the compute cost, if you train an end to end model and then minify it to the same size as the sum of your composite models it will have identical inference cost, but higher peak accuracy.
You could even do that with the "Shared Backbone" architecture, as youve described where two tailnetworks share a head network. It has been attempted thoroughly in the Deep Reinforcement Learning subdomain I am most familiar, and result in unnecessary performance loss. So it's not generally done anymore.
Man, everyone at work is going to be really bummed when I tell them that some guy on the internet has invalidated our empirical evidence of acceptable accuracy and performance with assumptions and appeals to authority.
Great post. surprised and excited to discover Tensorflow models can run on commodity hardware like the ESP32.
I ended up hand rolling a custom micropython module for the S3 to do a proof of concept handwriting detection demo on an ESP32, might be interesting to some.
https://luvsheth.com/p/running-a-pytorch-machine-learning
Great post with very interesting detail, thanks ! Another optimization could be to quantize the model, this transform all compute as int compute and not as floating point compute. You can lose some accuracy, but for any bigger model it's a requirement ! Espressif do a great job on the TinyML part, they have different library for different level of abstraction. You can check https://github.com/espressif/esp-nn that implement all low level layers. It's really optimized and if you use the esp32-s3 it will unlock a lot of performance by using the vector instructions.
You are right I should definitely be looking into how to run these models as ints as well, especially with the C optimizations to micropython you would see a lot larger performance gains using ints compared to floats. Definitely need to find some time to try it!
On the other hand the tinyML library looks great too and if I was going to do this for a product that would likely be the direction I would end up taking just cause it would be more extensible and better supported.
Thank you for the links!
Problems reducible even partially to matrix math are for many practical purposes embarrassing parallel even within a single core. A couple hundred million FLOPS with 1990s SIMD support will let you run nearly all near-SOTA models within, idk, 3s, with most running in 0.1 or 0.01s. That’s pretty fast considering it’s an EP32 and some of these capabilities/models didn’t even exist a year ago.
Your expectation was not really wrong, because for most purposes, when discussing a “model” one is really talking about “capabilities”. And capabilities often require many calls to the model. And that capability may be reliant on being refreshed very rapidly… and now your 0.1s is not even slow, it’s almost existentially slow.
Re: training. even on the EP32, training is entirely doable, so long as you pretend you are in 2011 solving 2011 problems hahaha
In most MCU there is not an FPU so all floating point compute is emulated with software, so it's really slow. But yes, simple SIMD on integer improve so much the performance !
The main limitation is often not the time to process but the RAM available, some architecture of model need to keep multiple layers in ram or very big layers, and you hit the hard limit of RAM pretty quickly.
Concerning the training on MCU, it's possible but with simple need and special architecture of model, again the RAM is the limit.
thank you for the post and good work.
can I ask, is the focus primarily on inference? is there anything serious going on with training at the power scale you are talking about?
Thanks !
Yes, the main focus is on inference. It's possible to re-train a simple model at this power scale, but it's often time very small model and not deep-learning. Nanoedge AI studio from STelectronic give you some tool to train the model after deployment on device.
It's often time used for predictive maintenance, in order to adapt each ML model at the water pump plugged, for example.
What about the Milk-V Duo? 0.5 TOPS INT8 @ $5.
Didn't know about it but their design decision is really cool (not very clear with the difference between the normal version and the "256 Mo" confusing).
The software side doesn't seem very mature with very few help regarding TinyML. But this course seem interesting https://sophon.ai/curriculum/description.html?category_id=48
I think we know each other. ;)