Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).
I found "magic" that could detect these and submitted a patch at https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?
Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.
Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.
So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.
We will have more performance measurements in the upcoming research paper. Hope that answer the question
Do you have a sense of performance in terms of energy use? 2x slower is fine, but is that at the same wattage, or more?
That sounds like a nit / premature optimization.
Electricity is cheap. If this is sufficiently or actually important for your org, you should measure it yourself. There are too many variables and factors subject to your org’s hardware.
Totally disagree. Most end users are on laptops and mobile devices these days, not desktop towers. Thus power efficiency is important for battery life. Performance per watt would be an interesting comparison.
What end users are working with arbitrary files that they don’t know the identification of?
This entire use case seems to be one suited for servers handling user media.
File managers that render preview images. Even detecting which software to open the file with when you click it.
Of course on Windows the convention is to use the file extension, but on other platforms the convention is to look at the file contents
MacOS (that is, Finder) also looks at the extension. That has also been the case with any file manager I've used on Linux distros that I can recall.
You might be surprised. Rename your Photo.JPG as Photo.PNG and you'll still get a perfectly fine thumbnail. The extension is a hint, but it isn't definitive, especially when you start downloading from the web.
Theoretically? Anyone running a virus scanner.
Of course, it's arguably unlikely a virus scanner would opt for an ML-based approach, as they specifically need to be robust against adversarial inputs.
You'd be surprised what an AV scanner would do.
https://twitter.com/taviso/status/732365178872856577
Several major players such as Norton, McAfee, and Symantec all at least claim to use AI/ML in their antivirus products.
Browsers often need to guess a file type
I mean if you care about that you shouldn't be running anything that isn't highly optimized. Don't open webpages that might be CPU or GPU intensive. Don't run Electron apps, or really anything that isn't built in a compiled language.
Certainly you should do an audit of all the Android and iOS apps as well, to make sure they've been made in a efficient manner.
Block ads as well, they waste power.
This file identification is SUCH a small aspect of everything that is burning power in your laptop or phone as to be laughable.
Whilst energy usage is indeed a small aspect this early on when using bespoke models, we do have to consider that this is a model for simply identifying a file type.
What happens when we introduce more bespoke models for manipulating the data in that file?
This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.
That's a slippery slope argument, which is a common logical fallacy[0]. This model being inefficient compared to the best possible implementation does not mean that future additions will also be inefficient.
It's the equivalent to saying many people programming in Ruby is causing all future programs to be less efficient. Which is not true. In fact, many people programming in Ruby has caused Ruby to become more efficient because it gets optimised as it gets used more (or Python for that matter).
It's not as energy efficient as C, but it hasn't caused it to get worse and worse, and spiral out of control.
Likewise smart contracts are incredibly inefficient mechanisms of computation. The result is mostly that people don't use them for any meaningful amounts of computation, that all gets done "Off Chain".
Generative AI is definitely less efficient, but it's likely to improve over time, and indeed things like quantization has allowed models that would normally to require much more substantial hardware resources (and therefore, more energy intensive) to be run on smaller systems.
[0]: https://en.wikipedia.org/wiki/Slippery_slope
That is a fallacy fallacy. Just because some slopes are not slippery that does not mean none of them are.
We're already there. Modern software is, by and large, profoundly inefficient.
In general you're right, but I can't think of a single local use for identifying file types by a human on a laptop - at least, one with scale where this matters. It's all going to be SaaS services where people upload stuff.
We are building a data analysis tool with great UX, where users select data files, which are then parsed and uploaded to S3 directly, on their client machines. The server only takes over after this step.
Since the data files can be large, this approach bypasses having to trnasfer the file twice, first to the server, and then to S3 after parsing.
This dont sound like very common scenario.
The hardware requirements of a massively parallel algorithm can't possibly be "a nit" in any universe inhabited by rational beings.
Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?
Testing it on my own system, magika seems to use a lot more CPU-time:
Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.I've ended up implementing a layer on top of "magic" which, if magic detects application/zip, reads the zip file manifest and checks for telltale file names to reliably detect Office files.
The "magic" library does not seem to be equipped with the capabilities needed to be robust against the zip manifest being ordered in a different way than expected.
But this deep learning approach... I don't know. It might be hard to shoehorn in to many applications where the traditional methods have negligible memory and compute costs and the accuracy is basically 100% for cases that matter (detecting particular file types of interest). But when looking at a large random collection of unknown blobs, yeah, I can see how this could be great.
Many commenters seem to be using magic instead of file, any reasons?
magic is the core detection logic of file that was extracted out to be available as a library. So these days file is just a higher level wrapper around magic
If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20
From the first paragraph:
Maybe your old-fashioned implementations were detecting in microseconds?
Yeah I saw that, but that could cover a pretty wide range and it's not clear to me whether that relies on preloading a model.
Then they could never detect zip files with certainty, given that to do that you need to read up to 65KB (+ 22) at the END of the file. The reason is that the zip archive format allows "gargabe" bytes both in the beginning of the file and in between local file headers.... and it's actually not uncommon to prepend a program that self-extracts the archive, for example. The only way to know if a file is a valid zip archive is to look for the End of Central Directory Entry, which is always at the end of the file AND allows for a comment of unknown length at the end (and as the comment length field takes 2 bytes, the comment can be up to 65K long).
That's why the whole question is ill formed. A file does not have exactly one type. It may be a valid input in various contexts. A zip archive may also very well be something else.