Magika: AI powered fast and efficient file type identification

Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).

I found "magic" that could detect these and submitted a patch at https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?

Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.

Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.

So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.

We will have more performance measurements in the upcoming research paper. Hope that answer the question

Do you have a sense of performance in terms of energy use? 2x slower is fine, but is that at the same wattage, or more?

That sounds like a nit / premature optimization.

Electricity is cheap. If this is sufficiently or actually important for your org, you should measure it yourself. There are too many variables and factors subject to your org’s hardware.

Totally disagree. Most end users are on laptops and mobile devices these days, not desktop towers. Thus power efficiency is important for battery life. Performance per watt would be an interesting comparison.

What end users are working with arbitrary files that they don’t know the identification of?

This entire use case seems to be one suited for servers handling user media.

File managers that render preview images. Even detecting which software to open the file with when you click it.

Of course on Windows the convention is to use the file extension, but on other platforms the convention is to look at the file contents

on other platforms the convention is to look at the file contents

MacOS (that is, Finder) also looks at the extension. That has also been the case with any file manager I've used on Linux distros that I can recall.

You might be surprised. Rename your Photo.JPG as Photo.PNG and you'll still get a perfectly fine thumbnail. The extension is a hint, but it isn't definitive, especially when you start downloading from the web.

Theoretically? Anyone running a virus scanner.

Of course, it's arguably unlikely a virus scanner would opt for an ML-based approach, as they specifically need to be robust against adversarial inputs.

You'd be surprised what an AV scanner would do.

https://twitter.com/taviso/status/732365178872856577

it's arguably unlikely a virus scanner would opt for an ML-based approach

Several major players such as Norton, McAfee, and Symantec all at least claim to use AI/ML in their antivirus products.

Browsers often need to guess a file type

I mean if you care about that you shouldn't be running anything that isn't highly optimized. Don't open webpages that might be CPU or GPU intensive. Don't run Electron apps, or really anything that isn't built in a compiled language.

Certainly you should do an audit of all the Android and iOS apps as well, to make sure they've been made in a efficient manner.

Block ads as well, they waste power.

This file identification is SUCH a small aspect of everything that is burning power in your laptop or phone as to be laughable.

Whilst energy usage is indeed a small aspect this early on when using bespoke models, we do have to consider that this is a model for simply identifying a file type.

What happens when we introduce more bespoke models for manipulating the data in that file?

This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.

That's a slippery slope argument, which is a common logical fallacy[0]. This model being inefficient compared to the best possible implementation does not mean that future additions will also be inefficient.

It's the equivalent to saying many people programming in Ruby is causing all future programs to be less efficient. Which is not true. In fact, many people programming in Ruby has caused Ruby to become more efficient because it gets optimised as it gets used more (or Python for that matter).

It's not as energy efficient as C, but it hasn't caused it to get worse and worse, and spiral out of control.

Likewise smart contracts are incredibly inefficient mechanisms of computation. The result is mostly that people don't use them for any meaningful amounts of computation, that all gets done "Off Chain".

Generative AI is definitely less efficient, but it's likely to improve over time, and indeed things like quantization has allowed models that would normally to require much more substantial hardware resources (and therefore, more energy intensive) to be run on smaller systems.

[0]: https://en.wikipedia.org/wiki/Slippery_slope

That is a fallacy fallacy. Just because some slopes are not slippery that does not mean none of them are.

This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.

We're already there. Modern software is, by and large, profoundly inefficient.

In general you're right, but I can't think of a single local use for identifying file types by a human on a laptop - at least, one with scale where this matters. It's all going to be SaaS services where people upload stuff.

We are building a data analysis tool with great UX, where users select data files, which are then parsed and uploaded to S3 directly, on their client machines. The server only takes over after this step.

Since the data files can be large, this approach bypasses having to trnasfer the file twice, first to the server, and then to S3 after parsing.

This dont sound like very common scenario.

The hardware requirements of a massively parallel algorithm can't possibly be "a nit" in any universe inhabited by rational beings.

Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?

Testing it on my own system, magika seems to use a lot more CPU-time:

    file /usr/lib/*  0,34s user 0,54s system 43% cpu 2,010 total
    ./file-parallel.sh  0,85s user 1,91s system 580% cpu 0,477 total
    bin/magika /usr/lib/*  92,73s user 1,11s system 393% cpu 23,869 total

Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.

I've ended up implementing a layer on top of "magic" which, if magic detects application/zip, reads the zip file manifest and checks for telltale file names to reliably detect Office files.

The "magic" library does not seem to be equipped with the capabilities needed to be robust against the zip manifest being ordered in a different way than expected.

But this deep learning approach... I don't know. It might be hard to shoehorn in to many applications where the traditional methods have negligible memory and compute costs and the accuracy is basically 100% for cases that matter (detecting particular file types of interest). But when looking at a large random collection of unknown blobs, yeah, I can see how this could be great.

Many commenters seem to be using magic instead of file, any reasons?

magic is the core detection logic of file that was extracted out to be available as a library. So these days file is just a higher level wrapper around magic

If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20

From the first paragraph:

enabling precise file identification within milliseconds, even when running on a CPU.

Maybe your old-fashioned implementations were detecting in microseconds?

Yeah I saw that, but that could cover a pretty wide range and it's not clear to me whether that relies on preloading a model.

At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

They had a hard limit that they wouldn't see past the first 256 bytes.

Then they could never detect zip files with certainty, given that to do that you need to read up to 65KB (+ 22) at the END of the file. The reason is that the zip archive format allows "gargabe" bytes both in the beginning of the file and in between local file headers.... and it's actually not uncommon to prepend a program that self-extracts the archive, for example. The only way to know if a file is a valid zip archive is to look for the End of Central Directory Entry, which is always at the end of the file AND allows for a comment of unknown length at the end (and as the comment length field takes 2 bytes, the comment can be up to 65K long).

That's why the whole question is ill formed. A file does not have exactly one type. It may be a valid input in various contexts. A zip archive may also very well be something else.

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

Though I have to say when looking at the Node module, I don't understand why they released it.

Their docs say it's slow:

https://github.com/google/magika/blob/120205323e260dad4e5877...

It loads the model an runtime:

https://github.com/google/magika/blob/120205323e260dad4e5877...

They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

Also as others have mentioned. The model appears to only detect 116 file types:

https://github.com/google/magika/blob/120205323e260dad4e5877...

Where libmagic detects... a lot. Over 1600 last time I checked:

https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

Made a small test to try it out: https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...

    hyperfine ./magika.bash ./file.bash
    Benchmark 1: ./magika.bash
      Time (mean ± σ):     706.2 ms ±  21.1 ms    [User: 10520.3 ms, System: 1604.6 ms]
      Range (min … max):   684.0 ms … 738.9 ms    10 runs
    
    Benchmark 2: ./file.bash
      Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
      Range (min … max):    22.4 ms …  29.0 ms    111 runs
    
    Summary
      './file.bash' ran
       29.88 ± 1.65 times faster than './magika.bash'

Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.

My little script is trying to identify in bulk, at least by passing 165 file paths to `magika`, and `file`.

Though, I absolutely agree with you. I think realistically it's better to do this kind of thing in a library rather than shell out to it at all. I was just trying to get an idea on how it generally compares.

Another note, I was trying to be generous to `magicka` here because when it's single file identification, it's about 160-180ms on my machine vs <1ms for `file`. I realize that's going to be quite a bit of python startup in that number, which is why I didn't go with it when pushing that benchmark up earlier. I'll probably push an update to that gist to include the single file benchmark as well.

Going by those number it's taking almost a second to run, not 10s of ms. And going by those numbers, it's doing something massively parallel in that time. So basically all your cores will spike to 100% for almost a second during those one-shot identifications. It looks like GP has a 12-16 threads CPU, and it is using those while still being 30 times slower than single-threaded libmagic.

That tool needs 100x more CPU time just to figure out some filetypes than vim needs to open a file from a cold start (which presumably includes using libmagic to check the type).

If I had to wait a second just to open something during which that thing uses every resource available on my computer to the fullest, I'd probably break my keyboard. Try using that thing as a drop-in file replacement, open some folder in your favorite file manager, and watch your computer slow to a crawl as your file manager tries to figure out what thumbnails to render.

It's utterly unsuitable for "interactive" identifications.

I've updated this script with some single-file cli numbers, which are (as expected) not good. Mostly just comparing python startup time for that.

    make
    sqlite3 < analyze.sql
    file_avg              python_avg         python_x_times_slower_single_cli
    --------------------  -----------------  --------------------------------
    0.000874874856301821  0.179884610224334  205.611818568799
    file_avg            python_avg     python_x_times_slower_bulk_cli
    ------------------  -------------  ------------------------------
    0.0231715865881818  0.69613745142  30.0427184289163

Can we do the 1600 if known, if not, let the AI take a guess?

Absolutely, and honestly in a non-interactive ingestion workflow you're probably doing multiple checks anyway. I've worked with systems that call multiple libraries and hand-coded validation for each incoming file.

Maybe it's my general malaise, or disillusionment with the software industry, but when I wrote that I was really just expecting more.

> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked

As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.

For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.

Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.

Okay, but your one file type is more likely to be included in the 1600 that libmagic supports rather than Magika's 116?

For that matter, the file types I care about are unfortunately misdetected by Magika (which is also an important point - the `file` command at least gives up and says "data" when it doesn't know, whereas the Magika demo gives a confidently wrong answer).

I don't want to criticize the release because it's not meant to be a production-ready piece of software, and I'm sure the current 116 types isn't a hard limit, but I do understand the parent comment's contention.

It's for researchers, probably.

Yeah, there is this line:

    By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Which implies a production-ready release for general usage, as well as usage by security researchers.

Hello! We wrote the Node library as a first functional version. Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file. Other than that, it's just as fast for single file lookups, which is the most common usecase.

Good to know! Thank you. I'll definitely be trying it out. Though, I might download and hardcode the model ;)

I also appreciate the use of ONNX here, as I'm already thinking about using another version of the runtime.

Do you think you'll open source your F1 benchmark?

We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files

Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.

I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.

I'll probably take some time this weekend to make a couple of issues around misidentified files.

I'll definitely be adding this to my toolset!

I don't understand why this needs to exist. Isn't file type detection inherently deterministic by nature? A valid tar archive will always have the same first few magic bytes. An ELF binary has a universal ELF magic and header. If the magic is bad, then the file is corrupted and not a valid XYZ file. What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

This also works for formats like Python, HTML, and JSON.

file (https://www.darwinsys.com/file/) already detects all these formats.

Indeed but as pointed out in the blog post -- file is significantly less accurate that Magika. There are also some file type that we support and file doesn't as reported in the table.

I can't immediately find the dataset used for benchmarking. Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?

Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?

That's not the point in file type guessing is it? Google employs it as an additional security measure for user submitted content which absolutely makes sense given what malware devs do with file types.

I still don't see how this is useful. The only time I want to answer the question "what type of file is this" is if it is an opaque blob of binary data. If it's a plain text file like Python, HTML, or JSON, I can figure that out by just catting the file.

Yes, but shouldn't the file type be part of the file, or (better) of the metadata of the file?

Knowing is better than guessing.

Consider, it's perfectly possible for a file to fit two or more file formats - polyglot files are a hobby for some people.

And there are also a billion formats that are not uniquely determined by magic bytes. You don't have to go further than text files.

This tool doesn't work this way.

What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

I use the file command all the time. The value is when you get this:

    ... $  file somefile.xyz
    somefile.xyz: data

AIUI from reading TFA, magika can determine more filetypes than what the file command can detect.

It'd actually be very easy to determine if there's any value in magika: run file on every file on your filesystem and then for every file where the file command returns "data", run magika and see if magika is right.

If it's right, there's your value.

P.S: it may also be easier to run on Windows than the file command? But then I can't do much to help people who are on Windows.

From elsewhere in this thread, it appears that Magika detects far fewer file types than file (116 vs ~1600), which makes sense. For file, you just need to drop in a few rules to add a new, somewhat obscure type. An AI approach like Magika will need lots of training and test data for each new file type. Where Magika might have a leg up is with distinguishing different textual data files (i.e., source code), but I don't see that as a particularly big use case honestly.

It's not always deterministic, sometimes it's fuzzy depending on the file type. Example of this is a one-line CSV file. I tested one case of that, libmagic detects it as a text file while magika correctly detects it as a CSV (and gives a confidence score, which is killer).

But even with determinism, it's not always right. It's not too rare to find a text file with a byte order mark indicating UTF16 (0xFE 0xFF) but then actually containing utf-8. But what "format" does it have then? Is it UTF-8 or UTF-16? Same with e.g. a jar file missing a manifest. That's just a zip, even though I'm sure some runtime might eat it.

But the question is when you have the issue of having to guess the format of a file? Is it when reverse engineering? Last time I did something like this was in the 90's when trying to pick apart some texture from a directory of files called asset0001.k and it turns out it was a bitmap or whatever. Fun times.

This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.

It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".

I like the idea, but the current implementation can't be relied on IMO; especially not for automation.

A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.

the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)

It provided the wrong file-types for some files, so I cannot rely on its output to be correct.

If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.

Except a 100% correct implementation doesn't exist AFAIK. So if I want to do anything that makes a decision based on the type of a file, I have to pick some algorithm to do that. If I can do that correctly 99% of the time, that's better than not being able to make that decision at all, which is where I'm left if a perfect implementation doesn't exist.

Nobody's asking for perfection. But the AI is offering inexplicable and obvious nondeterministic mistakes that the traditional algorithms don't suffer from.

Magika goes wrong and your fonts become audio files and nobody knows why. Magic goes wrong and your ZIP-based documents get mistaken for generic ZIP files. If you work with that edge case a lot, you can anticipate it with traditional algorithms. You can't anticipate nondeterministic hallucination.

Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.

Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.

I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?

Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.

Sure thing :)

Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff

These are files that were in one of my crawl datasets.

[0]: https://poc.lol/files/magika-test.tgz

Thank you - we are adding them to our test suit for the next version.

Super, thank you! I look forward to it :)

I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.

Do you have permission to redistribute these files?

What is the MIME type of a .tar file; and what are the MIME types of the constituent concatenated files within an archive format like e.g. tar?

hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

File signature: https://en.wikipedia.org/wiki/File_signature

PhotoRec: https://en.wikipedia.org/wiki/PhotoRec

"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/

Table of ': https://formats.kaitai.io/xref.html

AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...

Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression

Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database

sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...

clamav sigtool: https://www.google.com/search?q=clamav+sigtool

https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :

  sigtool –-find-sigs "$name" | sigtool –-decode-sigs

List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signatures

And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.

A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

https://github.com/google/osv.dev/blob/master/README.md#usin... :

We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

Add'l useful formats:

Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage

I'm extremely confused about the claim that other tools have a worse precision or recall for APK or JAR files which are very much regular. Like, they should be a valid ZIP file with `META-INF/MANIFEST.MF` present (at least), and APK would need `classes.dex` as well, but at this point there is no other format that can be confused with APK or JAR I believe. I'd like to see which file was causing unexpected drop on precision or recall.

The `file` command checks only the first few bytes, and doesn’t parse the structure of the file. APK files are indeed reported as Zip archives by the latest version of `file`.

This is false in every sense for https://www.darwinsys.com/file/ (probably the most used file version). It depends on the magic for a specific file, but it can check any part of your file. Many Linux distros are years out of date, you might be using a very old version.

FILE_45:

    ./src/file -m magic/magic.mgc ../../OpenCalc.v2.3.1.apk
    ../../OpenCalc.v2.3.1.apk: Android package (APK), with zipflinger virtual entry, with APK Signing Block

Interesting! I checked with file 5.44 from Ubuntu 23.10 and 5.45 on macOS using homebrew, and in both cases, I got “Zip archive data, at least v2.0 to extract” for the file here[1]. I don’t have an Android phone to check and I’m also not familiar with Android tooling, so is this a corrupt APK?

[1] https://download.apkpure.net/custom/com.apkpure.aegon-319781...

That doesn't appear to be a valid link. Try building `file` from source and using the provided default magic database.

I also tried this with the sources of file from the homepage you linked above, and I still get the same results.

You could try this for yourself using the same APKPure file which I uploaded at the following alternative link[1]. Further, while this could be a corrupt APK, I can’t see any signs of that from a cursory inspection as both the `classes.dex` and `META-INF` directory are present, and this is APKPure’s own APK, instead of an APK contributed for an app contributed by a third-party.

[1] https://wormhole.app/Mebmy#CDv86juV9H4aRCL2DSJeDw

apks are also zipaligned so it's not like random users are going to be making them either

People do create JAR files without a META-INF/MANIFEST.MF entry.

The tooling even supports it. https://docs.oracle.com/en/java/javase/21/docs/specs/man/jar...:

  -M or --no-manifest
     Doesn't create a manifest file for the entries

Minecraft mods 14 years ago used to tell you to open the JAR and delete the META-INF when installing them so can’t rely on that one…

Supported file types: https://github.com/google/magika/blob/main/docs/supported-co...

It's surprising that there are so many file types that seem relatively common which are missing from this list. There are no raw image file formats. There's nothing for CAD - either source files or neutral files. There's no MIDI files, or any other music creation types. There's no APL, Pascal, COBOL, assembly source file formats etc.

No tracker / .mod files either, just use file.

Thanks for the list, we will probably try to extend the list of format supported in future revision.

Well, what they used this for at Google was apparently scanning their user's files for things they shouldn't store in the cloud. Probably they don't care much about MIDI.

Yeah this quickly went from 'additional helpful tool in the kit' to 'probably should use something else first'

Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

We have had file(1) for years

This is beyond what file is capable of. It’s also mentioned in the third paragraph.

RTFA.

FWICT file is more capable, predictable and also faster while being more energy-efficient at the same time.

That's not what the performance table in the article is implying, with a precision and recall higher on Magika hovering around 99%, while magic is at 92% prec and 72% recall.

One can doubt the representativity of their dataset, but if what is in the article is correct, Magika is clearly way more capable and predictable

Some HN readers may not know about file(1) even. It's fine to mention that $subj enhances that, but the rtfa part seems pretty unnecessary.

Yes, it's slower than file(1), uses more energy, recognizes fewer file types, and is less accurate.

This group of Linux users used to brag Linux will identify files even if you change the extension, Windoze needs to police you about changing extension, nearly 20 years back.

Is it really common enough for files not to be annotated with a useful/correct file type extension (e.g. .mp3, .txt) that a library like this is needed?

Nothing is ever simple. Even for the most basic .txt files it’s still useful to know what the character encoding is (utf? 8/16? Latin-whatever? etc.) and what the line format is (\n,\cr\lf,\n\lf) as well as determining if some maniac removed all the indentation characters and replaced them with a mystery number of spaces.

Then there are all the container formats that have different kinds of formats embedded in them (mov,mkv,pdf etc.)

A fun read in service of your first point: https://en.wikipedia.org/wiki/Bush_hid_the_facts

At multiple points in my career I've been responsible for apis that accept PDFs. Many non-tech savvy people seeing this, will just change the extension of the file they're uploading to `.pdf`.

To make matters worse, there is some business software out there that will actually bastardize the PDF format and put garbage before the PDF file header. So for some things you end up writing custom validation and cleanup logic anyway.

malware can intentionally obfuscate itself

Yes!

Sometimes a file has no extension. Other times the extension is a lie. Still other times, you may be dealing with an unnamed bytestring and wish to know what kind of content it is.

This last case happens quite a lot in Nosey Parker [1], a detector of secrets in textual data. There, it is possible to come across unnamed files in Git history, and it would be useful to the user to still indicate what type of file it seems to be.

I added file type detection based on libmagic to Nosey Parker a while back, but it's not compiled in by default because libmagic is slow and complicates the build process. Also, libmagic is implemented as a large C library whose primary job is parsing, which makes the security side of me jittery.

I will likely add enabled-by-default filetype detection to Nosey Parker using Magika's ONNX model.

[1] https://github.com/praetorian-inc/noseyparker

What are use-cases for this? I mean, obviously detecting the filetype is useful, but we kinda already have plenty of tools to do that, and I cannot imagine, why we need some "smart" way of doing this. If you are not a human, and you are not sure what is it (like, an unknown file being uploaded to a server) you would be better off just rejecting it completely, right? After all, there's absolutely no way an "AI powered" tool can be more reliable than some dumb, err-on-safer-side heuristic, and you wouldn't want to trust that thing to protect you from malicious payloads.

no way an "AI powered" tool can be more reliable

The article provides accuracy benchmarks.

you would be better off just rejecting it completely

They mention using it in gmail and Drive, neither of which have the luxury of rejecting files willy-nilly.

I have not tried it recently, but IIRC, Gmail does reject attachments which are zip files, for security reasons.

Gmail nukes zips if they contain an executable or some other 'prohibited' file type. Most email providers block executable attachments.

Virus detection is mentioned in the article. Code editors need to find the programming language for syntax highlighting of code before you give it a name. Your desktop OS needs to know which program to open files with. Or, recovering files from a corrupted drive. Etc

It's easy to distinguish, say, a PNG from a JPG file (or anything else that has well-defined magic bytes). But some files look virtually identical (eg. .jar files are really just .zip files). Also see polyglot files [1].

If you allow an `unknown` label or human intervention, then yes, magic bytes might be enough, but sometimes you'd rather have a 99% chance to be right about 95% of files vs. a 100% chance to be right about 50% of files.

[1] https://en.wikipedia.org/wiki/Polyglot_(computing)

The results of which you'll never be 100% sure are correct...

They missed such an opportunity to name it "fail". It's like "file" but with "ai" in it.

What about faile?

But file(2) is already like that - my data files without headers are reported randomly as disk images, compressed archives or even executables for never-heard-of machines.

Other methods use heuristics to guess many filetypes and in the benchmark they show worse performance (in terms of precision). Assuming benchmarks are not biased, the fact that this approach uses AI heuristics instead of hard-coded heuristics shouldn't make it strictly worse.

Wonder how this would handle a polyglot[0][1], that is valid as a PDF document, a ZIP archive, and a Bash script that runs a Python webserver, which hosts Kaitai Struct’s WebIDE which, allowing you to view the file’s own annotated bytes.

[0]: https://www.alchemistowl.org/pocorgtfo/

[1]: https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf

Edit: just tested, and it does only identify the zip layer

You can try it here: https://google.github.io/magika/

It's relatively limited compared to `file` (~10% coverage), it's more like a specialized classificator for basic file formats, so such cases are really out-of-scope.

I guess it's more for detecting common file formats then with high recall.

However, where is the actual source of the model ? Let's say I want to add a new file format myself.

Apparently only the source of the interpreter is here, not the source of the model nor the training set, which is the most important thing.

Is there anything about the performance on unknown files?

I've tried a few that aren't "basic" but are widely used enough to be well supported in libmagic and it thinks they're zip files. I know enough about the underlying formats to know they're not using zip as a container under-the-hood.

Apparenty the Super Mario Bros. 3 ROM is 100% a SWF file.

Cool that you can use it online though. Might end up using it like that. Although it seems like it may focus on common formats.

Yes, I totally agree; it's not what I would qualify as open source.

Do you plan to release the training code along the research paper? What about the dataset?

In any case, it's very neat to have ML-based technique and lightweight model for such tasks!

A somewhat surprising and genuinely useful application of the family of techniques.

I wonder how susceptible it is to adversarial binaries or, hah, prompt-injected binaries.

It gets a lot of binary file formats wrong for me out-of-the-box. I think it needs to be a bit more effective before we can truly determine the effectiveness of such exploits.

Elsewhere in the thread kevincox[1] points out that it's extremely susceptible to adversarial binaries:

Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

Seems like this is genuinely useless for anybody but AI researchers.

[1] https://news.ycombinator.com/item?id=39395677

For the extremely limited number of file types supported, I question the utility of this compared to `magic`

“These aren’t the binaries you are looking for…”

I have a question: Is something like Magika enough to check if a file is malicious or not?

Example: users can upload PNG file (and only PNG is accepted). If Malika detects that the file is a PNG, does this mean the file is clean?

This comment from kevincox[1] says the answer is a hard "no":

Worse it seems that for unknown formats it confidently claims that it is one of the known formats. Rather than saying "unknown" or "binary data".

There are other comments in this thread that make me think Google contaminated their test data with training data and the 99% results should not be taken at face value. OTOH I am not particularly surprised that Magika would be better than the other tools at distinguishing semi-unstructured plain text e.g. Java source vs. C++ source or YAMLs versus INIs. But that's a very different use case than many security applications. The comments here suggest Magika is especially susceptible to binary obfuscation.

[1] https://news.ycombinator.com/item?id=39395677

The only way to do this reliably is to render the PNG to pixels then render it back to an PNG with a trusted encoder. Of course now you are taking the risk of vulnerabilities in the "render to pixels" step. But the result will be clean.

AKA parse, don't validate.

does this mean the file is clean?

No.

If that PNG of yours is not just an example note that you can detect easily if the PNG as any extra data (which may or may not indicate an attempt as mischief) and reject the (rare) PNGs with extra data. I ran a script checking the thousands of PNGs on my system and found three with extra data, all three probably due to the "PNG acropalypse" bug (but mischief cannot be ruled out).

P.S: btw I'm not implying using extra data that shouldn't be there in a PNG is the only way to have a malicious PNG.

this couldnt have been released at a better time for me! really needed a library like this.

Tell us why!

Thanks :)

So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand.

Pure nonsense. The rules are accurate, based on the actual formats, and not "heuristics".

Besides compound file types, not all formats are well-specified either. Example is CSV.

the rules aren't based on the formats, but on a small portion of them (their magic numbers). this makes them inaccurate (think docx vs zip) and heuristic.

Assuming that I've not misunderstood, how does this compare to things like: TrID [0]?? Apart from being open source.

[0] https://mark0.net/soft-trid-e.html

The bulk of the short article is a set of performance benchmarks comparing Magika to TrID and others.

Argh, the risks of browsing the web without JavaScript and/or third party scripts enabled, you miss content, because rendering text and images on the modern web can't be done without them, apparently. (Sarcasm).

You are of course correct. I can see the images showing the comparison. Apologies.

So instead of spending some of their human resources to improve libmagic, they used some of their computing power to create an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes), and which is much less effective in an adversarial context, and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable. Thanks guys.

So instead of spending some of their human resources to improve libmagic

A large megacorp can work on multiple things at once.

an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes)

You say that like it's a contradiction but it's not.

and which is much less effective in an adversarial context,

Is it? This seems like an assumption.

and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable.

Being introspectable or not has no bearing on the accuracy of a system.

Come on, can't you help but be impressed by this amazing AI tech? That gives us sci-fi tools like ... a less-accurate, incomplete, stochastic, un-debuggable, slower, electricity-guzzling version of `file`.

Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file.

"web browsers"? Odd to see this coming from Google itself. https://en.wikipedia.org/wiki/Content_sniffing was widely criticised for being problematic for security.

Content sniffing can be disabled by the server (X-Content-Type-Options: nosniff), but it’s still used by default. Web browsers have to assume that servers are stupid, and that for relatively harmless cases, it’s fine to e.g. render a PNG loaded by an <img> even if it’s served as text/plain.

Why is this piece of code being sold as open source, when in reality it just calls into proprietary ML blob that is tiny and useless, and actual source code of model is closed while properly useful large model is non existing ?

Not into proprietary, the blob is within an Apache-licensed repo. Though there was no code to train it, but the repo contains some info allowing to recreate the code training it. Basically a JSON-based configs containing graph architecture. Even if you didn't have them, the repo contains an ONNX model, from which one can devise the architecture.

Why not detect it by checking the magic number of the buffer?

Not every file has one for starters and many can be incorrect.

Especially in the context of use as a virus scanner, you don’t trust what the file says it is

What does it do with an Actually Portable Executable compiled by Cosmopolitan libc compiler?

It’s reported as a PE executable, `file` on the other hand reports it as a “DOS/MBR boot sector.”

As somebody who's dealt with the ambiguity of attempting to use file signatures in order to identify file type, this seems like a pretty useful library. Especially since it seems to be able to distinguish between different types of text files based on their format/content e.g. CSV, markdown, etc.

At $job we have been using Apache Tika for years.

Works but occasionally having bugs and weird collisions when working with billions of files.

Happy to see new contributions in the space.

My FOSS desktop text editor performs a subset of file type identification using the first 12 bytes, detecting the type quite quickly:

* https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main...

There's a much larger list of file signatures at:

* https://github.com/veniware/Space-Maker/blob/master/FileSign...

It seems like it defeats the purpose of such a tool that this initial version doesn’t have polyglot files. I hope they’re quick to work on that.

Took a .dxf file and fed it to Magika. It says with confidence of 97% that that must be a PowerShell file. A classic .dwg could be "mscompress" (whatever that is), 81%, or a GIF. Both couldn't be further from the truth.

Common files are categorized successfully – but well, yeah that's not really an achievement. Pretty much nothing more than a toy right now.

I use FFMPEG to detect if uploaded files are valid audio files. Would this be much faster?

Can someone please help me understand why this is useful? The article mentions malware scanning applications, but if I'm sending you a malicious PDF, won't I want to clearly mark it with a .pdf extension so that you open it in your PDF app? Their examples are all very obvious based on file extensions.

The name sounds like the Pokémon Magikarp or the anime series Madoka Magica.

Why? Just check the damn headers. Why do you need a power hungry and complicated AI model to do it? Why?

I used an HTML file and added JPEG magic bytes to its header:

magika file.jpg

file.jpg: JPEG image data (image)

I wonder how it performs with detecting C vs C++ vs ObjC vs ObjC++ and for bonus points: the common C/C++ subset (which is an incompatible C fork), also extra bonus points for detecting language version compatibility (e.g. C89 vs C99 vs C11...).

Separating C from C++ and ObjC is where the file type detection on Github traditionally had problems with (but has been getting dramatically better over time), from an "AI-powered" solution which has been trained on the entire internet I would expect to do better right from the start.

The list here doesn't even mention any of those languages except C though:

https://github.com/google/magika/blob/main/docs/supported-co...

how do i pronounce this? Myajika or MaGika? anyhow, its super cool.

Magika: AI powered fast and efficient file type identification

of 116 file types with proprietary puny model with no training code and no dataset.

We are releasing a paper later this year detailing how the Magika model was trained and its performance on large datasets.

And ? How do you advance industry by this googleblog post and source code that is useless without closed source model ? All I see here is loud marketing name, loud promises, but actually barely anything useful. Hooly rooftop characters sideproject?

It can't correctly identify a DXF file in my testing. It categorizes it as plain text.

Can we please god stop using AI like it's a meaningful word

Very useful.

I wrote an editor that needed file type detection but the results of traditional approaches were flaky.

probably a lot of interesting work going on that looks like this for the virustotal db itself.

This feels like old school google. I like that it's just a static webpage that basically can't be shut down or sunsetted. It reminds of when Google just made useful stuff and gave them away for free on a webpage like translate and google books. Obviously less life changing than the above but still a great option to have when I need this.

To me the obvious use case is to first use the file command but then, when file returns "DATA" (meaning it couldn't guess the file type), call magika.

I guess I'll be writing a wrapper (only for when using my shell in interactive mode) around file doing just that when I come back from vacation. I hate it when file cannot do its thing.

Put it this way: I use file a lot and I know at times it cannot detect a filetype. But is file often wrong when it does have a match? I don't think so...

So in most of the cases I'd have file correctly give me the filetype, very quickly but then in those rare cases where file cannot find anything, I'd then use the slower but apparently more capable magika.

If their “Exif Tool” is https://exiftool.org/ (what else could it be?), I don’t understand why they included it in their tests. Also, how does ExifTool recognize Python and html files?

It seems to detect my Android build.gradle.kts as Scala, which I suppose is a kind of hilarious confusion but not exactly useful.

I wonder what the output will be on polyglot files like run-anywhere binaries produced by cosmopolitan [1]

[1]: https://justine.lol/cosmopolitan/

I wonder how big of a deal it is that you'd have to retrain the model to support a new or changed file type? It doesn't seem like the repo contains training code, but I could be missing it...

I ran a quick test on 100 semi-random files I had laying around. Of those, 81 were detected correctly, 6 were detected as the wrong file type, and 12 were detected with an unspecific file type (unknown binary/generic text) when a more specific type existed. In 4 of the unspecific cases, a low-confidence guess was provided, which was wrong in each case. However, almost all of the files which were detected wrong/unspecific are of types not supported by Magika, with one exception of a JSON file containing a lot of JS code as text, which was detected as JS code. For comparison, file 5.45 (the version I happened to have installed) got 83 correct, 6 wrong, and 10 not specific. It detected the weird JSON correctly, but also had its own strange issues, such as detecting a CSV as just "data". The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code (Magika called them unknown). The other two "wrong" detections were also code formats that it seems it doesn't support. It was also able to output a lot more information about the media files. Not sure what to make of these tests but perhaps they're useful to somebody.

Voidtools - Everything.. looking at you to implement this

I just want to say thank you for the release. There are quite a lot of complaints in the comments but I think this is a useful and worthwhile contribution and I appreciate the authors for going through the effort to get it approved for open source release. It would be great if the model training data was included (or at lease documentation about how to reproduce it.) but that doesn’t preclude this being useful. Thanks!