Marker: Convert PDF to Markdown quickly with high accuracy

Let's not underestimate the impact of such tool: we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

I'm very excited about it.

Let's build a pipeline: all the pdfs -> markdown them all -> archive.org them all

we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

FWIW PDF is actually great for distribution. It allows you to invisibly embed all the raw data used to generate the document that the end user is seeing, in whatever format you want. So if you are generating your PDFs by using PrinceXML to render HTML, you can embed the raw JSON used to generate all of the text, graphs, charts, etc. Now most people don't actually do this of course, but that isn't the fault of the spec.

pdfs don't play well with ereaders.

Are the standards for building accessible PDFs worse than the standards for building accessible websites, or are they just not as commonly implemented?

(anecdotally) PDFs usually come from many people, departments, companies, and apps. It's hard to shoehorn in accessibility if someone didn't add it in at the origin (like in indesign or whatever app they used). Or if they printed to PDF, whatever accessibility they had would probably be lost. Much of the time it's like working with a raster image with some embedded text. Not really the same as being able to edit a proper semantic document.

With a website and available source code, any dev working on it later on can still add accessibility, tweak contrasts and fonts and add screen reader hints, etc.

It's much harder to do so for PDFs after the fact. And PDF viewer apps may or may not even support the accessibility annotations. By contrast all the major browsers and operating systems have OK support for web accessibility.

I don't know anything about websites. I had ebooks in mind.

Yeah, totally. PDFs are wonderful for archiving.*

They can hold so many different types of data so that they're extremely difficult to parse.

Because of this, you can put several malicious programs into them for RCE.

That way, if someone archives many PDFs, there can be a plethora of different RCE vulnerabilities just waiting for the user to discover.

It's a wonderful dream for any malicious actor.

* /s

Yes, there is an enormous interest in this kind of thing, not the least in larger organizations with tons of PDF documents in various forms.

Even though this would only cover a small part of the needs or use cases, it will still be hugely useful if it works well.

cough L cough L cough M cough anyone? :)

Yeah, I know, but a lot of this content can be pretty sensitive, and might not be allowed to upload outside organization networks sometimes (hospitals, governments etc).

Like most software, LLMs can be run locally, or on private infrastructure. This was on the front page yesterday, which is not the only way to run an LLM locally, but about the easiest way possible: https://news.ycombinator.com/item?id=38464057

Thanks! Well, yea, I just thought the quality of offline models might not yet be good enough. By I'm glad to be told otherwise :)

Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).

This also has tons of use-cases for accessibility, getting PDF accessibility right is tons of work and even if you manage it, it's highly likely that the PDF viewers your users use don't support the necessary standards anyway.

Let's build a pipeline

I don't think that is the right approach for archiving. The preferred pipeline would be

all the pdfs -> archive them all -> markdown them

This way you can always re-run the conversion as bugs are fixed and improvements are made. Generally archivist prefer to save as close to the source material as possible, because every transformation from there can only lose data.

Finally a good usecase for AI/ML/LLM.

I'm curious if anyone has had any success building this package. I've spent a lot of time trying to build it myself, but unfortunately haven't been able to get it to work. Has anyone else had better luck?"

I did it on mac without any issues. Are you using Mac or Linux? what is the issue?

I'm using Ubuntu 22.04. I encountered several errors with Poetry and attempted to fix them but eventually gave up.

(author) Please feel free to open an issue if you try again. Poetry can be painful, I might just switch to a requirements.txt file in the future. (you can skip poetry if you want by just pulling everything in pyproject.toml into a requirements.txt file also)

Is there a plan to release this package as a docker image?

Yes, this is on my list of things to do :)

I found the use of poetry a bresh of fresh air compared to the usual python silliness. Painless, as opposed to getting the cuda stuff working which took a lot longer.

The hard part was getting CUDA and torch to work. The package itself was just poetry install. Easy-peasy.

Great work! I am a bit confused with the comparison with nougat throughout the repo. Nougat was specifically trained for academic documents, and I don't think anyone ever claimed Nougat was the best OCR model out there. That's kinda clear in your benchmark too where you mention that nougat has higher accuracy on arxiv documents. You also mention that marker will convert fewer equations when compared to nougat, and yet compare with nougat in terms of speed? (again, only complaining because it's a model designed for academic documents).

For anyone trying to do OCR on any pdf with math in it, definitely do try nougat. It's very easy to install (just a python package), and extracts the math, text, tables and beyond (in a .mmd file) with a single command line command. It also runs reasonably fast for personal uses - it takes about 30 seconds to convert a 6 page document using CPU only on my 4 year old i5 laptop.

Author here: for my use case (converting scientific PDFs in bulk), nougat was the best solution, so I compared to it as the default. I also compare to naive text extraction further down.

Nougat is a great model, and converts a lot of PDFs very well. I just wanted something faster, and more generalizable.

Great work! I just tried it on Linux for System Administrators and it did a great job properly picking up on code and config text.

I noticed marker downloaded a PyTorch checkpoint called `nougat-0.1.0-small`, do you use nougat under the hood too or is that just a coincidence?

Yes, nougat is used as part of the pipeline to convert the equations (basically marker detects the equations then passes those regions to nougat). It's a great model for this.

How do you think nougat would handle RPG rulebook PDFs?

I'm looking for a food OCR model to help me transcribe sections of RPG books to markdown. Ideally, I'd like the emphasis such as bold or italics to be transcribed.

The combo of text, numbers, and math symbols seems similar to technical and academic writing, but often has weird formatting, text boxes in the margins, and many diagrams.

I'm not completely sure to be honest, but you should try it yourself with a sample page! I believe hugging face hosts it online on their demo pages so you don't even have to install the package to test on one page.

I don't think anyone ever claimed Nougat was the best OCR model out there

Comparing two things doesn't inherently imply the previous thing was touted about with superlatives. It's just a way to juxtapose the new thing with something that may be familiar. As you said, nougat is easy to install/run so it makes sense they'd compare it. Would it be better if they could add more libraries in the comparison? Absolutely; that'd be helpful.

Question for the author: Why to markdown? It seems to me the hard part of this tool is parsing pdfs with high accuracy, not whatever you do with them. As such, I would love if this tool allowed the user to choose the output format. I know that I would use a high accuracy pdf parser to render into epub.

I agree, the intermediate format should be plain text that could optionally be converted to any other format. I suppose that Markdown, however, is used as intermediate format here. It is close to plain text while it can preserve simple layout information.

In practice, I would use the Markdown output and plug it into any tool that converts that into the desired final output format.

That sounds reasonable. I might explore pdf -> markdown -> epub.

I wonder if this could somehow be used directly by calibre. I think calibre's pdf->epub conversion isn't amazing. In particular, tables often end up broken.

I chose markdown because I wanted to preserve equations (fenced by $/$$), tables, bold/italic information, and headers. I haven't looked into epub output, but this ruled out plain text.

You would want to have some kind fo markup that preserves structural markup as much as possible. I manage ebooks for a university press, and we have a deep backlist waiting for conversion, a lot of which only exists as page scans of old print volumes. I want to be able to offer them as epubs, which means I need to know where there are chapter breaks, heads, tables, charts, math, blockquotes, and so on and so forth. I have vendors that can do this for me, but it costs more than we'd get for some of these books in sales. I'd love to be able to do soem of this myself.

It'd be really great if there was something like this that also supported image extraction

My current workflow (for getting a magazine onto a website) is Calibre's HTMLZ export, then through Pandoc to markdown. It produces good enough Markdown to feed in to Hugo, and extracts images.

I've been through a number of options in the past and this is what I've settled on.

Interesting! I tried it, but it seems to struggle with multi-column layouts (lines get intermingled). Is that something you tried?

Especially for those that want to move out of Confluence. It is rather easy to obtain a docx or pdf from the API as well as the raw, uncompressed attachements, a bit more complicated to convert said files to markdown with full quality attachements and no formatting errors on every pages.

Impressive. It would be nice to have access to a spellchecker with support for more languages though. But the results are pretty good despite that.

Spellchecker is included. Just change the spell_Lang from eng to your lang

I know it is included. The problem is that the available selection of languages is not good enough to include any of the languages I need it for. There is only support for a handful of languages.

This seems like a great tool to help migrate my notes out of OneNote

How can it help with OneNote?

Try this?

https://help.obsidian.md/import/onenote

How good is tesseract for OCR nowadays? I tried using it a while back and it was nowhere near as good as the online offerings from AWS, Azure and GCP.

Last update was pretty recent, and the git mentions tesseract 5 as a dep. so it's likely moved on a bit from when you last tried it:

https://github.com/tesseract-ocr/tesseract/releases

I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.

I tried it quite recently and it failed on a very basic image. I also tried the iOS Vision API, which also failed. My test case was a clear photo of a book page.

This looks amazing, I'll have to play around with this over the weekend.

I regularly hand transcribe RPG PDFs scans from dubious sources that have not always been run through OCR to have selectable text. If it has, it wasn't always done very well.

It's literally faster to type it all myself than fix all the errors from copy-pasting (or after using OCR to turn it into text).

Even if the file was an official PDF the formatting would often get screwed up with lots of double or triple spaces and even tabs included between words.

This would save so much time if I can get it to work. Thanks for sharing!

I had this use case also in mind. Already tried with one book, but the results were not that good. Many of the tables and text boxes were messed up. I had pretty good results converting tables to markdown with ChatGPT by taking a screenshot of a table and pasting it to chat. It was able to handle some "irregular" tables with a bit of prompting. Like "Read the table row by row. Column headers are X, Y, Z. X is text, Y is number, Z is word" as a simplified example.

I regularly hand transcribe RPG PDFs scans from dubious sources

Heh, that was my immediate thought too. There's a ton of RPG stuff that never had any kind of physical release and is totally orphaned as IP.

Can someone help me understand the line

Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

Does this mean it isn't suitable if I wanted to use it in a product for sale or I cannot use it for tasks at my work? I would like to try to use this at work to convert vendor documentation to include in our internal wiki.

If your work is commercial then you cannot use it. Think of it this way, is your work being used in a commercial business. Then it cannot be used. If you are using this for personal use or anything that is not part of a business, its ok.

Are there any other libraries or online services that does this well? I have a large number of PDFs from government agencies. I’ve tried AWS Textract and works fairly well.

https://www.handwritingocr.com is aimed specifically at handwriting, and will do that better than Textract and co, but works well for printed text too.

Nice! The only missing feature is conversion of plots to ASCII art ; )

This could be achieved with chafa.py :

https://chafapy.mage.black/

https://hpjansson.org/chafa/

What’s the best tool for writing with chatGPT so that markdown gets rendered properly? Copy pasting in google docs is always misery.

I'd just use pandoc to convert to docx.

Great stuff!

I have a question regarding the output of Nougat: Where do the "hallucinations" come from (just scroll through the Nougat output of the Think Python example to see what I mean)?

Nevermind, i just read it runs it through an LLM, so hallucinations are par for the course.

I think these sorts of tools are dangerous at least until the hallucination (in text or formatting) rate is below that experienced by a careful reader repeatedly re-reading a document, which is almost but not quite zero and, depending on the application, potentially even until it's actually zero. I guess they're mostly fine for cases where the extact document content isn't important, but it's probably not common to have a lot of documents that nobody anywhere considers or ever will consider important yet which must be more accessible than pdfs.

Nice work. I tend to do most of my longer reading on an e-reader. PDFs, especially multi-column layouts, are a nightmare with the out-of-the-box offerings from Amazon Kindle or Pocketbook. This looks like something that'll improve my experience quite a lot.

The installation of this thing needed more time than manual fixups of the generated .md with a simple pdf2md converter. And I got a perfect result, unlike marker/nougat.

This kind of tool should also be built-into the post-processing pipeline of paperless-ngx. Well-parsed markdowns would be better indexable for search.

Kosmos2.5 seems promising and I hope we see it in oss (otherwise assume it just makes Azure cloud ocr better)

https://arxiv.org/pdf/2309.11419.pdf

That looks great ! I would think that the same with latex or typst could be even better if doable

Might the OCRing of for example MIT's student magazine The Tech have used a similar stack as this, sans Markdown output of course? In the sense of the way any given historical issue's complex layout has been OCR'd so well?

https://thetech.com/issues

Random old issue for example: https://thetech.com/issues/33/34

Nice. This would have been very helpful when I was building an e-discovery document processing engine. Back then we could get text out (OCR, so kind of) but it was a bear to present. Markdown would have been a whole lot easier.

I've struggled with the other part of this flow: getting a good clean PDF of a website in an automated way. Whatever archive.today does is probably the best approach I've seen, but they don't publish their code as far as I can tell.

Amazing work. Thank you.

I have a set of PDF files, and this week was thinking how I can link them to an LLM and be able to ask questions about them. So this was very timely.

I did a quick side-by-side testing against Nougat, and it clearly works better. On a handful of PDFs I tested, Marker extracted considerably more text (the text did not have any math, just academic papers), finished the job faster, and did not crash on any pdf, while Nougat took a lot longer to finish, and sometimes crashed due to out-of-memory error (could not allocate more than 7GB RAM!)

I have an odd usecase that I've yet to find a good solution to: Reading construction documents (Blueprints are always PDF). I've had much better luck parsing DXF (AutoCAD) files but it's not always easy to get an architect to send them to me even if I'm the GC on the job.

I'm not very technical but could benefit from this tool tremendously. Is there a way to use it from R?

Really interesting stuff... it might be worth adding some before-and-after examples to the repo.

What kind of PDF are you tweaking it for? How does it handle handwritten annotations?

I'd love to try this for a magazine I publish in PDF (designed with Adobe Indesign), but I couldn't make the repo work on my local. Any chance anyone could make a guide to try it on the cloud? It would be appreciated :)