return to table of content

Marker: Convert PDF to Markdown quickly with high accuracy

mannycalavera42
15 replies
7h1m

Let's not underestimate the impact of such tool: we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

I'm very excited about it.

Let's build a pipeline: all the pdfs -> markdown them all -> archive.org them all

Alex3917
5 replies
4h13m

we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

FWIW PDF is actually great for distribution. It allows you to invisibly embed all the raw data used to generate the document that the end user is seeing, in whatever format you want. So if you are generating your PDFs by using PrinceXML to render HTML, you can embed the raw JSON used to generate all of the text, graphs, charts, etc. Now most people don't actually do this of course, but that isn't the fault of the spec.

sertbdfgbnfgsd
3 replies
4h5m

pdfs don't play well with ereaders.

Alex3917
2 replies
4h2m

Are the standards for building accessible PDFs worse than the standards for building accessible websites, or are they just not as commonly implemented?

solardev
0 replies
59m

(anecdotally) PDFs usually come from many people, departments, companies, and apps. It's hard to shoehorn in accessibility if someone didn't add it in at the origin (like in indesign or whatever app they used). Or if they printed to PDF, whatever accessibility they had would probably be lost. Much of the time it's like working with a raster image with some embedded text. Not really the same as being able to edit a proper semantic document.

With a website and available source code, any dev working on it later on can still add accessibility, tweak contrasts and fonts and add screen reader hints, etc.

It's much harder to do so for PDFs after the fact. And PDF viewer apps may or may not even support the accessibility annotations. By contrast all the major browsers and operating systems have OK support for web accessibility.

sertbdfgbnfgsd
0 replies
3h52m

I don't know anything about websites. I had ebooks in mind.

chaxor
0 replies
41m

Yeah, totally. PDFs are wonderful for archiving.*

They can hold so many different types of data so that they're extremely difficult to parse.

Because of this, you can put several malicious programs into them for RCE.

That way, if someone archives many PDFs, there can be a plethora of different RCE vulnerabilities just waiting for the user to discover.

It's a wonderful dream for any malicious actor.

* /s

samuell
4 replies
6h38m

Yes, there is an enormous interest in this kind of thing, not the least in larger organizations with tons of PDF documents in various forms.

Even though this would only cover a small part of the needs or use cases, it will still be hugely useful if it works well.

mannycalavera42
3 replies
6h35m

cough L cough L cough M cough anyone? :)

samuell
2 replies
6h30m

Yeah, I know, but a lot of this content can be pretty sensitive, and might not be allowed to upload outside organization networks sometimes (hospitals, governments etc).

scoot
1 replies
6h21m

Like most software, LLMs can be run locally, or on private infrastructure. This was on the front page yesterday, which is not the only way to run an LLM locally, but about the easiest way possible: https://news.ycombinator.com/item?id=38464057

samuell
0 replies
2h37m

Thanks! Well, yea, I just thought the quality of offline models might not yet be good enough. By I'm glad to be told otherwise :)

vikp
0 replies
2h42m

Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).

miki123211
0 replies
4h52m

This also has tons of use-cases for accessibility, getting PDF accessibility right is tons of work and even if you manage it, it's highly likely that the PDF viewers your users use don't support the necessary standards anyway.

kevincox
0 replies
1h25m

Let's build a pipeline

I don't think that is the right approach for archiving. The preferred pipeline would be

all the pdfs -> archive them all -> markdown them

This way you can always re-run the conversion as bugs are fixed and improvements are made. Generally archivist prefer to save as close to the source material as possible, because every transformation from there can only lose data.

Gabrys1
0 replies
22m

Finally a good usecase for AI/ML/LLM.

s1291
7 replies
3h25m

I'm curious if anyone has had any success building this package. I've spent a lot of time trying to build it myself, but unfortunately haven't been able to get it to work. Has anyone else had better luck?"

hashemian
5 replies
3h18m

I did it on mac without any issues. Are you using Mac or Linux? what is the issue?

s1291
4 replies
2h55m

I'm using Ubuntu 22.04. I encountered several errors with Poetry and attempted to fix them but eventually gave up.

vikp
3 replies
2h36m

(author) Please feel free to open an issue if you try again. Poetry can be painful, I might just switch to a requirements.txt file in the future. (you can skip poetry if you want by just pulling everything in pyproject.toml into a requirements.txt file also)

s1291
1 replies
2h16m

Is there a plan to release this package as a docker image?

vikp
0 replies
27m

Yes, this is on my list of things to do :)

yxhuvud
0 replies
2h0m

I found the use of poetry a bresh of fresh air compared to the usual python silliness. Painless, as opposed to getting the cuda stuff working which took a lot longer.

yxhuvud
0 replies
2h9m

The hard part was getting CUDA and torch to work. The package itself was just poetry install. Easy-peasy.

alsodumb
6 replies
14h13m

Great work! I am a bit confused with the comparison with nougat throughout the repo. Nougat was specifically trained for academic documents, and I don't think anyone ever claimed Nougat was the best OCR model out there. That's kinda clear in your benchmark too where you mention that nougat has higher accuracy on arxiv documents. You also mention that marker will convert fewer equations when compared to nougat, and yet compare with nougat in terms of speed? (again, only complaining because it's a model designed for academic documents).

For anyone trying to do OCR on any pdf with math in it, definitely do try nougat. It's very easy to install (just a python package), and extracts the math, text, tables and beyond (in a .mmd file) with a single command line command. It also runs reasonably fast for personal uses - it takes about 30 seconds to convert a 6 page document using CPU only on my 4 year old i5 laptop.

vikp
2 replies
2h39m

Author here: for my use case (converting scientific PDFs in bulk), nougat was the best solution, so I compared to it as the default. I also compare to naive text extraction further down.

Nougat is a great model, and converts a lot of PDFs very well. I just wanted something faster, and more generalizable.

civilitty
1 replies
1h7m

Great work! I just tried it on Linux for System Administrators and it did a great job properly picking up on code and config text.

I noticed marker downloaded a PyTorch checkpoint called `nougat-0.1.0-small`, do you use nougat under the hood too or is that just a coincidence?

vikp
0 replies
27m

Yes, nougat is used as part of the pipeline to convert the equations (basically marker detects the equations then passes those regions to nougat). It's a great model for this.

defsectec
1 replies
12h36m

How do you think nougat would handle RPG rulebook PDFs?

I'm looking for a food OCR model to help me transcribe sections of RPG books to markdown. Ideally, I'd like the emphasis such as bold or italics to be transcribed.

The combo of text, numbers, and math symbols seems similar to technical and academic writing, but often has weird formatting, text boxes in the margins, and many diagrams.

alsodumb
0 replies
10h26m

I'm not completely sure to be honest, but you should try it yourself with a sample page! I believe hugging face hosts it online on their demo pages so you don't even have to install the package to test on one page.

fshr
0 replies
12h35m

I don't think anyone ever claimed Nougat was the best OCR model out there

Comparing two things doesn't inherently imply the previous thing was touted about with superlatives. It's just a way to juxtapose the new thing with something that may be familiar. As you said, nougat is easy to install/run so it makes sense they'd compare it. Would it be better if they could add more libraries in the comparison? Absolutely; that'd be helpful.

sertbdfgbnfgsd
4 replies
4h5m

Question for the author: Why to markdown? It seems to me the hard part of this tool is parsing pdfs with high accuracy, not whatever you do with them. As such, I would love if this tool allowed the user to choose the output format. I know that I would use a high accuracy pdf parser to render into epub.

carschno
1 replies
3h58m

I agree, the intermediate format should be plain text that could optionally be converted to any other format. I suppose that Markdown, however, is used as intermediate format here. It is close to plain text while it can preserve simple layout information.

In practice, I would use the Markdown output and plug it into any tool that converts that into the desired final output format.

sertbdfgbnfgsd
0 replies
3h48m

That sounds reasonable. I might explore pdf -> markdown -> epub.

I wonder if this could somehow be used directly by calibre. I think calibre's pdf->epub conversion isn't amazing. In particular, tables often end up broken.

vikp
0 replies
2h37m

I chose markdown because I wanted to preserve equations (fenced by $/$$), tables, bold/italic information, and headers. I haven't looked into epub output, but this ruled out plain text.

Finnucane
0 replies
3h49m

You would want to have some kind fo markup that preserves structural markup as much as possible. I manage ebooks for a university press, and we have a deep backlist waiting for conversion, a lot of which only exists as page scans of old print volumes. I want to be able to offer them as epubs, which means I need to know where there are chapter breaks, heads, tables, charts, math, blockquotes, and so on and so forth. I have vendors that can do this for me, but it costs more than we'd get for some of these books in sales. I'd love to be able to do soem of this myself.

lgats
3 replies
12h22m

It'd be really great if there was something like this that also supported image extraction

afandian
1 replies
7h25m

My current workflow (for getting a magazine onto a website) is Calibre's HTMLZ export, then through Pandoc to markdown. It produces good enough Markdown to feed in to Hugo, and extracts images.

I've been through a number of options in the past and this is what I've settled on.

samuell
0 replies
6h15m

Interesting! I tried it, but it seems to struggle with multi-column layouts (lines get intermingled). Is that something you tried?

prmoustache
0 replies
8h59m

Especially for those that want to move out of Confluence. It is rather easy to obtain a docx or pdf from the API as well as the raw, uncompressed attachements, a bit more complicated to convert said files to markdown with full quality attachements and no formatting errors on every pages.

yxhuvud
2 replies
3h47m

Impressive. It would be nice to have access to a spellchecker with support for more languages though. But the results are pretty good despite that.

rurban
1 replies
2h33m

Spellchecker is included. Just change the spell_Lang from eng to your lang

yxhuvud
0 replies
2h11m

I know it is included. The problem is that the available selection of languages is not good enough to include any of the languages I need it for. There is only support for a handful of languages.

potatoman22
2 replies
14h16m

This seems like a great tool to help migrate my notes out of OneNote

smusamashah
0 replies
7h36m

How can it help with OneNote?

baby_souffle
0 replies
2h46m
iamflimflam1
2 replies
7h15m

How good is tesseract for OCR nowadays? I tried using it a while back and it was nowhere near as good as the online offerings from AWS, Azure and GCP.

rereasonable
0 replies
5h31m

Last update was pretty recent, and the git mentions tesseract 5 as a dep. so it's likely moved on a bit from when you last tried it:

https://github.com/tesseract-ocr/tesseract/releases

I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.

Geee
0 replies
5h31m

I tried it quite recently and it failed on a very basic image. I also tried the iOS Vision API, which also failed. My test case was a clear photo of a book page.

defsectec
2 replies
13h11m

This looks amazing, I'll have to play around with this over the weekend.

I regularly hand transcribe RPG PDFs scans from dubious sources that have not always been run through OCR to have selectable text. If it has, it wasn't always done very well.

It's literally faster to type it all myself than fix all the errors from copy-pasting (or after using OCR to turn it into text).

Even if the file was an official PDF the formatting would often get screwed up with lots of double or triple spaces and even tabs included between words.

This would save so much time if I can get it to work. Thanks for sharing!

milep
0 replies
6h25m

I had this use case also in mind. Already tried with one book, but the results were not that good. Many of the tables and text boxes were messed up. I had pretty good results converting tables to markdown with ChatGPT by taking a screenshot of a table and pasting it to chat. It was able to handle some "irregular" tables with a bit of prompting. Like "Read the table row by row. Column headers are X, Y, Z. X is text, Y is number, Z is word" as a simplified example.

crooked-v
0 replies
11h39m

I regularly hand transcribe RPG PDFs scans from dubious sources

Heh, that was my immediate thought too. There's a ton of RPG stuff that never had any kind of physical release and is totally orphaned as IP.

nemacol
1 replies
2h39m

Can someone help me understand the line

Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

Does this mean it isn't suitable if I wanted to use it in a product for sale or I cannot use it for tasks at my work? I would like to try to use this at work to convert vendor documentation to include in our internal wiki.

infecto
0 replies
2h28m

If your work is commercial then you cannot use it. Think of it this way, is your work being used in a commercial business. Then it cannot be used. If you are using this for personal use or anything that is not part of a business, its ok.

jnathsf
1 replies
14h1m

Are there any other libraries or online services that does this well? I have a large number of PDFs from government agencies. I’ve tried AWS Textract and works fairly well.

wriggles
0 replies
7h11m

https://www.handwritingocr.com is aimed specifically at handwriting, and will do that better than Textract and co, but works well for printed text too.

dr_kiszonka
1 replies
13h13m

Nice! The only missing feature is conversion of plots to ASCII art ; )

ploum
0 replies
7h16m

This could be achieved with chafa.py :

https://chafapy.mage.black/

https://hpjansson.org/chafa/

dr_dshiv
1 replies
7h9m

What’s the best tool for writing with chatGPT so that markdown gets rendered properly? Copy pasting in google docs is always misery.

rany_
0 replies
7h6m

I'd just use pandoc to convert to docx.

KeplerBoy
1 replies
7h23m

Great stuff!

I have a question regarding the output of Nougat: Where do the "hallucinations" come from (just scroll through the Nougat output of the Think Python example to see what I mean)?

Nevermind, i just read it runs it through an LLM, so hallucinations are par for the course.

thfuran
0 replies
5h37m

I think these sorts of tools are dangerous at least until the hallucination (in text or formatting) rate is below that experienced by a careful reader repeatedly re-reading a document, which is almost but not quite zero and, depending on the application, potentially even until it's actually zero. I guess they're mostly fine for cases where the extact document content isn't important, but it's probably not common to have a lot of documents that nobody anywhere considers or ever will consider important yet which must be more accessible than pdfs.

scary-size
0 replies
7h16m

Nice work. I tend to do most of my longer reading on an e-reader. PDFs, especially multi-column layouts, are a nightmare with the out-of-the-box offerings from Amazon Kindle or Pocketbook. This looks like something that'll improve my experience quite a lot.

rurban
0 replies
2h35m

The installation of this thing needed more time than manual fixups of the generated .md with a simple pdf2md converter. And I got a perfect result, unlike marker/nougat.

rounakdatta
0 replies
6h35m

This kind of tool should also be built-into the post-processing pipeline of paperless-ngx. Well-parsed markdowns would be better indexable for search.

ramoz
0 replies
2h11m

Kosmos2.5 seems promising and I hope we see it in oss (otherwise assume it just makes Azure cloud ocr better)

https://arxiv.org/pdf/2309.11419.pdf

poulpy123
0 replies
2h23m

That looks great ! I would think that the same with latex or typst could be even better if doable

nanna
0 replies
2h7m

Might the OCRing of for example MIT's student magazine The Tech have used a similar stack as this, sans Markdown output of course? In the sense of the way any given historical issue's complex layout has been OCR'd so well?

https://thetech.com/issues

Random old issue for example: https://thetech.com/issues/33/34

mlhpdx
0 replies
14h13m

Nice. This would have been very helpful when I was building an e-discovery document processing engine. Back then we could get text out (OCR, so kind of) but it was a bear to present. Markdown would have been a whole lot easier.

maliker
0 replies
0m

I've struggled with the other part of this flow: getting a good clean PDF of a website in an automated way. Whatever archive.today does is probably the best approach I've seen, but they don't publish their code as far as I can tell.

hashemian
0 replies
3h28m

Amazing work. Thank you.

I have a set of PDF files, and this week was thinking how I can link them to an LLM and be able to ask questions about them. So this was very timely.

I did a quick side-by-side testing against Nougat, and it clearly works better. On a handful of PDFs I tested, Marker extracted considerably more text (the text did not have any math, just academic papers), finished the job faster, and did not crash on any pdf, while Nougat took a lot longer to finish, and sometimes crashed due to out-of-memory error (could not allocate more than 7GB RAM!)

danofsteel32
0 replies
7h55m

I have an odd usecase that I've yet to find a good solution to: Reading construction documents (Blueprints are always PDF). I've had much better luck parsing DXF (AutoCAD) files but it's not always easy to get an architect to send them to me even if I'm the GC on the job.

bingdig
0 replies
2h26m

I'm not very technical but could benefit from this tool tremendously. Is there a way to use it from R?

airstrike
0 replies
13h18m

Really interesting stuff... it might be worth adding some before-and-after examples to the repo.

What kind of PDF are you tweaking it for? How does it handle handwritten annotations?

101008
0 replies
1h2m

I'd love to try this for a magazine I publish in PDF (designed with Adobe Indesign), but I couldn't make the repo work on my local. Any chance anyone could make a guide to try it on the cloud? It would be appreciated :)