Let's not underestimate the impact of such tool: we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.
I'm very excited about it.
Let's build a pipeline: all the pdfs -> markdown them all -> archive.org them all
FWIW PDF is actually great for distribution. It allows you to invisibly embed all the raw data used to generate the document that the end user is seeing, in whatever format you want. So if you are generating your PDFs by using PrinceXML to render HTML, you can embed the raw JSON used to generate all of the text, graphs, charts, etc. Now most people don't actually do this of course, but that isn't the fault of the spec.
pdfs don't play well with ereaders.
Are the standards for building accessible PDFs worse than the standards for building accessible websites, or are they just not as commonly implemented?
(anecdotally) PDFs usually come from many people, departments, companies, and apps. It's hard to shoehorn in accessibility if someone didn't add it in at the origin (like in indesign or whatever app they used). Or if they printed to PDF, whatever accessibility they had would probably be lost. Much of the time it's like working with a raster image with some embedded text. Not really the same as being able to edit a proper semantic document.
With a website and available source code, any dev working on it later on can still add accessibility, tweak contrasts and fonts and add screen reader hints, etc.
It's much harder to do so for PDFs after the fact. And PDF viewer apps may or may not even support the accessibility annotations. By contrast all the major browsers and operating systems have OK support for web accessibility.
I don't know anything about websites. I had ebooks in mind.
Yeah, totally. PDFs are wonderful for archiving.*
They can hold so many different types of data so that they're extremely difficult to parse.
Because of this, you can put several malicious programs into them for RCE.
That way, if someone archives many PDFs, there can be a plethora of different RCE vulnerabilities just waiting for the user to discover.
It's a wonderful dream for any malicious actor.
* /s
Yes, there is an enormous interest in this kind of thing, not the least in larger organizations with tons of PDF documents in various forms.
Even though this would only cover a small part of the needs or use cases, it will still be hugely useful if it works well.
cough L cough L cough M cough anyone? :)
Yeah, I know, but a lot of this content can be pretty sensitive, and might not be allowed to upload outside organization networks sometimes (hospitals, governments etc).
Like most software, LLMs can be run locally, or on private infrastructure. This was on the front page yesterday, which is not the only way to run an LLM locally, but about the easiest way possible: https://news.ycombinator.com/item?id=38464057
Thanks! Well, yea, I just thought the quality of offline models might not yet be good enough. By I'm glad to be told otherwise :)
Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).
This also has tons of use-cases for accessibility, getting PDF accessibility right is tons of work and even if you manage it, it's highly likely that the PDF viewers your users use don't support the necessary standards anyway.
I don't think that is the right approach for archiving. The preferred pipeline would be
all the pdfs -> archive them all -> markdown them
This way you can always re-run the conversion as bugs are fixed and improvements are made. Generally archivist prefer to save as close to the source material as possible, because every transformation from there can only lose data.
Finally a good usecase for AI/ML/LLM.