This is an amazing piece of work. Thanks for making this open-source.
The default way of generating beautiful PDFs in the backend these days is, running a headless browser and using browser APIs to turn HTML/CSS into PDFs. And apparently, it's a bit costly running instances of browser in the server and scale it properly for huge workloads.
This is literally a game changer. Now it's possible to design PDFs using HTML/css and generate them without the browser overhead!
As an aside its amazing how far the web has come, where the best way to make pretty pdf documents is to literally run a web browser on the server. This would have been so unthinkable back in the 90s & 2000s
I'm fairly certain that using a headless browser on the server is mainly about sandboxing all the security concerns that PDFs have, not aesthetics, but yes.
it's actually because layout-via-code for arbitrary documents is a humblingly complex problem, so leveraging existing layout engines is preferred.
This impressive effort looks far better than what I'd achieve, but when this approach has been tried before, it is eventually discovered that few organizations have the resources to maintain a rendering engine long-term.
I do think complexity could be part of why we don't have many options here, but I don't agree that a layout engine is too difficult to maintain. More of the issue is that CSS layout (and maybe layout in general) is not widely well-understood. I've almost _never_ come across people interested in layout because generally it's a few properties to get something working and then you move on.
I'm curious are there other instances of this happening than Edge switching to Blink? That event was one of my main motivators; it felt like further consolidation of obscure knowledge.
Opera switched from Presto to Blink, too.
Very fun project! Did you ever consider integrating with web-platform-tests? It's shared between all the common browser vendors, and we're always interested in more contributors :-)
True. But I wonder if there are more special-purpose engines similar to Prince that have been abandoned.
I've run some of the WPT tests manually, but I don't yet have <style> support, and some of them use <script> I think? That's a path I'm wary of (eval()?) but I could have a special mode just for tests.
I did discover lots of weird corners that would be great to make some WPT tests for. Definitely something I want to do!
Yes, a _lot_ of WPT tests depend on <script>. But there's also a bunch of ref-tests, where you just check that A and B match pixel for pixel (where B is typically written in the most obvious, dumb way possible). It lets you test advanced features in terms of simple ones. But yes, you'd need selector support in particular.
I maintain a standalone web layout engine[0] (currently implementing Flexbox and CSS Grid) which has no scripting support. WPT layout tests using <script> is a major blocker to us running WPT tests against our library. Yoga (used by React Native) is in a similar position.
Do you think the WPT would accept pull requests replacing such tests with equivalent tests that don't use <script> (perhaps using a build script to generate multiple tests instead - or simply writing out the tests longhand)?
I could run against only the ref-tests, but if I can't get full coverage then the WPT seems to provide little value over our own test suite.
[0]: https://github.com/DioxusLabs/taffy
I don't decide WPT policies (and I honestly don't know who does), but I'm pretty sure using a build script would be right out, as WPT is deeply embedded in a lot of other projects. E.g., if you made a build script, you would need to add support for running that script in Blink and in Gecko and in WebKit, and their multiple different runners, and probably also various CI systems.
As for the second option, I don't actually know. If it becomes 10x as long, I would assume you get a no for maintainability and readability reasons. If it's 20% longer and becomes no less clear, I'd say give it a try with a couple tests first? It's possible that the WPT would be interested in expanding its scope to a wider web platform than just browsers. You would probably never get people to stop writing JS-dependent tests, though, so you would need to effectively maintain this yourself.
Of course, for a bunch of tests you really can't do without <script>, given that a lot of functionality is either _about_ scripting support (e.g. CSSOM), intimately tied to it (e.g. invalidation) or will be tested only rather indirectly by other forms of tests (e.g. computed values, as opposed to used values or specified values or actual values or …).
To reply mostly with my WPT Core Team hat off, mostly summarising the history of how we've ended up here:
A build script used by significant swaths of the test suite is almost certainly out; it turns out people like being able to edit the tests they're actually running. (We _do_ have some build scripts — but they're mostly just mechanically generating lots of similar tests.
A lot of the goal of WPT (and the HTML Test Suite, which it effectively grew out of) has been to have a test suite that browsers are actually running in CI: historically, most standards test suites haven't been particularly amenable to automation (often a lot of, or exclusively, manual tests, little concern for flakiness, etc.), and with a lot of policy choices that effectively made browser vendors choose to write tests for themselves and not add new tests to the shared test suite: if you make it notably harder to write tests for the shared test suite, most engineers at a given vendor are simply going to not bother.
As such, there's a lot of hesitancy towards anything that regresses the developer experience for browser engineers (and realistically, browser engineers, by virtue of sheer number, are the ones who are writing the most tests for web technologies).
That said, there are probably ways we could make things better: a decent number of tests for things like Grid use check-layout-th.js (e.g., https://github.com/web-platform-tests/wpt/blob/f763dd7d7b7ed...).
One could definitely imagine a world in which these are a test type of their own, and the test logic (in check-layout-th.js) can be rewritten in a custom test harness to do the same comparisons in an implementation without any JS support.
The other challenge for things like Taffy only targeting flexbox and grid is we're unlikely to add any easy way to distinguish tests which are testing interactions with other layout features (`position: absolute` comes to mind!).
My suggestion would probably be to start with an issue at https://github.com/web-platform-tests/rfcs/issues, describing the rough constraints, and potentially with one or two possible solutions.
Security of the pdf format is not relavent here. The headless browser outputs a PDF. It is not taking a user controlled pdf as input.
Ah of course, my apologies. I misread the original post.
I needed to transform a 12MB HTML file into a PDF document and headless Chrome quickly ran out of memory (4GB+).
We are now using a commercial alternative that seems be be using a custom engine that implements the HTML and CSS specs. The result is reduced memory usage (below 512MB during my tests) and the resulting PDF is much smaller, 3.3MB vs 42MB.
Did you try Weasyprint?
Yes, I’ve tried all the open source projects I could find. Including Weasyprint and wkhtmltopdf. Weasyprint was much slower than headless Chrome and also required a lot of memory to process the HTML. And wkhtmltopdf is no longer maintained and crashed while processing.
We use docraptor based on princexml engine but haven’t tried a huge pdf. We generate 20-30 page pdfs sometime and it works great.
We are also using DocRaptor. It takes around 20 seconds to generate the PDF, and we only need to generate it every night. So the costs are also not an issue at the moment.
Back around 2002 at least there were some products, ABCpdf is one I used a lot, which ran Internet Explorer on the server to generate PDFs from HTML. Worked pretty well from what I recall.
Have you tried Typst? It's like a modern version of LaTeX and allows to generate nice looking documents quickly. Can be called from the console and makes it easy to create templates and import resources like images, fonts and data (csv, json, toml, yaml, raw files, ...). Of course it is its own language instead of HTML/CSS but so far I found it quite pleasant to use.
I'm a little confused by your comment. I've been using the Prawn library to generate PDFs on the backend for a side project I am working on for quite sometime https://github.com/prawnpdf/prawn
(Admittedly, the PDFs I generate are most certainly not beautiful, so maybe that's the difference)
Lots of people are already really comfortable with html/css, so having the option to avoid learning an entirely new paradigm is helpful.
Prawn really is great. I use it to generate invoices and for exporting a billing overview in client projects. And it’s quite fast as well, since it generates the PDF directly without the need to spin up a browser.
One of the benefits of using the browser is that the generated PDF will be using vectors/fonts etc whereas Canvas will be mostly an image in the PDF. Not a big deal for the most use cases.
I feel like it’s probably not a leap to go from this to having a PDF renderer as a backend. The trickiness is in the layout, which this is already doing. Looks to be a lower level api and a way to render to absolutely positioned html. That gets you most of the way there.
I've used this with good success
https://ekoopmans.github.io/html2pdf.js/
I've served dynamic content directly as PDF with https://xmlgraphics.apache.org/fop/
You can make PDF client-side by html2canvas or webkit.js (https://github.com/trevorlinton/webkit.js)
We did use this approach years back when I worked for on a feature that generated PDF invoices.
But I wondered whether using instead something like LaTeX wouldn't be faster and easier to scale.
I don't think rasterized output makes a good pdf.