This is cool!
Here are some other similar(?) tools, for seeing the inner contents of a PDF file (the raw objects etc), but I haven't compared them to this tool here:
- https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-rups-7.2.5.jar)
- https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax inspect foo.pdf > output.html)
- https://github.com/trailofbits/polyfile (polyfile --html output.html foo.pdf)
- https://www.reportmill.com/snaptea/PDFViewer/ = https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag PDF onto it)
- https://sourceforge.net/projects/pdfinspector/ (an "example" of https://superficial.sourceforge.net/)
- https://www.o2sol.com/pdfxplorer/overview.htm
More?
Recommend just letting people have their one day in the sun. We’ve become less the site of builders as the red team for testing your launch.
yeah I agree, and while everyone is suggesting tools which are really good but I designed mine to get rid of the flags and CLI interface. Good for tech people that keeps remembering flags, I'm not :(
Mutool is the one I suggest to people. The easiest way to understand a PDF is to decompress it and then just read the contents.
At that point you’ll realise that a PDF is mostly just a list of objects and that those objects can reference each other. After that you’ll journey through the spec understanding what each type of object does and what the fields in it control. The graphics stream itself is just a stack based co-ordinates drawing system that’s easy to follow too.By way of an example. Here's an object that represents a Page. You can see the dimensions in the MediaBox. The contents themselves are contained at object "9 0 obj" ("9 0 R" is the pointer to it):
Meanwhile "9 0 obj" has the drawing instructions. They seem a little weird at first glance but you see the values ".23999999 0 0 -.23999999 0 792" each get pushed on the stack and then "cm" pops them to interpret them as the transformation matrix. The depth and detail of all of the different possible things that can be represented in a PDF is insane. But understanding the structure above is all you need to begin your journey!EDIT The rest of your journey is contained in this epic document: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
My tool can do exactly the same (viewing internal structure, exporting objects, and see the uncompressed raw content for stream) with a graphical interface and without all this kind of flags (which one of the reasons I started to design this project with egui), but thanks for posting yours too.
Sweet, currently working on PDF signature stuff so I'm sure I'll find some stuff handy :)
Thanks for the list, the idea behind my tool was to try to code something that might fit an analyst that would take a fast look at the PDF. I'm also trying to figure out some fast heuristics to mark/highlight some peculiar stuff on the file itself.
Now regarding the tools you mentioned, I haven't checked out all of them, but part of them are interesting (and more mature, speaking of testing and compatibility). However some (at least the ones I was trying) are very basic, and they don't allow the "Save object as.." or uncompress it. I like the feature of displaying the PDF for preview :)
The venerable PDFedit[1] more or less forces you to confront the internal structure of the PDF file as well.
[1] http://pdfedit.cz/en/index.html
I am the author of PDFSyntax, thanks for mentioning it!
The HTML output is like a pretty print where you can read view objects and follow links to other objects.
Since I have added a new command (disasm) that is CLI oriented and displays a greppable summary of the structure. Here is an explanation: https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...