return to table of content

Show HN: IPA, a GUI for exploring inner details of PDFs

svat
8 replies
4d5h

This is cool!

Here are some other similar(?) tools, for seeing the inner contents of a PDF file (the raw objects etc), but I haven't compared them to this tool here:

- https://pdf.hyzyla.dev/

- https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-rups-7.2.5.jar)

- https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax inspect foo.pdf > output.html)

- https://github.com/trailofbits/polyfile (polyfile --html output.html foo.pdf)

- https://www.reportmill.com/snaptea/PDFViewer/ = https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag PDF onto it)

- https://sourceforge.net/projects/pdfinspector/ (an "example" of https://superficial.sourceforge.net/)

- https://www.o2sol.com/pdfxplorer/overview.htm

More?

richardw
1 replies
3d21h

Recommend just letting people have their one day in the sun. We’ve become less the site of builders as the red team for testing your launch.

nicolodev
0 replies
3d21h

yeah I agree, and while everyone is suggesting tools which are really good but I designed mine to get rid of the flags and CLI interface. Good for tech people that keeps remembering flags, I'm not :(

aidos
1 replies
3d22h

Mutool is the one I suggest to people. The easiest way to understand a PDF is to decompress it and then just read the contents.

    mutool clean -d in.pdf out.pdf
At that point you’ll realise that a PDF is mostly just a list of objects and that those objects can reference each other. After that you’ll journey through the spec understanding what each type of object does and what the fields in it control. The graphics stream itself is just a stack based co-ordinates drawing system that’s easy to follow too.

By way of an example. Here's an object that represents a Page. You can see the dimensions in the MediaBox. The contents themselves are contained at object "9 0 obj" ("9 0 R" is the pointer to it):

    2 0 obj
    <<
      /Type /Page
      /MediaBox [ 0 0 612 792 ]
      /Contents 9 0 R
    >>
    endobj
Meanwhile "9 0 obj" has the drawing instructions. They seem a little weird at first glance but you see the values ".23999999 0 0 -.23999999 0 792" each get pushed on the stack and then "cm" pops them to interpret them as the transformation matrix.

    9 0 obj
    <<
      /Length 18266
    >>
    stream
    .23999999 0 0 -.23999999 0 792 cm
    q
    0 0 2551 3301 re
    ...
The depth and detail of all of the different possible things that can be represented in a PDF is insane. But understanding the structure above is all you need to begin your journey!

EDIT The rest of your journey is contained in this epic document: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

nicolodev
0 replies
3d22h

mutool clean -d in.pdf out. pdf

My tool can do exactly the same (viewing internal structure, exporting objects, and see the uncompressed raw content for stream) with a graphical interface and without all this kind of flags (which one of the reasons I started to design this project with egui), but thanks for posting yours too.

whizzter
0 replies
4d1h

Sweet, currently working on PDF signature stuff so I'm sure I'll find some stuff handy :)

nicolodev
0 replies
4d4h

Thanks for the list, the idea behind my tool was to try to code something that might fit an analyst that would take a fast look at the PDF. I'm also trying to figure out some fast heuristics to mark/highlight some peculiar stuff on the file itself.

Now regarding the tools you mentioned, I haven't checked out all of them, but part of them are interesting (and more mature, speaking of testing and compatibility). However some (at least the ones I was trying) are very basic, and they don't allow the "Save object as.." or uncompress it. I like the feature of displaying the PDF for preview :)

mananaysiempre
0 replies
4d5h

The venerable PDFedit[1] more or less forces you to confront the internal structure of the PDF file as well.

[1] http://pdfedit.cz/en/index.html

desgeeko
0 replies
4d

I am the author of PDFSyntax, thanks for mentioning it!

The HTML output is like a pretty print where you can read view objects and follow links to other objects.

Since I have added a new command (disasm) that is CLI oriented and displays a greppable summary of the structure. Here is an explanation: https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...

geekodour
8 replies
4d4h

what's a good tool to check if a pdf is not tampered with eg. as a tool to check before loading a pdf from a public bucket to your backend application?

remram
6 replies
3d20h

How could a PDF be tampered with in your own bucket?

verdverm
4 replies
3d19h

Sounds like they amay be accepting user PDFs, saving them to a bucket, and then doing processing after.

remram
3 replies
3d18h

They trust their AWS EC2 instance doing the processing but not their AWS S3 bucket doing the storing? I don't really understand the threat model here.

And what's "public bucket"?

geekodour
1 replies
3d17h

So the model here is, first it gets uploaded to a staging bucket, a lambda/callback checks the validity of the file and then puts it into a safe bucket of which content I trust to put in my server(backend)

remram
0 replies
3d15h

I think maybe you are using the word "tampered" in an unusual way? To mean unsafe?

geekodour
0 replies
3d17h

ah apologies again, for this specific one i meant where users from the internet are allowed to upload to. (i am using presigned urls)

geekodour
0 replies
3d17h

apologies my bad. I mean someone uploading a malicious pdf as user input. I am talking beyond calmav and flietype checking, to check if its a valid pdf.

criddell
0 replies
4d1h

If you sign the file, you should be able to verify that the signature still matches the file.

AlanYx
6 replies
4d5h

Does anyone have any recommendations for a good tool that allows both programmatic inspection and modification of PDF primitives. For example, let's say someone wants to iterate through every embedded image in a PDF and apply some form of signal processing to the images in-place, then re-save the PDF?

verdverm
2 replies
3d19h

I've been using several Python libraries for working with PDFs. At least one of them allows you to walk the AST. (will look up in a bit and edit this comment)

analog31
1 replies
3d18h

I've been using pypdf for working with PDFs in Python. My uses are pretty humble. I create Jupyter notebooks for managing sheet music that I receive in PDF format, allowing me to do things like break up a book of tunes into individual files, and so forth. This in turns makes it easier to pull up individual tunes on my tablet during a performance. But it looks like you can treat the PDF as a tree structure. I've used that feature for writing some recursive functions.

verdverm
0 replies
3d15h

yeah, I've been using pypdf mainly, camelot-py for some table stuff, and a bit of pdfminer

I've been needing something to see the x/y bounds of tables to fix some edge cases with camelot, seem to be some good links in the comments here

nicolodev
0 replies
4d5h

I'd suggest you to code something along popular libraries for PDF manipulation. I've used pdf-rs for the tool.

mananaysiempre
0 replies
4d4h

I’ve used pikepdf[1] for text processing before. To use it for the task you outline, you’ll probably need to thoroughly investigate how bitmaps can be represented in PDFs. (Or maybe not, if you only need to deal with a known finite set of PDFs or PDF producers.)

[1] https://pikepdf.readthedocs.io/en/latest/

desgeeko
0 replies
4d

My tool (PDFSyntax[1], mentioned in this thread) is a Python library that is able to both inspect and transform PDF files.

Depending on your transformation use case, you may write an incremental update with only a few bytes at the end of the original file instead of rewriting it entirely. To my knowledge this feature of the PDF specification is often overlooked and not a lot of libraries implements it.

It is a work in progress and I have not developed functions for images yet, though.

[1] https://github.com/desgeeko/pdfsyntax

giancarlostoro
4 replies
4d

This looks nice, and I didn't know about eGUI which looks like it runs on the web. Very interesting.

https://www.egui.rs/

mdaniel
2 replies
3d19h

That demo page told me everything I needed to know about not using that framework. There's no right-click to copy the text, and there's similarly no context menu to open links in a new tab, or copy their URL

sexy_seedbox
1 replies
3d17h

And just like that, we're back to the old days of Macromedia Flash where you have to write your own context menu and specify which texts are selectable.

mdaniel
0 replies
3d16h

Yeah, I knew I was in for some onoz when I saw "compiled to WebAssembly and rendered with WebGL". In their defense, it's stunning that any text operations work at all

Also, "There is no DOM, HTML, JS or CSS" is some uh-huh given the considerable amount of silliness involved in view-source:https://www.egui.rs/

nicolodev
0 replies
4d

Thanks! Immediate paradigm might be a little bit scary if you used to play with Qt, but looks easy to manage and it's really interactive

ZoomZoomZoom
4 replies
4d2h

I recently wanted to edit out a huge background image repeating on almost every page of a PDF and found out there's no obvious way to do it.

Would appreciate any tool suggestions!

mkl
0 replies
3d6h

This is probably a simple find-and-replace task, so I wouldn't bother with proper PDF parsing or libraries. I would:

1. Use pdftk to uncompress it: pdftk input.pdf output uncompressed.pdf uncompress

2. Look at the PDF code (it's text based) to find the image insertion code.

3. Replace all instances of the image insertion code with strings of spaces the same length (there's a table of object byte offsets at the end that you don't want to mess up).

4. Use pdftk to compress it again: pdftk edited.pdf output output.pdf compress

I have a script that does this to remove pen strokes of particular colours so I can e.g. strip out marking rubric on test solutions written on a tablet.

Get the PDF 1.7 spec from https://pdfa.org/resource/pdf-specification-archive/. You're looking for the "Do" operator invoking a named image object defined elsewhere with "/Subtype /Image". See section 4.8, particularly the example on p343. Or, if it's badly done, it might instead be an inline image using the "BI" operator (a bit later in the same section).

dr_kiszonka
0 replies
4d2h

You could try one of Adobe's PDF APIs or script their software locally.

darknavi
0 replies
4d1h

If you're OK doing it manually (not scripted), Inkscape can do this.

darby_nine
0 replies
3d23h

I've had good experience with pypdf, if you're willing to do a little coding.

nicolodev
1 replies
4d5h

Thanks, it seems a great product too :) Do you have any particular feature that you share that product for?

whizzx
0 replies
3d5h

I use it literally everyday, not only to see the structure but also modify pdf's on the fly when I need to tests edge cases.

Stuff I do with it: Modify content streams, extract images/content, just investigate general structure of the pdf documents, remove pages, repair documents,... it's literally a swiss army knife when working with pdf's

lolinder
0 replies
3d15h

Do be careful with iText products—they license their stuff under AGPL but their interpretation of AGPL is pretty extreme. If you talk to their team they'll tell you that ~everything your company makes should be AGPL-licensed if you use iText anywhere [0]:

You may not deploy it on a network without disclosing the full source code of your own applications under the AGPL license. You must distribute all source code, including your own product and web-based applications.

They also have this delightful nagware encoded as a base64 string that spits this out in your logs [1]:

You are using iText under the AGPL.

If this is your intention, you have published your own source code as AGPL software too. Please let us know where to find your source code by sending a mail to agpl@apryse.com We'd be honored to add it to our list of AGPL projects built on top of iText and we'll explain how to remove this message from your error logs.

If this wasn't your intention, you are probably using iText in a non-free environment. In this case, please contact us by filling out this form: http://itextpdf.com/sales If you are a customer, we'll explain how to install your license key to avoid this message. If you're not a customer, we'll explain the benefits of becoming a customer.

For using RUPS on a local computer you're probably safe, but I avoid the company because everything about their approach to the AGPL suggests that they chose it as a marketing technique for their paid products (with an extremely strong desire that it never be used commercially without pay), not out of a serious commitment to free software.

[0] https://itextpdf.com/how-buy/AGPLv3-license

[1] https://github.com/itext/itext-dotnet/blob/develop/itext/ite...

jerknextdoor
2 replies
4d3h

I was curious to try this out as it might actually solve a minor problem of mine right now, but it crashed as soon as I tried to open a PDF.

Installed from git using cargo 1.80.1 on Ubuntu 22.04 on an AMD Framework laptop if that's of any help.

nicolodev
1 replies
4d3h

argh, that's too bad, feel free to open an issue, what's happening in the console? It's panicking, isn't it? Feel free to contact me via email if you prefer

mdaniel
0 replies
3d19h

heh, parsing PDF files from the wild is practically a community driven fuzzer

jeffreportmill1
2 replies
4d4h

Great work! I'm sorry to be another jerk posting a link to something similar, but here is my solution, running in the browser (just drag and drop your PDF in):

https://reportmill.com/snaptea/PDFViewer/

nicolodev
1 replies
4d4h

Nice! My tool should be runnable in the browser thanks to wasm compatibility with Rust + egui :) Btw I've just tried it, and it's a little bit buggy in Safari with a 504kb PDF (lots of objects though). Apart from that, is there a way to export the raw stream? Is there any reason of do you print all the raw streams as a text?

jeffreportmill1
0 replies
3d15h

I don’t remember much about the work - it was just a quick and dirty app to help me debug PDF for my ReportMill work (10 years ago). I remember thinking there probably weren’t more than 100 people on the planet who would even care about it.

nicolodev
0 replies
4d4h

:D Well, I'm sure that half of reverse engineering community needs to thank you, and Zynamics for the important contribution for tools of static analysis. I just take the occasion to thank you for being an inspiration with such awesome tools like in BinNavi, BinDiff, and ultimately PDF dissector. When I was reading that it got discontinued, I just had that idea and started to reason about something focused on analysis, and applying some approaches we've already seen for the binary analysis tools.

whiteandmale
0 replies
4d5h

IPA ranks among my top beer styles. Oh wait.... 'thisn dork country, rightey?

eviks
0 replies
3d14h

Since this mentions malware, is this tool safe to use to open it? Or do you still need to use other protection methods?

aquafox
0 replies
3d13h

Is there actually a open source good alternative of Adobe Acrobat? I mean one where you can easily 1) merge and extract pages 2) edit content, e.g. change a single line in a simple vector graphics, add text ornother elements 3) place a signature and create a "protected" pdf. AFAIK, there isn't a single good open source program to do all of these things. Why?