A General Purpose PDF/Image Sanitizer


ID: 287
PHID: PHID-TASK-wpds5i5b23fmq3q5mwhf
Author: HulaHoop
Status at Migration Time: invalid
Priority at Migration Time: Wishlist


Whonix has some neat tools to manipulate and sanitize documentation like the Metadata Anonymization Toolkit. Adding a PDF sanitizer (and even an photo sanitizer) can add alot of value and improve workflow for journalists and others working with documents.

It is somewhat complex but we are not starting form scratch. Qubes has a series of bash scripts implementing these functions (see 1& 2). The point of modifying it is to make an improved edition that is hypervisor agnostic and and can work on air-gapped setups like physical whonix too.
Ideally the client/server scripts would be unified into one package so both processes can be made available to a user running on a different VM platform.

The idea behind the choice of .rgb is, it can’t carry any complex data that can trigger anything when parsed by ImageMagick compared to other formats.

The process:

Untrusted vm [untrusted pdf > conversion to png > simple rgb format] > Transfer to trusted vm [simple rgb > reassemble back to png (2) > optional manual ocr operation (see 4 & 5) and cleanup (3) > A trusted fully searchable pdf]

Edit: The scripts (2) support image files too.

Edit: The OCR and enhancement is out of scope for this script and it involves more tools. The primary purpose is to edit and repackage the scripts from the repo (2) into a disro neutral form.

The cleanup is about enhancing scanned printed material and adjusting it to make it better to read than originals and easier to process correctly by OCR. OCR is about restoring the searchability of pdfs that was lost when they were converted to a simpler format during sanitization.

Konrad Voelkel’s link recommended scantailor and unpaper for enhancing scanned images PDFs. Both are packaged for Debian. This step should be left manual and up to the user before finalizing conversion to trusted PDF.

Micah’s tool is too simple for the threat model that we are addressing. I’m mentioning his work becuase it talks about the redaction process during the PNG stage.

Bonus points for slapping on a GUI.

1 http://theinvisiblethings.blogspot.com/2013/02/converting-untrusted-pdfs-into-trusted.html

2 https://github.com/QubesOS/qubes-app-linux-pdf-converter

3 http://www.fmwconcepts.com/imagemagick/textcleaner/

4 http://www.konradvoelkel.com/2010/01/linux-ocr-and-pdf-problem-solved/

5 http://www.konradvoelkel.com/2013/03/scan-to-pdfa/

6 https://github.com/micahflee/pdf-redact-tools



2015-05-15 14:55:07 UTC


2015-05-15 15:09:28 UTC


2015-05-15 21:01:46 UTC