Removing watermarks and other clutter from manuscripts

June 22, 2022, 12:51 PM

Consider reading a manuscript that includes automatically generated watermarks/metadata on each page, which covers some of the text. Additionally, the document is split into several PDFs. After the third page of the second PDF referencing something from the first PDF you might wonder, if you can remove that clutter and merge the cleaned PDFs.

#!/usr/bin/env bash
## file: script.sh

## directory structure
# $ tree
#   .
#   ├── 1.pdf
#   ├── 2.pdf
#   ├── 3.pdf
#   ├── 4.pdf
#   ├── orig_merged.pdf
#   ├── script.sh
#   └── workdir
#       ├── 1.pdf
#       ├── 2.pdf
#       ├── 3.pdf
#       ├── 4.pdf
#       ├── clean_compressed.pdf
#       └── clean.pdf
#   
#   1 directory, 12 files

mkdir -p workdir

for f in *.pdf; do
    qpdf --qdf --object-streams=disable ${f} "workdir/${f}"
    sed -i '/TEXT_TO_BE_REMOVED/d' "workdir/${f}"
done

cd workdir; pdftk $(ls *.pdf | sort -n) cat output clean.pdf
cd -

The size of the final document will be probably larger then that of merged original files. You might consider compressing workdir/clean.pdf.

$ pdftk $(ls *.pdf | sort -n) cat output orig_merged.pdf

$ qpdf --object-streams=generate --compress-streams=y --decode-level=generalized workdir/clean.pdf workdir/clean_compressed.pdf

$ ls -lah workdir/clean.pdf workdir/clean_compressed.pdf
.rw-r--r-- mika mika 5 MB Wed Jun 22 13:20:27 2022  workdir/clean.pdf
.rw-r--r-- mika mika 3 MB Wed Jun 22 13:21:59 2022  workdir/clean_compressed.pdf