Removing watermarks and other clutter from manuscripts
June 22, 2022, 12:51 PM
Consider reading a manuscript that includes automatically generated watermarks/metadata on each page, which covers some of the text. Additionally, the document is split into several PDFs. After the third page of the second PDF referencing something from the first PDF you might wonder, if you can remove that clutter and merge the cleaned PDFs.
#!/usr/bin/env bash
## file: script.sh
## directory structure
# $ tree
# .
# ├── 1.pdf
# ├── 2.pdf
# ├── 3.pdf
# ├── 4.pdf
# ├── orig_merged.pdf
# ├── script.sh
# └── workdir
# ├── 1.pdf
# ├── 2.pdf
# ├── 3.pdf
# ├── 4.pdf
# ├── clean_compressed.pdf
# └── clean.pdf
#
# 1 directory, 12 files
mkdir -p workdir
for f in *.pdf; do
qpdf --qdf --object-streams=disable ${f} "workdir/${f}"
sed -i '/TEXT_TO_BE_REMOVED/d' "workdir/${f}"
done
cd workdir; pdftk $(ls *.pdf | sort -n) cat output clean.pdf
cd -
The size of the final document will be probably larger then that of merged original files. You might consider compressing workdir/clean.pdf
.
$ pdftk $(ls *.pdf | sort -n) cat output orig_merged.pdf
$ qpdf --object-streams=generate --compress-streams=y --decode-level=generalized workdir/clean.pdf workdir/clean_compressed.pdf
$ ls -lah workdir/clean.pdf workdir/clean_compressed.pdf
.rw-r--r-- mika mika 5 MB Wed Jun 22 13:20:27 2022 workdir/clean.pdf
.rw-r--r-- mika mika 3 MB Wed Jun 22 13:21:59 2022 workdir/clean_compressed.pdf