Creating a searchable scanned and OCR'd book

Prerequisites:

a directory with image files, for example in ppm format, one for each page (use “pdfimages file.pdf directory/” if you have one multipage PDF), names sorted in the right order
tesseract-ocr version 4. It is available in Debian Sid and in stretch-backports. The version 3 is much less accurate
language packages for tesseract, for example tesseract-ocr-eng (English) and tesseract-ocr-ces (Czech)
imagemagick, pdftk, poppler-utils, parallel (beware, there is “parallel” program in moreutils package, and this one is different)

First, create individual PDFs out of these images.

for f in *.ppm; do echo "convert -level 25,95% -quality 70 -density 300 -compress jpeg $f ${f%ppm}pdf"; done | parallel

adjust “level” to match your scanner. The goal is to have black black and white white.
adjust “density” to your scanner DPI

Next, run OCR engine on the original files and create PDFs with text layer only:

export OMP_THREAD_LIMIT=1
for f in *.ppm; do echo "tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f $f pdf"; done | parallel

Now, merge the image layer and the text layer

for f in *.ppm; do echo "pdftk $f.pdf background ${f%.ppm}.pdf output combined-${f%.ppm}.pdf"; done | parallel

And finally merge all the generated pages into one big PDF

pdfunite combined*.pdf output.pdf

wiki