This shows you the differences between two versions of the page.
— |
pdfwarez [2018-11-23 19:02:47] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Creating a searchable scanned and OCR'd book ====== | ||
+ | Prerequisites: | ||
+ | * a directory with image files, for example in ppm format, one for each page (use "pdfimages file.pdf directory/" if you have one multipage PDF), names sorted in the right order | ||
+ | * tesseract-ocr version **4**. It is available in Debian Sid and in stretch-backports. The version 3 is much less accurate | ||
+ | * language packages for tesseract, for example tesseract-ocr-eng (English) and tesseract-ocr-ces (Czech) | ||
+ | * imagemagick, pdftk, poppler-utils, parallel (beware, there is "parallel" program in moreutils package, and this one is different) | ||
+ | |||
+ | First, create individual PDFs out of these images. | ||
+ | <code> | ||
+ | for f in *.ppm; do echo "convert -level 25,95% -quality 70 -density 300 -compress jpeg $f ${f%ppm}pdf"; done | parallel | ||
+ | </code> | ||
+ | * adjust "level" to match your scanner. The goal is to have black black and white white. | ||
+ | * adjust "density" to your scanner DPI | ||
+ | |||
+ | Next, run OCR engine on the original files and create PDFs with text layer only: | ||
+ | <code> | ||
+ | export OMP_THREAD_LIMIT=1 | ||
+ | for f in *.ppm; do echo "tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f $f pdf"; done | parallel | ||
+ | </code> | ||
+ | * adjust "dpi" to your scanner DPI | ||
+ | * "l" is language, e.g. eng, ces or deu | ||
+ | * "pkill -f -USR1 parallel" to get the progress | ||
+ | |||
+ | Now, merge the image layer and the text layer | ||
+ | <code> | ||
+ | for f in *.ppm; do echo "pdftk $f.pdf background ${f%.ppm}.pdf output combined-${f%.ppm}.pdf"; done | parallel | ||
+ | </code> | ||
+ | |||
+ | And finally merge all the generated pages into one big PDF | ||
+ | <code> | ||
+ | pdfunite combined*.pdf output.pdf | ||
+ | </code> |