User Tools

Site Tools


pdfwarez

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

pdfwarez [2018-11-23 19:02:47] (current)
Line 1: Line 1:
 +====== Creating a searchable scanned and OCR'd book ======
  
 +Prerequisites:​
 +  * a directory with image files, for example in ppm format, one for each page (use "​pdfimages file.pdf directory/"​ if you have one multipage PDF), names sorted in the right order
 +  * tesseract-ocr version **4**. It is available in Debian Sid and in stretch-backports. The version 3 is much less accurate
 +  * language packages for tesseract, for example tesseract-ocr-eng (English) and tesseract-ocr-ces (Czech)
 +  * imagemagick,​ pdftk, poppler-utils,​ parallel (beware, there is "​parallel"​ program in moreutils package, and this one is different)
 +
 +First, create individual PDFs out of these images.
 +<​code>​
 +for f in *.ppm; do echo "​convert -level 25,95% -quality 70 -density 300 -compress jpeg $f ${f%ppm}pdf";​ done | parallel
 +</​code>​
 +  * adjust "​level"​ to match your scanner. The goal is to have black black and white white.
 +  * adjust "​density"​ to your scanner DPI
 +
 +Next, run OCR engine on the original files and create PDFs with text layer only:
 +<​code>​
 +export OMP_THREAD_LIMIT=1
 +for f in *.ppm; do echo "​tesseract -c textonly_pdf=1 --oem 1 --dpi 300 -l eng $f $f pdf"; done | parallel
 +</​code>​
 +  * adjust "​dpi"​ to your scanner DPI
 +  * "​l"​ is language, e.g. eng, ces or deu
 +  * "pkill -f -USR1 parallel"​ to get the progress
 +
 +Now, merge the image layer and the text layer
 +<​code>​
 +for f in *.ppm; do echo "pdftk $f.pdf background ${f%.ppm}.pdf output combined-${f%.ppm}.pdf";​ done | parallel
 +</​code>​
 +
 +And finally merge all the generated pages into one big PDF
 +<​code>​
 +pdfunite combined*.pdf output.pdf
 +</​code>​
pdfwarez.txt ยท Last modified: 2018-11-23 19:02:47 (external edit)