Skip to content

Conversation

ntodd
Copy link

@ntodd ntodd commented Dec 18, 2014

Leverage the GNU Parallel tool to OCR multiple pages in parallel. If Parallel is installed, a full document extraction will generate an image for each page and then spawn a tesseract process for each available core. If Parallel is not installed or a subset of pages are indicated, the old behavior will be used. This speeds up OCR processing significantly on multi-core machines.

With a bit more work, this could be leveraged by the other OCR code paths.

Nate Todd added 2 commits December 18, 2014 17:20
Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction.  If Parallel is not installed, use previous behavior.
@deuxshaish
Copy link

I like this a lot.. Will test and observe, thanks for the commit

@pickhardt
Copy link

This is a great idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants