Biowulf High Performance Computing at the NIH
Tesseract on Biowulf

Tesseract is an open-source command-line Optical Character Recognition (OCR) engine. It was originally developed at HP, open-sourced in 2005, and has been developed at Google since then. Version 4 (available on Biowulf) adds LSTM based OCR engine and models for dozens of languages and a number of scripts.

Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 59584206
salloc.exe: job 59584206 queued and waiting for resources
salloc.exe: job 59584206 has been allocated resources
salloc.exe: Granted job allocation 59584206
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3092 are ready for job

[user@cn3144 ~]$ module load tesseract
[+] Loading singularity  3.5.3  on cn3092
[+] Loading tesseract  4.1.1

[user@cn3144 ~]$ tesseract eurotext.png eurotext-eng-deu -l eng+deu
Tesseract Open Source OCR Engine v4.1.1 with Leptonica

[user@cn3144 ~]$ more eurotext-eng-deu.txt
The (quick) [brown] {fox} jumps!
Over the $43,456.78  #90 dog
& duck/goose, as 12.5% of E-mail
from is spam.
Der „schnelle” braune Fuchs springt
über den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. EI zorro
marrön räpido salta sobre el perro
perezoso. A raposa marrom räpida
salta sobre 0 cäo preguigoso.

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. For example:

set -e
module load tesseract
tesseract eurotext.png eurotext-eng-deu -l eng+deu

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#]
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. tesseract.swarm). For example:

tesseract image1.jpeg image1-hineng -l hin+eng
tesseract image2.jpeg image2-hineng -l hin+eng
tesseract image3.jpeg image3-hineng -l hin+eng
tesseract image4.jpeg image4-hineng -l hin+eng

Submit this job using the swarm command.

swarm -f tesseract.swarm [-g #] --module tesseract
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
--module tesseract Loads the tesseract module for each subjob in the swarm