A neural network based detector for handwritten words.
- Download trained model, and place the unzipped files into the
modeldirectory - Go to the
srcdirectory and executepython infer.py - This opens a window showing the words detected in the test images (located in
data/test) - Required libs: torch, numpy, sklearn, cv2, path, matplotlib
- The model is trained with the IAM dataset
- Download the forms and the xml files
- Create a dataset directory on your disk with two subdirectories:
gtandimg - Put all form images into the
imgdirectory - Put all xml files into the
gtdirectory
- Go to
srcand executepython train.pywith the following parameters specified (only the first one is required):--data_dir: dataset directory containing agtand animgdirectory--batch_size: 27 images per batch are possible on a 8GB GPU--caching: cache the dataset to avoid loading and decoding the png images, cache file is stored in the dataset directory--pretrained: initialize with saved model weights--val_freq: speed up training by only validating each n-th epoch--early_stopping: stop training after n validation steps without improvement
- The model weights are saved every time the f1 score on the validation set increases
- A log is written into the
logdirectory, which can be opened with tensorboard - Executing
python eval.pyevaluates the trained model
- The model classifies each pixel into one of three classes (see plot below):
- Inner part of a word (plot: red)
- Outer part of a word (plot: green)
- Background (plot: blue)
- An axis-aligned bounding box is predicted for each inner-word pixel
- DBSCAN clusters the predicted bounding boxes
- The backbone of the neural network is based on the ResNet18 model (taken from torchvision, with modifications)
- The model is inspired by the ideas of Zhou and Axler
- See this article for more details

