Jump label

Service navigation

Main navigation

You are here:

Main content

Document Analysis and Recognition

The digital conversion of documents (books, forms, old archives, correspondences, etc.) is a worldwide initiative and considerable amount of effort is invested to protect and conserve these documents. Producing digital copies gives the opportunity to a larger audience to access those documents and perform regular operations like searching (e.g. word spotting, see below), editing, conversion, publishing, etc. Our aim is to propose and develop top-notch solutions to help this digitization process by offering solutions in image processing, document layout analysis and handwriting recognition in a multi-script environment (Roman, Arabic, Bangla).

While modern models usually rely on machine learning and a significant amount of training data, another research focus of the group is to alleviate the data demand of document analysis models. By exploiting techniques such as transfer-, semi- or weakly supervised learning the application of machine learning models becomes easier and in the best case does not require any manually labelled data (see annotion-free learning).

Word Spotting

 Automatically transcribing documents is possible if the variability of the scripts' visual appearance is very limited (e.g. in modern printed documents that can be transcribed with OCR) or considerable amounts of annotated training material are available.
In application scenarios where these requirements are not fulfilled and handwritten text recognition does not provide satisfactory results, Word Spotting methods offer a viable alternative.
Word spotting describes the retrieval task of finding the most probable occurrences of a word of interest in a document collection.
As the system provides a ranked list of alternatives, it is up to the expert and his domain knowledge to decide which entities are finally relevant.

In the last years the pattern recognition group made several influential contributions and developed Word Spotting methods based on Hidden Markov Models and Convolutional Neural Networks.

2710271_the_query retrieval

Bag-of-Features Hidden Markov Models

The Bag-of-Features HMM word spotting method uses the query-by-example modality. This means that the user has to provide an exemplary instance of the query word in a document image. The proposed query model is estimated only from this individual example and no additional annotated training data is used, i.e. the method is annotation-free. While this limits the visual variability of the text in the document images that the proposed word spotting method can cope with, users are supported with automatic search functionality directly after acquiring a new collection of document images. In this regard, it is also important that the proposed word spotting method does not require a given segmentation of the document into lines or words, i.e. it is segmentation-free.

Given that the proposed query model has only seen a single example of the query word, the proposed word spotting system yields very high performance. This is achieved by exploiting the properties of document images on a very general level. The use of bag-of-features (BoF) allows for adapting the feature representation to the problem domain in an unsupervised manner. Modeling BoF sequences with a hidden Markov model (HMM) takes the length variability of text into account. A decoding algorithm for word spotting with semi-continuous HMMs allows for efficient retrieval in a coarse-to-fine decoding framework.


Segmentation-free Bag-of-Features HMM Word Spotting: The figure shows the query model generation and retrieval using our Bag-of-Features HMM. This model describes the spatial sequential structure of Bag-of-Features representations that have been extracted from the query word image. This can be seen as a dynamic, probabilistic extension of the popular Spatial Pyramid that might be known from Bag-of-Features applications in Computer Vision. With our patch-based decoding framework, we are able to detect image regions that are visually similar to the query without providing any prior segmentation on word or line level.


Related publications:

Rothacker, L., Wolf, F. and Fink, G. A., Annotation-free Word Spotting with Bag-of-Features HMMs, IJPRAI, 35(4), pages 2153001, 2020.

Rothacker, L., Rusinol, M., Llados, J. and Fink, G.A., A Two-Stage Approach to Segmentation-Free Query-by-Example Word Spotting,  manuscript cultures, 1(7), 47-57, 2014.

Rothacker, L., Rusinol, M. and Fink, G. A., Bag-of-Features HMMs for Segmentation-Free Word Spotting in Handwritten Documents, ICDAR 2013.

Learning Attributes with Convolutional Neural Networks

An elegant solution for enabling a word spotting system to perform query-by-example as well as query-by-string are common subspace approaches. Here, the textual representation and the word image representation are projected into a common subspace in which the word spotting task boils down to a simple nearest neighbor search. A very successful approach in this regard has been the embedded attributes framework. The projection for the text is done by computing binary textual attributes in a d-dimensional space. Each attribute then represents one dimension in a common attribute space. This attribute representation is called Pyramidal Histogram of Characters (PHOC).

We present an approach to word spotting by using Convolutional Neural Networks (CNN) which are able to predict multiple attributes at the same time. The CNN optimize the feature representation and the attribute detectors in a combined, supervised fashion which leads to discriminative features and highly accurate representations. By leveraging attributes, the CNN is able to predict representations for word image classes with high precision, even if they were not present at training time (out of vocabulary).

While being designed as a segmentation-based method, the approach may be generalized to segmentation-free challenges.


The Convolutional Neural Network PHOCNet is able to map word images to their respective pyramidal histogram of characters. As this projects word images and strings in a common embedding space, this method allows for query-by-string and query-by-example word spotting.

Related publications:
Sudholt, S. and Fink, G.A., Attribute CNNs for Word Spotting in Handwritten Documents, IJDAR, 21(3), pages 159-160, 2018.

Rothacker, L., Sudholt, S. and Fink, G. A., Word Hypotheses for Segmentation-free Word Spotting in Historic Document Images, ICDAR 2017.

Sudholt, S. and Fink, G. A., PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents, ICFHR 2016.

Next Challenges

Annotation-free Learning

Word spotting is a popular tool especially for supporting the first exploration of historic, handwritten document collections. Today, the best performing methods rely on machine learning techniques, which require a high amount of annotated training material. As training data is usually not available in the application scenario, annotation-free methods aim at solving the retrieval task without representative training samples.

In our work, we developed a word spotting method, that overcomes this drawback by performing learning without requiring any manually labeled data. The proposed method uses a synthetic dataset to train an initial model. Due to the supervised training on the synthetic dataset, the model is capable to perform query-by-string word spotting. This initial model is then transferred to the target domain iteratively in a semi-supervised manner. Our method exploits the use of a lexicon which is used to perform word recognition to generate pseudo-labels for the target domain. The selection of pseudo-labels used to train the network is based on a confidence measure. We show that a confidence based selection is superior to randomly selecting training samples and already a rough estimate of the lexicon is sufficient to outperform other annotation-free methods.


Annotation-free training scheme: The combination of synthetic data generation and a pseudo-labelling strategy allows to train a PHOCNet without manually labeled data. In our works we show that our method may exploit the benefits of machine learning techniques without introducing the demand for annotated data.

Related publications:

Wolf, F. and Fink, G. A., Annotation-free Learning of Deep Representations for Word Spotting using Synthetic Data and Self Labeling, DAS 2020.

Wolf, F., Brandenbusch, K. and Fink, G. A., Improving Handwritten Word Synthesis for Annotation-free Word Spotting, ICFHR 2020.

Sub content


Prof. Dr.-Ing. Gernot A. Fink
Head of Research Group
Tel.: 0231 755-6151