The myth of the paperless office is still a myth. Despite the multitude of electronic tools/devices available on the market to produce digital documents of any kind, many documents still lay on paper support. The digital conversion of these documents (books, forms, old archives, correspondences, etc.) is a worldwide initiative and considerable amount of effort is invested to protect and conserve these documents. Producing digital copies gives the opportunity to a larger audience to access those documents and perform regular operations like searching (e.g. word spotting, see below), editing, conversion, publishing, etc.
Our aim is to propose and develop top-notch solutions to help this digitization process by offering solutions in image processing, document layout analysis and handwriting recognition in a multi-script environment (Roman, Arabic, Bangla).
The image processing in such a document environment is a challenging research field because the document is a special type of image which requires special type of treatment to produce the best quality image for the further processing steps. The goal of these image processing attempts is to preserve as much the layout of the document and the underlying text portion and discard the noise, the artifacts coming from the document itself (e.g. old documents) or from the digitization process.
The layout analysis is a strategic part in the Document Analysis and Recognition (DAR) and it is responsible to automatically retrieve the logical structure of the document composed by entities like titles, figures, paragraphs, lines, words. Such a decomposition of a document can positively affect the recognition. While for printed material the layout detection can heavily rely on the regularities derived from the size of the letter, the spacing, the font style, the line-wise spatial arrangement of the letters or words, for handwritten unconstrained materials more sophisticated solutions should be deployed.
Last but not least, the recognition of word entities in a multi-script environment is the most challenging research topic addressed in this research group. We try to recognize unconstrained handwriting and cope with the multitude of scripts and their graphical specificity in large vocabulary scenarios.
In offline automatic handwriting recognition images of text are transcribed into a textual (e.g. ASCII) representation. The major challenge is the great variability of human script found in unconstrained multi writer scenarios. Recognizers cope with this variability by increasing their model complexity. However, more complex models usually require more annotated training material. This is either extremely costly or simply unavailable due to the particularity of the data, e.g., historic document images that should be transcribed. One of our efforts in reducing model complexity while at the same time achieving state-of-the-art results in challenging multi writer scenarios is sub-character modeling.
Sub-character HMM modeling for Arabic text allows sharing of common patterns between different position-dependent visual forms of an Arabic character as well as between different characters. The number of HMMs gets reduced considerably while still capturing the variations in shape patterns. A character is horizontally split into sub-characters exploiting the similar patterns and these sub-characters can later be used to reconstruct the original characters. This results in a compact, efficient, and robust recognizer with reduced model set. The sub-character HMM models do not need any explicit segmentation of characters into sub-characters. Instead, the HMM models learn the patterns automatically from the training data as long as they are defined adequately in the dictionary using the domain knowledge of the script. In addition to the sub-character HMMs, we are investigating space-model for Arabic text recognition to better cope with space irregularities found in handwritten Arabic text. Moreover, we are also investigating other aspects of HMM like contextual HMM and multi-stream HMM which can potentially improve the text recognition capabilities of a recognizer.
Ahmad, I., Fink, G. A. and Mahmoud, S., Improvements in Sub-Character HMM Model Based Arabic Text Recognition, ICFHR 2014.
Ahmad. I., Rothacker, L., Fink, G. A. and Mahmoud, S., Novel Sub-character HMM Models for Arabic Text Recognition, ICDAR 2013.
Rothacker, L., Vajda, S. and Fink, G. A., Bag-of-Features Representations for Offline Handwriting Recognition Applied to Arabic Script, ICFHR 2012.
Automatically transcribing documents is possible if the variability of the scripts' visual appearance is very limited (e.g. in modern printed documents that can be transcribed with OCR) or considerable amounts of annotated training material are available. Especially for historical documents neither assumption holds.
Historical documents are often handwritten and, therefore, contain large variability that are due to human writing process. Even if they have been printed, the printing process did not produce as uniform and clear results as we are used to today. Severe artifacts like fading ink, ink bleed-through, paper stains and other degradations from document storage cause additional variability that are unwanted for recognition.
Historical document collections are usually quite particular. For that reason, training material cannot be easily adopted from different, ideally modern sources. If a large portion of documents has to be transcribed manually before a fully automatic transcription becomes available, the ad-hoc applicability of such systems in real world scenarios is hardly possible.
Query-by-example word spotting offers a compromise where large archives of document images can be searched rapidly only based on a single exemplary instance of the query word. The search result is presented as a list of document image regions ranked according to similarity to the query. This is more robust with respect to recognition errors because the user interprets a list of n-best results. In transcription-based systems it is usually infeasible to present n-best results for the recognition of a full text.
Query-by-example word spotting works as long as the words' visual appearances are similar, as in single writer scenarios. If the variability increases, i.e., in multi-writer scenarios, more examples are required. Therefore, according to the availability of annotated training material, word spotting methods scale in their generalization capabilities.
Segmentation-free Bag-of-Features HMM Word Spotting: The figure shows the query model generation and retrieval using our Bag-of-Features HMM. This model describes the spatial sequential structure of Bag-of-Features representations that have been extracted from the query word image. This can be seen as a dynamic, probabilistic extension of the popular Spatial Pyramid that might be known from Bag-of-Features applications in Computer Vision. With our patch-based decoding framework, we are able to detect image regions that are visually similar to the query without providing any prior segmentation on word or line level.
Rothacker, L., Rusinol, M., Llados, J. and Fink, G.A., A Two-Stage Approach to Segmentation-Free Query-by-Example Word Spotting, manuscript cultures, 1(7), 47-57, 2014.
Fink, G. A., Rothacker, L., and Grzeszick, R., Grouping Historical Postcards Using Query-by-Example Word Spotting, ICFHR 2014.
Rothacker, L., Fink, G. A., Banerjee, P., Bhattacharya, U. and Chaudhuri, B. B., Bag-of-Features HMMs for Segmentation-Free Bangla Word Spotting, ICDAR 2013.
Rothacker, L., Rusinol, M. and Fink, G. A., Bag-of-Features HMMs for Segmentation-Free Word Spotting in Handwritten Documents, ICDAR 2013.