OCR = optical character recognition
Enabling full text searches
OCR is used to
locate and
recognize sequences of characters in an image, thus enabling them to be converted into words which can then be used to carry out full text searches. Conversion is carried out automatically by software, doing away with the need for much more costly manual retranscription. Words and sequences of characters stored in a text file can be reused within a new page layout, held in a database, etc.
The idea is that characters are recognized on the basis of forms memorized by the software and known terms contained in the dictionary used by the tool. This means that each space and each sequence of characters (known as a “string”) is accurately identified and reassembled in the main reading direction.
The quality of OCR depends on the primary document and the quality of digitization.
Although OCR techniques are constantly evolving in response to very high demand, the quality of recognition depends on a large number of factors linked to both the primary document and the digitization process itself. For example:
- Digital images must have sufficient contrast and be sufficiently straight.
- Printing errors (characters that are too fat, smudges, and especially translucence between two pages) reduce the quality of word recognition and segmentation.
- Materials with a columnar layout and/or illustrations, which must be read in a non-linear direction, are more complex to process than materials with a consistent layout.
- Generally speaking, fonts which are either very small or very large and/or which have spaced characters are difficult to process.
- Materials using non-Latin alphabets are also complex to process, though progress is at a more advanced stage than for ancient handwritten texts.
The ALTO format (Analyzed Layout and Text Object)
In order to use the output from OCR, BnF uses an XML-based format governed by a schema, the ALTO format.
Le format ALTO
ALTO is one of the most commonly used formats for converting images to text. It is an XML-based format governed by a schema. It retains the full coordinates of content (text, illustrations, graphics, etc.) within the image and enables the image and text to be superposed (in a multilayer PDF file) and search terms to be highlighted.
Elements and sub-elements of the ALTO format
ALTO enables pages to be segmented into various elements, each consisting of sub-elements:
The page element can contain five elements:
- TopMargin: the top part of the page, from the left-hand edge to the right-hand edge excluding the text area. Where possible, this is the area containing the title, credits, etc.
- BottomMargin: the bottom part of the page, from the left-hand edge to the right-hand edge excluding the text area.
- LeftMargin: the left-hand part of the page excluding the top and bottom parts and the text area.
- RightMargin: the right-hand part of the page excluding the top and bottom parts and the text area.
- PrintSpace: the text area. This is a mandatory element. It must contain at least one BlockGroup element.
Example zoning for a press page
© BnF
Where one of these elements contains information (text, an illustration, etc.), that information is described in one or more
BlockGroup elements.
There are four different types of
BlockGroup element:
- TextBlock: a block of text. This element is used to group lines of text into a coherent unit.
- Illustration: an image or drawing.
- GraphicalElement: a graphical element other than an image or drawing. This element can be used to describe an intertextual separation element or a textual element not recognized as such by the OCR process.
ComposedBlock: used to enable BlockGroup elements to overlap.
Within a TextBlock, the
String element groups together sequences of characters.
Generic attributes
Each of the elements and blocks defined above is defined by generic attributes. These attributes define the dimensions and type of each block, line and string as follows:
- ID: block number
- Height: height of block in pixels
- Width: width of block in pixels
- Quality: quality of recognition
- Fontstyle: style of font
- Type: type of font
Pixel coordinates are defined with reference to the upper leftmost reference point on the page. This means that each block, line and string is recognized in the order in which it is laid out in the original.
The ALTO format can also be used to indicate geometric shapes (circles, polygons, ellipses, etc.), illustrations and graphics, manage hyphenation, etc. Non-textual objects also have their own zoning and coordinates.
Identifying words and strings
Each sequence of characters making up a word or part of a hyphenated word (a “
string”) is identified using the following information:
- Generic attributes
- < content >: a word that has been recognized by the OCR tool and/or re-input to the required quality level (high quality – i.e. a 99.985% recognition rate – implies human correction).
- Hyphenation: the recognized part is contained within < content >, together with:
- < subs_type >: specifies which part of the word is which – < hyppart1> for the first part and < hyppart2> for the second.
- < subs_content >: returns the complete, unhyphenated word.
- < wc > (“word confidence”): a word recognition confidence rating from 0 to 10.
- < wd >: specifies whether or not the word belongs to a dictionary.
The character recognition software assigns a reliability value to each word, indicated in the < wc > (word confidence) tag and ranging from 0 to 10. This value is used to calculate the following:
- the quality ratio for each page: the number of < wc > tags on each page divided by the total number of words
- the quality ratio for each document: the sum of the quality ratios for each page divided by the total number of pages
For each document digitized by BnF, the quality ratio calculated automatically by the software is manually verified by the service provider across a sample of words, in accordance with the ISO 2859-1 standard. This process serves to confirm the stated quality ratio.
BnF requires a quality ratio of 99.9% for some digitized documents. For these documents, irrespective of the post-OCR quality ratio, the service provider must guarantee this ratio using all necessary correction methods, including manual correction.