National Library of France

Search Form

  Professionals

Download

Conversion to text format

To meet the needs of Internet users, BnF carries out the conversion into text format of printed materials which have previously been digitized into image format. Specific enhanced access points are also created for tables of contents and indexes.
The entire process is based on XML formats and specific standards: ALTO and TEI, detailed in a summary note on languages used to encode and structure text.
A number of specific rules are available within BnF’s Technical Digitization Charter (Charte technique de numérisation).

OCR conversion and the ALTO format

OCR = optical character recognition

Enabling full text searches

OCR is used to locate and recognize sequences of characters in an image, thus enabling them to be converted into words which can then be used to carry out full text searches. Conversion is carried out automatically by software, doing away with the need for much more costly manual retranscription. Words and sequences of characters stored in a text file can be reused within a new page layout, held in a database, etc.

The idea is that characters are recognized on the basis of forms memorized by the software and known terms contained in the dictionary used by the tool. This means that each space and each sequence of characters (known as a “string”) is accurately identified and reassembled in the main reading direction.

The quality of OCR depends on the primary document and the quality of digitization.

Although OCR techniques are constantly evolving in response to very high demand, the quality of recognition depends on a large number of factors linked to both the primary document and the digitization process itself. For example:
  • Digital images must have sufficient contrast and be sufficiently straight.
  • Printing errors (characters that are too fat, smudges, and especially translucence between two pages) reduce the quality of word recognition and segmentation. 
  • Materials with a columnar layout and/or illustrations, which must be read in a non-linear direction, are more complex to process than materials with a consistent layout. 
  • Generally speaking, fonts which are either very small or very large and/or which have spaced characters are difficult to process.
  • Materials using non-Latin alphabets are also complex to process, though progress is at a more advanced stage than for ancient handwritten texts.

The ALTO format (Analyzed Layout and Text Object)

In order to use the output from OCR, BnF uses an XML-based format governed by a schema, the ALTO format.

Le format ALTO

ALTO is one of the most commonly used formats for converting images to text. It is an XML-based format governed by a schema. It retains the full coordinates of content (text, illustrations, graphics, etc.) within the image and enables the image and text to be superposed (in a multilayer PDF file) and search terms to be highlighted.

Elements and sub-elements of the ALTO format

ALTO enables pages to be segmented into various elements, each consisting of sub-elements:

The page element can contain five elements:

  • TopMargin: the top part of the page, from the left-hand edge to the right-hand edge excluding the text area. Where possible, this is the area containing the title, credits, etc.
  • BottomMargin: the bottom part of the page, from the left-hand edge to the right-hand edge excluding the text area. 
  • LeftMargin: the left-hand part of the page excluding the top and bottom parts and the text area. 
  • RightMargin: the right-hand part of the page excluding the top and bottom parts and the text area. 
  • PrintSpace: the text area. This is a mandatory element. It must contain at least one BlockGroup element.
Example zoning for a press page

Example zoning for a press page

Where one of these elements contains information (text, an illustration, etc.), that information is described in one or more BlockGroup elements.

There are four different types of BlockGroup element:
  • TextBlock: a block of text. This element is used to group lines of text into a coherent unit.
  • Illustration: an image or drawing. 
  • GraphicalElement: a graphical element other than an image or drawing. This element can be used to describe an intertextual separation element or a textual element not recognized as such by the OCR process.
ComposedBlock: used to enable BlockGroup elements to overlap.
Within a TextBlock, the String element groups together sequences of characters.

Generic attributes

Each of the elements and blocks defined above is defined by generic attributes. These attributes define the dimensions and type of each block, line and string as follows:
  • ID: block number
  • Height: height of block in pixels 
  • Width: width of block in pixels 
  • Quality: quality of recognition 
  • Fontstyle: style of font 
  • Type: type of font 

Pixel coordinates are defined with reference to the upper leftmost reference point on the page. This means that each block, line and string is recognized in the order in which it is laid out in the original.

The ALTO format can also be used to indicate geometric shapes (circles, polygons, ellipses, etc.), illustrations and graphics, manage hyphenation, etc. Non-textual objects also have their own zoning and coordinates.

Identifying words and strings

Each sequence of characters making up a word or part of a hyphenated word (a “string”) is identified using the following information:
  • Generic attributes
  • < content >: a word that has been recognized by the OCR tool and/or re-input to the required quality level (high quality – i.e. a 99.985% recognition rate – implies human correction). 
  • Hyphenation: the recognized part is contained within < content >, together with: 
  • < subs_type >: specifies which part of the word is which – < hyppart1> for the first part and < hyppart2> for the second. 
  • < subs_content >: returns the complete, unhyphenated word.
  • < wc > (“word confidence”): a word recognition confidence rating from 0 to 10. 
  • < wd >: specifies whether or not the word belongs to a dictionary.
The character recognition software assigns a reliability value to each word, indicated in the < wc > (word confidence) tag and ranging from 0 to 10. This value is used to calculate the following: 
  • the quality ratio for each page: the number of < wc > tags on each page divided by the total number of words
  • the quality ratio for each document: the sum of the quality ratios for each page divided by the total number of pages
For each document digitized by BnF, the quality ratio calculated automatically by the software is manually verified by the service provider across a sample of words, in accordance with the ISO 2859-1 standard. This process serves to confirm the stated quality ratio.

BnF requires a quality ratio of 99.9% for some digitized documents. For these documents, irrespective of the post-OCR quality ratio, the service provider must guarantee this ratio using all necessary correction methods, including manual correction.

Tuesday, March 22, 2011