National Library of France

Search Form

  Professionals

Download

Conversion to text format

To meet the needs of Internet users, BnF carries out the conversion into text format of printed materials which have previously been digitized into image format. Specific enhanced access points are also created for tables of contents and indexes.
The entire process is based on XML formats and specific standards: ALTO and TEI, detailed in a summary note on languages used to encode and structure text.
A number of specific rules are available within BnF’s Technical Digitization Charter (Charte technique de numérisation).

Tables of contents, indexes, and enhanced text using the TEI standards

Tables of contents and indexes

Tables of contents and indexes (geographical indexes, author indexes, indexes of references to people, etc.), which are coded in simplified TEI, are very useful tools for viewing and browsing documents.

Where they exist in a primary document, they are recreated in the corresponding digital document so as to provide specific access points (generically referred to as a “Table of contents” within the Gallica viewing interface).

The conversion process involves the manual input and tagging of each entry. Each table of contents or index is divided into major sections, within which each entry (heading and target page number) is tagged as a hyperlink to the target page to which it refers. Before this is possible, the pages making up the table(s) of contents or index(es) must first be identified in the refNum file.

BnF’s document digitization process can also manage “multi-volume” tables of contents (where one entry is linked to one or more pages in another digital document).

Example tables of contents within Gallica:

Enhanced text

As an experiment, the Revue de Synthèse has been fully encoded in simplified TEI. This text version of the document supports more reliable searches than OCR-generated conversions, as well as enabling accurate navigation both within each volume and between volumes.

This has been achieved by tagging structural elements of the text (major headings, chapters, images, formulas, quotations, footnotes, cross-references, etc.).

Examples in Gallica:

A version in image format is also available via Gallica.

For more info

Digitization programs

Use of simplified TEI at BnF

BnF began to use TEI encoding in its projects during the late 1990s. These projects used the third edition of TEI (P3) in its simplified version in DTD form, and in particular the 1996 French translation of this edition.

Simplified TEI is a selection of the most essential and commonly used elements, attributes, and parameters of TEI.

Mandatory tags consist of the following:

  • The full range of mandatory elements and attributes for all genres of document, which are used to transcribe text. A simple TEI document includes the following elements:
    • < front >: includes all introductory elements (headers, title pages, prefaces, dedications, etc.) found before the start of the text itself
    • < group >: groups together more than one individual text or group of texts
    • < body >: groups together the complete body of an individual text, excluding any introductory elements and appendices
    • < back >: groups together all appendices following the main text
  • a header, which includes all information related to document creation and management (authors and their roles, language used, context, classification, and text descriptors), links with other sources, and version history

Base tags were chosen from various categories of texts, the main categories being prose, poetry, plays, transcribed speech, dictionaries, and terminological information.

A given set of tags is usually used to encode texts belonging to a specific genre. Base tags therefore determine the composition of the encoded text, in accordance with the desired interpretation.

Divisions defined in the DTD denote parts or sections of text in line with the base structure selected. Every object, such as “chapters”, “sections”, and “acts”, have a defined place within the logical structure of the document. While the names given to these objects may vary within the primary document for cultural or usage-related reasons, TEI considers them to be the same type of element – namely, a division (

) which may or may not be numbered (numbered divisions are no longer supported in the latest version of TEI Lite). These divisions can then be described by a “type” attribute: for example, a chapter could be identified by the tag < div2 type='chapter' >.

 

Example of structural divisions within a text

Example of structural divisions within a text


A number of specific tags are used to encode special parts of a text and manage links.

TEI

TEI, which is based on SGML, is used to encode texts – and in particular literary and linguistic texts – in electronic form. It aims to reflect the logical organization of a text and recreate its hierarchical tree structure (divisions, chapters, sub-chapters, sections, etc., down to the most complex components such as quotations, verses, proper nouns referred to in the text, underlinings and other forms of highlighting, etc.).

Its modular architecture means that sets of elements can be chosen that correspond to the encoding requirements of a specific type of text: poetry, plays, dictionaries, linguistic corpora, manuscripts, textual criticism, transcriptions of oral speech, etc.

TEI’s modular architecture offers a high degree of flexibility. Modules can be combined in various ways in accordance with certain principles. The level of precision can also be selected in line with the requirements of the encoding project. The vast number of available tags means that all the richness of a text can be recreated, enabling texts to be handled just as if they were files ready for publication.

TEI is now in its fifth edition (TEI P5). It can be expressed in either a DTD or a schema. It is managed by the TEI Consortium, a non-profit foundation that operates as a multilingual collaborative project. A French version of the Tag Dictionary is produced and maintained by AFNOR group CG 46/CN 357/GE8 TEI.

Thursday, March 24, 2011