National Library of France
Professionals
Download
Conversion to text format
Tables of contents and indexes (geographical indexes, author indexes, indexes of references to people, etc.), which are coded in simplified TEI, are very useful tools for viewing and browsing documents.
Where they exist in a primary document, they are recreated in the corresponding digital document so as to provide specific access points (generically referred to as a “Table of contents” within the Gallica viewing interface).
The conversion process involves the manual input and tagging of each entry. Each table of contents or index is divided into major sections, within which each entry (heading and target page number) is tagged as a hyperlink to the target page to which it refers. Before this is possible, the pages making up the table(s) of contents or index(es) must first be identified in the refNum file.
BnF’s document digitization process can also manage “multi-volume” tables of contents (where one entry is linked to one or more pages in another digital document).
Example tables of contents within Gallica:
As an experiment, the Revue de Synthèse has been fully encoded in simplified TEI. This text version of the document supports more reliable searches than OCR-generated conversions, as well as enabling accurate navigation both within each volume and between volumes.
This has been achieved by tagging structural elements of the text (major headings, chapters, images, formulas, quotations, footnotes, cross-references, etc.).
Examples in Gallica:
A version in image format is also available via Gallica.
BnF began to use TEI encoding in its projects during the late 1990s. These projects used the third edition of TEI (P3) in its simplified version in DTD form, and in particular the 1996 French translation of this edition.
Simplified TEI is a selection of the most essential and commonly used elements, attributes, and parameters of TEI.
Mandatory tags consist of the following:
Base tags were chosen from various categories of texts, the main categories being prose, poetry, plays, transcribed speech, dictionaries, and terminological information.
A given set of tags is usually used to encode texts belonging to a specific genre. Base tags therefore determine the composition of the encoded text, in accordance with the desired interpretation.
Divisions defined in the DTD denote parts or sections of text in line with the base structure selected. Every object, such as “chapters”, “sections”, and “acts”, have a defined place within the logical structure of the document. While the names given to these objects may vary within the primary document for cultural or usage-related reasons, TEI considers them to be the same type of element – namely, a division (
Example of structural divisions within a text
A number of specific tags are used to encode special parts of a text and manage links.
sur la TEI simplifiée : une introduction au codage des textes électroniques en vue de leur échange
(P3 edition, French translation, 1996)
TEI
TEI, which is based on SGML, is used to encode texts – and in particular literary and linguistic texts – in electronic form. It aims to reflect the logical organization of a text and recreate its hierarchical tree structure (divisions, chapters, sub-chapters, sections, etc., down to the most complex components such as quotations, verses, proper nouns referred to in the text, underlinings and other forms of highlighting, etc.).
Its modular architecture means that sets of elements can be chosen that correspond to the encoding requirements of a specific type of text: poetry, plays, dictionaries, linguistic corpora, manuscripts, textual criticism, transcriptions of oral speech, etc.
TEI’s modular architecture offers a high degree of flexibility. Modules can be combined in various ways in accordance with certain principles. The level of precision can also be selected in line with the requirements of the encoding project. The vast number of available tags means that all the richness of a text can be recreated, enabling texts to be handled just as if they were files ready for publication.
TEI is now in its fifth edition (TEI P5). It can be expressed in either a DTD or a schema. It is managed by the TEI Consortium, a non-profit foundation that operates as a multilingual collaborative project. A French version of the Tag Dictionary is produced and maintained by AFNOR group CG 46/CN 357/GE8 TEI.
Thursday, March 24, 2011