Access to digital resources: principles

Access to digital resources: indexes and search engines

For digital resources, two main categories of data are indexed:
  • metadata
  • the contents (“full text”), where available

It is important to note that a document’s table of contents and indexes (geographical indexes, indexes of references to people, etc.) are also converted into text format so that the document can be searched and browsed.

The Gallica index thus consists of metadata, full text where available, existing tables of contents, image keys, and information from external partners’ OAI warehouses. 

The search engine used by BnF is Lucene (the Wikipedia search engine).

Lucene is a free search engine written in Java and used to index and search text.
In particular, it enables the various indexed elements of a document to be weighted relative to each other: for example, when searching for the word “wretched”, the most relevant documents (shown at the top of the list) will be those where the word “wretched” is found in the metadata (e.g. the title) rather than in the document content.

Free software for Gallica

BnF favors the use of free software for reasons of sustainability, production cost, and software maintenance.
The whole of Gallica has been created using free software: 
  • the Lucene search engine
  • the Apache web server 
  • the Tomcat application engine 
  • the Eclipse development tool

Wednesday, November 6, 2013