This operation is performed in accordance with the legal deposit of the internet as established by the French Heritage Code (art. L131-1 to L133-1 and R131-1 to R133-1), following the Law on Copyright passed on August 1st, 2006. Legal deposit is one of the main methods available to the BnF to ensure the growth and development of its collections.
BnF uses a spider called Heritrix (http://crawler.archive.org) to harvest websites. The robot’s identification field is "User-Agent : Mozilla/5.0 (compatible; bnf.fr_bot; ...)". It always applies high politeness rules (delays between two requests) in order not to stress the producers' servers.
Robots.txt protocolIn accordance with the Heritage Code (art L132-2-1), the BnF is authorized to disregard the robot exclusion protocol, also called robots.txt. This protocol aims to direct the activity of crawlers used by search engines, by filtering out non-text and/or non-indexable content (binary files such as images, sounds, videos, style sheets or administration files).
To accomplish its legal deposit mission, the BnF can choose to collect some of the files covered by robots.txt when they are needed to reconstruct the original form of the website (particularly in the case of image or style sheet files). This non-compliance with robots.txt does not conflict with the protection of private correspondence guaranteed by law, because all data made available on the Internet are considered to be public, whether they are or are not filtered by robots.txt.
The BnF strives to avoid the generation of these false URLs, integrating many filters in the harvest profiles, and to focus on relevant URLs.
If the performance of your website is affected by this operation, please report it by email to email@example.com. We will propose a solution as soon as possible.
Thursday, November 26, 2015