Cloudflare Web Bot Auth, technical

documentation

Cloudflare Web Bot Auth, technical

documentation

Cloudflare Web Bot Auth, technical

documentation

For transparency purposes and to participate in the effort to build a better internet, Haiku has decided to strictly follow Cloudflare's Web Bot Auth recommendations for authentication when communicating with third-party sites. This document specifies the objectives and terms of implementation of Haiku's controlled web navigation policy.

1 - Sources of public data we collect

Haiku's services make daily use of external providers for collecting information on the web, both during various indexing processes and during direct interactions at users' request.

1.1 – Institutional public data

To ensure the continuity of its services, Haiku uses an automatic legal data collection system, in accordance with Open Data standards for court decisions and legal documents. The indexed decisions include, but are not limited to, those of European institutions such as the Court of Justice of the European Union, the European Court of Human Rights, as well as French institutions.

1.2 – Other public data

During the provision of its services, Haiku may, at the user's request or at its discretion if the need is automatically qualified, subcontract a web search to Google LLC through its Vertex Generative AI product. For the purposes of retrieving and processing the request, it may be necessary to exploit these results in order to integrate the content of various web pages into the processing of our services. This exploitation, through HTTP requests, is essential for performance and quality reasons as well as for transparency and respect for copyright (this notably allows the display of the titles of the various pages used by our processing so that the user can trace the source of the information presented to them). Beyond the functional aspect, this delegation ensures compliance with robots.txt files through the rules relating to GoogleBot, mitigating the risk of violation and facilitating understanding of the mechanisms we implement to collect data on the web.

1.3 – Protection measures against abuse

For good data collection practices and respect for third-party website productions, we apply a strict protection policy, both for public institutional infrastructures and for copyright: (i) Haiku does not record any information from third-party sites when the HTTP request stems from a user request (in the context of an interaction with search functions, for example). This aims to guarantee the non-appropriation of external knowledge and makes systematic citation of sources used by our productions necessary; (ii) Data from public institutions, made accessible by Open Data policy measures, are collected only once when they do not have extended application over time (for example, case law only has temporal anchoring in its date of delivery, so it is possible, for a given date, to collect all decisions at once, then maintain a synchronization table to avoid re-collecting decisions from that same date). Data likely to be updated over time are subject to collection according to a sliding window not exceeding ninety (90) days. This ensures minimal pressure on public infrastructures, in accordance with industry standards regarding general terms of use and reuse of data.

2 - Identification and access restriction

2.1 – User-Agent header values

Due to the dual nature of the processing carried out by Haiku and to simplify the identification and understanding of the objectives pursued, we have decided to segment the User-Agent header values: (i) the institutional data indexing robot is identified by HaikuBot, it is not used to explore content in order to train Haiku's generative AI base models. Its User-Agent string is as follows:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; HaikuBot/1.0; +https://haiku.fr/robots-doc

(ii) the robot mobilized for direct requests from the user is identified by Haiku-SearchBot in accordance with implicit industry conventions. Haiku-SearchBot is not used to explore the web automatically, nor to explore content for generative AI training purposes. Its User-Agent string is as follows:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; Haiku-SearchBot/1.0; +https://haiku.fr/robots-doc

The search offered by Haiku follows a whitelist policy and does not have the behavior of a general search engine. In fact, Haiku will never seek to collect all of your web pages in order to provide an alternative search engine based on your content to its users. However, if you wish to exclude certain sections of your website from our processing, you can specify the header value to reject in the User-agent parameter. Conversely, it may be interesting for you to appear in the search results we offer to our users, particularly for the purpose of highlighting your intellectual productions concerning the legal field. To ensure that your site appears in search results, we recommend that you authorize Haiku-SearchBot in your site's robots.txt file.

2.2 – Supported transfer protocols

Haiku's crawlers and retrievers support HTTP/1.1 and HTTP/2 protocols. They use the protocol version that offers the best crawling performance and may switch protocols between crawling sessions based on previous crawling statistics. The default protocol version used by Haiku's crawlers is HTTP/1.1 for maturity reasons. Crawling via HTTP/2 can save computing resources (for example, CPU, RAM) for third-party sites. However, if you wish to oppose its operation, you can ask the server hosting your site to respond with an HTTP 421 status code when Haiku attempts to access your site via HTTP/2.

2.3 - Additional resources

If you wish to learn more about how Haiku processes your data or if you wish to participate in improving the security and performance of our products, you can contact us through the dedicated form. Haiku reserves the right to modify this technical documentation in order to best reflect our practices and inform you of the provisions we implement for the collection and use of information on the web.