The robots.txt file is a generic method for restricting access to resources during the crawling of a website, but it does not address all the challenges that artificial intelligence poses regarding data protected by copyright. Initiatives have emerged, and several specifications have been proposed to tackle this issue, which is becoming increasingly important as generative AI models develop.
ai.txt
The ai.txt file follows the format and syntax of the robots.txt file, which will likely facilitate its adoption and development. The differences are explained in this blog. The main difference between robots.txt and ai.txt highlighted by the authors is the timing of when the information is considered.
For the robots.txt file, a bot consults the file when it wants to retrieve pages from a site to determine what it can access and what is off-limits. This helps protect against the creation of datasets for training models but does not prevent later use via a hyperlink. Two use cases:
- An AI assistant provides an answer containing a link to your resource. In this case, robots.txt is not taken into account because it is not scraping.
- If another site uses one of your resources (e.g., an image) and this site is scraped by an AI bot, then your robots.txt file will not be considered.
According to the authors, the ai.txt file is a way to address these two use cases because:
- The file must be read when accessing a resource, not just when scraping a site.
- AI bots are expected to check the permissions of the site from which the resource is downloaded, not just the site where the resource is cited.
Thus, there is significant work required on the AI tools’ side, including both bots and any tool providing external resources.
Another specificity of this protocol is that it includes a wide variety of media. The fact that this initiative was launched in collaboration with artists is likely not a coincidence.
Legal Basis
For their development, the authors of ai.txt rely on European regulations, which have already created a de facto global standard with GDPR. The directive on copyright and related rights in the Digital Single Market defines exceptions to copyright for automated data processing (text and data mining).
Initially reserved for research or public interest purposes, these exceptions have been extended to all processes using legally accessible data (which remains a separate issue) unless rights holders have explicitly prohibited usage in a manner accessible to an automated process.
The ai.txt file enables precisely this and therefore provides a legal basis for lawsuits against the use of copyrighted material despite exceptions. However, the ai.txt file is not the only means proposed for expressing rights holders’ preferences.
TDM Reservation Protocol
The WWW Consortium has been working for some time on a protocol to address this specific need, and the working group’s final report was submitted on January 28, 2025.
The TDM Reservation Protocol proposes a tdmrep.json file, formatted in JSON, which must be placed not at the root of the site but in the site’s .well-known directory. It allows defining resource sets (expressible via regex paths), a boolean indicating whether access to corresponding resources is allowed or not, and the ability to specify a link to the reasoning behind the access restriction.
Content Authenticity Initiative (CAI)
The Content Authenticity Initiative (CAI), associated with Adobe, advocates for adopting a protocol called “Content Credentials.” They claim 4,000 members and implementation by Adobe, Meta, LinkedIn, OpenAI, YouTube, and others. The protocol has been deployed in photography, sometimes directly integrated into hardware and journalism.
Its function is somewhat akin to watermarking: documents contain verifiable metadata that can be accessed using open-source tools. The specification includes information regarding usage by automated data processing: it allows restricting all use for data mining or only model training, including generative model training.
However, the protocol seems more suitable for binary data rather than text.
llm.txt
The llm.txt file is more comparable to a site’s sitemap than to the robots.txt file. In this case, it does not express an authorization for use but rather aims to encourage it. It is a markdown-formatted file with a structured format, which is not the most intuitive for a programmatically readable file but has the advantage of being human-readable as well.
Reading the description, this standard appears to be more “project-oriented,” typically for an open-access tool. The main file serves to:
- Provide information about the site/project.
- List internal pages relevant to an LLM.
- Additionally, unlike a sitemap, list external resources. For a project, this could include documentation for a protocol used by the project or a dependency.
Since standard HTML pages are not always easily readable automatically—often cluttered with decorations—they also propose providing markdown versions of the most relevant pages (same URL with a .md suffix) to facilitate reading.
It’s quite interesting to remove decorations internally, where one has information on the site’s template, rather than leaving it to those retrieving the page to do so. However, since this is not automatic, it is uncertain whether it will be widely adopted, except for project pages that are potentially auto-generated.
Establishing Standards
We are at a point where the willingness to tackle copyright issues is already present, but where technical solutions have yet to reach a consensus. Several sectors have put forward their proposals, each with its own specificities. The future will tell which one—or which ones—will emerge as the next web standards.