Even if most of Internet data are publicly available, it could be legally challenging to retrieve and exploit this kind of data without infringing any regulations depending on the nature of the data you need, the nature of your processing and your purpose. To make an analogy, it is not because [insert here a famous movie star name] is a public person that you can exploit his or her picture for your own benefit, it works with the Mona Lisa too since a lot of information are copyrighted or protected by companies GTU (general term of usage). Perception of privacy are not the same between countries and Europe has always been ahead in term of regulation, which is not without consequences on innovation. But since GDPR (General Data Protection Regulation) is a well established legal framework, it is a ideal play ground to study how web crawl may be respectful to privacy.
What is GDPR and what does it protect ?
The GDPR regulation has been introduced in 2016 by the EU, it is designed to protect personal data defines as “any information that relates to an identified or identifiable living individual. Different pieces of information, which collected together can lead to the identification of a particular person, also constitute personal data. » Note that « Personal data that has been de-identified, encrypted or pseudonymised but can be used to re-identify a person remains personal data and falls within the scope of the GDPR. » meaning that is not enough to compute and only keeps an embedding of the data (for example) if it allows to identify someone.
Personal data aside, the GDPR regulation also compel to define a legal-basis for data scrapping and to ensure that the whole process do not infringed privacy laws to build a data set, i.e., must ensure that the dataset is compliant with the applicable regulations.
What are the penalties for non-compliance with these obligations? The less severe infringements could result in a fine of up to €10 million, or 2% of the firm’s worldwide annual revenue from the preceding financial year, whichever amount is higher and it can goes up to €20 million, or 4% of the firm’s worldwide annual revenue. Better ensure that your crawling is compliant.
Legal basis for data collection
During data scrapping, you don’t know in advance what kind of data you will retrieve, so you have to assume that it will contain some personal data among the spam which Internet is essentially made of. GDPR regulation list several cases which allows to retrieve personal data, but in the case of web crawling, you are basically limited to two of them : either you have the consent of every concerned person, either you must have a legitimate interest and ensure that your dataset is compliant. Obviously, the latter is the most probable case, but is not as simple as it may appear.
To ensure a legal basis for scrapping, extracting and using data, one needs to meet the following criteria:
- companies must only collect data for their own purpose and do not made them public
- the collected data must, under no circumstances, cause any financial or reputational losses to its owners
In a nutshell, web scraping is legal and ethical when you extract data only for personal use and analysis. Things are completely different when you want to republish the collected data, in which case you need to ask for the data subjects permission and check website policies before scraping – otherwise you may face personal data protection laws infringement. Web crawlers do not have the freedom to use the obtained data for unlimited commercial purposes and the copyright for data it’s will enforceable no matter how the data was obtained. It’e even worse if you scrap sensible personal data such as health information.
In the case of a SEO crawler, the content of web pages is not kept, or only small extract such as title or links anchor. An embedding may be kept for internal process of categorization for exemple. So they is no risk that personal data may be make public, which simplify the legal basis.
Legitimate interest and consent
A legitimate interest must be understand as whenever an organisation uses personal data in a way that the data subject would expect. The interest are usually commercial interests, research purpose or even wider societal benefits. A legitimate interest should provide a clear benefit while minimizing the risk of infringing on data subjects’ privacy.
If asking every concerned person for their consent if out of question, web crawlers may try to reach the closest notion of consent by respecting four cumulative criteria: it must be freely given, specific, informed and unambiguous. Let’s look at each of these criteria.
Consent: Freely given data
A large part of website, at least commercial ones (which are more eager to protect their data), got TCU (“Terms and Conditions of Use”) which may expressed what is possible to do. As an exemple, the french national editor association provide standard terms to include in your TCU which oppose to text and data-mining. But let’s face it, this is impractical for most of web crawling for obvious reason of efficiency. Nonetheless, we can make a fair assumption that if some restriction applied on a given website, then the robots.txt file of the same website has to reflect this restriction is a bot understandable manner.
The robots.txt is a file localized at the root level of a website and contains instructions about what resources of a website are publicly available. It may completely or partially denied access to one or more user-agent or just impose a slower pace to crawl the website. The user-agent is an identifier for a bot.
So a bot must only crawled publicly advertised content, respect the robots.txt file and also any directive it find in the web pages meta-data itself (such as no-follow). But there is a more thing: it must be a way to receive abuse requests and take them into account (and delete the content if asked). Usually, there is a abuse email address, or any way to contact the support. If none of that exists, then the TGU of a company operating in Europe must have a way to contact the data -protection officer.
Consent: Informed and Unambiguous process
A respectful bot can not act in disguise, even if it expose it to be banned from some websites. There is several way to walk in the light for a bot and the first one is to advertise a dedicated unambiguous user-agent (no spoofing allowed).
Like GoogleBot, the user-agent itself may contains the url of a web page presenting the bot and explicitly stating its purpose. If there is no link if the user-agent string, this information must be easily found.
Consent: Specific purpose
The purpose of the crawl must be specific, i.e. you can’t crawl the web just to create a dataset for “yet-to-defined” usage. This also mean that you can’t use your data for another purpose that the one you clearly state. For example, a SEO company may have the following purpose: to explore and store the web graph and to derive metrics about the graph
Extra steps to reduce risks
There is some measures which can reduce the legal risks, depending on your process :
- avoid sites with sensitives informations such as medical sites, health forums but also pornographic sites
- maintain a black list of hosts and /or ip addresses of people which reach you to stop the scrapping of their site
- avoid social media of any kind (the most famous ones are not scrappable anyway)
- anonymise or pseudo-anonymise collected data
- register your crawler
Expressing consent in a technical way
Expressing a refusal to text and data mining (TDM) in a website GTU is a first step, but te ensure the technical feasibility the consent must be expressed in a more technical way and that is what initiative like like TDMRep are trying to do :
“In a digital environment, TDM usage of copyright protected works can be subject to different terms and conditions, depending on the legal framework. In generic terms, an act of reproduction is required before TDM can be applied on content accessible on the Web; international laws stipulate that such act of reproduction is subject to authorization by rightsholders. So far, analyzing and processing the terms and conditions of a website, contacting rightsholders, seeking for permission and concluding licensing agreements require time and resources.
In such context, a machine-readable solution which streamlines the communication of TDM rights and licenses available for online copyrighted content is necessary to facilitate the development of TDM applications and reduce the risks of legal uncertainty for TDM actors. Such a solution, that shall rely on a consensus by rightsholders and TDM actors, will optimize the capacity of TDM actors to lawfully access and process useful content at large scale.”
There is also a file similar to a robots.txt file but dedicated to AI bots named ai.txt In order to keep an open Internet without infringing privacy laws or copyright, we need some ways to talk directly to crawlers, or scrapping will remains a risky adventure.
Q&A about GDPR crawling
- What is the impact of GDPR on web scraping activities? GDPR was a pionner in personal data protection and privacy, before IA bots make these questions more actual. It lay the first legal basis for web scrapping. GDPR define what is a personal data and what is need in order to scrap it and exploit it in regard of your purpose.
- What are the key principles of data protection under GDPR in relation to web scraping? the key principes are what is a personal data, do not make a personal data public, use it at it is expected to be used and in a explicit and non-ambiguous manner.
- Will web scraping activities require explicit consent under GDPR? There is some cases which do not require explicit consent, such as scientific research or general public interest. See gdpr for a complete list.
- What are the penalties for non-compliance with GDPR in web scraping? The minimum penalties is a fine of up to €10 million, or 2% of the firm’s worldwide annual revenue from the preceding financial year, whichever amount is higher and these amount could be doubled (€20 million or 4%).
- What is the definition of ‘legitimate interest’ in the context of GDPR and web scraping? When as a company/organisation you are processing personal data in order to carry out tasks related to your business activities without legal obligation. You must ensure that the process is minimizing the impacts on the rights and freedoms on individuals.
- How can companies demonstrate GDPR compliance in their web scraping practices? their is no “absolute” way to demonstrate compliance, but a company should at least make its scrapping process explicit, respect website restrictions (mainly robots.txt files) and comply with any receive demands.
- How has the landscape of web scraping changed since the implementation of GDPR in December 2018? One of the main changes brought about by the GDPR is the concept of legitimate interest as a legal basis for data processing. Prior to the GDPR, companies could rely on implied consent or legitimate interest to collect and process data. The GDPR also introduced new rights for individuals, such as the right to access and rectify their personal data, the right to be forgotten, and the right to data portability. These rights give individuals more control over their personal data and enable them to have a say in how their data is used. Overall, the implementation of GDPR has had a significant impact on the landscape of web scraping and data collection. Companies now need to ensure that they are in compliance with GDPR regulations and obtain explicit consent from individuals before collecting and processing their personal data. This has led to a greater emphasis on privacy and data protection in the online world.
Resources for web scrapping with GDPR
- GDPR info The GDPR itself
- CNIL GDPR-toolkit The french privacy regulator provide this toolkit which is not specific to web scrapping but provide usefull information about GDPR in general
- The legal basis of legitimate interests this datasheets give good hints about the legal basis of web scrapping
- The state of web scraping in the EU by the International Association of Privacy Professionals