Web site vs Web crawlers

There was a time when the only bots (aka web crawlers) that crawl your website were search engine crawlers. It had a quite obvious purpose and clear usefulness to web sites: they allowed websites to become visible on the web, by being referenced by search engines. Nowadays, there is a myriad of bots for different purposes, and their respective interest is not always straightforward.

Thales Imperva annual report on bad bots estimate that in 2023, almost half of internet traffic came from bots (automatic process) and around 32 per cent from bad bots, i.e., with malicious intents (illegal access, data scraping, commercial use of your protected content etc). Both of these numbers are increasing each year. The difference they are making between good bots and bad bots is unclear. How one can decide that a bot has malicious intent when their intents are unknown? They are also raising concern about “good” bots which may have side effects such as increasing ad views without provoking any sell. 

If most bot intents remains unclear, their behaviors have an impact of web sites. Thus it seems more and more important to know how to deal with web bots so that your website can either benefit from or prevent their use from downgrading website performance, depending on your needs. Next, we introduce you to a few tools a webmaster dispose against web bots, and we discuss what the impact of crawling your website is and how you can protect yourself from bad behavior.

Crawl mitigation tools

Below is a toolbox useful to get protection from crawlers and data scraping. There are 2 types of tools:

  • External tools which blocks access to your web site, completely transparent for the administrator and eventually leaving logs
  • Informational resources, declaring what part of a website is not crawlable or which part should be crawl (robots.txt and sitemaps)
  • and you can always trying to communicate with the crawler owner

We will first present each tools and in the next part we will present how and when you need to use each of them. 

Robot Exclusion Protocol

The most efficient tool to regulate how your website is crawled is the Robot Exclusion Protocol, implemented by the well known robots.txt file, has been proposed in 1994 by Martijn Koster. Google (mainly) push to standardize this protocole and it became a RFC in 2022. The robots.txt file contains rules, either generic or targeting a specific user-agent which indicate which pages are allowed and which pages are disallowed on a website. This Google tutorial is a good introduction. For an example, you can look at this Wikipedia robots.txt file.

Sitemap and news sitemap

A robots.txt file may contains one or more sitemap entries for the website. This directive is non-standard and may be ignored. The sitemap is basically a list of url within a website either formatted as a text file or a (recommended) xml file. The sitemap is really useful to indicate which part of the website should be crawled and is a good way to indicate website new content (new pages or recent modifications using the lastmod entry). The Google bot always crawled website sitemap and used-it as suggestion list. 

Security platform or security tools

You can protect your website using different security platforms such as Crowdstrike,  Akamai, Cloudflare or Palo Alto Networks. These platform usually provide anti-malware tools, ip filtering, firewall, etc… It’s an easy way to get a good protection at a cost. But they can also have their downside by blocking everything, even what you would like to let through. 

Contacting the people behind the crawler

Sometimes the simplest solution is to try to communicate with the team behind a crawler using your favorite mean of communication. And If you can’t find a way to contact them directly, it’s always possible to find the Internet provider linked to the faulty IP address and send a abuse request which will be forwarded promptly. 

Behavior and mitigation strategy

When a website is crawled, the same IP address and/or user-agent are logged an unusual amount of times in server’s journals. The burst of visit may be detected either automatically by a security platform either directly spot into server’s logs for smaller site. Let’s have a look at certain signals, the underlying behavior and their mitigation strategy (if needed). 

DoS, DDoS

According to wikipedia, the initial inspiration for the Robot Exclusion Protocol was a (unvoluntary) denial of service attack caused by a buggy crawler. An attack by denial of service (DoS) (or distributed denial of service (DDoS) if multiple IP are used) is an attempt to disrupt the normal traffic of a server, service or network by flooding the target or its surrounding infrastructure with Internet traffic, i.e., if you send enough requests to a server, it will not be able to answer all of them and certainly not assume a normal behavior for legit access. 

Each website is scaled according to its usual traffic (or just a tight budget) but this information is not publicly available and a web crawler will probably considered every website the same way. Nowadays, one of the smaller hosting solution will allows 30 simultaneous requests and it will easily scaled to hundreds of simultaneous connections. A misconfigured crawler or a very naive crawler may still send too much requests, but it is not be the case with the vast majority of crawlers which will not risk to be permanently banned. 

Mitigation strategy: For the smaller websites, a crawling may appears like a sudden burst, but most of the time it doesn’t affect the web server capabilities. Accessing a web server every minute should not be a problem and is certainly not a DoS attack. First things first, you should evaluate if a crawling is really provoking a denial of service by looking at the server statistics (load, traffic) or logs (if the web server indicate too much connection at once). If there is a denial of service, then the buggy crawler may be banned from the website by disallowing all resources in the robots.txt file. 

Note that a non-standard directive from the Robot Exclusion Protocole (Crawl-delay) allows to tell a crawler which delay (in seconds) it must observe between two attempts. Google bot does not take it into account, but other crawler such as BingBot respect it.  

Brute force attack

A brute force attack is a very naive attack which consists in trial and error  attempts to crack passwords, gain access or break encryption keys. Obviously, it can only work with very weak encryption algorithm, but may be useful is users are not require to use strong passwords and/or passphrase (the famous chair/keyboard vulnerability). A brute force attack is detected with repeated failure to access to a protected resource. It is sometime the case the crawler which do not really try to access a resource, but only to crawl a publicly found URL which require credentials.

Mitigation strategy: Like DoS attack, several attempts to access a restricted resource does not mean you are subject to a brute force attack which require thousands attempts to get a minimal chance to be successful. The best mitigation strategy is to prevent access to the restricted resources using a robots.txt directive. Note that robots.txt directive syntax allows to define restrictions when a given request parameter is used. It doesn’t need to be a single path.

Action URL

The web is made of pages linked by hyperlinks, but links can also trigger actions such as “add to cart” on a ecommerce website, a login or trying to access a restricted resource. Even if these urls are publicly found, try to access them may trigger some alarm. A login url may send an email for example. the worst case is with dynamic pages where each access create new links with more parameters in the URL, the crawler may then be trapped in a loop. A crawler can’t just ignore url parameter since some website are fully dynamic and the url is the same for all resources with excepting the parameters. 

Mitigation strategy: Most of the plateform propose or recommend a default robots.txt file design to avoid these side effects. Look at Magento or Prestashop propositions for e-commerce platform. WordPress propose a robots.txt file to prevent access to the administration console. Again, robots.txt directive are not restricted to absolute path and it is possible to filter base on url parameter using regex or wildcard. A simple rule : if a resource should not be publicly accessed, then it must be disallowed by the robots.txt file.

False alarm

Security plugins or platform are more willing to report false positive rather than miss genuine attacks. It might not exists a real difference between a web crawler behavior and someone trying to scrap a website, at least not programmatically. For instance, a well known security platform register all bots which can’t solve their captcha or checks as an attack source but according to them, it is up to their client to whitelist or not the corresponding IP address. How many of their client will take the time to review all possible attack sources to unblock unknown crawler ? Just a few in best case scenario. 

Mitigation strategy: If you can’t trust everything that comes out of security platform, your best defense is to acquire some knowledge about the most common web bots and their respective purpose. We do not need to present search engine bots. Seo bots are quite common, but their interest is only a better knowledge of the web graph (aka the structure of the Internet). Marketing bots maintain their database. We already talk about the crawl impact on a web server above, the second most common fear is concerning intellectual property and the usual suspect are artificial intelligence bots. Unfortunately, there is no official list of bots, but you can check lists such as Cloudflare verified bot list