Deep dive into the architecture of the Babbar Crawler: The Crawl Policy

At Babbar, we try to crawl the web like a search engine. Of course, we don’t have the same computing resources or bandwidth as Google, Bing or other big players, but we still have a neat record in terms of coverage and crawling speed. While our focus is on Search Engine Optimization, we still share a lot of similar objectives and constraints with classical search engines.

In this article, which follows the previous article introducing the Babbar crawler, we will go into more details into the crawl policy.

Crawling the web is like being the explorer in a never-ending jungle of URLs. To make their way on the internet, web crawlers need a well-thought-out policy to guide them. Without one, they might get lost in a sea of low-quality pages or stuck in the quicksand of dead links or sucked into a black hole. Let’s break down how to design an effective crawl policy that balances exploration, efficiency, and a touch of politeness.

How the Crawler works

Understanding how the crawler operates is essential to crafting a solid crawl policy. The big picture is illustrated by this schema :

Barkrowler itself has an important role, but it’s not actually where the hard stuff is done. Here is a simplified overview of the key components:

  • Scheduler: This is the brain of the operation, determining the order and timing of URLs to be fetched. That’s where a lot of politeness is managed.
  • Fetcher: (yes, we have a fetcher in our fetcher) Responsible for retrieving content from the web, including HTML, headers, and other resources.
  • Parser: Breaks down fetched content into meaningful pieces, such as links, metadata, and body text.

Barkrowler also handles robots.txt file rules (allow / disallow as well as crawl-delay), DNS resolution, SSL shenanigans, HTTP magic and so on. This is not the topic of this article however, so for more information about this, I invite you to have a look at our robots.txt tips page and wait for a later article.

Then we have the main piece: WDL, which handles the index, the crawl policy, metrics computation and so on. We will give more details about its global operation in a later article, but for now we just must understand how it operates on the huge database of pages. A very simplified schema of WDL is presented here:

We will mostly focus for now on the big green arrow which represents the Iterator process that continuously scans the database and does a lot of things, included sending crawl requests. This is where the Crawl Policy is used to decide if a web page that we have discovered

The architecture is built for scalability and efficiency, relying on technologies like Pulsar for messaging and virtual threads for parallel processing. These features ensure the crawler can handle large-scale web exploration without crashing or overloading servers.

Why a Crawl Policy matters

Before we dive into the nuts and bolts, let’s answer the big question: Why do you need a crawl policy? Think of it as the rulebook for your crawler. It determines:

  • What to crawl: Should you focus on a site which is new, or revisit pages you’ve already seen? The web is complex and evolving, but some websites are there forever and haven’t changed. We also have some storage constraints; we don’t want to waste space on a server, so we need to focus on the most interesting pages. Think as if we were a search engine: discovering a new site can be a game changer!
  • How fast to crawl: Nobody likes an impolite crawler that overwhelms servers. We crawl approximately 3.5 billion pages per day, so even if we try to be polite, we have, from time to time, complaints about our crawling speed, especially if a website is collocated on a unique server which serves several domains.
  • Where to prioritize: Should we invest resources in high-quality domains or explore the uncharted corners of the web? Think about the fact that we want to mimic the behavior of a search engine web crawler: we need to focus on the most interesting pages. But what does it mean?

A good crawl policy isn’t just about efficiency, it’s about making sure your crawler’s efforts yield the most valuable insights without stepping on too many toes. With the same goal in mind as a search engine, our objective is to make a set of rules that will allow us to explore efficiently the most interesting parts, so it’s the policy that defines what we are interested in.

Foundations of a good Crawl Policy

Objectives

  • Deliver crawl requests at a controlled pace.
  • Manage crawling differently for new URLs (never crawled before) and re-crawling (multiple fetches of the same URL).

Quality

  • Avoid spending too much time on a low-quality website and maximize exploration (and refreshing) of valuable sites. Challenge: How to define these valuable sites?
  • Prioritize URLs with high interest (generally those with high metrics).

Discovery

  • For a new site, perform some initial crawling to assess its value.

Freshness

  • Re-crawling helps discover new pages, new links, and detect pages that disappear.

Constraints

  • The crawl policy operates within the framework of the global iterator.
  • Partitioning is done by domain.
  • Information is aggregated by domain, host, and view.
  • The order of hosts and URLs follows a lexicographic structure (though not strictly).
  • The base unit of the iterator is a block of 4096 views, with each view containing:
    • Any number of backlinks.
    • A fetch (which may include any number of forelinks).
  • Crawl requests are sent to Barkrowler, which manages queues to enforce rate limits and ensure politeness toward external servers. Sending too many requests to a host/domain at once is counterproductive.
  • Storage space is limited. Since the web evolves constantly, it is necessary to periodically delete older data to make room for new crawls.

1. Rate Limiting from the sending side

We have multiple factors to take into account, but one of them is very low level: the internet bandwidth. We have several servers which handle the fetching, each server is connected to our internal network through a network card, and the router itself is connected to the internet via a link which has a maximum capacity. We cannot crawl faster than we can download, so our primary constraint is to limit the rate at which our crawl policy sends crawl requests.

One of the first versions of our crawl policy was simply to select random URLs while scanning the database with a sampling rate (a probability) adjusted so that the rate of crawl request would reach a given value.

Now we have a complex rate limited mechanism with a feedback loop that globally (we have obviously a distributed infrastructure) adjusts a target threshold to select URLs based on a weighting method in order to attain the target rate.

2. Targets: to Crawl or Re-crawl?

Crawling is like treasure hunting—some pages are gems, while others are just rocks. Your policy should decide:

  • When to crawl new pages: Prioritize discovery to assess the value of unexplored domains.
  • When to revisit old pages: Focus on freshness, tracking changes, and identifying new links or content removals.

Pro tip: Assign higher weights to URLs with higher semantic value or importance in search rankings.

3. Host and Domain limits: spreading

You don’t want to shower all your attention on one site while ignoring others. A good policy sets limits on the number of requests per host or domain to ensure fair distribution. For example:

  • Allocate quotas based on the size and quality of the domain.
  • Use a mix of proportional and equal allocation to balance fairness with focus.

Choosing the pages like a search engine

In the end, the rules that will determine what our crawlers are going to focus on is a mix of several weight boosting and unboosting rules.

The basics of crawl AND re-crawl policy weighting rules

We use our two main metrics as the baseline: BAS (Babbar authority score) and Semantic Value, which considers the relevance between the content of web pages and the content of the web pages they link to.

Then we boost based on quality, first using a business level which is computed from the languages found in the content of the URLs, the number of websites on a domain, the hosting platform used by the website, and some other secondary criteria.

We also apply a boosting for crawling unseen domains (particularly because some information like the business level or the platform requires crawling to be known.

Specific reweighting rules for crawling “new” web pages

For pages never crawled before, we apply a boosting based on how recently we have “viewed” this URL. The lastViewindicator refers to the most recent time we have fetched a page with a link to this URL.

We also systematically try to crawl URLs which come from certain sources, we call these URLs seeds.

Specific reweighting rules for re-crawling web pages

For web pages which have already be crawled, we have a lot more information to consider for recrawling.

  • Based on the status of the previous fetch (HTTP status in case of successful fetch, various error states if the crawlers have encountered a problem while fetching the URL)
  • We also have a special “pending error” status when a page previously successfully crawled has received an error on subsequent trials. If a URL is in this state, we boost the recrawl probability to “resolve” the case.
  • We boost the URL that have been recently viewed (see above for the definition) and boost also if the last fetch is far in the past.
  • And finally, we boost the crawl of a URL if it has a tendency to change (and conversely we deboost URLs that never change)

Limits

Since our iterator in production performs a full pass in a period of a few days, we are sending the crawl-request batches with a given time-to-live target. Basically, based on the expected politeness per domain/host/IP address, we try to guess a reasonable amount of URL to crawl in one go. We define a few limits and try to spread our crawl request on all the URLs of a host/domain. The limits are at host level (currently around 10k pages per pass) and domain level (currently around 100k pages). We also have heuristics to share the limits between hosts inside a domain. In some way this can be thought of as the infamous crawl budget supposedly used by Google to limit the exploration of some websites. But actually, most of the budgeting is done via the weighting mechanism.

Dealing with challenges

The Garbage Collection Problem

The web is a living organism—pages are born, evolve, and eventually disappear. Our crawl policy needs a garbage collection mechanism to:

  • Remove dead pages: Use a time-to-live (TTL) system for old or orphaned pages.
  • Prioritize updates: Replace outdated content with fresher, more relevant data.

These garbage collecting rules interact with the rest of the crawling policy to make our current state of the web graph, which is in turn used to compute metrics based on links and content, like a search engine. The data constantly evolves, while our resources are constant, deleted data allows us to add new pages, new web sites, new domains.

The Trash Score

In addition to all the quality metrics we are experimenting on a new score. Not all sites are created equal, and some are outright problematic. A “trash score” can help classify low-quality or spammy sites, using metrics like:

  • Ratio of referring hosts to referred hosts/domains/IP. An example would be a domain with a lot of referring domains, all hosted on the same IP address.
  • Number of hosts per domain. An example would be some domains with millions (or billions) of hosts which are used to cross-links and generate pages on demand.
  • Number of outbound links per page.
  • Many other ratios using internal/external links count, and other metrics.

While defining this score is tricky without a gold standard, it’s a valuable tool for refining your crawl priorities. We are integrating the score into several parts of the crawl policy.

Conclusion on the Crawl Policy

Designing a crawl policy is as much an art as it is a science. It’s about finding the right balance between discovery and efficiency, all while being a considerate guest on the web. With a solid strategy in place, we are confident that our crawler allows us to have a fair, accurate, fresh and representative picture of the web, similar to (albeit smaller) what major search engines such as Google have. Our crawlers do not search, but they perform a delicate work of data collection which has the very same goal.

Additional note

For everyone concerned about their own web server resources. We know we are no Google, and that our crawling does not bring you any immediate and obvious value, but the openness and fairness of the web is an important thing to cherish. We understand that sometimes, being hit by a crawler that seems over enthusiastic is not a good thing for your site. From time to time, we get some complaints. Sometimes the complaints are borderline insults, sometimes the dialogue we start with the people ends up by those people linking with us on Linkedin or thanking us on Bluesky or Mastodon.

Some important things:

  • We do respect the robots.txt file but it takes some time to reload it, so do not be surprised if your changes are not immediately effective
  • Use robots.txt file rules to disallow pages that you do not want us to crawl. Typically, if your web site has action URLs using the GET method (this is not a good idea at all), like for example add-to-cart pages or login pages, you should disallow them.
  • Use crawl-delay directive in the robots.txt file if your website is not backed by a big server, it allows you to limit the crawl at the website level
  • If you have many hosts (especially if they are on different domains) behind just one IP address, then you should seriously consider having the infrastructure to handle the load as if it were many servers. We have a distributed set of crawlers, and we cannot assume the capacity of your servers behind one IP address. A lot of people have servers behind a single load-balancing IP address, so we have to consider that many hosts on a single IP address is not synonym to a small infrastructure.
  • We understand HTTP code 429 as a request to slow down our crawlers on your site (it actually propagates to the IP address quickly). You can use it at the global server level to handle many websites without specifying a crawl-delay on each website’s robots.txt.
  • Avoid using 403 return code as a means to tell us to stop crawling. Robots.txt is the king. Not only a 403 is not a signal for our crawlers to slow down (that 429) or stop crawling (that’s in your robots.txt file), but you are going to still receive our requests until our crawlers realize it’s not a good idea to continue.
  • Finally, the more URLs you have, the more you are likely to be crawled, because our goal is to discover the best parts of the internet! Robots.txt file rules can however be used to just allow a single page (for example the home page) if you want to limit the number of things we can crawl

We understand that your interest is to allow primarily search engines such as Google to crawl you but understand that this way of thinking is a big bonus for monopolies to stay. Crawlers like ours (and other crawlers not in the SEO space) are for the most part not here to steal your content, DDoS you or sell your data. We just want to understand the web and its content to bring insights to our customers and improve the quality of the content they produce. Each rule of your robots.txt file should be carefully thought up and tested. You can of course use our tool babbar.tech to get the list of known URLs on your web site, so that no rule will accidentally block or allow an undesired URL.

Some examples of robots.txt file rules:

A robots.txt file with a crawl-delay of 15 seconds

Plain Text

A robots.txt file with rules to limit the crawler allowed URLs

Plain Text

You can find lots of robots.txt file validators on the web, it’s really important to spend some time working on it.