How does a high performance web crawler work? The Babbar case

In this article, we are going to introduce the high level architecture of the Babbar SEO crawler, Barkrowler. A web crawler, also known as a web spider or web robot, is a program used by search engines like Google or Bing to navigate and gather information from the vast expanse of the World Wide Web.

Unlike Google, however, the main goal of Babbar is not to create a search engine with an inverse index used to answer users’ queries, but to collect data about web pages, web sites and domains, and the links between them. We still need to analyze content and compute “popularity” metrics, so the job performed by a general SEO web crawler is very similar to what a search engine web crawler.

Let’s see first what is web crawling in general, and how it differs from web scraping, and a full search index.

What is web crawling ?

Web crawling is the process of systematically scanning websites and web pages to find and index relevant content. Search engine crawlers, such as Barkrowler, visit web pages, read their content, and follow any links to other pages within the same website. The crawler then analyzes the data collected and extracts valuable information, such as keywords, metadata, and text, which is later used by search engines to provide more accurate search results.

To ensure efficient crawling, web crawlers employ advanced algorithms and techniques. They prioritize the crawling of popular and frequently updated websites, ensuring that search engines have the most up-to-date information. Crawlers also have to be cautious and respectful of website owners’ guidelines, as excessive crawling can put a strain on server resources and potentially disrupt website operations. 

Web scraping, on the other hand, is a technique used to extract specific data from websites. We will not cover this aspect.

Search engine optimization (SEO) plays a vital role in improving a website’s visibility and ranking on search engine result pages. By optimizing website content, structure, and meta tags, website owners can enhance their chances of being indexed and ranked higher by search engines. Understanding how web crawlers work can help in implementing effective SEO strategies to drive organic traffic to websites.

The Babbar crawler

Babbar’s mission is to provide the most accurate information about the web in an independent way so that our customers can get high quality data about their website, but also about their competitors websites, as well as all the data in relevant related sites (news websites, blogs, etc.). To achieve this goal we have to collect and analyze the content of a lot of web pages, exactly in the way that a search engine like Google or Bing would do. In other words, we have to crawl the web, perform the generic information scraping as if we were to have a full search index of the web.

Babbar’s crawler is a cutting-edge piece of technology that embodies our dedication to innovation and efficiency. Despite being a relatively small-scale company, we’ve managed to engineer a solution that rivals industry giants in performance and scalability. This is a point of immense pride for our team, as it demonstrates our ability to push the boundaries of what’s possible in the realm of SEO technology.

Our crawler has consistently proven its prowess in the field. For instance, according to the Cloudflare bot leaderboard, Babbar’s crawler ranks as the 3rd most efficient SEO-focused bot, standing shoulder to shoulder with the best in the industry. Even more impressively, it is positioned among the top 20 crawlers globally, a remarkable achievement considering the competitive landscape. At one point, we even climbed to the second spot, and we continue to remain fiercely competitive, often neck-and-neck with Moz’s dotbot.

This accomplishment reflects not just the technical brilliance of our crawler but also the meticulous care we put into optimizing its performance. From its sophisticated algorithms to its resource-efficient architecture, every aspect of the crawler has been designed to deliver exceptional results while minimizing overhead. Babbar’s crawler is a testament to what a focused and passionate team can achieve, proving that size is no barrier to excellence in the world of SEO technology.

Barkrowler is the name (and User-Agent string) of our Fetcher. As of December 2024, we fetch approximately 3.3 billion pages per day, as shown by our monitoring:

This article is about the architecture of a “crawler” which is much more than a piece of software that sends GET requests on the world wide web.

Before we get to the architecture, a bit of history about why we decided to start working on a crawler, and more importantly on the index that would store a picture of the web.

The Origins of Babbar’s Technology: A Journey from Text Embeddings to Advanced Web Crawling

Before Babbar was officially born, our team was already laying the groundwork for what would become the core of our technology. This journey began with the development of our text embeddings computation technology, which now powers several key features of our platform, including Semantic Value computationInduced Strength, and other advanced indexing metrics. Here’s a breakdown of our evolution:

1. Text Embeddings Technology (2008 – 2017)

  • Development and Early Innovation (2008): Our in-house text embeddings technology emerged in 2008, designed to represent human text as numerical vectors. These vectors enable semantic distance computations, a cornerstone for analyzing and interpreting text meaning at scale.Key Feature: High throughput for both learning and inference.
  • Web-Scale Learning Algorithm (2015): We transitioned to developing a web-scale version of our learning algorithm, leveraging online learning processes to adapt dynamically to large datasets.
  • Scaling the Technology (2017): To demonstrate the scalability of our system, we began training it on massive datasets. This was pivotal in testing its robustness and efficiency.

2. Experimenting with Web-Scale Datasets

  • First Experiments with CommonCrawl:
    We started with CommonCrawl datasets, which, while vast, contained a significant amount of noise. This posed challenges but also provided valuable insights.
  • BUbiNG Integration: We utilized BUbiNG, an open-source crawler developed by the University of Milano. Although it proved efficient in crawling, it lacked the “intelligence” needed for more nuanced tasks and eventually hit its limitations.
  • Realization of Potential: Despite BUbiNG’s shortcomings, its performance showcased the potential to revolutionize the SEO landscape. It inspired us to rethink the quality of data available to SEO experts, addressing one of the most significant criticisms in the field: the poor quality of SEO data.

3. A New Vision: Building an Intelligent SEO Tool

  • Redefining Search Engine Metrics: Instead of creating a full-fledged search engine (with a comprehensive page index), we focused on developing a metrics computation system. This system aimed to replicate how real search engines rank web pages, providing SEO professionals with actionable insights.
  • Advancing Crawling Technology: We shifted our attention to building a complete web crawling solution, prioritizing the storage and indexing aspects of the crawl.
    • Fetching and parsing tasks, already addressed by BUbiNG, became the foundation for further innovation.
    • We undertook the challenge of storing and managing vast amounts of crawled data effectively.

4. Where We Are Today

This iterative journey—from foundational text embedding research to large-scale crawling—has culminated in the sophisticated platform that powers Babbar. By addressing critical gaps in SEO data quality and scale, we’ve not only demonstrated the strength of our technology but also reshaped how SEO professionals interact with and leverage web data.

 

What is the architecture of a general web crawler?

Many web experts perform routine crawls of their website. Even such a limited endeavor can be challenging, because moderately large websites can have several million pages. The basic behavior of such a crawler is, however, quite simple. It has a list of initial URLs, fetch them, parse the result to find new internal pages, and add these pages to the list. Every page is fetched just once, and normally, after a limited time, the list does not change anymore and eventually all pages are fetched. Additionally, since the crawl is targeted, a scraping step can easily be added to collect the inner content of a web page, and compute information about duplicate content in a more precise way.

A web crawler is very different, first because it works continuously, which means that it has to update the link graph, as well as page-level information, and second because the size of list of URLs to crawl (a list which is called the Web Frontier in the literature) never stops to increase, unless there is a kind of garbage collection process.

Moreover, a single web site crawler is limited by the capacity of the web server. Requesting 100 pages/second on a web site will, in general, quickly bring it to its knees. Basically, in such crawlers, performance and optimization is not a problem because the limit is at the other end. In a web crawler, you can process thousands of web sites simultaneously, which means that tuning it for maximum performance is a necessity.

Finally, given the sheer size of the web, it’s necessary to think of a distributed architecture to be able to scale. It requires us to clearly split the responsibilities and allocate resources accordingly.

We end up with the following parts:

  • Fetcher: given a constant stream of URLs (or crawl requests), send appropriate requests to web servers and collect the responses – it must manage DNS resolving, Queues per IP address / per hostname, connections (SSL, things related to HTTP protocol), caches for temporary fetched data storage. It also handles robots.txt (instructions for robots such as our crawl). This is the most technical or low-level step for crawlers, but many errors can lead to poor data collection at this step.
  • Parser: fetches data, analyzes the HTML and performs various tasks, such as extracting metadata, links, text, as well as detection of encoding and language. The parser ends up sending a structured message to the index for storage. The main role of the parser is to extract the useful content of the page, which is at the root of a good crawler, and which has an important impact on the later steps, especially those in relation with semantic.
  • Index: a store that contains what we want to keep it way which is structured to facilitate data retrieval, with two goals in mind: first serve our customers with the data they need, second, guide the crawling to be similar to the best search engines crawlers. The index also performs some crucial content handling: fingerprints for duplicate content, embedding computation for efficient content storage, which is later used to compute metrics. One of the key role of the index in a crawler architecture is to dispatch a lot of information from the source page to the pages receiving the backlinks, which includes the content (albeit summarized), the URL, the IP address, the source metrics, etc.
  • Crawl Policy: a set of heuristics that are executed on the whole list of known URLs (a.k.a. the Web Frontier) to determine which URLs will be selected for crawl (or recrawl). A lot of information is taken into account in this step: the metrics (including the semantic value which is computed using summarized page contents), a set of constraints related to the domain and hostname of the website (which is not a crawl budget, but can be somehow related to it), the history of previous fetches, the last time we’ve crawled a page that links to our target, etc.
  • Semantic embeddings computation: semantic is key in search engines, and thus in SEO (especially now that Generative AI is taking over the rendering part of the search), we have obviously integrated this part very early in our technology, and it reflects in many aspects. Since we do not perform targeted scraping, our computation must be robust and generic, and it heavily relies on the content extraction step. As explain in our article about duplicate content, depending of the ratio of boilerplate to content, some aspects of the boilerplate can have an impact on the general site semantic orientation. In some case that can be misleading.
  • Metrics computation: one of the main goals of Babbar, from the beginning, was to provide fine-grained, fresh, detailed and truthful metrics to our customers. Computing PageRank-like metrics at page level continuously is a challenging task, and the way we solved it is one of the determinants of its cost-effectiveness. It’s however not only important for our customers, but also for the crawl policy. All search engines and crawlers have limited resources, and guiding the usage of these resources is consequently a crucial task.
  • Serving: computing the graph continuously using high throughput methods requires constant updates to the database. To keep costs low, while keeping data redundancy to a high level, we split the computing and serving parts, this typically allows us to limit our resources needs in terms of memory. In web search engines, the serving part consists of a low-latency “inverse index”, SEO crawlers have a more traditional index that provides information about any web page, any web site, or web domain, but also about a link (backlink or forward link).

There are also some side works that allowed our success and cost-effectiveness: compression is one of the topics that have been a very important part of our work, and it concerns not only the index itself, where compression occurs on URLs and in the whole DB, but also several aspects of embeddings and metrics computation, as well as the messaging system.

We will cover all these topics in subsequent articles.