The index behind Babbar

Every day at Babbar, we crawl billions of pages that need to be analyzed and stored. At the same time, we compute metrics for over 250 billion pages that must be continuously updated (for a total of more than 1,500 billion pages) and propagated within a graph of nearly 15,000 billion links.

As a reminder, when it was first launched, Google’s entire index contained less than 30 million pages, which represents just 10 seconds of processing on our infrastructure. Today, we will unveil the technologies that allow us to index and manipulate such vast amounts of data.

The Babbar Index

An index is essentially a large distributed database, structured in a way that allows fast access to information for our clients, whether it concerns pages, links, or websites. In fact, there are multiple indexes. This means there are at least two key aspects: data distribution and storage.

In reality, there is a third and crucial aspect: we do not just store pages or links; we must also continuously compute and update relevant metrics derived from the web graph formed by this data. This means we need to insert and update massive amounts of data, whereas most databases are optimized for read access.

Our initial architecture experiments for the index considered “all-in-one” solutions like Apache Ignite, but despite numerous attempts, the performance never reached our expectations. We therefore pivoted towards a custom-built solution.

We will now explore the different components that make up Babbar’s index: partitioning, communication via Apache Pulsar, storage with RocksDB, and continuous updating.

Partitioning

The fundamental principle of data distribution is simple: we partition (shard) data based on a wisely chosen key so that partitions remain of similar size while preserving relational information within each partition.

For example, we can define a rule where, for a given URL, partitioning is done based on the URL’s Host. This ensures that Hosts are distributed across different partitions, but all URLs from the same Host remain grouped in the same partition.

It is also possible to use different partitioning schemes for the same data depending on the context (e.g., indexing vs. crawling), as long as all components agree on the partitioning schema.

Distribution – Apache Pulsar

Apache Pulsar is an open-source framework, initially developed by Yahoo, that provides a messaging solution organized by topics within namespaces using a publish-subscribe model. It fundamentally enables asynchronous communication between nodes connected to the same Pulsar cluster.

For our use case, topic names within their namespace are simply numbers corresponding to partition identifiers defined by our schema. For example, babbar.tech may belong to partition 42, while yourtext.guru may belong to partition 69. A consumer can subscribe to receive only the fetches (namespace) of partition 42 (topic), while another may only be interested in crawl requests (namespace) from partition 69 (topic).

A Flexible Framework

Pulsar offers several useful features, including:

  • Message compression – When dealing with large amounts of data and/or large messages, choosing and enabling a compression mechanism becomes crucial.
  • Batch processing – When messages are small but numerous, batch processing improves compression efficiency and optimizes sending and receiving performance.
  • Multiple subscriptions – Multiple consumers can subscribe to the same topic, either receiving duplicate messages or distributing messages among them.
  • Failover support – In a distributed environment, having an alternative recipient in case the primary consumer becomes unavailable is useful.
  • Persistence – Ensuring that no messages are lost via an acknowledgment and redistribution mechanism in case of issues.
  • Reactive vs. proactive – Each consumer can decide whether to process messages via event-based (push) or request-based (pull) mechanisms.

Pulsar thus enables horizontal scalability. Need more crawlers? Simply connect additional machines to the cluster and update the crawl request consumers’ configuration!

URL Processing Workflow

The typical journey of a crawl request and its resulting fetch can be summarized as follows:

  1. The crawl policy places a crawl request in the crawl requests namespace.
  2. A crawler consumes the crawl request and fetches the webpage.
  3. After processing, the crawler places the fetch result in the fetches namespace, in the topic corresponding to the Host as defined by the partitioning schema.
  4. The index consumes and stores the fetch while analyzing outbound links and placing messages in the linksnamespace topics based on their target Host.
  5. Indexes consume and store links from a specific topic corresponding to the partition they manage.

Storage – RocksDB

RocksDB is a high-performance, open-source key-value database. It originates from a fork of LevelDB (developed by Google) and has been optimized for multi-core processors and flash storage (SSD) by Meta (formerly Facebook), which also maintains it.

Concept

RocksDB (like LevelDB) relies on a specific data structure: multi-level Log-Structured Merge Trees (LSM trees). Each level consolidates the previous one, with the most recent information stored in the first levels. The levels are divided into SST (Static Sorted Table) files containing operations sorted by the key of the entry they act upon.

The only supported operations are putdelete, and merge. In this system, a delete does not actually remove anything; instead, it adds an operation to the most recent level. The consolidation process reduces operations across levels: a recent put or delete for an entry replaces older operations on the same entry. When a level exceeds a certain size relative to the next level, a compaction occurs, regenerating an existing level by merging all levels above it. Here is the result of the compaction of the first two levels :

Compaction must also be performed on the fly during a read operation: all levels must be traversed, from the most recent to the oldest, while reducing the operations concerning the searched entry to obtain its final state. A get operation can therefore have a non-negligible cost.

Merge Operators

You might have noticed that there is no direct modification operation in this system. For example, if we want to update an URL’s last-seen date without altering its previously computed metrics, we would need to read the entry, modify the date, and then write back the updated information. Given the cost of a read operation, this does not scale. This is where mergeoperations come into play.

merge operation allows combining the current and previous states of an entry to produce an updated state. The user must define the merge operation via a function that is called whenever needed. By correctly structuring the stored information and defining an appropriate merge operator, data updates can be deferred to the consolidation stage.

However, a technical challenge arises: during consolidation involving a merge operation, the database itself must invoke the user-provided function. Since RocksDB is implemented in C++ and our applications are in Java, and since RocksDB’s API does not support defining custom operators in Java, we had to modify RocksDB’s code to allow Java-based merge operators. This was just the first step, as the overhead of C++ to Java calls is significant, and deep compactions could require millions of calls.

Implementing an optimized version of merge operator support was a true game-changer in our architecture, enabling efficient data updates. We also added support for compaction filters. As RocksDB is open-source, you can find our contribution on GitHub!

Continuous Updates

Once the problem of data ingestion and their updates was resolved, we needed to set up a way to calculate web graph metrics continuously and automate other tasks.

This is when the different pieces of the puzzle need to fit together. We have hundreds of databases distributed across dozens of machines, and we need to make sense of all this! It’s the choice of indexes, key formats, and partitioning schemes that guarantee the co-location of information when we need it.

Thus, we can define a synchronized traversal of several indexes present on the same disk of the same machine in order to gather all the information concerning a single website and calculate its metrics. Thanks to the partitioning scheme and Pulsar, we can broadcast these metrics through a variant of the reasonable surfer via the site’s outgoing links. Finally, upon reception, a custom merge operator allows us to update the metrics of incoming links while preserving other information from the crawlers.

It is also this synchronized traversal and merge operations that allow us to maintain the web graph in real-time by taking into account updates to web pages and the outgoing links they contain. We need to be able to generate the difference between two successive versions of the same page in order to create, delete, or update backlinks located on another partition, and thus on another machine.

Finally, it is necessary to have a garbage collector to maintain good hygiene in the database, and deleting a web page from the index also requires the deletion of the backlinks.

The Foundations of Babbar

Building our own indexing solution laid the foundation for Babbar’s architecture. These technical innovations allow us to maintain a large-scale, real-time snapshot of the web at manageable costs.

With this foundation, we can compute and provide representative metrics like the Babbar Authority Score, Semantic Value, and Induced Strength. And we have more innovations in the pipeline – stay tuned!