Data compression for a large-scale web crawler

As the volume of generated data grows exponentially, web crawling systems play a crucial role in the field of SEO (Search Engine Optimization). These systems, designed to explore and analyze web pages, must ingest, process, and store vast amounts of data to deliver actionable and relevant insights to our clients.

Babbar’s technical architecture, optimized for SEO, handles diverse information: metadata (often textual), graph metrics to synthesize the Babbar Authority Score (BAS), vector representations essential for calculating Semantic Value, and aggregated information defined by our domain experts to determine (or predict!) Link Induced Strength.

These elements are stored in a distributed database system designed to meet the demands of real-time processing and continuous analysis. The primary technical challenge lies in the sheer volume of data: a system of this scale can ingest and analyze tens of thousands of pages and millions of links every second.

Optimizing the processing, storage, and transmission of this data is therefore essential to ensure scalability while maintaining high performance at a sustainable cost.
In this article, I will discuss the challenges and the solutions adopted at Babbar to efficiently compress data in a demanding context.

Data and Their Complexity

To understand the scope of the challenges, it is essential to examine the structure of the processed data. A web page contains its URL, metadata, graph metrics, a vector representation (embedding) of its content, and a list of outgoing links (forelinks) that may be internal or external to the site. Each outgoing link is described by its target URL, a textual anchor, and metadata. These outgoing links become incoming links (backlinks) from the target’s perspective and must carry representative information from the source to support complex calculations on dynamically evolving data.

URLs, metrics, and semantic vectors, as central elements of this architecture, must be managed for each page, forelink, and backlink. This level of granularity, while essential for high-quality analyses, generates a substantial volume of data, amplified by the need to distribute this information via links between pages.

The Need for Compression

Data compression in such a system serves a dual purpose: reducing the size of stored data and limiting network transfer volumes. The primary objective is to enable the system to scale effectively while maintaining high performance and manageable storage costs.

URLs, semantic vectors, and graph metrics each present specific challenges. URLs, due to their sheer quantity, must be compressed to minimize memory footprint and optimize transmission. Semantic vectors, though crucial for content analysis, are space-intensive, requiring precision reduction without compromising relevance. Finally, graph metrics need to be compressed while preserving the accuracy necessary for reliable calculations.

Technical Solutions

To address the challenges posed by the volume and heterogeneity of the data, several compression techniques have been implemented, each tailored to the nature of the data in question.

URLs

URLs are ubiquitous but occupy a significant amount of memory. To compress them effectively, we developed a Huffman-based compression algorithm specifically adapted to web address patterns. Huffman coding assigns shorter codes to frequent elements and longer codes to rare ones, based on their statistical distribution.

For URLs, optimizing Huffman codes involves analyzing a large set of URLs to identify the most common patterns. This approach reduces the average size of a URL by approximately a factor of three. Additionally, redundant information in internal links is eliminated by storing only differential parts of URLs within the same site.

Semantic Vectors

Semantic vectors represent textual content as coordinates in a multidimensional space. These representations are essential for analyzing context (Semantic Focus) and relationships between web pages (Semantic Value), but they consume a significant amount of disk space.

To reduce their size, a quantization technique has been applied, reducing the number of bits needed to represent each dimension of a vector. Two precision levels were defined: a standard precision for pages themselves, ensuring maximum analysis quality, and a reduced precision for backlinks, where approximate representation suffices for dissemination via outgoing links. This approach can reduce representation from 32 bits per dimension to 8 bits or less, cutting storage requirements while maintaining acceptable semantic calculation accuracy.

Graph Metrics

Graph metrics, such as the Babbar Authority Score, play a key role in assessing the relative importance of websites. These metrics must be regularly updated and shared between system nodes.

To retain sufficient precision while reducing size, we adopted a logarithmic linearization approach. This transformation represents highly dispersed values (like graph metrics) over a narrower range while minimizing precision loss for smaller values, which are often critical for calculations. Special cases were also applied to the smallest values to further improve precision. The compressed values are then quantized for additional storage optimization.

Manual vs. Automatic Serialization

Data serialization is vital in a large-scale web crawling system, enabling the conversion of complex structures into simple formats suitable for storage or transmission. In most cases, Protocol Buffers (Protobuf) was preferred. This efficient binary format offers versatility for handling various data types, scalability for seamless schema updates, and improved productivity through automatic code generation.

However, for specific cases, such as primary key management or advanced compression optimization, manual serialization was implemented. Although more complex, this approach enhances performance and reduces the size of critical data by precisely adjusting their structure and encoding. This ensures fine-tuned optimization tailored to each data type and business need.

Compression Algorithms

Data is then compressed using widely adopted algorithms. Several were tested, including Snappy, GZip, LZ4, and ZSTD. After extensive benchmarking, ZSTD was selected for its excellent balance between compression speed and ratio. Snappy offered high speed but insufficient compression. LZ4 provided a better compression ratio while remaining fast, but ZSTD outperformed it in compression efficiency with only a slight trade-off in speed, making it the best fit for our needs. GZip was unsuitable for this type of workload.

It’s worth noting that the previously described techniques significantly influence the performance of subsequent compression algorithms by reducing the intrinsic entropy of the data through decreased variability and complexity, facilitating their processing.

Optimized Databases

All this data is ingested into RocksDB, a highly performant key-value database that warrants a dedicated article. One critical decision regarding compaction mode—Level vs. Universal—was made. While Level mode is generally recommended for disk space efficiency, we opted for Universal mode. This choice was driven by two major advantages: better CPU load anticipation by avoiding parallel compactions and significant reductions in write amplification, greatly extending SSD lifespan. These characteristics are particularly well-suited to our high-write-frequency and CPU-constrained environment.

In a large-scale web crawling and analysis system, data compression is more than a technical optimization—it is a cornerstone of viability. The technical solutions presented, from advanced compression algorithms to domain-specific architectural choices, enable the processing of massive volumes while maintaining cost control.

These techniques and innovations, embedded at the heart of Babbar’s architecture, reflect the team’s commitment to delivering a robust and high-performance infrastructure. They demonstrate the importance of a tailored approach, where technical optimization directly translates into client benefits, enhancing their competitiveness and ensuring the sustainability of their SEO strategies.