The Influence of JavaScript Frameworks on Web Crawling

At Babbar, we crawl over 3 billion pages a day, and we sometimes encounter difficulties retrieving data correctly from certain pages. These issues are often linked to server-side protections or the use of specific JavaScript libraries, which make reading pages more or less complex. To optimize our approach, we’re looking for simple solutions to identify problematic sites and handle them differently.

Of course, we strictly adhere to crawling rules. All our requests use a clearly identifiable user-agent, and our IPs are whitelisted and classified as “SEO bots” by many services, such as Cloudflare. If a site refuses this type of bot, we don’t persist. We also make sure to respect the rules defined in each site’s robots.txt file.

We crawl these web pages to power our SEO tools, babbar.tech and yourtext.guru. Our main goal is to gather as much data as possible, similar to what Google does with its googlebot, to understand and analyze page content and navigate the links between them. We aim to extract visible text as a web browser would. This becomes more complex when technologies like JavaScript are used, making extraction from the raw HTML source more challenging.

In addition to text, we also collect links (both internal and external) as well as meta tags, which are essential for SEO analysis and search engine indexing. This data allows us to better understand content structure and assess page quality.

Testing

To understand the influence of JavaScript frameworks, we started by retrieving the HTML code of 50,000 URLs using two methods: Java code with Apache HttpClient, and Playwright. Playwright is a library that automates the use of web browsers and executes JavaScript, helping us simulate the full rendering of pages, including dynamic content. We then compared the sizes of the code retrieved, with and without JavaScript execution. To simplify the analysis, we focused only on URLs returning an HTTP 200 status code.

The first surprise: even between two crawls using the same method, the size of the responses could vary significantly. These variations were sometimes due to errors, but in other cases, they were caused by the dynamic content of the page, which can change with each server load. To consolidate the results, we ran several identical crawls over a relatively short period.

Sorting by size differences, it became difficult to draw clear conclusions.

However, when sorting by percentage difference, the results become much more interesting:

These results show that very few pages exhibit significant size differences, but they allow us to precisely target sites where JavaScript execution adds a large amount of information.

Here are the most commonly encountered JavaScript scripts:

Among the frameworks that frequently stand out, we identified the following:

  • Shopify – Shopify uses JavaScript to dynamically load elements such as products, customer reviews, and customization options, which can significantly increase page size.
  • Klaviyo – Klaviyo is a marketing platform that integrates with e-commerce sites, often adding JavaScript scripts for user tracking and pop-up displays, contributing to noticeable differences in loaded content.
  • ReactJS – ReactJS is a JavaScript framework used to build dynamic user interfaces. Its execution generates client-side content, making some information accessible only after JavaScript rendering.

We also encountered several sites using an AES (Advanced Encryption Standard) library to protect page content. AES is an encryption algorithm used to secure data, making content inaccessible. This poses a challenge for crawling, as the page content cannot be interpreted without executing the JavaScript decryption logic.

For pages with less than a 50% increase in size with JavaScript execution, the difference mainly stems from consent panels for cookies. These panels are often dynamically generated by JavaScript, meaning they may be absent when JavaScript isn’t executed. However, this information is typically of little value to us, as it does not contribute relevant content to our analyses.

Another interesting case that occurs quite frequently involves pages returning an HTTP 200 code but with minimal content:

 <html><body><script>window.location.href="targetLink";</script></body></html>
In this situation, our bot attempts to read the page but finds no information. This happens because the URL should return a 301 or 302 redirect to the target URL, not a 200.

These tests highlight that, in some cases, executing JavaScript is crucial to fully capture the content of a page. Many sites rely on JavaScript to dynamically load key sections, such as customer reviews or product information, which are absent from the initial HTML code. However, we also observed that certain libraries, such as cookie consent panels, can interfere with this process. Additionally, scripts used for analytics and trackers often add unnecessary noise to the data we collect without contributing to our objectives.

To ensure proper indexing by search engines, understanding the challenges posed by JavaScript execution is essential. From our perspective, crawling techniques must be adapted to ensure that the content visible to bots like Googlebot is accurately retrieved, providing the right information for our tools. An efficient crawler should navigate complex pages, even when JavaScript is involved, to extract relevant data and ensure a complete rendering of the content.

The next step will involve analyzing size differences after extracting useful text—content relevant to our SEO analyses—to better identify what is truly necessary. For instance, scripts related to ads or consent panels add noise without contributing to analytical goals.

It’s important to remember that executing JavaScript during crawling is resource-intensive and time-consuming, as it requires simulating full browser behavior, including page rendering. This significantly impacts overall crawling performance. Optimizing this process involves executing JavaScript only when absolutely necessary to reduce resource consumption and processing time. This requires identifying pages where essential content cannot be retrieved without JavaScript execution and targeting these cases specifically.

FAQ

1. Why is crawling pages with JavaScript more complex?
Crawling pages with JavaScript is more complex because it requires simulating the behavior of a full web browser. Many pages load their content dynamically using JavaScript, meaning that simply accessing the HTML code is not enough to extract all the data. The resource and time costs are significant, which pushes us to optimize crawling by executing JavaScript only when necessary.

2. What impact do server-side protections have on crawling?
Server-side protections, such as IP restrictions or bot-type checks, can limit crawler access. Some sites use these measures to prevent scraping. To avoid these limitations, it’s essential to follow the guidelines in the robots.txt file and use an identifiable user-agent. This demonstrates the crawler’s transparency and helps prevent blocking.

3. Why is it essential to extract meta tags and links?
Meta tags and links are crucial for SEO because they provide information that search engines use to understand and rank pages. Meta tags, like descriptions, help improve visibility in search results, while internal and external links help build a clear and coherent site architecture, facilitating crawler exploration.

4. Why is crawl optimization important for pages with unnecessary scripts?
Pages with unnecessary scripts, such as ad trackers or cookie consent panels, can introduce significant noise, making it harder to extract useful data. These scripts can increase page size without adding value to the analyzed content. Crawl optimization involves filtering or ignoring these unnecessary elements to enhance crawling efficiency and ensure that only essential information is collected for SEO analysis.

5. I use a lot of JavaScript on my site. Will it still be well indexed by Google?
Extensive use of JavaScript can complicate the crawling process for search engines like Google. Googlebot can execute JavaScript, but it takes more time and consumes more resources. If critical site content relies heavily on JavaScript, it’s essential to ensure that the JavaScript is well-optimized and that important content is quickly accessible. By optimizing server-side rendering or using dynamic rendering techniques, you can improve your site’s ranking despite heavy JavaScript usage.