Following up on the article “Influence of JavaScript Frameworks on Web Crawling” we continuously work on improving our crawler to better handle pages displaying dynamically generated information.
However, the cost of crawling with JavaScript execution is significantly higher than without. It is therefore not feasible to execute it on all pages. It is crucial to carefully target the URLs to be submitted to the JS execution module.
Our goal is to create a qualified database of URLs for which we know whether JavaScript execution is necessary to retrieve data.
URL Selection Steps
To define a list of URLs to qualify, we followed several steps.
Initial URL Filtering
We started with a list of 2,800,000 URLs. Unfortunately, as this data was somewhat outdated, many were still in HTTP. By keeping only HTTPS URLs, we reduced the list to 300,000 URLs.
Elimination of Irrelevant URLs
A significant proportion of the URLs corresponded to homepage (home page) links, which do not represent our overall crawl. We separated homepages from other URLs to establish a ratio in the final selection. We also excluded URLs from platforms such as YouTube, Twitter, and Facebook, which we do not wish to crawl.
Diversification and Domain Limitation
After these filters, we obtained 100,000 non-home URLs. To ensure maximum diversity in the explored sites, we limited the selection to 3 URLs per domain. This constraint reduced our list to 38,747 URLs.
Page Existence Verification
Since our initial list was somewhat old, we tested its URLs to retain only those returning an HTTP 200 code on a GET request.
So far, we have tested 13,208 URLs.
Among them, 10,086 correctly respond with a 200 code.
Now that we have a relatively large group of URLs, the goal is to create a smaller first subset to test manually and assess the relevance of the results.
The idea is to present the tester with two versions of the same page:
- One with JavaScript execution
- One without JavaScript execution
This will generate the following type of screen for the tester:

At Babbar, we have already developed manual classification tools and observed an important phenomenon: when the tester encounters a very high similarity rate between two choices (for example, if 90% of pages are identical with and without JavaScript execution), their error rate increases. In other words, they risk missing some differences. After clicking the same button ten times in a row, fatigue sets in, and attention diminishes.
Creating a Dataset with More URLs Requiring JavaScript
To avoid this, we sought to create a dataset with a high proportion of URLs requiring JavaScript. We implemented a broad initial automated test to detect as many relevant pages as possible.
The choice was made to select pages that meet two criteria:
- DOM parsing with Boilerpipe returns an almost empty text.
- A JavaScript framework family is detected in the DOM. For this detection, we use the Jappalizer tool.
With this first filter, we obtained 806 URLs out of the 10,086 tested.
Selecting a Subset of 200 URLs for an Initial Manual Test
From this filtered list, we defined a first subset of 200 URLs to test our procedure. To ensure a balanced sample, we applied the following rules:
- 80% non-home URLs / 20% home URLs
- 80% URLs passing our automated test / 20% not passing it
Initial Results and Adjustments
After several qualification iterations, we identified 96 URLs showing visual differences with and without JavaScript execution.
An interesting finding: 6 URLs detected as requiring JavaScript were not identified by our initial automated test. Out of a total of 40 URLs, this is a significant rate. This observation leads us to reconsider our selection ratio. Perhaps we should shift from an 80/20 split to 60/40 to improve the diversity of selected pages.
Comparing Text Rather than Visual Differences
Another observation: even if a page looks visually different, the extracted DOM text sometimes remains identical.
Example: a page that initially appears blank before JavaScript adds a display=none
on the foreground div.
And let’s not forget our main goal: to improve web crawling by detecting pages that require JavaScript execution to extract their information.
We aim to:
- Discover page links
- Classify the page’s theme using embeddings
Relying solely on a visual difference would therefore be a mistake. It is essential to compare the extracted text from each page with and without JavaScript execution for a more relevant analysis.
We modified our tool to display a page’s text after simple parsing. To achieve this, we retrieve its HTML code using the Java HTTP Apache Client library and the Playwright framework. For each URL, we generate a WARC file containing the results of both calls:
- The raw HTML code
- The text after simple parsing
- The text extracted with Boilerpipe
We then manually analyze these files to compare the results.

The differences are harder to detect than in the previous test. The objective is to determine whether JavaScript execution actually provides more information. However, some of the added text sometimes comes from cookie banners, which is irrelevant for our purposes.
By reviewing our 200-URL list, we found:
- 6 URLs where JavaScript adds data without being detected by the automated test.
- 5 URLs that were not identified in the visual test.
- But most importantly, 51 previously detected URLs are no longer recognized, indicating that our initial test generated a high number of false positives.
Conclusion
We will now build a qualified database of thousands of URLs to determine when it is relevant to execute JavaScript. The idea is to optimize this process so that it is only applied to certain pages, thereby reducing costs.

This project was funded by the State as part of France 2030