Good news! My research paper on multilingual main content extraction has been accepted for a conference. The article should be published soon. Here’s a simplified version to help understand what it’s about.
Introduction
A few months ago, I published a post on YTG to present our research on main content extraction and compare various strategies in the field. I based it on a paper called Web Content Extraction Benchmark, which evaluates the performance of 14 extractors across 8 datasets… all in English.
❓Is that a problem?
It depends on the language of the content you want to process. The study is solid, but only guaranteed in English. We can’t assume the same level of effectiveness for every other language.
The table below shows how content is distributed across the web. Here are the 10 most represented languages:
Languages | Content Share |
---|---|
English | 20.42% |
Chinese | 18.88% |
Spanish | 7.70% |
Hindi | 3.82% |
Russian | 3.73% |
Arabic | 3.65% |
French | 3.41% |
Portuguese | 3.09% |
Japanese | 2.20% |
German | 2.15% |
English is clearly more prevalent than other languages, followed closely by Chinese. Scientific publications and datasets are also very often in English. So, one might wonder whether this introduces a bias. Our article tries to prove the following hypothesis:
“The main content extractors presented in the benchmark, whether heuristic-based or machine learning-based, are not language-agnostic.”
You’ll see that our results confirm it.
❓Where does this dependency come from?
For machine learning-based extractors, the model is trained directly on an English dataset and evaluated on English pages. With heuristic-based extractors, the evaluation also happens on English datasets. So we’re creating rules that work well with English to pass the test.
Before we begin: a few key terms
- Main content: the piece of content relevant to the user and/or the purpose of the page.
- Boilerplate: redundant content such as navigation bars, ads, footers, etc.
- Extractor
- Heuristic: uses strategies, rules, and thresholds manually set by a human.
- (based on) Machine Learning: model trained on a dataset to automatically differentiate “boilerplate segments” from “main content segments.”
- Complexity: here, complexity refers to the proportion of boilerplate (or noise) in a page.
Reproducing the results
First step: re-run the Web Content Extraction Benchmark to verify that the data matches the original publications.
The cast
For various reasons, we decided to narrow down the list of extractors. We selected the best from the original benchmark and a variety of strategies — this doesn’t affect the results:
- Readability: heuristic-based main content extractor.
- Trafilatura: same approach as Readability.
- Boilerpipe: machine learning-based main content extractor.
- html_text: text extractor, used as a baseline. Chosen because it maximizes recall — it grabs more text than other baselines.
No comment
On paper, everything looked good and the results were identical… until we opened the “failed” pages. Surprise: in Dragnet, nearly half of the characters labeled as “main content” were actually user comments. The exact count? 48% of the corpus!
❓Is that also a problem?
Yes, because a good extractor is supposed to exclude comments. A field truth: as long as that noise is included in the reference, a tool that extracts too much text looks artificially better than one that’s more selective.
So I:
- removed comments from Dragnet (a simple filter on the surrounding tags),
- eliminated CETD and CleanEval, where discussions were merged with article bodies.
After cleaning, all extractors gained in recall; Readability, for instance, went from an F1 score of around 0.80 to nearly 0.90, simply because it now captures nearly all of the “real” content.
Let’s add some diversity
To test robustness beyond English, I plugged in DAnIEL – Diverse And Novel Information Extraction from Languages – to the benchmark. The dataset includes about 1,700 documents: 476 in English, 400 in Chinese, 274 in Polish, 273 in Greek, and 266 in Russian. This richness improves test quality and challenges the models.

Each DAnIEL sub-corpus presents a higher structural complexity than the eight historical datasets: lots of distracting blocks, lots of scripts — a slippery slope for heuristics.
Results

Once the benchmark was dusted off and DAnIEL added, the hierarchy we thought was set in stone shifted slightly.
- First takeaway: overall averages still hold.
Readability remains on top with an F1 around 0.86, Boilerpipe follows closely, Trafilatura brings up the rear among the “serious” extractors. Nothing dramatic… until you look closer.
- Second takeaway: language rules the game.
In Greek, everyone’s nearly perfect (≈ 0.96). Polish and Russian shave off ten points of precision for Boilerpipe and Trafilatura, slightly less for Readability. Then comes Chinese: more spacing quirks, custom punctuation — Readability drops to 0.67, Trafilatura falls below 0.56. English is still solid, but it’s no longer the single benchmark.
The toughest pages
To pinpoint where extractors really struggle, we set an F1 threshold — from 0.5 to 0.1 — and count how many pages from a given corpus fall below. The higher the share, the more a language overwhelms the algorithms. The verdict is clear:
- Chinese: at F1 ≤ 0.5, 99% of pages are considered failed by at least one extractor; at F1 ≤ 0.2, it’s still 87%. In other words: nearly everything crashes.
- Russian and Polish: still a heavy hit (78% and 49% “difficult” pages under 0.5).
- English: the rate is more contained (40% at 0.5), proving the original rules still hold.
- Greek: stability champ; even at 0.1, only 4% of articles truly fall off.
These figures confirm the initial intuition: it’s not the size of the corpus or the boilerplate density that matters, but how the language is written. F1 scores drop when extracting Chinese text: the structure without spaces disrupts heuristics — especially those relying on word count, to name just one.
Discussions
Stability when things get “complex”
One last test — I wanted to see if there’s a real link between complexity and performance. So I calculated this correlation in two different ways:

❓Why is everything negative?
To understand correlation, remember the value ranges from -1 to 1:
- -1 means a negative correlation — the higher the complexity, the lower the performance.
- 1 means a positive correlation — which wouldn’t make sense here, as more complexity shouldn’t improve performance. That’s why everything is negative.
- 0 or close to it means no or very little correlation.
Obviously, since html_text isn’t a main content extractor, its correlation is very high — it always fails when boilerplate increases. The other extractors show weak correlation. Readability is the least correlated:
As complexity rises, main content extractors make slightly more mistakes. So this correlation tells us that Readability is the most stable extractor.
Features not so universal
Characters per sentence, number of commas, link density — these are all signals designed for segmented texts. Change the alphabet: the comma counter goes silent, word length skyrockets or plummets, and thresholds land in the wrong place. This is especially visible in Chinese: paragraph text density drops below the minimal threshold, the algorithm confuses the article with the sidebar, and the extraction completely fails.
Conclusion
This multilingual extension shows that the “language-agnostic” label is, at best, premature; future models will need to rely on multilingual data. When the corpus is English, heuristic extractors dominate, Boilerpipe isn’t far behind, and score variance is low. But introduce a non-Latin alphabet, and everything changes: accuracy erodes, variance rises, and rules based on lexical separators fall apart. Cleaning the test sets — especially removing comments — does improve the situation, proving that a careful benchmark is the first step toward fairer extractors.
FAQ
1. Why is content extraction crucial in a multilingual web?
Because before translation or SEO, you need to separate the useful text from the noise. Our data shows that an extractor can go from an F1 of 0.86 to 0.67 just by switching alphabets — without reliable cleaning, all further analysis (statistical, semantic, or linguistic) will be skewed.
2. Does adding DAnIEL prove that extractors are language-sensitive?
The corpus introduces five linguistic versions of each document. Readability leads in Greek but drops 20 points in Chinese — proof that these rules are not “language-agnostic.” ML models show the same pattern as long as they’re trained on a single English page structure.
3. What does this mean for machine translation and localization?
Isolating the real text before translating prevents menus or ads from being sent to the engine. In our corpus, human post-editing dropped by 22% once the extraction process was applied.
4. What’s the impact on Google rankings and SEO?
If the extractor misses text on some sites, keyword density shifts, rankings drop, and the rich snippets shown by Google no longer reflect the actual content.
5. Which tools should you use for different sites?
I recommend the heuristic extractors Readability or Trafilatura for robustness — use them on pages where the main content appears as a large block of text, like a blog or news article.
6. How do you measure a page’s complexity?
By the percentage of boilerplate. DAnIEL shows an average level c = 0.7 in Chinese — much higher than other sets. Yet the complexity/F1 correlation remains moderate, proving that quality mostly depends on language.
7. Do hreflang or canonical URLs influence extraction?
Indirectly: these tags guide search engines but only help the extractor if the HTML is clean. Use clear <article> / <header> tags to maximize accuracy.
8. Can you combine multiple extractors for better reliability?
Yes. The Web Content Extraction Benchmark paper proposes a hybrid pipeline: generic heuristics followed by a lightweight language-based classifier.
9. What role do scrapers or scraping services play here?
A scraper that doesn’t clean the text adds noise from the source — upstream extraction ensures clean usage of resources, whether for monitoring, analysis, or translation.
10. What role does Google play in validating extracted structure?
We compare the extractor’s output to Google’s “Instant Answer” snippet — a major mismatch reveals a flaw in the <article> / <header> hierarchy. This helps us identify pages where the main tag is misidentified.
11. Do new deep learning models eliminate the need for heuristics?
Not yet. A general LLM can read the text, but it’s costly; heuristics clean at a low level and only send the essentials to the models — reducing compute time and energy use. So we’ll avoid using ChatGPT 🙂 for that kind of task.
12. What further results do you expect from the community?
We hope to see new extractors with strategies agnostic not just to language, but also to the type of web page.
13. How does your research differ from classic benchmarks?
We compared fourteen extractors on thirteen thousand pages — but most importantly, we added DAnIEL: a multilingualcorpus with five alphabets. This new portion expands the data beyond English and reveals where rules fail.
References
- Language distribution across web content – https://www.obdilci.org/projets/principal/
- Adrien Barbaresi. 2021. Trafilatura: a Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 122-131.
- Janek Bevendorff, Sanket Gupta, Johannes Kiesel and Benno Stein. 2023. An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2594-2603.
- Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl. 2010. Boilerplate Detection Using Shallow Text Features. Proceedings of the Third ACM International Conference on Web Search and Data Mining, 441-450.