Interview of Adrien Barbaresi by Guillaume Pitel

Adrien and Guillaume discuss Trafilatura, public research, LLMs, and the future of the web!

Guillaume Pitel: Hello Adrien. You’re well known as the creator of Trafilatura. I’m very familiar with the tool, and at Babbar we work a lot on content extraction, segmentation, and selection issues. The idea behind this conversation is to introduce the topic to a broader audience. Before diving into the details, I’d like you to tell us about your educational and professional background, and maybe your personal connection to open source.

Adrien Barbaresi: My background? I come from linguistics, then computational linguistics, then NLP, and applied data engineering. To put it simply, I shifted from the humanities—reading and analyzing texts—to machines doing that for us. I also adapted to the job market, with the rise of LLMs, and developed an open-source project that wasn’t initially intended for that purpose, but ended up being widely adopted. Professionally, I also transitioned—from public research to the private sector. Today, I work in a company that heavily uses LLMs.

Guillaume: There’s a lot to say about public research, I know from experience. Can you give us a few dates from your career path? I also started in academia: finished my PhD in 2004, then did three years of postdocs—CNRS, INRIA, CEA. Then I left academia because I felt it wasn’t for me. Maybe your experience is different?

Adrien: Yes, and I made another change I haven’t mentioned yet: I moved countries. I started my PhD in France and finished it in Germany, in Berlin. I had studied German, so conceptually it wasn’t too far off. But it did play a role. I worked for a research institute of the Berlin-Brandenburg Academy of Sciences. So I’ll talk a bit from the Berlin perspective, but the issues are the same on both sides: lots of fixed-term contracts, lots of project-based work. Even people with positions need to secure funding every 2-3 years. You can’t plan mid-term, people come and go… This lack of permanence, this inability to plan, prevents doing deep research in good conditions!
And now I realize that in the private sector, you’re sometimes more free. Sure, you need funding, but gathering 10 people in a small company to solve a problem together is often simpler than in a lab, where you’re buried in constraints.

Guillaume: I had exactly the same impression. I did my best research when I started working independently. No longer had to answer ultra-specific grant calls.
That reminds me—I have a German friend I met at Berkeley. He was working on FrameNet back then. I looked him up recently: he’s still a postdoc. He’s older than me—I’m 47, he must be 50 or 52. Still a postdoc. Sure, he’s risen in seniority, but the title is still “postdoc.” It’s crazy.

Adrien: That’s another issue—the hierarchy. Between postdoc and assistant professor or full professor, there aren’t many options. Or you have to join a CNRS unit, but those positions aren’t always open, or they’re not where you want them to be. So people trade flexibility for the ability to work on topics they’re passionate about. But you have to make a choice: do I stay, or do I leave?
And the problem doesn’t go away when you land a permanent position. Even then, there are still constraints. Everyone has to find their balance. Sometimes it’s in public research, sometimes elsewhere.

Guillaume: And that lack of freedom has created an opening for private companies to scoop up talent and sometimes do research that’s more impactful than in academia.
Let’s talk about your main project: Trafilatura. But before we do—you never gave me the dates. When was your PhD again?

Adrien: I defended my PhD in 2015. And I started working on the topic way before that, in 2012–2013. Back then, I wanted to work on something else, but there wasn’t any data. So I figured, let’s start by collecting data. And that actually became a real research subject, because there were no tools or clear methods for it—especially in the humanities.

Guillaume: Right around the time LLMs started emerging. Let’s focus on Trafilatura now—it’s an open-source tool for extracting the main content of a web page, right?

Adrien: Yes. The idea isn’t to scrape very specific info like the price of a pair of sneakers (or all prices on an e-commerce site). Trafilatura extracts what a human reader would consider the main content: the title, date, author, and core text. Not the ads, menus, or footer. You can look at it another way: it removes everything repetitive and low-value.

Guillaume: And technically, how does it work? What’s the method?

Adrien: At first, I targeted WordPress, which was widely used. I thought: if I can extract content correctly from WordPress pages, I’ve won 70% of the web. But with all the plugins and customizations, it quickly turned into a headache. So I added an algorithmic approach.
Today, Trafilatura combines several methods: heuristic rules, a Python version of Readability (like in Firefox), JustText (which comes from public research in the Czech Republic), and more. Based on my testing, this combination works best.

Guillaume: You mentioned that usage has exploded recently, especially because of LLMs. Can you elaborate?

Adrien: Yes. There’s a Hugging Face study—RefinedWeb—that compared model performance when trained on WET data (basically simple HTML to Text) versus cleaned-up text using Trafilatura. Result: Trafilatura helped the model converge faster, with less noise. I could never have demonstrated that at such a scale myself. Since then, downloads have skyrocketed.

Guillaume: And you—do you use LLMs as a coder?

Adrien: Yes, but not for everything. I’m not a prompt engineer. I use LLMs for debugging, rewording things, testing ideas. But I stay in control. For example, when I started learning Rust, I’d paste error messages into an LLM to get explanations. The models’ ability to summarize is incredibly impressive.

Guillaume: Let’s talk a bit about the future of the web. As you’ve seen, many sites now block crawlers—except Google’s. The web is closing in on itself. What do you think?

Adrien: It’s a real problem. StackOverflow, Reddit—all these sites have been scraped by LLMs… People don’t answer questions anymore—they wonder, what’s the point? The content they create is harvested by machines. We’re at risk of a general impoverishment of the web.
And on the user side, it’s the same. Before, you’d surf from link to link. Now, people just ask a chatbot and that’s it. They never leave the portal. It kills diversity, curiosity, critical thinking.

Guillaume: Do you think it could actually change the way people think?
Adrien: Yes. If we spend 15 years asking machines all our questions, we’ll lose the ability to search, to cross-check. It’s like forgetting how to read a map because we have GPS. We become dependent—and more passive. And the problem isn’t just with navigation, it affects thought itself.

Guillaume: Well, let’s not dive into philosophy too deeply, but clearly, it raises big questions. Thanks a lot, Adrien, for this fascinating conversation.

Adrien: Thank you! Let’s continue the discussion anytime.