Interview of Fabien Vauchelles by Guillaume Pitel

This time, it’s Guillaume Pitel, CTO of Babbar.tech, who’s playing along by interviewing Fabien Vauchelles, the creator of Scrapoxy.
Big thanks to Fabien!

Guillaume: Hello Fabien. Great to have you here today! You’re the creator of Scrapoxy, and I’d like to learn more about your journey. How did you come to develop Scrapoxy? How long have you been working on this project?

Fabien: Good question! I started scraping about twenty years ago, almost from the beginning of my journey on the web. Back then, I was working on search engines and needed data retrieval, which gradually led me from crawling to scraping.

I had an ambitious project: predicting people’s career paths based on their profiles. The idea was to identify their “next step” depending on various factors. This paved the way for applications like job recommendations or predictive analysis of job market trends. At the time, we didn’t have advanced AI like today, just basic NLP models.

I started scraping millions of profiles from a well-known social network. But soon, I faced banning issues. To bypass this, managing multiple IP addresses became necessary, which was both complex and costly.

So, I developed a solution to automate this process—initially with proxies on AWS, then setting up an automated orchestration system, which became Scrapoxy. I shared it as open source, and it quickly gained popularity. Some users pushed it to extremes, launching 20,000 machines overnight, whereas I only used 200! That’s when I realized how popular the tool was becoming.

Guillaume: That’s impressive! I imagine the platforms reacted quickly. How did they enhance their defenses over time?

Fabien: Exactly. Initially, it worked very well, but protections strengthened over time. At one point, I had scraped almost every French LinkedIn profile—around 16 million profiles.

I never exploited this commercially but conducted internal NLP tests, particularly predicting when someone would leave their job. Using the data, we could anticipate professional changes and suggest suitable job opportunities.

Some companies employed similar techniques—some succeeded, others were acquired, or ceased activity. It highlights just how powerful these tools can be.

Guillaume: And today, what exactly does Scrapoxy manage in the scraping process?

Fabien: Scrapoxy is a proxy manager dedicated to scraping. It doesn’t perform crawling but orchestrates proxies. It manages various providers like AWS, OVH, Digital Ocean, and 25 proxy providers, including residential proxies.

It automates IP rotation and manages instance scaling to reduce costs. For example, instead of keeping 200 instances active continuously, Scrapoxy dynamically turns them on and off, reducing costs by 80%.

It also offers features like browser fingerprinting to avoid detection, managing bans, and integration with Scrapy to simplify usage.

Guillaume: So, basically, Scrapoxy optimizes infrastructure to allow scrapers to operate smoothly?

Fabien: Exactly. Scrapoxy isn’t a scraping tool itself but a solution that enhances existing tools. It prevents IPs from being quickly banned and optimizes costs by running machines only when necessary. It’s also modular; contributors develop modules for specific use-cases and share them. Additionally, some users provide very interesting solutions for varying the load incrementally.

Guillaume: Open source is fantastic, but how do you manage the community and the monetization of Scrapoxy?

Fabien: That’s a real challenge. Several models exist: consulting, open core (free version and premium paid version), or sponsorship. I’ve tried consulting, but it often shifts towards unrelated issues.

Currently, I’m leaning toward an open core model with an advanced paid version. However, with advances in AI, I’m reconsidering the extent of open source because it becomes too easy to clone and improve projects with AI.

Guillaume: That’s quite a dilemma. Speaking of that, do you think AI will profoundly transform scraping?

Fabien: Yes, we’re witnessing an ongoing evolution of the “cat-and-mouse” game. Initially, protections focused on IP bans, then browser fingerprint detection. Now we’re entering an AI battle: one AI creates an antibot, another AI finds ways around it. The future will be an automated confrontation between these systems.

We also see emerging models capable of interacting with sites as humans would, simulating natural behaviors. These are fascinating challenges to follow!

Guillaume: Fascinating indeed! Thank you, Fabien, for this very enriching discussion.

Fabien: Thank you, Guillaume. It was a pleasure discussing these topics!