Interview with Julien Nioche Creator of StormCrawler and Green IT Expert

Introduction and Context

Guillaume Pitel: Julien, we met almost 10 years ago, when I was at Exensa and we were getting serious about crawling. We’d developed a machine learning algorithm for semantics and wanted to prove it could scale to the Web. We started with Common Crawl, but dissatisfied with the quality, I wanted to crawl myself.

I’d contacted you because I’d seen your activity around StormCrawler and DigitalPebble. In the end, it didn’t work out due to the complexity of arranging funding with the UK, even before Brexit. I collaborated with the University of Twente in the Netherlands, which led to the creation of Babbar. Today, we crawl extensively, particularly for Ibou, our Web search engine currently under development. But let’s talk about you: what’s your background?

Journey: From Linguistics to NLP

Julien Nioche: I’ve been based in Great Britain for over 20 years, in Bristol for 16 years. I have an atypical background with deeply rooted impostor syndrome: I never studied computer science at university. I learnt to code during my military service, during my guard shifts at military school.

I studied Russian and languages up to master’s level. I was aiming for literary translation, but you need to not require money and be incredibly talented. That wasn’t my case. Technical translation didn’t interest me because translators were already using computer-assisted systems. It was through discussions with my lecturers that I realised I could combine linguistics and programming through natural language processing (NLP).

I worked for a start-up in France, then went to the University of Sheffield as a researcher. We did named entity extraction: automatically extracting from text the names of people, products, places and their relationships. At the time, everything was rule-based, which required linguistic skills. It became unmanageable after a while.

Open Source and Lucene

Julien: The start-up in France built “semantic” search engines with dictionaries and lexical resources. I’d found a little-known open source project that I began using intensively: Lucene. At the time, it wasn’t yet at the Apache Foundation. I made some modest contributions, but I was amongst the first users.

This project was incredibly important for me. That’s where I saw how to write good Java code, the dynamics of an open source project, how to combine different influences from around the world. I became a better Java developer just by looking at Lucene’s code.

At Sheffield, I did information extraction. After a few years, the academic side wore me down. The enormous European projects produced little relative to the money invested. It was brilliant for tourism with meetings bizarrely always in Spain, Italy, Greece in the summer, but the output was depressingly poor. I came from industry and wanted to make things that worked, that had an impact.

The Move to Crawling

Julien: That’s when I started DigitalPebble. It was a leap into the void because I had no consulting experience. I arrived with my background in Lucene, wanting to build search engines. This was at the very beginning of Elasticsearch. I’d actually met Shay Banon when he was still on his own, I think in Amsterdam.

My clients would tell me: “That’s great, you can do indexing, search, classification. But we need documents. Can you help us get web pages?”

Guillaume: Ah, the famous data acquisition problem!

Julien: Exactly. I’d already played with the web as a data corpus. At the time, I’d worked with Gregory Grefenstette

Guillaume: Seriously? I did my post-doc with him!

Julien: Amazing! I’d done an internship at the Xerox research centre in Grenoble with him. We’d worked on using web corpora for linguistic tasks, particularly language identification. But here, I was being asked to crawl several million pages. I needed tools of a different calibre.

Discovering Nutch and Apache

Julien: Whilst searching, I came across Apache Nutch. Nutch allows you to crawl the Web at very large scale and is best known for having given birth to Hadoop, the first major open source Map/Reduce platform. Doug Cutting, who created Nutch, was also the author of Lucene. At the risk of sounding like a fanboy, he’s someone I admire greatly: he made Lucene, Hadoop, Nutch, Avro. And he’s an absolutely lovely person whom I had the chance to meet at one of the Apache conferences.

I started using Nutch around 2008-2009 for clients who wanted to crawl a billion pages. On fairly large clusters, I found loads of bugs and possible improvements. I began contributing patches, patch after patch. And I was asked to become a committer on the project.

Guillaume: That’s fine recognition.

Julien: Commercial successes bring joy, but being invited to be a committer on Nutch is one of the things that gave me the most professional joy. I felt more valued than landing a big contract. It was having the recognition of people I admired. It gave me a massive boost.

I was asked afterwards to be PMC chair, the link between the project steering committee and the Apache Foundation. I used Nutch for quite a few clients. Gradually, I was doing more and more crawling, less and less NLP, and very little search. Meanwhile, Solr and Elasticsearch had become well established.

Nutch’s Limitations

Julien: I began to see Nutch’s limitations. Even though it was very robust and reliable, the fact that Nutch relies on Hadoop meant that crawling was done in batches, in waves. The larger the crawl grew, the more certain batch operations took time between phases.

To illustrate, the sequence is: generation of URLs to fetch, fetching these URLs, parsing to extract links and content, indexing, then updating the database. With this model, as it goes on, the first and last phases take more time because the database grows rapidly. The fetch part, which should be the essence of crawling, represents a diminishing share.

It was also inefficient because we weren’t using resources at the same time. When you fetch, you use the network, little CPU, few discs. When you parse, it’s CPU, no network. When you do generation and update, it’s CPU and lots of disc, no network. In short, you always have unused resources.

The Birth of StormCrawler

StormCrawler Logo : a spider with a lighting bolt

Julien: 12-13 years ago, there was the emergence of stream processing. Several projects were emerging, including Apache Storm. I began looking at Storm and found it appealing. The concepts were simple but allowed expressive things.

I thought: I’ll see if I can write a small crawler in Storm. I called it StormCrawler, which says a lot about my imagination! I invested time out of interest, to see where it leads me. Quite quickly, I saw that it interested potential users. I found myself in San Francisco with a client to help them use it.

Guillaume: And the community developed?

Julien: Yes, and it’s a joy equal to being made a committer on Nutch. When you create an open source project, just an idea you code in your office, and someone takes an interest, sends a contribution, even a comment, or uses it… It’s a brilliant feeling: “Wow, I made something good enough for people to invest their time and attention.”

It’s the best antidote to impostor syndrome. StormCrawler grew gradually. I wanted to solve Nutch’s problems: now we’re constantly doing everything. We retrieve URLs, we update the database, we fetch, we parse – all at the same time. All resources are used simultaneously.

It makes sense: if we care about politeness in crawling, we’re not going to hit a server with many requests in a restricted time. We respect robots.txt and its directives. Having more time to fetch allows fetching more politely.

More modular too. With Nutch, you have to retrieve the entire code base and modify it. With StormCrawler, everything is modular, based on Maven dependencies. I’ve seen users crawl a billion pages with minimal code – just a few configuration files. The idea was to make something elegant, lightweight and above all easy to customise.

I think StormCrawler succeeded in being both easy to modify and reusable. Users have used it for very different things, only needing to write their own specific part and rely on a common base.

Towards Apache and Transmission

Julien: Over time, I tried to leverage others’ efforts. The idea wasn’t to let go, but to encourage other contributors to participate and give them the keys to the truck, so the project becomes theirs as much as mine.

A year and a half ago, I donated StormCrawler to the Apache Foundation. We went through a fairly rapid incubation phase because amongst the committers, four were already Apache Foundation members. Now, StormCrawler is a Top-Level Project. My participation is limited, most of the work is done by other committers. Whatever my involvement, the project will continue.

Guillaume: It’s a bit like watching your children grow up and accepting that they become independent?

Julien: Exactly. I’d gone that far, I’ll always be there, but the project must live its own life.

The Turn Towards Green IT

Guillaume: And since this handover, what have you moved towards?

Julien: Environmental issues have always been of great importance to me. I’ve always been aware that we weren’t treating the planet as we should. For a long time, there was a dichotomy between my personal convictions and my professional activities.

Two to three years ago, I began looking at what was being done around Green Software. I found that the skills I knew, my professional expertise, could be usable for things I profoundly believed in. I fell into Green Software – this idea that we can measure and reduce the environmental impacts of what we do with software.

Guillaume: Before we go into detail, let’s talk briefly about search engines and AI. We’re launching Ibou, an autonomous search engine. With the emergence of generative AI which uses search massively, what’s your view on the environmental impact?

AI and Environmental Impact

Julien: AI is on everyone’s lips, and environmental impacts too. It’s a source of problems, especially in the United States: water use in arid regions, electricity use that national grids can’t always provide. We end up with actors running gas turbines. It’s an environmental catastrophe. There’s also the noise.

I was speaking with someone from the industry who lives in the United States and told me: “To go shopping, I pass 20 data centres. When they run their diesel generator, the air quality is abominable depending on the wind.” It takes us back to 18th century France where suburbs were determined by the wind and factories. We’re returning to that with data centres.

Not a day goes by without an article trying to shed light. In Great Britain, it’s a race to build data centres. Oddly, instead of building them in Scotland where we have so much renewable energy that we must cut them from the grid and where water access isn’t a problem, they’re put in south-east England where there are already water and electricity access problems.

Guillaume: For latency perhaps?

Julien: I don’t know. Or they’re afraid the Scots might become truly independent 🙂! There’s progress: Google published a paper last month shedding light on their impacts. But it remains very opaque overall. When a model runs in production, it’s difficult to evaluate.

For web search, you recalled that machine learning isn’t new in search. They were mainly classic vectorial approaches with cosine distances, rather lightweight in computation. It’s not like LLMs.

The problem is opacity. For most of AI and SaaS, when you ask: “What quantity of energy was used for my pipelines and in which region?”, they say: “That’s interesting, but we can’t tell you.” Everything is opaque. It’s difficult to have an estimate and be able to account for hidden impacts.

Search Engines and the Environment

Guillaume: You were talking about search engines positioning themselves on environmental issues by planting trees. Any thoughts on that?

Julien: Well, I contribute to a charitable organisation near Bristol that plants trees. I sometimes spend my weekends planting trees, it’s a subject close to my heart.

It’s a good thing, but what would be even better is if they published indicators: “Per user query, we estimate this generates such quantity of electricity and, as our servers run in such place, that translates to such quantity of emissions.” Putting things in perspective with open data, as far as possible.

Guillaume: Transparency is a problem, especially now that AI companies are action like it’s the Far West. Since 11-12 September, Google has blocked queries with the top 100 results, heavily used in SEO for position tracking. We suspect it’s not because of people doing SEO, but rather generative AI actors like OpenAI, Perplexity, Anthropic who make massive non-compliant queries.

Bing too, this summer, stopped its pure search API. Now you must use Azure with their service that integrates automatic summarisation. Quite a tightening of competitive space on search.

Julien: Google does publish things though. I believe they were at 7 grammes per search in 2023. It’s hard to know if an ultra-optimised engine could do better or if their scale makes them efficient. Google, amongst the main cloud actors, is the most transparent. It’s not perfect, but they publish each year a detailed environmental report with clear methodology. The problem is that when you look at the table, emissions are rising, electricity consumption rising, all because of AI. But they put a green tick beside it! They’re rapidly moving away from their objectives.

They’re still quite transparent and make available to their users the most practical tools. Beyond environmental issues, there are questions of privacy and sovereignty: is it acceptable to have entirely delegated the foundations of the digital industry to a private actor from a foreign country under their government’s influence?

Web Crawling and Efficiency

Guillaume: At Babbar, we’re very attentive to costs, financial and also environmental. We have 60 machines for the database managing 700 billion URLs, 16 for crawling. We’re on dedicated servers in the Paris region with OVH and Scaleway. As little cloud AWS or GCP as possible.

We’re attentive to CPU, RAM, disc costs. My main problem: we crawl the web, so we trigger operations everywhere in the world. I haven’t found an indicator allowing me to say: the carbon intensity of requesting information from such server. That would allow me to guide my crawl and have an idea of my indirect impact.

Julien: Not to my knowledge. The network is somewhat specific. There was a fascinating presentation at GreenIO in Paris last December. GreenIO is one of the major conferences on Green Software. There’s also a podcast of the same name by Gaël Duez.

The presentation talked about the environmental impact of the network. It’s hard to have estimates. What happens is that the network is provisioned for demand peaks. Most of the time, very little happens relative to capacity, but the quantity of energy needed is the same. It becomes somewhat a constant. Switches have options not activated by default that would allow reducing when there’s little traffic.

There could be intelligent routing that would take into account the carbon intensity of a node in the network and say: “It’s a bit dirty at that spot, we’ll go the other way, it’s cleaner.” But currently, it’s a constant.

In your case, you were saying it’s mainly physical resources, embodied carbon. That’s due to the fact you’re running in France where energy is relatively clean. If your servers were in Poland with coal energy, that would change the ratio.

Intelligent Crawling and Quality

Guillaume: Crawl intelligence is determinant. When you crawl naively, you quickly end up crawling only rubbish. There’s an enormous amount of rubbish on the Internet. Sites create an infinity of pages to cheat on metrics like PageRank, and by definition, these infinite sites end up occupying your entire index.

Julien: I fully recognise what you’re describing. With crawls we were doing, if you don’t take adequate measures, you quickly end up with 80% of URLs being infinite redirections to worthless adult content.

For Common Crawl, I think it’s improved. I know Common Crawl well because I’ve worked with them several times. Very long ago, when their engineer had left, I ran the crawl. I’d managed to convince them to return to more standard Nutch. Later, I donated a StormCrawler configuration for crawling news sites, which is now part of their resources.

More recently, I spent a few months with them. Sebastian Nagel, their principal engineer who is also a committer on Nutch and StormCrawler, has done enormous work. Quality has improved. Common Crawl has been one of the main training sources for LLMs. Organisations we all know have widely used Common Crawl.

Guillaume: For me it was around 2015-2016. We started with Bubing, then developed our own version. We contributed a lot to Bubing because there were loads of bugs, but it was impressive in efficiency compared to Nutch or Heritrix.

Julien: Speaking of Bubing, I’d got in touch with them. I did a project a few years ago called URLFrontier, where the idea was to separate the URL selection and storage logic into a separate service, to avoid reinventing the wheel. Every crawler reinvents its own code for that.

The Bubing authors were academics not very interested long-term. It illustrates that open source isn’t just that the code is accessible. Most important is what community you build. That’s a project’s success: the community behind it. It’s in Apache projects’ DNA.

Guillaume: We contributed fixes to Bubing, then had to fork because we were becoming too dependent. Zero community was interested in contributing. It’s difficult sometimes to contribute to open source as a company. We tried to contribute to RocksDB, never beyond repeated attempts because you sense there’s no willingness from the maintainer.

Price, Energy and Location

Guillaume: My intuitive rule for environmental impact is mainly price. If you spend money, it must have an impact. I suspect it’s too simplified but it must still have a good link.

Julien: Yet price is quite a poor proxy for impact. You’re right, if it’s expensive, it’s probably heavy. But let me give you examples on AWS.

You can take measures that will reduce your costs by committing for a period or reserving instances. That will have no environmental impact on the same services. For impact, everything depends essentially on where services run.

Concrete example: I was working recently with an organisation in Bristol, Camera Forensics, fantastic people who use StormCrawler. Most of their processes run on AWS in the United States, simply because it’s the default – us-east-1. Many services were only available there initially.

In reality, everything is about carbon intensity of the electrical grid. In the United States, lots of coal and gas, few renewables. In France, you’re essentially on nuclear – not necessarily good news environmentally, but at least it’s low carbon. Carbon intensity in France is around 30-40 grammes of CO2 equivalent per kilowatt-hour.

Here in Great Britain, it’s different because the energy transition of the last 10-15 years has been effective. Lots of wind, wind turbines, and despite what the French think, sunshine too! So we have much more variation in electrical carbon intensity.

In Sweden, they’re always at about 20g because they have enormous renewables. The same process depending on whether it runs in the United States or Sweden will require the same quantity of energy. But in one case, it will generate 400g of CO2 per kilowatt, in the other case 20g. It has no relation to price. You’ve paid the same thing, but in one case the environmental impact is 20 times less.

That’s what we call carbon-aware, the basis of Green Software. We accompanied this Bristol organisation. They’re migrating to Sweden and moving to more recent Graviton instances, cheaper and more efficient. They’re killing two birds with one stone.

The other strategy is time shifting: taking into account fluctuations in carbon emissions within a day. If you have a batch to do, doing it when the grid is relatively clean to minimise impact.

Question for you: where do your servers run geographically?

Guillaume: In the Paris region, and with OVH and Scaleway in France. As a good nuclearised Frenchman, user of local resources, the carbon impact of electricity isn’t the most determinant for me in my thinking. When I do reports, the most determinant is physical resources, the servers. Our energy use is relatively low impact.

My main problem: we crawl the web, so we trigger operations everywhere. I haven’t found a carbon intensity indicator per Autonomous System or per IP for example. That would allow me to guide my crawl. Network cost estimates range from very little to enormous depending on whom you ask.

The Network and Hidden Impacts

Julien: The network is a difficult problem. There was this presentation at GreenIO explaining that with the network, it’s provisioned for demand peaks. Most of the time, absolutely nothing happens, but the quantity of energy needed is the same. It becomes a constant.

There could be intelligent routing based on carbon intensity, but currently it’s an evaluated constant. In your case, the fact that your physical resources are more important is due to the fact you’re running in France where energy is clean. If you were in Poland, it would change the ratio.

In hyperscalers, equipment is so heavily used over 5-6 years that the ratio is three quarters usage, one quarter hardware. For a laptop or phone, it’s the reverse: 80% of impact is in manufacture, 20% in use. The best thing to do: keep it as long as possible.

For cloud servers, it’s different. Sometimes it’s more efficient to have servers with shorter lifespan but more efficient, especially if the grid is really dirty. The savings will mean that after some time, it will have a positive impact.

What you described with crawling is a bit like sunflower crawling that would favour certain regions according to sun and wind. That would allow optimisation, but it’s difficult to measure since it’s not on your side.

Guillaume: Complicated. There would need to be dialogue between the crawler and the crawled. And then potentially, there’s also a trade-off with politeness: is it better to crawl when the server isn’t busy to avoid bothering it? But when it’s not busy, it’s probably night-time, so not ideal for solar.

Also, we as crawlers are only interested in text for indexing. Having intelligence at web server level that would say: “If it’s a crawler, I won’t activate cookies, personalisation, JavaScript, I’ll return a minimal payload with well-structured semantic HTML.”

LLMs and the Web’s Future

Julien: We’re seeing that precisely regarding crawlers used by LLMs. It’s called “LLMs.txt”. The idea is to point to a Markdown version of pages, a minimal version of textual content with just links. Serving that to robots.

I amused myself by looking. I took 1 million websites, launched a small crawl with StormCrawler to see the proportion that had such a page. I left it running several hours and when the total was still at zero, I killed the crawl. For now, I don’t get the impression it’s taking off. But who knows, it would rather be a good thing.

The problem is that it requires AI actors to want to play by the rules. When you see robots.txt – that directive allowing the webmaster to tell the crawler which page not to visit or what frequency to use – it’s the basics. A well-behaved crawler should respect that. StormCrawler does it by default.

But AI actors don’t. They bombard websites relentlessly. I don’t think we can expect them to respect something different, unless it has added value for them, like not having to parse complex pages.

Guillaume: But that would mean they place value on money, and precisely I note they have rather too much and prefer to sacrifice money to save time. In January, an open source actor released Anubis, a JavaScript anti-bot that blocks crawlers pretending to be browsers, because sites are getting pounded by crawlers with no respect whatsoever.

Solutions like Cloudflare already existed, but here it’s someone who did it in open source because his site got bombarded. It was a kind of GitHub with links to commit history requesting database access. He got collapsed by these web rogues who respect nothing because they put money in, buy proxies, crawlers, and throw the sauce without caring about consequences.

It’s the misfortune of these overcapitalised industries without limits, in a frenzied race to be first.

Spruce: A Green IT Project

Guillaume: Tell us about your Green IT projects now.

Julien: With my open source DNA, I started a few months ago a new project: Spruce. What Spruce does is allow estimating the energy and emissions from a user’s usage on AWS – different services: storage, network, compute. It will give an estimate of the energy needed and CO2 emissions.

Now, DigitalPebble’s activity is essentially that: Green Software, GreenOps. I still do crawling if asked with great fervour, but I’m essentially in Green Software.

Guillaume: What exactly is GreenOps?

Julien: Everyone is familiar with FinOps, the activity consisting of looking at and reducing the financial costs of cloud usage. GreenOps looks not at costs but at environmental impacts.

Spruce is a GreenOps tool. It takes usage reports, utilisation reports generated by AWS – enormous Parquet files like a giant CSV with a million lines and hundreds of columns. Spruce reads the Parquets, runs different data and models to arrive at these estimates. From there, it allows doing reporting, having figures on AWS cloud impact. But also trying to reduce their impacts, seeing which projects, environments, services have an impact and especially what we can do to reduce.

I with DigitalPebble provide the open source project but also consulting that helps organisations translate this information into action to reduce their environmental impact.

Spruce uses open source libraries. We don’t create our own models, we reuse existing libraries. There’s a module that allows retrieving estimates from the Boavizta API.

One of the project’s motivations: there was Cloud Carbon Footprint which was for a long time the reference. It was a good multiplatform tool: Google, Alibaba, AWS. But it was supported by a company in the US doing consulting, and they stopped supporting the project. The project is dead. Nobody contributes to it anymore.

That was the case with Cloud Carbon Footprint: no contributor diversity. Everyone worked for the same company. The day that company gets fed up, the project is dead. Spruce, the idea is to have something that won’t work like that. For now it’s a project under the DigitalPebble label, my consulting company, but I may also put it at the Apache Foundation after a while. We’ll see, it will depend on dynamics, whether there’s attraction.

Commercial solutions for GreenOps also exist. In France, Sopht and OxygenIT, in Great Britain GreenPixie and Tailpipe. Loads of offerings on the market with their strengths. I know all these companies because I recently worked for a year for the British Ministry of Justice. I was a civil servant where I dealt with Green Software at the Ministry of Justice. It wasn’t a role that existed, it’s a role I created.

It’s an enormous environment with thousands of developers. I saw there was a void on the GreenOps side, so I sought to fill it. After a while, I realised I’d probably have more impact returning to do that from my side with DigitalPebble. In that context, I’d worked with some of these commercial solution providers.

The idea with Spruce isn’t to compete with them, but to provide a first tool that will allow a developer, an organisation, to show something quickly without it costing much money and have that first convincing step within their company.

The Human Problem of GreenOps

Julien: The problem with GreenOps and FinOps isn’t technical problems. It’s not “we’re able to have the data” – especially for FinOps, we have that data, Amazon gives it with much detail. The problem isn’t coding dashboards, that’s fairly trivial. The problem is a human problem of convincing people, but also making information visible.

That a development team working on a product can easily have, in the same way they have their deployment metrics, the costs and impacts too.

What often happens: the developer, in a company, the cost, they don’t really care. Unless they’re in a start-up and their immediate survival depends on the company’s financial survival. But from the moment the organisation exceeds a certain size, the larger it is, the less their involvement in the financial cost of what they do will be.

I’ve seen this in large organisations: the quantity of services on the cloud that are unused is incredible. It’s partly due to how these organisations function. Often a team will create a new service, transmit it to another team that handles maintenance. That team says: “Why do we have five environments?” And when in doubt, they leave things because they don’t have deep knowledge of the tool.

We end up with unused resources. Buckets with data we leave because we’re not quite sure, with replication across three continents. All this leads to enormous quantities of resources that cost lots of money and have an environmental impact.

These developers won’t necessarily be very sensitive to the financial aspect. I’ve seen it: “Yes, it costs so much per month and we don’t use it.” But on the other hand, if I say: “Yes, it’s so much per month, but it also generates 30 kilos of CO2 per month without any reason.” 30 kilos of CO2 is 300 kilometres by car, the equivalent we can completely eliminate.

That can make the developer who, otherwise, says: “I don’t have time to look”, take the time to do it. In English we say “It’s not my wallet”, but it’s my planet. That’s GreenOps: it allows having an organisation that goes further on FinOps, further on finances.

The argument: if you put forward environmental criteria, it will also trigger actions on the financial side.

Conclusion and Perspectives

Guillaume: To conclude, you now work essentially on Green IT?

Julien: Yes, DigitalPebble’s activity is essentially Green Software and GreenOps. I still do a bit of crawling or search if asked with great fervour, but I’m essentially in Green Software.

Being able to align my environmental convictions with my professional expertise is a great satisfaction. For a long time there was a dichotomy between my personal convictions and my professional activities. Now it’s aligned, and it’s something I profoundly believe in.

Green Software is measuring and reducing the environmental impacts of what we do. The real problem with GreenOps isn’t technical, it’s human: convincing people and making information visible. When a developer sees the environmental impact of their code, even if they don’t care about financial costs, they can act because it’s their planet.

Guillaume: Thank you Julien for all these insights, from your journey to StormCrawler through to Green IT. It’s fascinating to see how your technical expertise now serves a cause close to your heart.

Julien: Thank you Guillaume. And good luck with Ibou, it’s a fine project. The idea of having a search engine that respects web publishers and tries to be sustainable is exactly the kind of initiative we need.