
Abstract
In OpenWebSearch.EU, my main job is pretty simple to explain and very non-simple to execute: help crawl the open web in a way that is transparent, legally responsible, and actually usable by others. This project sits at the heart of that effort — designing the OWLer crawling pipeline and the OWLer Dashboard that lets people see, steer, and reuse what we collect as part of the European Open Web Index.
Why we needed "our own" crawler
When I first joined the OpenWebSearch.EU project, we had a bold vision and a very practical problem.
The vision: build a European Open Web Index that anyone can reuse for search, analytics, and AI, instead of everyone quietly scraping the same commercial indices.
The problem: there was no shared crawling backbone that we fully controlled.
CommonCrawl is great, but we needed predictable legal and ethical guarantees, control over what we crawl and when, and a pipeline that could run across a patchwork of European HPC centres without becoming a devops horror story.
That's where OWLer comes in — a distributed crawler that extends StormCrawler/URLFrontier, but tuned specifically for the OpenWebSearch.EU project and the Open Web Index.
Designing the OWLer crawling pipeline
On paper, OWLer looks like "just another StormCrawler setup". In practice, it's a cooperative crawling fabric spread across multiple data centres that all have their own hardware, policies, and quirks.
My work here has mostly focused on the pipeline side. We started by taking StormCrawler further: we built on StormCrawler's proven Storm-based architecture and turned it into an OpenWebSearch-friendly stack, covering URL discovery and filtering, content fetching, parsing and WARC writing, all wired into a shared URLFrontier service. The frontier is the "brain" that coordinates crawl state across all OWLer instances.
We also focused on making the crawl cooperative, not chaotic. Multiple OWLer instances run on VMs and servers across European providers. Instead of each crawler doing its own thing, they all talk to the same frontier and follow shared policies (respecting robots.txt, rate limits, and legal constraints). That's how we can grow from millions to tens of millions of page fetches per day without stepping on each other's toes.
Finally, we put effort into turning crawls into reusable datasets. Every fetch ends up in WARC files plus rich metadata (language, basic quality signals, crawl context). Those artifacts flow into the Federated Data Infrastructure behind the Open Web Index: Parquet metadata, CIFF indices, and specialised slices for downstream tasks like search, analytics, or LLM/RAG pipelines.
In short, OWLer is the piece that turns "we should crawl the open web" into "here is a well-structured, inspectable data stream we can all build on."
From black box to "glass cockpit": the OWLer Dashboard
After we had OWLer running at scale, a new problem showed up: even we struggled to see what the crawler was actually doing.
You don't want an index that appears out of nowhere. You want something you can debug, question, and explain — especially if you're a webmaster, a researcher, or someone wiring OWI into their own systems.
That's why I spent a lot of time on the OWLer Dashboard, which lives inside the broader Observability & Control app.
The dashboard gives you crawl and pipeline metrics at a glance, showing how many pages are being crawled, how many end up in preprocessing, how much data flows into the Open Web Index, and how this changes over time. Think of it as a "glass cockpit" for the crawl: if something stalls or spikes, you see it immediately.
It also supports drilling into slices of the web. Instead of one big opaque number, you can explore by domain, dataset, or time range. Want to know what we've actually crawled in the last 24 hours for a specific region or use-case? The dashboard lets you filter, inspect associated datasets, and understand how that slice got into the index.
Most importantly, we're putting webmasters in the loop. The public OWLer page and dashboard are deliberately not just for engineers. They explain why we crawl, how to opt out with robots.txt, and how to request more control over how your content shows up. For an "infrastructure" project, this is crucial: web publishers shouldn't have to reverse-engineer what we're doing.
Crawling-on-Demand and flexible index slices
One of the things I'm most excited about is making the crawling pipeline feel interactive, not just batch-like.
We've been pushing OWLer and the surrounding infrastructure towards what we call crawling-on-demand: you provide seed URLs or a focused domain/topic, OWLer spins up a dedicated crawl using the same machinery as the main pipeline, and the output ends up as its own dataset and, if needed, its own index slice.
The long-term idea is simple: instead of everyone scraping the web from scratch, you request exactly the slice you need — legally compliant, well-documented, and aligned with the rest of the Open Web Index.
For me, OWLer and the OWLer Dashboard are the two pieces that make this possible: one turns distributed infrastructure into a coherent crawler, the other turns that crawler into something people can see, reason about, and eventually ask for specific data products from.
Citation
If you found this work useful in your own research, please consider citing the following.
@misc{zerhoudi2023owler,
title = {The Open Web Search Crawler (OWLer)},
author = {Zerhoudi, Saber and Dinzinger, Michael and Granitzer, Michael},
note = {Zenodo preprint, OpenWebSearch.EU},
year = {2023},
howpublished = {\url{https://openwebsearcheu.pages.it4i.eu/wp1/owseu-crawler/owler/}}
}
@inproceedings{hendriksen2024openwebindex,
title = {The Open Web Index: Crawling and Indexing the Web for Public Use},
author = {Hendriksen, Gijs and Dinzinger, Michael and Fröbe, Maik and Schmidt, Sebastian and Zerhoudi, Saber and others},
booktitle = {Advances in Information Retrieval (ECIR 2024)},
year = {2024}
}
@article{granitzer2023impactowi,
title = {Impact and development of an Open Web Index for open web search},
author = {Granitzer, Michael and Voigt, Stefan and Fathima, Noor Afshan and Golasowski, Martin and Guetl, Christian and Hecking, Tobias and Hendriksen, Gijs and Hiemstra, Djoerd and Martinovič, Jan and Slaninová, Kateřina and Stein, Benno and de Vries, Arjen P. and Vondrák, Vít and Wagner, Alexander and Zerhoudi, Saber},
journal = {Journal of the Association for Information Science and Technology},
year = {2023}
}