Open Web InfrastructureECIR 2024

An Open Crawler and Dashboard for Europe's Web Index

Saber ZerhoudiUniversity of PassauMichael DinzingerUniversity of PassauMichael GranitzerUniversity of Passau

Abstract

Web search depends on web indexes, and almost all of them are proprietary. If you want to build a search engine, a research prototype, or an AI system that needs web data, your options are limited and opaque. This project is my contribution to changing that: I help build the OWLer crawling pipeline and the OWLer Dashboard, part of the infrastructure that collects, organizes, and exposes web data for the European Open Web Index.

Why build our own crawler

When I joined the OpenWebSearch.EU project, we had an ambitious goal and a very concrete obstacle. The goal: build a European Open Web Index that anyone (researchers, startups, public institutions) can use for search, analytics, and AI, without depending on commercial providers. The obstacle: no shared crawling system existed that we fully controlled. CommonCrawl covers a lot of ground, but we needed predictable legal guarantees, fine-grained control over what gets crawled and when, and a pipeline that could run across a network of European computing centers without constant manual intervention. OWLer is our answer: a distributed crawler built on StormCrawler, designed specifically for cooperative, multi-site crawling that feeds into a public index.

The OWLer crawling pipeline

From the outside, OWLer looks like another web crawler. From the inside, it is a coordination problem. Multiple OWLer instances run on machines across European data centers. They all need to crawl different parts of the web without stepping on each other, respect the same legal and ethical rules, and produce output in formats that downstream services can actually consume. That coordination is the hard part. The architecture works in three layers. A shared URLFrontier service acts as the brain: it tracks which URLs have been seen, assigns work to crawler instances, and enforces politeness policies (robots.txt, rate limits, legal constraints). The crawlers themselves handle URL discovery, content fetching, parsing, and archiving into standard WARC files. And a metadata layer annotates each page with language, quality signals, and crawl context, turning raw fetches into structured data. The result: tens of millions of page fetches per day, with every page traceable from seed URL to final archive. That data flows into the Open Web Index, stored as Parquet files, searchable through CIFF indices, and sliceable by domain, language, or topic for downstream use in search, analytics, or AI pipelines.

Making the crawler visible: the OWLer Dashboard

Once OWLer was running at scale, a new problem appeared: even we had trouble seeing what the crawler was actually doing. Transparency matters here. You don't want a web index that appears out of nowhere. You want something you can debug, question, and explain, especially if you're a webmaster whose site is being crawled, a researcher building on the data, or someone evaluating the index for bias. The OWLer Dashboard is our response. It gives you crawl metrics at a glance: pages fetched, data volume, pipeline throughput, all updating over time. If something stalls or spikes, you see it immediately. It supports drilling into slices of the data. Instead of one opaque number, you can explore by domain, dataset, or time range. Want to know what we crawled in the last 24 hours for a specific region? Filter, inspect, and trace how that content entered the index. Most importantly, it puts webmasters in the loop. The public-facing pages explain why we crawl, how to opt out via robots.txt, and how to control how your content appears. For infrastructure that touches every website it visits, this kind of transparency isn't optional.

Crawling on demand

The long-term vision goes beyond batch crawling. We are pushing OWLer toward crawling-on-demand: you provide seed URLs or specify a domain and topic, OWLer spins up a dedicated crawl using the same pipeline, and the output lands as its own dataset, legally compliant, well-documented, and integrated with the rest of the Open Web Index. The idea is simple: instead of everyone scraping the web independently, you request exactly the slice you need. OWLer handles the infrastructure; the Dashboard lets you see what you got.

Citation

If you found this work useful in your own research, please consider citing the following.

@misc{zerhoudi2023owler,
  title   = {The Open Web Search Crawler (OWLer)},
  author  = {Zerhoudi, Saber and Dinzinger, Michael and Granitzer, Michael},
  note    = {Zenodo preprint, OpenWebSearch.EU},
  year    = {2023},
  howpublished = {\url{https://openwebsearcheu.pages.it4i.eu/wp1/owseu-crawler/owler/}}
}

@inproceedings{hendriksen2024openwebindex,
  title   = {The Open Web Index: Crawling and Indexing the Web for Public Use},
  author  = {Hendriksen, Gijs and Dinzinger, Michael and Fröbe, Maik and Schmidt, Sebastian and Zerhoudi, Saber and others},
  booktitle = {Advances in Information Retrieval (ECIR 2024)},
  year    = {2024}
}

@article{granitzer2023impactowi,
  title   = {Impact and development of an Open Web Index for open web search},
  author  = {Granitzer, Michael and Voigt, Stefan and Fathima, Noor Afshan and Golasowski, Martin and Guetl, Christian and Hecking, Tobias and Hendriksen, Gijs and Hiemstra, Djoerd and Martinovič, Jan and Slaninová, Kateřina and Stein, Benno and de Vries, Arjen P. and Vondrák, Vít and Wagner, Alexander and Zerhoudi, Saber},
  journal = {Journal of the Association for Information Science and Technology},
  year    = {2023}
}