
Abstract
Evaluating search systems properly means watching how people actually use them, not just measuring a single query-response pair. But recruiting real users for every experiment is costly and hard to reproduce. This project builds a stack of user simulators that trade off control and realism: SimIIR 2.0 models structured search sessions with swappable components, SimIIR 3 extends this to conversational and LLM-backed settings, and UXSim bridges the gap between rule-based stability and LLM flexibility by combining both in a single framework.
Why simulate users at all?
Search evaluation has a fundamental tension. The metrics we compute offline (precision, nDCG, recall) tell us how good a ranked list is. They say nothing about what happens when a real person sits down, types a query, scans results, clicks, reads, gets frustrated, reformulates, and eventually gives up or finds what they need.
User studies capture that richness, but they are slow, expensive, and nearly impossible to reproduce exactly. Run the same study twice and you get different participants with different moods on different days.
User simulators sit in the gap. They model the decisions a person makes during a search session (what to query, what to click, when to stop) and replay those decisions hundreds or thousands of times under controlled conditions. The result: reproducible, session-level evaluation that still reflects how people actually search.
This project is my attempt to build a clean toolkit for exactly that: starting from SimIIR 2.0, moving to SimIIR 3, and then adding UXSim as a bridge to LLM-based agents and modern search UIs.
SimIIR 2.0: modular building blocks for search sessions
SimIIR 2.0
is the foundation of this line of work.
The idea is straightforward: break a simulated search session into small, replaceable parts. A query generator decides what to type. A click model decides what to examine. A stopping strategy decides when enough is enough. Each part can be swapped independently, so testing a new click model doesn't mean rewriting the whole simulator.
What made SimIIR 2.0 different from its predecessor is the addition of user types and Markov-based transitions. Instead of one generic user, the framework supports distinct profiles: an exploratory searcher who reads broadly, a focused searcher who clicks selectively, each with their own probabilistic flow through the session. The Markov model captures this: given the current state (just issued a query, just clicked a result), it decides what the user does next.
We used this design to study how different user types react to changes in ranking quality, to benchmark new query generation methods, and to run full-session evaluations on digital library and web search collections. The framework is open source and designed so others can drop in their own components.
SimIIR 3: adding conversation and LLM pipelines
Search has been changing. Users increasingly interact with systems through conversation, not just typed queries and ten blue links. Ranking pipelines now include neural rerankers and LLM-based components. SimIIR 2.0 wasn't built for that world.
SimIIR 3
is the community's response. I co-designed this next iteration to handle conversational search scenarios, integrate cleanly with modern retrieval toolkits like PyTerrier, and support LLM-based query and response generation modules alongside traditional ones.
The core principle stays the same: user behavior is built from small blocks (querying, examining, judging, stopping). But the environment can now be a ranked list, a conversational assistant, or a mix of both. The simulator doesn't care which; it just needs to know what actions are available and what feedback to expect.
SimIIR 3 is also built as a community tool: a public repository, a standard simulation layout, and active use in tutorials and workshops on user simulation.
UXSim: when rules meet language models
With
UXSim
I wanted to confront a tension that kept coming up in my work.
Traditional simulators are predictable and easy to analyze, but they feel rigid: they can't handle unexpected UI elements or nuanced tasks. LLM-based agents are flexible and can reason about context, but they sometimes drift or hallucinate actions that no real user would take. Neither alone is enough.
UXSim combines both through three design choices. First, an orchestration policy decides, step by step, whether to call a classic component (say, a click model with known parameters) or an LLM module (for tasks that need language understanding). This keeps the simulation grounded where it can be, and flexible where it needs to be.
Second, a cognitive agent that sees the interface. Instead of operating on abstract data structures, the agent receives a simplified view of the actual UI (results, snippets, buttons) and issues actions like "click result 3" or "scroll down." These map to real interactions on the target system.
Third, a web interface for running simulations. Others can configure scenarios, launch runs, and inspect traces without writing code. This lowers the barrier for researchers who want to use simulation but don't want to maintain a Python environment.
The progression
These three frameworks form a deliberate progression. SimIIR 2.0 provides controlled, modular simulation with known user models, the reliable baseline. SimIIR 3 extends that baseline into conversation and modern retrieval stacks. UXSim explores what happens when you let an LLM agent share the wheel with classical components.
The thread connecting them: make it possible to stress-test search systems at session level: cheaply, reproducibly, and with increasing realism.
Citation
If you found this work useful in your own research, please consider citing the following.
@inproceedings{zerhoudi2022simiir2,
title = {The SimIIR 2.0 Framework: User Types, Markov Model-Based Interaction Simulation, and Advanced Query Generation},
author = {Zerhoudi, Saber and Günther, Sebastian and Plassmeier, Kim and Borst, Timo and Seifert, Christin and Hagen, Matthias and Granitzer, Michael},
booktitle = {Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM '22)},
year = {2022},
doi = {10.1145/3511808.3557711}
}
@inproceedings{azzopardi2024simiir3,
title = {SimIIR 3: A Framework for the Simulation of Interactive and Conversational Information Retrieval},
author = {Azzopardi, Leif and Breuer, Timo and Engelmann, Björn and Kreutz, Christin and MacAvaney, Sean and Maxwell, David and Parry, Andrew and Roegiest, Adam and Wang, Xi and Zerhoudi, Saber},
booktitle = {Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP '24)},
year = {2024},
doi = {10.1145/3673791.3698427}
}
@inproceedings{zerhoudi2025uxsim,
title = {UXSim: Towards a Hybrid User Search Simulation},
author = {Zerhoudi, Saber and Granitzer, Michael},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
year = {2025},
doi = {10.1145/3746252.3761640}
}