AppSelectBench: Application-Level Tool Selection Benchmark

Tianyi Chen; Michael Solodko; Sen Wang; Jongwoo Ko; Junheng Hao; Colby Banbury; Sara Abdali; Saeed Amizadeh; Qing Xiao; Yinheng Li; Tianyu Ding; Kamran Ghasedi Dizaji; Suzhen Zheng; Hao Fan; Justin Wagle; Pashmina Cameron; Kazuhito Koishida

AppSelectBench: Application-Level Tool Selection Benchmark

Computation and Language 2025-12-01 v2

Authors: Tianyi Chen , Michael Solodko , Sen Wang , Jongwoo Ko , Junheng Hao , Colby Banbury , Sara Abdali , Saeed Amizadeh , Qing Xiao , Yinheng Li , Tianyu Ding , Kamran Ghasedi Dizaji , Suzhen Zheng , Hao Fan , Justin Wagle , Pashmina Cameron , Kazuhito Koishida

View on arXiv ↗ PDF ↗

Abstract

Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://microsoft.github.io/appselectbench/.

Keywords

benchmarking graphical user interface multi-agent systems

Cite

@article{arxiv.2511.19957,
  title  = {AppSelectBench: Application-Level Tool Selection Benchmark},
  author = {Tianyi Chen and Michael Solodko and Sen Wang and Jongwoo Ko and Junheng Hao and Colby Banbury and Sara Abdali and Saeed Amizadeh and Qing Xiao and Yinheng Li and Tianyu Ding and Kamran Ghasedi Dizaji and Suzhen Zheng and Hao Fan and Justin Wagle and Pashmina Cameron and Kazuhito Koishida},
  journal= {arXiv preprint arXiv:2511.19957},
  year   = {2025}
}

AppSelectBench: Application-Level Tool Selection Benchmark

Abstract

Keywords

Cite

Related papers