MAESTRO: Benchmarking Multi-Agent Systems with Transparent Telemetry

MAESTRO Team

MAESTRO

MAESTRO Team

Benchmarking Multi-Agent Systems with Transparent Telemetry
Project page and reproducible artifacts

Paper Code

Hugging Face

Demo video: MAESTRO benchmarking workflow and result inspection.

Abstract

MAESTRO is a unified evaluation suite for multi-agent systems. It helps researchers and practitioners test how different backend LLMs and agent architectures affect reliability, cost, latency, and overall task quality under reproducible settings. By standardizing telemetry collection and analysis, MAESTRO makes it easier to compare systems fairly, diagnose failure modes, and understand where performance gains are real versus unstable across tasks.

How MAESTRO Works

MAESTRO evaluates multi-agent systems with a reproducible pipeline from trace collection to paper-grade analysis plots. It aligns runs across models and agent architectures so cost, duration, and accuracy can be compared consistently.

The architecture below summarizes this flow: collect telemetry, normalize into a shared schema, compute comparable metrics, and publish interpretable visual outputs.

Changing the Backend LLM Increases Cost, with Mixed Performance Gains

As model capability increases, budget cost usually increases as well, but task duration does not always decrease. Weaker models can have lower token prices, yet often require more rounds to finish a task. For accuracy, stronger models tend to improve results overall, but gains still show high variance: depending on query type and agent architecture, gains can shrink or even turn negative.

Changing Agent Architecture Shifts Reliability and Resource Patterns

Architecture choice (Plan-and-Execute, CRAG, LATS ) also changes how reliably tasks are solved and how consistently cost/duration behave. While different architectures can have similar average accuracy, they can stably behave differently in the sense of speed and cost. For example, CRAG mainly varies in speed, while LATS mainly varies in cost. This suggests that architecture choice can be a key lever for optimizing specific performance dimensions, and that different architectures may be more suitable for different application contexts depending on whether latency or cost is the primary concern.

Web Search Helps Selectively, with Latency-Cost Tradeoffs

Turning web search on does not help every setting equally. The same tool can increase latency and cost in some model-architecture combinations while improving outcomes in others. Accuracy can improve or regress depending on workload behavior and agent decision paths.

Latency Cost Accuracy

Latency-Cost Shift with Web Search

What this shows: how enabling Tavily shifts each setup in speed-vs-cost space. The shift is architecture-dependent: some architectures move mostly along cost, some mostly along duration, and some along both, highlighting different operational tradeoffs.

Accuracy Change with Web Search

What this shows: accuracy change after turning web search on. Gains are not uniform across architectures or models, and in some settings the change can be small or negative.

BibTeX

@misc{maestro,
      title={MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability},
      author={Tie Ma and Yixi Chen and Vaastav Anand and Alessandro Cornacchia and Amândio R. Faustino and Guanheng Liu and Shan Zhang and Hongbin Luo and Suhaib A. Fahmy and Zafar A. Qazi and Marco Canini},
      year={2026},
      eprint={2601.00481},
      archivePrefix={arXiv},
      primaryClass={cs.NI},
      url={https://arxiv.org/abs/2601.00481},
}