Code execution servers are the most consequential tools an agent can call — arbitrary code running in any environment creates unlimited potential for side effects. The critical assessment question is whether execution is genuinely sandboxed or merely claims to be. Idempotency annotation quality (Dimension 5) is the most differentiated signal in this category: servers that clearly annotate which execution calls are destructive and which are safe to retry are categorically safer for autonomous agent use.
No assessed servers in this category yet.
Submit a server for assessment.
Missing a server? Submit it for assessment.
How scores are calculated →