How Sagentum Assesses MCP Servers

Methodology version 1.0 — published March 2026

Why MCP servers need a different kind of assessment

MCP servers are not generic APIs. They are tools that AI agents call in autonomous loops, often without human oversight between calls. The failure modes are different and more consequential than a standard API integration.

An agent calling a server with an undocumented destructive side effect may irreversibly delete data. An agent that cannot parse a server's error response may retry indefinitely, cascading failures across a workflow. A server whose tool descriptions are written for humans — not agents — causes the model to misuse or mis-select the tool entirely. A server leaking credentials in response headers compromises the agent's entire environment.

None of these failure modes are captured by existing registries. Star counts, install counts, and uptime monitoring tell you nothing about whether a server is safe to run in an autonomous agent loop. Sagentum's 8-dimension standard was built specifically for this context.

The 8 Assessment Dimensions

1. Tool Description Quality

1.5×

Can an AI agent understand what this tool does, when to use it, and what it will receive back — without human interpretation?

PassTool descriptions use precise functional language written for agent consumption. Parameters described with types and constraints. Return values documented with schema. Side effects explicitly stated.

PartialTool descriptions exist but use vague or marketing language. Some parameters undescribed. Return schema partially documented.

FailNo tool descriptions, or descriptions so vague an agent cannot reliably determine when to call the tool.

2. Behavioural Consistency

Does the server return predictable, structured output every time?

PassConsistent response schema across all test calls. No schema drift. Machine-parseable without custom per-response logic.

PartialMostly consistent with occasional schema variations. Some fields optional without documentation.

FailSchema varies between calls. Field names inconsistent. Responses require custom parsing logic.

3. Error Handling & Agent Recovery

When something goes wrong, does the server tell the agent what happened and whether to retry?

PassMCP-compliant error responses on all error conditions. Clear distinction between retryable and non-retryable errors. Machine-readable error codes. No silent failures.

PartialSome errors return correct codes. Other conditions produce generic responses.

FailGeneric error responses. No machine-readable error codes. Silent failures on some error conditions.

4. Security Posture

Does the server handle credentials and permissions safely?

PassNo static secrets in repository. Proper credential handling. Minimal permission scope. Input sanitisation evident.

PartialGenerally safe with specific gaps. Some over-permissioning. Minor issues in static analysis.

FailStatic secrets found in repository. Credentials passed in URLs. Over-permissioned tool definitions.

5. Idempotency & Agent Safety

1.5×

Is it safe for an agent to call this server multiple times with the same parameters? Are destructive operations clearly labelled?

PassRead-only tools annotated readOnlyHint: true. Destructive tools annotated destructiveHint: true. Repeat calls produce consistent results.

PartialSome annotations present. Most tools safe but annotation incomplete.

FailNo idempotency annotations. Destructive operations unlabelled. Repeat calls produce inconsistent results.

6. Documentation Quality

Is the server documented well enough that a developer can integrate it without human interpretation?

PassComplete tool reference with parameter types. Authentication documented end-to-end. Example calls provided. Troubleshooting guidance present.

PartialCore tools documented. Edge cases underdocumented. Authentication setup requires trial and error.

FailNo documentation beyond basic README. Parameter types undocumented. No examples.

7. Reliability Signal

Is there evidence the server is maintained and will continue to be?

PassActive commits in past 90 days. Issues responded to promptly. Changelog maintained.

PartialMaintained but infrequently. No status page. Changelog absent.

FailLast commit >6 months ago. Issues unresponded. No maintenance signal.

8. Deployment Accessibility

Can a developer and their agent access this server without friction?

PassRemote (hosted) deployment available. Programmatic authentication. Free tier available.

PartialRemote deployment available but requires significant setup. Or intentionally local-only for architectural reasons with good documentation.

FailNo remote option and local setup is unnecessarily complex. Authentication requires human intervention.

Scoring

Dimensions 1 and 5 are weighted 1.5× — they are the most consequential for autonomous agent use. All other dimensions are weighted 1.0×. Not Tested dimensions are excluded from both numerator and denominator.

Pass = 1.0 | Partial = 0.5 | Fail = 0.0

Score = (Σ weight × value) ÷ (Σ weight) × 100

Score ceilings by assessment type

Documentation Onlymax 75

Documentation + Static Analysismax 80

Full (live tested)no ceiling

Certification tiers

✅ Production Ready85+Dimensions 1 and 5 both Pass + Full assessment required

✓ Suitable65–84

⚠ Use With Caution45–64

✕ Not Recommended<45

Live Testing

The test harness makes a maximum of 15 calls per server per assessment using User-Agent Sagentum/1.0 (+https://sagentum.com/testing-policy). Server developers can opt out by emailing testing-opt-out@sagentum.com. Opting out is not penalised.

1.Standard valid tool call — validate response schema
2.Repeat identical call — validate idempotency
3.Malformed parameter — validate error response structure
4.Missing required parameter — validate parameter validation
5.Invalid authentication — validate auth error response
6.Response header inspection — check for credential leakage
7.Rapid sequential calls — validate rate limiting
8.Read-only idempotency — confirm no side effects on repeat calls

Disputes

If a published score is incorrect, submit specific counter-evidence — a quote from documentation that contradicts the assessment, or a test result showing different behaviour. Disputes without specific evidence are acknowledged but do not trigger re-assessment. Valid disputes are reviewed within 48 hours and resolved with a public changelog note.

Disputes are not negotiations. A score changes only when new evidence changes what the dimension criteria require.

What Sagentum Does Not Do

—Does not host MCP servers
—Does not charge for placement that affects scores
—Does not rank servers by payment
—Does not accept investment from MCP server hosts, registries, or AI model providers
—Does not produce favourable assessments for paying vendors — the certification fee pays for the assessment process, not the result

Rubric versioning

The scoring rubric is versioned. Every assessment record carries the rubric version used to produce it. The rubric may only be amended quarterly. Prior versions are permanently documented — old assessments are never retroactively changed.

Current rubric version: 1.0