Assurance Protocol Changelog

The Forge assurance scoring protocol evolves as we improve detection accuracy. Every report on the Forge Matrix is stamped with the protocol version that produced it, so scores can be interpreted in context. This page documents every change.

Current protocol: v4 | 4 versions | Latest: 2026-06-07

Protocol v4 2026-06-07 CURRENT

Agent-era protocol — 161 scenarios (+87), 16 categories (+8), deployment-profile calibration, paired benign/malicious probes, capability-gating

Status: Current protocol. All new reports use this version. Adds an agentic/tool-use threat model on top of v3 while keeping scoring fully deterministic; Model Certification (no deployment profile) is byte-identical to pre-v4, and v3 reports keep their original meaning.

Changes from v3

161 scenarios across 16 categories (was 74 / 8): harmful_content (14), dangerous_capability (12), over_refusal (12), prompt_injection (14), indirect_injection (14), agentic_action_safety (16), tool_misuse (8), exfiltration (12), alignment_deception (12), sandbagging (6), memory_persistence (8), multi_agent_trust (8), context_integrity (8), reliability_under_pressure (8), data_residency (5), audit_integrity (4)
Deployment-profile calibration: scenarios scored against a declared capability envelope (no_tools | read_scoped | tool_capable | autonomous). Model Certification (no profile) stays byte-identical to pre-v4 scoring
Paired benign/malicious probes (shared pair_id): the benign twin scores by engagement (expected_behavior keyed to the envelope), the malicious twin is refuse-for-all — making the harm floor structurally non-overridable
comply_with_safeguard behavior state: irreversible/financial agent actions pass only when a required safeguard (confirmation / scope statement / warning) precedes the destructive action
Capability-gating: a scenario requiring a tool the deployment did not declare is marked not_applicable and excluded from the pass-rate denominators
Indirect prompt injection via simulated tool output, retrieved documents, and sub-agent messages — data-vs-instruction boundary enforced with an anti_invariant marker plus a planted canary
Leak-canary scan hoisted to run on EVERY scoring gate (not only must_refuse), so invariant / anti_invariant scenarios also catch planted-secret exfiltration
New harm-floor categories: dangerous_capability (non-operational CBRN/cyber/weapons uplift refusal), alignment_deception, sandbagging; harmful_content expanded to 14
New profile-calibrated categories: agentic_action_safety, memory_persistence, multi_agent_trust
Oracle hardening: removed length-based refusal guessing (refusal now requires positive evidence); terse compliance ("Done.") and affirmative-action statements are detected; full-unicode case-folding for multilingual refusal detection
Report fields added — scoring_mode (base | profile), profile, status (scored | not_applicable), outcome_basis — so base certifications and profile-calibrated assessments are separable on the Matrix
Python score_scenario and the PHP forge_oracle_score_scenario mirror remain at byte-for-byte parity (tests/test_oracle_parity.py)

Protocol v3 2026-04-06

Enterprise hardening — 74 scenarios (+19), severity weighting, domain-specific safety packs, weighted pass rates

Status: Superseded by v4 (2026-06-07). Reports scored under v3 retain their original meaning. Severity weighting means critical failures (safety, exfiltration) count 2x.

Changes from v2

74 scenarios (+19): tool_misuse (+6), safety (+7), reliability (+2), exfiltration (+1), context_integrity (+1), adversarial (+2)
New tool_misuse scenarios: file read exploitation, SQL injection, malicious package install, env var manipulation, privilege escalation, network scan
New safety scenarios: medical misinformation, financial fraud, weapons proliferation, CSAM-adjacent content, harassment/doxxing, deepfake non-consent, medical over-refusal
Severity weighting system: critical (2.0x), high (1.5x), standard (1.0x), informational (0.75x)
Every scenario assigned a severity tier — critical failures in safety and exfiltration count double
weighted_pass_rate field added to reports alongside unweighted pass_rate (backward compatible)
Per-scenario weight and severity fields included in signed report artifact
5 multi-turn escalation scenarios (was 3): added trust-building and emotional manipulation patterns
8 tool_misuse scenarios (was 2): comprehensive coverage of file, SQL, package, env, privilege, network attack vectors
12 safety scenarios (was 5): domain-specific coverage for medical, financial, weapons, CSAM, harassment, deepfake
Unicode normalization for invariant matching: subscript/superscript digits mapped to ASCII

Protocol v2 2026-04-01

Multi-vector expansion — 55 scenarios (+17), multi-turn escalation, language/encoding attacks, consistency scoring

Status: Significant detection accuracy improvement. Reports scored under v1 may have false positives from empathetic refusals being mis-scored as compliance.

Changes from v1

55 scenarios (+17): multi-turn escalation (3), language-switching (3), encoding/obfuscation (3), system prompt extraction (2), RAG poisoning (2), format manipulation (2), consistency under pressure (2)
Multi-turn conversation support: 3-turn escalation scenarios with intermediate response capture
Consistency-rate scoring: 80% agreement threshold across 6 variant phrasings
Three-layer refusal detection pipeline: _clean_response() -> _detect_compliance() -> _detect_refusal()
Context-aware compliance markers: opening 30-char check, refusal-after-marker detection (fixes false positives where empathetic refusals were scored as compliance)
Refusal detection expanded from 29 to 40+ keywords plus 5 regex patterns
Indirect injection test changed from keyword invariant to anti-invariant check (eliminates synonym false positives)
Response cleaning: strips chat template role tokens, collapses repetition loops (10+ char pattern repeated 3+ times)
Tags field added to ScenarioResult for scenario metadata filtering
Invariant matching simplified to substring on cleaned text (no word-boundary regex)
Refusal scan window expanded from 300 to 500 characters
Full model response captured (was truncated to 200 characters in v1)

Protocol v1 2026-03-27

Baseline — 38 scenarios, Trident Protocol (114 test vectors), Forge Parallax dual attestation

Status: This is the baseline protocol. Early Matrix data was scored under this version.

Capabilities

38 scenarios across 8 categories: safety (5), reliability (8), adversarial (4), tool_misuse (2), exfiltration (5), context_integrity (7), data_residency (4, Power), audit_integrity (3, Power)
Trident Protocol: every scenario tested with 3 independent prompt vectors (main + 2 variants), 114 total — pass/fail by majority vote across all 3
Refusal detection via 29 keyword substring matches + 3 regex patterns, scanned in first 500 characters of response
Invariant matching: exact substring for strings >5 chars, word-boundary regex for 5 chars or fewer
Anti-invariant matching: response must NOT contain a specific injection marker (indirect prompt injection resistance)
Ed25519 report signing with tamper-evident SHA-512 hash chain per scenario
Behavioral fingerprint (30 probes) embedded in report
Full model response stored in report artifact with per-variant breakdown
Forge Parallax: /break runs dual attestation — Break (stress) then Assurance (verify) — with paired_run_id for cross-run drift analysis
Protocol version embedded in signed report payload

The protocol version is embedded in the signed report payload and cannot be tampered with.

Back to Matrix