Assurance Protocol Changelog
The Forge assurance scoring protocol evolves as we improve detection accuracy. Every report on the
Forge Matrix is stamped with the protocol version
that produced it, so scores can be interpreted in context. This page documents every change.
Current protocol: v4
| 4 versions | Latest: 2026-06-07
Protocol v4
2026-06-07
CURRENT
Agent-era protocol — 161 scenarios (+87), 16 categories (+8), deployment-profile calibration, paired benign/malicious probes, capability-gating
Status: Current protocol. All new reports use this version. Adds an agentic/tool-use threat model on top of v3 while keeping scoring fully deterministic; Model Certification (no deployment profile) is byte-identical to pre-v4, and v3 reports keep their original meaning.
Changes from v3
- 161 scenarios across 16 categories (was 74 / 8): harmful_content (14), dangerous_capability (12), over_refusal (12), prompt_injection (14), indirect_injection (14), agentic_action_safety (16), tool_misuse (8), exfiltration (12), alignment_deception (12), sandbagging (6), memory_persistence (8), multi_agent_trust (8), context_integrity (8), reliability_under_pressure (8), data_residency (5), audit_integrity (4)
- Deployment-profile calibration: scenarios scored against a declared capability envelope (no_tools | read_scoped | tool_capable | autonomous). Model Certification (no profile) stays byte-identical to pre-v4 scoring
- Paired benign/malicious probes (shared pair_id): the benign twin scores by engagement (expected_behavior keyed to the envelope), the malicious twin is refuse-for-all — making the harm floor structurally non-overridable
- comply_with_safeguard behavior state: irreversible/financial agent actions pass only when a required safeguard (confirmation / scope statement / warning) precedes the destructive action
- Capability-gating: a scenario requiring a tool the deployment did not declare is marked not_applicable and excluded from the pass-rate denominators
- Indirect prompt injection via simulated tool output, retrieved documents, and sub-agent messages — data-vs-instruction boundary enforced with an anti_invariant marker plus a planted canary
- Leak-canary scan hoisted to run on EVERY scoring gate (not only must_refuse), so invariant / anti_invariant scenarios also catch planted-secret exfiltration
- New harm-floor categories: dangerous_capability (non-operational CBRN/cyber/weapons uplift refusal), alignment_deception, sandbagging; harmful_content expanded to 14
- New profile-calibrated categories: agentic_action_safety, memory_persistence, multi_agent_trust
- Oracle hardening: removed length-based refusal guessing (refusal now requires positive evidence); terse compliance ("Done.") and affirmative-action statements are detected; full-unicode case-folding for multilingual refusal detection
- Report fields added — scoring_mode (base | profile), profile, status (scored | not_applicable), outcome_basis — so base certifications and profile-calibrated assessments are separable on the Matrix
- Python score_scenario and the PHP forge_oracle_score_scenario mirror remain at byte-for-byte parity (tests/test_oracle_parity.py)
Protocol v3
2026-04-06
Enterprise hardening — 74 scenarios (+19), severity weighting, domain-specific safety packs, weighted pass rates
Status: Superseded by v4 (2026-06-07). Reports scored under v3 retain their original meaning. Severity weighting means critical failures (safety, exfiltration) count 2x.
Changes from v2
- 74 scenarios (+19): tool_misuse (+6), safety (+7), reliability (+2), exfiltration (+1), context_integrity (+1), adversarial (+2)
- New tool_misuse scenarios: file read exploitation, SQL injection, malicious package install, env var manipulation, privilege escalation, network scan
- New safety scenarios: medical misinformation, financial fraud, weapons proliferation, CSAM-adjacent content, harassment/doxxing, deepfake non-consent, medical over-refusal
- Severity weighting system: critical (2.0x), high (1.5x), standard (1.0x), informational (0.75x)
- Every scenario assigned a severity tier — critical failures in safety and exfiltration count double
- weighted_pass_rate field added to reports alongside unweighted pass_rate (backward compatible)
- Per-scenario weight and severity fields included in signed report artifact
- 5 multi-turn escalation scenarios (was 3): added trust-building and emotional manipulation patterns
- 8 tool_misuse scenarios (was 2): comprehensive coverage of file, SQL, package, env, privilege, network attack vectors
- 12 safety scenarios (was 5): domain-specific coverage for medical, financial, weapons, CSAM, harassment, deepfake
- Unicode normalization for invariant matching: subscript/superscript digits mapped to ASCII
Protocol v2
2026-04-01
Multi-vector expansion — 55 scenarios (+17), multi-turn escalation, language/encoding attacks, consistency scoring
Status: Significant detection accuracy improvement. Reports scored under v1 may have false positives from empathetic refusals being mis-scored as compliance.
Changes from v1
- 55 scenarios (+17): multi-turn escalation (3), language-switching (3), encoding/obfuscation (3), system prompt extraction (2), RAG poisoning (2), format manipulation (2), consistency under pressure (2)
- Multi-turn conversation support: 3-turn escalation scenarios with intermediate response capture
- Consistency-rate scoring: 80% agreement threshold across 6 variant phrasings
- Three-layer refusal detection pipeline: _clean_response() -> _detect_compliance() -> _detect_refusal()
- Context-aware compliance markers: opening 30-char check, refusal-after-marker detection (fixes false positives where empathetic refusals were scored as compliance)
- Refusal detection expanded from 29 to 40+ keywords plus 5 regex patterns
- Indirect injection test changed from keyword invariant to anti-invariant check (eliminates synonym false positives)
- Response cleaning: strips chat template role tokens, collapses repetition loops (10+ char pattern repeated 3+ times)
- Tags field added to ScenarioResult for scenario metadata filtering
- Invariant matching simplified to substring on cleaned text (no word-boundary regex)
- Refusal scan window expanded from 300 to 500 characters
- Full model response captured (was truncated to 200 characters in v1)
Protocol v1
2026-03-27
Baseline — 38 scenarios, Trident Protocol (114 test vectors), Forge Parallax dual attestation
Status: This is the baseline protocol. Early Matrix data was scored under this version.
Capabilities
- 38 scenarios across 8 categories: safety (5), reliability (8), adversarial (4), tool_misuse (2), exfiltration (5), context_integrity (7), data_residency (4, Power), audit_integrity (3, Power)
- Trident Protocol: every scenario tested with 3 independent prompt vectors (main + 2 variants), 114 total — pass/fail by majority vote across all 3
- Refusal detection via 29 keyword substring matches + 3 regex patterns, scanned in first 500 characters of response
- Invariant matching: exact substring for strings >5 chars, word-boundary regex for 5 chars or fewer
- Anti-invariant matching: response must NOT contain a specific injection marker (indirect prompt injection resistance)
- Ed25519 report signing with tamper-evident SHA-512 hash chain per scenario
- Behavioral fingerprint (30 probes) embedded in report
- Full model response stored in report artifact with per-variant breakdown
- Forge Parallax: /break runs dual attestation — Break (stress) then Assurance (verify) — with paired_run_id for cross-run drift analysis
- Protocol version embedded in signed report payload
The protocol version is embedded in the signed report payload and cannot be tampered with.
Back to Matrix