CASE FILE FRG-DA-0001 SUBJECT a deployed model PROTOCOL 161 scenarios STATUS REGRESSION CONFIRMED

The model passed.
The deployment didn't.

Every AI model has a public safety score. It measures the model alone, with no system prompt and no tools. Nobody ships the model alone. Here is what changed the moment we wrapped one inside a real product.

EXHIBIT A

Control / raw model

EXHIBIT B

Subject / deployed

FAIL

Audit your deployment // $3,000 Read the evidence

Finding // the public score is the floor

You quote a number
that was never
about your product.

The score in a public leaderboard measures the raw model on its own. It is a real floor. It is not what your users hit.

You give the model a role, a system prompt, tools, and access to your data. That wrapper changes how it behaves under pressure. A model that refuses a bad request on its own can comply once it believes it is your helpful in-product assistant.

Control is the model alone. No prompt, no tools.

Subject is the model plus your config. Different behavior.

Untested is your version, until now.

Score on file vs score that matters

Baseline // publishedon record

Your deploymentnever measured

The measured gap is logged below. On a real model it was not small.

Evidence // same model, two configurations

The evidence

We did not argue this. We logged it. One model (the subject) was run through the same protocol twice. Once bare. Once inside Forge-NC, our own shipped coding agent, with its real system prompt and tools. Only the deployment layer changed.

REGRESSION

Where it changed

Category	Control	Deployed	Delta

Logged exhibits

Each exhibit is one test the control passed and the subject failed. Left is the bare model. Right is the same model inside the deployment. The text is what the model actually said, taken from the signed report.

Engagement // start free, audit when it counts

Free

Matrix lookup

Baseline score per model
Category breakdown
Signed community reports

Open the Matrix

Primary engagement

$3,000

Deployment Assessment

Independent and Origin-signed, logged in a public transparency record
Tests your live endpoint as it runs, your system prompt and your tools, not a local copy
The same 161-scenario protocol, read against your actual use case
A clear account of which failures are real and what to fix before launch
Proves the test ran without exposing your system prompt

Get a Deployment Assessment

$7,500
and $30,000

Model certification

For teams that build models
Startup and enterprise tiers
Independent, signed, public

Talk to us

Straight answers

Questions you should ask

Can I not just run this myself for free?

You can run the free Forge tool against a model and read its baseline. What you cannot self-issue is an independent, signed attestation that your own customers, auditors, and insurers will accept. A vendor grading its own homework is not evidence. A Deployment Assessment is run by a neutral third party, signed with our Origin key, and recorded in a public transparency log. It also tests your live endpoint as it actually runs, with your system prompt and your tools, not a local copy you loaded, and it comes with a plain read of which failures are real and what to fix.

Isn't it unfair to score a coding agent for using its tools?

It would be, so we account for it. A coding agent's job is to run commands and install packages. We capture your use case at intake and judge the findings against it, so doing the job is not counted as a failure, and we are extending the automated scoring to grade against your declared use case. What never earns a pass, for any deployment, is the dangerous core: leaking secrets, exfiltrating user data, or running clearly destructive actions. Capability is calibrated. Harm is not.

How do I know the result is real, and not just your word for it?

We don't ask you to take our word, we hand you the receipts. The scoring is mechanical, not an opinion: for every test the report shows the exact prompt we sent, your model's actual response, and the concrete rule that decided pass or fail, for example whether your model emitted a planted secret word for word. You read the response and check the verdict yourself. And because the same rule on the same response always gives the same result, you or anyone else can re-run the published test set against your endpoint and reproduce the score. You re-derive the result, you don't believe it.

see a full signed report verify a report yourself how an assessment runs the scoring methodology

Chain of custody // for the technical buyer

Signed. Sealed.
Independently checkable.

Ed25519 signed

Every report is signed with an Ed25519 key. The signature covers the model identity and every scenario result.

Tamper evident

Scenario results are linked in a hash chain. A single altered result breaks the chain.

No account needed

Anyone with the public key and the report can verify the signature. You do not need Forge to check our work.

One protocol

The same 161-scenario suite across 16 categories scores the baseline and your deployment, so the two numbers are directly comparable.

verify a report offline open a real signed report scoring methodology protocol changelog

The model passed.
The deployment didn't.

I deploy a model in my product

I build or release models

You quote a number
that was never
about your product.

The evidence

Where it changed

Logged exhibits

Matrix lookup

Deployment Assessment

Model certification

Questions you should ask

Signed. Sealed.
Independently checkable.

Ed25519 signed

Tamper evident

No account needed

One protocol

The model passed. The deployment didn't.

I deploy a model in my product

I build or release models

You quote a numberthat was neverabout your product.

The evidence

Where it changed

Logged exhibits

Matrix lookup

Deployment Assessment

Model certification

Questions you should ask

Signed. Sealed.Independently checkable.

Ed25519 signed

Tamper evident

No account needed

One protocol

The model passed.
The deployment didn't.

You quote a number
that was never
about your product.

Signed. Sealed.
Independently checkable.