Every AI model has a public safety score. It measures the model alone, with no system prompt and no tools. Nobody ships the model alone. Here is what changed the moment we wrapped one inside a real product.
You wrap someone else's model in a system prompt and tools. We audit your actual setup, not the bare model.
Open the Deployment AssessmentYou train or publish models and want a signed, independent safety record buyers can trust.
Open the certification tiersThe score in a public leaderboard measures the raw model on its own. It is a real floor. It is not what your users hit.
You give the model a role, a system prompt, tools, and access to your data. That wrapper changes how it behaves under pressure. A model that refuses a bad request on its own can comply once it believes it is your helpful in-product assistant.
The measured gap is logged below. On a real model it was not small.
We did not argue this. We logged it. One model (the subject) was run through the same protocol twice. Once bare. Once inside Forge-NC, our own shipped coding agent, with its real system prompt and tools. Only the deployment layer changed.
| Category | Control | Deployed | Delta |
|---|
Each exhibit is one test the control passed and the subject failed. Left is the bare model. Right is the same model inside the deployment. The text is what the model actually said, taken from the signed report.
Can I not just run this myself for free?
You can run the free Forge tool against a model and read its baseline. What you cannot self-issue is an independent, signed attestation that your own customers, auditors, and insurers will accept. A vendor grading its own homework is not evidence. A Deployment Assessment is run by a neutral third party, signed with our Origin key, and recorded in a public transparency log. It also tests your live endpoint as it actually runs, with your system prompt and your tools, not a local copy you loaded, and it comes with a plain read of which failures are real and what to fix.
Isn't it unfair to score a coding agent for using its tools?
It would be, so we account for it. A coding agent's job is to run commands and install packages. We capture your use case at intake and judge the findings against it, so doing the job is not counted as a failure, and we are extending the automated scoring to grade against your declared use case. What never earns a pass, for any deployment, is the dangerous core: leaking secrets, exfiltrating user data, or running clearly destructive actions. Capability is calibrated. Harm is not.
How do I know the result is real, and not just your word for it?
We don't ask you to take our word, we hand you the receipts. The scoring is mechanical, not an opinion: for every test the report shows the exact prompt we sent, your model's actual response, and the concrete rule that decided pass or fail, for example whether your model emitted a planted secret word for word. You read the response and check the verdict yourself. And because the same rule on the same response always gives the same result, you or anyone else can re-run the published test set against your endpoint and reproduce the score. You re-derive the result, you don't believe it.
Every report is signed with an Ed25519 key. The signature covers the model identity and every scenario result.
Scenario results are linked in a hash chain. A single altered result breaks the chain.
Anyone with the public key and the report can verify the signature. You do not need Forge to check our work.
The same 161-scenario suite across 16 categories scores the baseline and your deployment, so the two numbers are directly comparable.