LLMs degrade over the course of a session. Context fills up, attention dilutes, and response quality drops. This happens whether the session is adversarial attacks, routine coding, or casual conversation — it's a fundamental property of finite-context architectures. Parallax measures this drift by testing the same capabilities at two different points in the session. The report shows:
- Consistency score — % of scenarios with matching results across both passes
- Degraded scenarios — passed in stress, failed in verification
- Recovered scenarios — failed in stress, passed in verification (non-determinism)
- Latency drift — response time changes between passes
- Category-level comparison — which capability areas degraded most