// Track

Testing AI-Adjacent Systems

Evaluation, Audit, and Quality Assurance for AI Pipelines

Design evaluations for agent outputs, run audit swarms, handle knowledge cutoff as a testing concern, and build LLM-as-judge systems for automated quality scoring. Drawn from real audit runs across Knox's fleet — including the SP-001 false positive incident and the Autoresearch prompt quality system.

Recommended: Complete Tests Pass ≠ System Works first

3 lessons~26 min total

Lessons are shown in recommended order. Complete them in sequence for the best experience — or jump to any lesson.

Lesson 1·9 min read

The Audit Swarm Pattern

Five agents, 277 lessons, one pass. How to architect a multi-agent audit that covers what no human reviewer can — and why the Fact-Checker is the only thing standing between your swarm and a report full of false positives.

Lesson 2·8 min read

Knowledge Cutoff as a Testing Concern

The SP-001 incident: an audit swarm flagged 25 valid model IDs as CRITICAL errors because its training data predated the Claude 4.6 release. How grounding documents prevent AI systems from confidently invalidating their own outputs.

Lesson 3·9 min read

LLM-as-Judge: Automated Quality Scoring for Prompts

How Knox built a system that scores, rewrites, and auto-applies improvements to its own skill library — the five-dimension rubric, the delta-gate, the overflow-reject behavior, and why you need empirical calibration before trusting any judge score.