ADR-0006: Evaluation Baseline and Benchmarking Loop
Status
Accepted - Q3 2025
Context
With the core architecture, configuration, model loader, and guardrails in place (ADRs 0001–0005), ShieldCraft AI needed a disciplined evaluation loop. Prospects and internal stakeholders asked for empirical evidence that retrieval and generation quality meets expectations across datasets. The team also needed a way to detect regressions when swapping embeddings (ADR-0002) or adjusting prompts (future ADRs).
Constraints:
- Benchmarks must run locally and in CI without incurring runaway costs.
- Results should be persisted for storytelling within the docs portal and for audit trails.
- The loop must integrate with configuration toggles (ADR-0003) and model loader paths (ADR-0004).
Decision
Establish a benchmarking framework anchored on MTEB/BEIR suites with deterministic logging:
- Create evaluators in
ai_core
andlambda/beir_benchmark
that load datasets, chunk with configured strategies, and run retrieval/generation passes. - Persist metrics (accuracy, MRR, latency, token usage) to
mteb_benchmark.log
andmteb_results.json/
, versioned per model/vector-store combination. - Wire benchmarks into Nox sessions (
nox_sessions/beir.py
) so they can run on demand in CI or manually. - Publish summarized results in docs and dashboards, linking back to decision ADRs for traceability.
- Add hooks for future spot-check harnesses and regression gates (see consequences).
Alternatives Considered
- Ad-hoc scripts per experiment
- Pro: Quick to start
- Con: No provenance, hard to compare
- Third-party evaluation SaaS
- Pro: Rich reporting
- Con: Cost, data governance concerns
- Manual QA only
- Pro: Low effort
- Con: Not scalable, lacks rigor
Consequences
- Provides empirical validation for buyers and internal teams; benchmarks back claims in the pricing and architecture narratives.
- Adds maintenance overhead as benchmarks must evolve with new models/domains.
- Creates a foundation for ADR-0007 (documentation and storytelling) and future ADRs covering retrieval spot-check harnesses.
Rollout Plan
- Implement evaluation utilities with configuration-driven parameters.
- Add logging and artifact persistence.
- Integrate runs into Nox sessions and CI documentation.
- Surface results in docs (
/docs-site/docs/github/evaluation.md
) and dashboards. - Set expectations that major model/vector changes must update benchmark artifacts.
Success Metrics
- Baseline MTEB/BEIR runs complete within agreed time windows and resource budgets.
- Each release candidate has up-to-date evaluation artifacts linked in release notes.
- Regression detection catches degradations before production demos.
References
ai_core/
evaluation moduleslambda/beir_benchmark/
nox_sessions/beir.py
mteb_benchmark.log
,mteb_results.json/
- ADR-0002: Vector Store Selection
- ADR-0004: Dual-Path Model Loader Strategy
- ADR-0005: Security Baseline and Cost Guardrails
- ADR-0007: Documentation and Storytelling Framework