ADR-0004: Dual-Path Model Loader Strategy

Status

Accepted - Q2 2025

Context

With configuration contracts solidified (ADR-0003), the team needed a deterministic way to toggle between inexpensive stub inference for local development and production-grade large language models (LLMs) for demos and benchmarks. Constraints included:

Keep costs near-zero for day-to-day development and CI.
Support rapid experimentation with Hugging Face models (e.g., Mistral-7B) without rewriting orchestration code.
Provide a graceful path for hardware acceleration (GPU, quantization) while staying compliant with cost and security guardrails from ADR-0005.
Ensure the docs portal and evaluation jobs can rely on consistent outputs per environment.

Decision

Implement a dual-path model loader in ai_core/model_loader.py that:

Defaults to a stub backend returning deterministic, low-latency responses.
Supports a configurable "real" backend (initially mistralai/Mistral-7B-v0.1) using Hugging Face transformers.
Allows optional 4-bit quantization via BitsAndBytesConfig when GPUs are present.
Emits structured metadata (model name, device, quantization flags) to downstream logging for governance.
Respects environment configuration sourced via ADR-0003, including budget gates and secrets.

A single factory function returns a loader object adhering to a small interface (generate, token_usage, close). The choice of backend is controlled by config (environment and tier aware) and can be overridden via environment variables during experiments.

Alternatives Considered

Always-real model
- Pro: Single path
- Con: Expensive, slower iteration, brittle CI
Separate services for stub vs. real
- Pro: Strong isolation
- Con: Operational overhead, duplicated contracts
Managed LLM-only approach
- Pro: Minimal infrastructure
- Con: Less control over guardrails, harder to test offline

Consequences

Enables local development and CI to run quickly with deterministic outputs.
Provides a safe toggle to activate high-fidelity LLMs for demos, benchmarks, or production, aligned with tiered pricing (ADR-0001).
Centralizes retry and error handling, making it easy to inject future providers.
Requires monitoring to ensure the real backend does not exceed cost ceilings (addressed in ADR-0006 evaluations and ADR-0005 guardrails).

Rollout Plan

Build the loader abstraction with factory selection and context manager semantics.
Wire config toggles (model.backend, model.quantization) sourced from ADR-0003.
Add pytest coverage for both stub and real paths, including unhappy scenarios (missing weights, OOM).
Integrate telemetry into evaluation harness (ADR-0006) and docs portal demos.
Document usage in /docs-site/docs/github/ai-core.md and update onboarding guides.

Success Metrics

All AI surfaces (CLI, evaluation jobs, docs portal) create loaders via the factory.
CI runtime remains under 15 minutes with stub backend.
Real backend demos stay within monthly budget thresholds defined in ADR-0005.

References

ai_core/model_loader.py
config/*.yml (model sections)
tests/ai_core/test_model_loader.py
ADR-0001: Architecture Baseline and Tiering
ADR-0003: Environment-Aware Configuration Backbone
ADR-0005: Security Baseline and Cost Guardrails
ADR-0006: Evaluation Baseline and Benchmarking Loop

Status​

Context​

Decision​

Alternatives Considered​

Consequences​

Rollout Plan​

Success Metrics​

References​