Fixtures & reference validation
The validator’s first 11 checks verify the contract: schema is parseable, sample run succeeds, output shape matches. They do not check that your tool returns the right answer.
Check 12 — reference_fixtures_match — is opt-in. Drop a reference/
folder into your tool with known inputs and expected outputs, and the
validator runs each fixture and compares the response with type-aware
comparators.
Folder layout
tools/my-tool/├── tool.json├── pyproject.toml├── main.py└── reference/ ├── fixture_basic.json ├── fixture_edge_case.json └── tolerance.yaml # optional, see belowEach fixture_*.json contains the inputs and expected outputs:
{ "name": "Basic compound interest", "inputs": { "principal": 10000, "annual_rate": 5, "years": 10, "compounding": 12, "monthly_contribution": 0, "inflation_adjusted": false }, "expected_outputs": { "final_value": 16470.09, "total_contributions": 10000, "total_interest": 6470.09 }}The validator POSTs inputs to /run, then compares each
expected_outputs key against the response.
Tolerance
Different output types need different comparison strategies. Defaults:
| Output type | Comparator | Default tolerance |
|---|---|---|
number | absolute or relative tolerance | 1e-6 abs OR 1e-4 rel |
text, markdown | exact string match | — |
table | cell-wise, with numeric tolerance | 1e-6 abs |
kv | key-by-key | 1e-6 abs for numbers |
chart_* | series shape + numeric values | 1e-3 rel |
image | size + SHA256 (default), histogram distance (with [accuracy] extras) | — |
audio | size + SHA256, spectral distance (with [accuracy] extras) | — |
file (PDF) | size + SHA256, text diff (with [accuracy] extras) | — |
| any other | size + SHA256 | exact |
Heavy comparators are gated by an optional dependency group. Install with:
uv sync --extra accuracyThis pulls in scikit-image, librosa, and pdfplumber. Without these,
binary outputs are compared by SHA-256 only — exact-match for
deterministic tools, useless for stochastic ones. The validator flags
this with a warn so you know you’re getting weaker accuracy.
tolerance.yaml
Override the defaults per output key:
final_value: abs: 0.005 # 0.5p accuracytotal_interest: rel: 0.001 # 0.1% relative
growth_chart: series: abs: 0.01Keys not listed use the type defaults.
Generating fixtures from sample input
The CLI has a helper:
uv run pixie validate my-tool --update-fixtures --yesThis runs the tool with the validator’s sample inputs and writes the
response into reference/fixture_sample.json. Useful as a starting
point, but you should hand-author fixtures that exercise meaningful
inputs (edge cases, known answers from your spec, regression cases for
old bugs).
Running only the reference check
uv run pixie validate my-tool --reference-onlyuv run pixie validate my-tool --fixture reference/fixture_basic.jsonuv run pixie validate my-tool --tag regression # if your fixtures have tags--reference-only skips checks 1–11 entirely, useful in CI when you’ve
already confirmed the contract elsewhere.
When fixtures fail
The report includes per-output diffs:
[fail] reference_fixtures_match — fixture_basic.json final_value: expected 16470.09, got 16470.085 (abs diff 0.005) ✓ within tolerance total_interest: expected 6470.09, got 6469.50 (abs diff 0.59) ✗ exceeds 0.005The skill debug-tool reads this
and walks back to the most likely cause.
When NOT to add fixtures
- For stochastic tools (Monte Carlo with random seed elsewhere, LLM
wrappers without
temperature=0) — fixtures will flap and undermine confidence in the suite. - For tools that depend on external state (a tool that hits a live API whose response changes hourly) — pin the seed/mock the API, or skip.
- For tools that produce intentionally large outputs (gigabyte images, hours of audio). Sample a small fixture instead.
Reference fixtures are a regression safety net, not a correctness proof. Use them for the cases where “the answer should be X” is unambiguous.