Skip to content

Fixtures & reference validation

The validator’s first 11 checks verify the contract: schema is parseable, sample run succeeds, output shape matches. They do not check that your tool returns the right answer.

Check 12 — reference_fixtures_match — is opt-in. Drop a reference/ folder into your tool with known inputs and expected outputs, and the validator runs each fixture and compares the response with type-aware comparators.

Folder layout

tools/my-tool/
├── tool.json
├── pyproject.toml
├── main.py
└── reference/
├── fixture_basic.json
├── fixture_edge_case.json
└── tolerance.yaml # optional, see below

Each fixture_*.json contains the inputs and expected outputs:

{
"name": "Basic compound interest",
"inputs": {
"principal": 10000,
"annual_rate": 5,
"years": 10,
"compounding": 12,
"monthly_contribution": 0,
"inflation_adjusted": false
},
"expected_outputs": {
"final_value": 16470.09,
"total_contributions": 10000,
"total_interest": 6470.09
}
}

The validator POSTs inputs to /run, then compares each expected_outputs key against the response.

Tolerance

Different output types need different comparison strategies. Defaults:

Output typeComparatorDefault tolerance
numberabsolute or relative tolerance1e-6 abs OR 1e-4 rel
text, markdownexact string match
tablecell-wise, with numeric tolerance1e-6 abs
kvkey-by-key1e-6 abs for numbers
chart_*series shape + numeric values1e-3 rel
imagesize + SHA256 (default), histogram distance (with [accuracy] extras)
audiosize + SHA256, spectral distance (with [accuracy] extras)
file (PDF)size + SHA256, text diff (with [accuracy] extras)
any othersize + SHA256exact

Heavy comparators are gated by an optional dependency group. Install with:

Terminal window
uv sync --extra accuracy

This pulls in scikit-image, librosa, and pdfplumber. Without these, binary outputs are compared by SHA-256 only — exact-match for deterministic tools, useless for stochastic ones. The validator flags this with a warn so you know you’re getting weaker accuracy.

tolerance.yaml

Override the defaults per output key:

tools/my-tool/reference/tolerance.yaml
final_value:
abs: 0.005 # 0.5p accuracy
total_interest:
rel: 0.001 # 0.1% relative
growth_chart:
series:
abs: 0.01

Keys not listed use the type defaults.

Generating fixtures from sample input

The CLI has a helper:

Terminal window
uv run pixie validate my-tool --update-fixtures --yes

This runs the tool with the validator’s sample inputs and writes the response into reference/fixture_sample.json. Useful as a starting point, but you should hand-author fixtures that exercise meaningful inputs (edge cases, known answers from your spec, regression cases for old bugs).

Running only the reference check

Terminal window
uv run pixie validate my-tool --reference-only
uv run pixie validate my-tool --fixture reference/fixture_basic.json
uv run pixie validate my-tool --tag regression # if your fixtures have tags

--reference-only skips checks 1–11 entirely, useful in CI when you’ve already confirmed the contract elsewhere.

When fixtures fail

The report includes per-output diffs:

[fail] reference_fixtures_match — fixture_basic.json
final_value: expected 16470.09, got 16470.085 (abs diff 0.005) ✓ within tolerance
total_interest: expected 6470.09, got 6469.50 (abs diff 0.59) ✗ exceeds 0.005

The skill debug-tool reads this and walks back to the most likely cause.

When NOT to add fixtures

  • For stochastic tools (Monte Carlo with random seed elsewhere, LLM wrappers without temperature=0) — fixtures will flap and undermine confidence in the suite.
  • For tools that depend on external state (a tool that hits a live API whose response changes hourly) — pin the seed/mock the API, or skip.
  • For tools that produce intentionally large outputs (gigabyte images, hours of audio). Sample a small fixture instead.

Reference fixtures are a regression safety net, not a correctness proof. Use them for the cases where “the answer should be X” is unambiguous.