Testing results¶
This page tracks fixture-based validation of Context Types 1-5.
Result set version: April 13, 2026
Scope update in this pass:
- Parklane and Kingsbrook sections remain aligned to the latest published reruns.
- Fairway Type 3 section is refreshed from a new rerun audit (
debug/ui_01/run_20260413_003852_502094_4873242e__ui_01/).
Context Type definitions¶
Type 1 — Contextual Tables¶
A separate table (for example, a glazing schedule) provides supplementary data that must be merged into the main item table. If the item table references GL-03, the pipeline should fill from the corresponding context-table row. Links can be direct 1:1 or rule-based (for example, if Fixed -> use this glass makeup).
Type 2 — Contextual Text & General Notes¶
Freeform notes (tempered glass rules, performance specs, bird-friendly requirements, structural criteria) define blanket or conditional rules. The pipeline should apply these as metadata to relevant items.
Type 3 — Contextual Item Cards (style/legend)¶
Visual legends define operability/category/configuration. If a schedule row says D1 is Style C, the pipeline should use the legend to populate operability and related attributes.
Type 4 — Item Cards with Variables¶
Complex diagrams define variable dimensions across sub-items (including computed dimensions such as total height - sill height). Output should produce separate line items with correct computed values.
Type 5 — Multi-label Item Cards¶
A single drawing can define multiple rows (for example W3A and W3B) sharing geometry but differing by basis-of-design or parameters. The pipeline should split items and assign shared vs variant-specific attributes correctly.
Fixture coverage matrix¶
| Fixture | Latest run used in docs | Types covered | Status |
|---|---|---|---|
Parklane (CRPA-Park Lane T1.pdf) |
run_20260412_011036 |
Type 1, Type 2 | Active |
Kingsbrook (kingsbrook_takeOFF T2.pdf) |
run_20260412_010342 |
Type 1-5 (broad mixed coverage) | Active |
Fairway (VTT - Fairway T3.pdf) |
run_20260413_003852_502094_4873242e__ui_01 |
Type 3 | Active |
Evaluation methodology¶
Ground truth (GT)¶
GT files are manually curated expected outputs for each benchmark fixture. They define the canonical row-level and field-level values used for regression measurement.
- Location:
test/ground-truths/ - Examples:
gt__r01-02__parklane.json,gt__r01-05__kingsbrook.json - Alignment key:
row_id
What GT is and is not
GT is a human-authored evaluation baseline, not model output. When source documents, field contracts, or unit policies change, GT should be updated in the same change window to keep benchmark scores meaningful.
Normalized comparator architecture¶
Cartex benchmark summaries use the normalized comparator implemented in test/gates/phase_gate_report.py.
Evaluation flow:
- Load actual rows from
debug/run_*/<run_id>_rows.jsonand GT rows fromtest/ground-truths/. - Align rows by
row_id. - Apply fixture-level ignore metadata for known non-deterministic obscured-row extraction artifacts in Kingsbrook-like layouts (see Obscured-row extraction artifacts).
- Compare every field in every common row using field-aware normalization logic.
- Aggregate mismatches into row-level and run-level metrics.
Field-aware matching rules:
| Field group | Normalized matching logic |
|---|---|
| General text fields | Case-insensitive, whitespace-normalized string match |
Width, Height |
Parse units into inches, then numeric compare |
Rough Opening Measurements |
Parse A x B into inch pairs, compare canonical pairs |
Quantity, Glass Layer |
Parse integer-like tokens, compare numeric values |
Frame Brand |
Token overlap/containment check (e.g. Rehau vs Starr Rehau) |
Glass Arrangement Configuration |
Structural compare of panel dimensions/count and metal-panel marker |
Special Notes |
Semantic fact extraction + overlap thresholds (recall/precision gating) |
Reporting policy
Testing documentation uses normalized comparator metrics as the release-quality KPI. Strict exact-string comparison may still be used for internal diagnostics but is intentionally not surfaced in benchmark summaries.
Score calculation¶
For a run with common_rows aligned rows and F comparable fields per row:
total_comparable_fields = common_rows * F
total_field_mismatches = count of normalized non-matching field comparisons
field_accuracy = (total_comparable_fields - total_field_mismatches) / total_comparable_fields
row_accuracy(row_i) = matched_fields(row_i) / F
Gate metrics:
row_completeness_pass:missing_row_ids == []andextra_row_ids == []after configured ignore rules.no_high_conf_severe_errors_pass: no row withconfidence >= 0.9androw_accuracy <= 0.4.
Type 1 + Type 2 benchmark (Parklane)¶
Fixture: CRPA-Park Lane T1.pdf
Pages tested: 1
Template: Glass Schedule
Strategies fired: auxiliary_table, text_rule, dimension_card
Run: debug/run_20260412_011036/
Flags: High Accuracy Tables on, Monolithic mode off
Current result summary¶
| Metric | Value |
|---|---|
| Expected rows / Actual rows | 19 / 19 |
| Row completeness gate | Pass |
| High-confidence severe-error gate | Pass |
| Field accuracy (normalized comparator) | 0.857143 |
| Remaining normalized mismatches | Glass Layer (19), Glass Width (19) |
Interpretation¶
- Type 1 behavior is functioning (context table grounding and row mapping).
- Type 2 behavior is functioning (text-note metadata grounding).
- The remaining mismatch cluster is currently attributed to field-contract ambiguity in template source semantics (
Glass LayervsGlass Width), not row-loss behavior.
Regression breakdown (normalized comparator, GT-aligned)¶
Glass Layer — 19 mismatches¶
Cause
Thickness text (1" IGU) is being written into a layer-count slot. This behavior is currently shaped by CATO2 template field-rule contracts that are injected into specialist instructions. In the current template source, Glass Layer and Glass Width carry the same verbatim rule text (Glass thickness if determinable), creating ambiguous one-to-one guidance between these two fields. This is being investigated as a likely upstream rule-definition issue.
Examples:
Glass Width — 19 mismatches¶
Cause
The same thickness token is leaking into Glass Width for the same contract-level reason: CATO2-provided field-rule contracts are injected into specialist prompts, and Glass Width currently shares the same verbatim rule as Glass Layer (Glass thickness if determinable). This overlap in upstream rule text is under investigation.
Examples:
Type 1-5 broad benchmark (Kingsbrook)¶
Fixture: Pages from kingsbrook_takeOFF T2.pdf
Pages tested: 1-2
Template: Standard Takeoff
Strategies fired: auxiliary_table, text_rule, image_legend, dimension_card, multi_label
Run: debug/run_20260412_010342/
Flags: High Accuracy Tables on, Monolithic mode off
Current result summary¶
| Metric | Value |
|---|---|
| Expected rows / Actual rows (raw) | 50 / 51 |
| Expected rows / Actual rows (after fixture suppression) | 50 / 50 |
| Row completeness gate | Pass |
| High-confidence severe-error gate | Pass |
| Field accuracy (normalized comparator) | 0.806 |
| Remaining normalized mismatches | Material (50), Special Notes (47) |
Known non-deterministic extraction issue¶
The obscured lower portion of the main schedule can trigger hallucinated rows in the last ~5 rows of the detected main table.
- This issue is not fixed.
W75Bwas one observed manifestation in this run, but hallucinated row IDs/values can differ run-to-run.- Treat this as an extraction instability zone, not a single deterministic row bug.
Regression breakdown (normalized comparator, GT-aligned)¶
Material — 50 mismatches¶
Cause
The document usually implies frame material through product family (Starr Rehau Artevo) instead of explicit material text. Current pipeline does not consistently infer that implicit material mapping.
Examples:
W11B
actual:
gt: uPVC/vinyl
W1A
actual: Triple Glazed
gt: uPVC/vinyl
W77A
actual: Triple Glazed
gt: uPVC/vinyl
Special Notes — 47 mismatches¶
Cause
Notes synthesis captures many facts but often drops key constraint clauses. Dominant missing GT clusters are design_pressure, safety_glazing, air_leakage, head_height_rule, and some u_factor tokens.
Examples:
W11B
actual: 1 3/4" Triple Glazed IGU ... Warm-edge spacer ... (shortened)
gt: ... U-0.18 ... Min head height 7'-0" ... +/- 48 PSF design pressure ... safety glazing ...
W13A
actual: 1 3/4" Triple Glazed IGU ... (shortened)
gt: ... design pressure + safety glazing clauses present ...
Type 3 benchmark (Fairway)¶
Fixture: VTT - Fairway T3.pdf
Pages tested: 1
Template: Standard Takeoff
Strategies fired: text_rule, image_legend
Run: debug/ui_01/run_20260413_003852_502094_4873242e__ui_01/
Flags: High Accuracy Tables on, Monolithic mode off
Current result summary¶
| Metric | Value |
|---|---|
| Expected rows / Actual rows | 21 / 21 |
| Row completeness gate | Pass |
| High-confidence severe-error gate | Pass |
| Field accuracy (normalized comparator) | 0.948413 |
| Remaining normalized mismatches | Product Type (7), Operability (3), Frame Material (1), Glass Type (1) |
Interpretation¶
- Core Type 3 legend grounding is stable on this rerun (row completeness holds, no severe high-confidence failures).
Special Notessemantic match is fully passing under normalized comparison for this run.- Residual errors are concentrated in taxonomy/classification (
Product Type) and a small operability cluster.
Regression breakdown (normalized comparator, GT-aligned)¶
Product Type — 7 mismatches¶
Cause
subtype extraction coverage is incomplete on a subset of window rows. Even when subtype evidence is present, Product Type is sometimes left blank.
Examples:
Operability — 3 mismatches¶
Cause
A small subset of door styles in the image legend are visually similar and semantically close, which can introduce ambiguity for Gemini when mapping style cards to exact operability labels. D11 and D12 are representative of this legend-level ambiguity.
Examples:
Frame Material — 1 mismatch¶
Cause
One door row still drops explicit frame material.
Example:
Glass Type — 1 mismatch¶
Cause
One row carries a broader composite phrase than GT (double glazed insulated glass; Tempered glass vs Tempered glass).
Planned testing upgrades¶
Dissent-based confidence signal (planned)¶
A planned enhancement is to introduce a dissent-based signal so Gemini can lower confidence when extraction evidence is weak, ambiguous, or internally inconsistent.
Expected effect:
- better confidence calibration in obscured/low-quality regions
- clearer separation between high-confidence errors vs uncertain extractions
- improved triage for manual audit workflows