Testing results¶

This page tracks fixture-based validation of Context Types 1-5.

Result set version: April 13, 2026
Scope update in this pass:

Parklane and Kingsbrook sections remain aligned to the latest published reruns.
Fairway Type 3 section is refreshed from a new rerun audit (debug/ui_01/run_20260413_003852_502094_4873242e__ui_01/).

Context Type definitions¶

Type 1 — Contextual Tables¶

A separate table (for example, a glazing schedule) provides supplementary data that must be merged into the main item table. If the item table references GL-03, the pipeline should fill from the corresponding context-table row. Links can be direct 1:1 or rule-based (for example, if Fixed -> use this glass makeup).

Type 2 — Contextual Text & General Notes¶

Freeform notes (tempered glass rules, performance specs, bird-friendly requirements, structural criteria) define blanket or conditional rules. The pipeline should apply these as metadata to relevant items.

Type 3 — Contextual Item Cards (style/legend)¶

Visual legends define operability/category/configuration. If a schedule row says D1 is Style C, the pipeline should use the legend to populate operability and related attributes.

Type 4 — Item Cards with Variables¶

Complex diagrams define variable dimensions across sub-items (including computed dimensions such as total height - sill height). Output should produce separate line items with correct computed values.

Type 5 — Multi-label Item Cards¶

A single drawing can define multiple rows (for example W3A and W3B) sharing geometry but differing by basis-of-design or parameters. The pipeline should split items and assign shared vs variant-specific attributes correctly.

Fixture coverage matrix¶

Fixture	Latest run used in docs	Types covered	Status
Parklane (`CRPA-Park Lane T1.pdf`)	`run_20260412_011036`	Type 1, Type 2	Active
Kingsbrook (`kingsbrook_takeOFF T2.pdf`)	`run_20260412_010342`	Type 1-5 (broad mixed coverage)	Active
Fairway (`VTT - Fairway T3.pdf`)	`run_20260413_003852_502094_4873242e__ui_01`	Type 3	Active

Evaluation methodology¶

Ground truth (GT)¶

GT files are manually curated expected outputs for each benchmark fixture. They define the canonical row-level and field-level values used for regression measurement.

Location: test/ground-truths/
Examples: gt__r01-02__parklane.json, gt__r01-05__kingsbrook.json
Alignment key: row_id

What GT is and is not

GT is a human-authored evaluation baseline, not model output. When source documents, field contracts, or unit policies change, GT should be updated in the same change window to keep benchmark scores meaningful.

Normalized comparator architecture¶

Cartex benchmark summaries use the normalized comparator implemented in test/gates/phase_gate_report.py.

Evaluation flow:

Load actual rows from debug/run_*/<run_id>_rows.json and GT rows from test/ground-truths/.
Align rows by row_id.
Apply fixture-level ignore metadata for known non-deterministic obscured-row extraction artifacts in Kingsbrook-like layouts (see Obscured-row extraction artifacts).
Compare every field in every common row using field-aware normalization logic.
Aggregate mismatches into row-level and run-level metrics.

Field-aware matching rules:

Field group	Normalized matching logic
General text fields	Case-insensitive, whitespace-normalized string match
`Width`, `Height`	Parse units into inches, then numeric compare
`Rough Opening Measurements`	Parse `A x B` into inch pairs, compare canonical pairs
`Quantity`, `Glass Layer`	Parse integer-like tokens, compare numeric values
`Frame Brand`	Token overlap/containment check (e.g. `Rehau` vs `Starr Rehau`)
`Glass Arrangement Configuration`	Structural compare of panel dimensions/count and metal-panel marker
`Special Notes`	Semantic fact extraction + overlap thresholds (recall/precision gating)

Reporting policy

Testing documentation uses normalized comparator metrics as the release-quality KPI. Strict exact-string comparison may still be used for internal diagnostics but is intentionally not surfaced in benchmark summaries.

Score calculation¶

For a run with common_rows aligned rows and F comparable fields per row:

total_comparable_fields = common_rows * F
total_field_mismatches  = count of normalized non-matching field comparisons
field_accuracy          = (total_comparable_fields - total_field_mismatches) / total_comparable_fields

row_accuracy(row_i)     = matched_fields(row_i) / F

Gate metrics:

row_completeness_pass: missing_row_ids == [] and extra_row_ids == [] after configured ignore rules.
no_high_conf_severe_errors_pass: no row with confidence >= 0.9 and row_accuracy <= 0.4.

Type 1 + Type 2 benchmark (Parklane)¶

Fixture: CRPA-Park Lane T1.pdf
Pages tested: 1
Template: Glass Schedule
Strategies fired: auxiliary_table, text_rule, dimension_card
Run: debug/run_20260412_011036/
Flags: High Accuracy Tables on, Monolithic mode off

Current result summary¶

Metric	Value
Expected rows / Actual rows	19 / 19
Row completeness gate	Pass
High-confidence severe-error gate	Pass
Field accuracy (normalized comparator)	0.857143
Remaining normalized mismatches	`Glass Layer` (19), `Glass Width` (19)

Interpretation¶

Type 1 behavior is functioning (context table grounding and row mapping).
Type 2 behavior is functioning (text-note metadata grounding).
The remaining mismatch cluster is currently attributed to field-contract ambiguity in template source semantics (Glass Layer vs Glass Width), not row-loss behavior.

Regression breakdown (normalized comparator, GT-aligned)¶

`Glass Layer` — 19 mismatches¶

Cause

Thickness text (1" IGU) is being written into a layer-count slot. This behavior is currently shaped by CATO2 template field-rule contracts that are injected into specialist instructions. In the current template source, Glass Layer and Glass Width carry the same verbatim rule text (Glass thickness if determinable), creating ambiguous one-to-one guidance between these two fields. This is being investigated as a likely upstream rule-definition issue.

Examples:

C1
  actual: 1" IGU
  gt:     2

P1
  actual: 1" IGU
  gt:     2

`Glass Width` — 19 mismatches¶

Cause

The same thickness token is leaking into Glass Width for the same contract-level reason: CATO2-provided field-rule contracts are injected into specialist prompts, and Glass Width currently shares the same verbatim rule as Glass Layer (Glass thickness if determinable). This overlap in upstream rule text is under investigation.

Examples:

C1
  actual: 1" IGU
  gt:

P10
  actual: 1" IGU
  gt:

Type 1-5 broad benchmark (Kingsbrook)¶

Fixture: Pages from kingsbrook_takeOFF T2.pdf
Pages tested: 1-2
Template: Standard Takeoff
Strategies fired: auxiliary_table, text_rule, image_legend, dimension_card, multi_label
Run: debug/run_20260412_010342/
Flags: High Accuracy Tables on, Monolithic mode off

Current result summary¶

Metric	Value
Expected rows / Actual rows (raw)	50 / 51
Expected rows / Actual rows (after fixture suppression)	50 / 50
Row completeness gate	Pass
High-confidence severe-error gate	Pass
Field accuracy (normalized comparator)	0.806
Remaining normalized mismatches	`Material` (50), `Special Notes` (47)

Known non-deterministic extraction issue¶

The obscured lower portion of the main schedule can trigger hallucinated rows in the last ~5 rows of the detected main table.

This issue is not fixed.
W75B was one observed manifestation in this run, but hallucinated row IDs/values can differ run-to-run.
Treat this as an extraction instability zone, not a single deterministic row bug.

Regression breakdown (normalized comparator, GT-aligned)¶

`Material` — 50 mismatches¶

Cause

The document usually implies frame material through product family (Starr Rehau Artevo) instead of explicit material text. Current pipeline does not consistently infer that implicit material mapping.

Examples:

W11B
  actual:
  gt:     uPVC/vinyl

W1A
  actual: Triple Glazed
  gt:     uPVC/vinyl

W77A
  actual: Triple Glazed
  gt:     uPVC/vinyl

`Special Notes` — 47 mismatches¶

Cause

Notes synthesis captures many facts but often drops key constraint clauses. Dominant missing GT clusters are design_pressure, safety_glazing, air_leakage, head_height_rule, and some u_factor tokens.

Examples:

W11B
  actual: 1 3/4" Triple Glazed IGU ... Warm-edge spacer ... (shortened)
  gt:     ... U-0.18 ... Min head height 7'-0" ... +/- 48 PSF design pressure ... safety glazing ...

W13A
  actual: 1 3/4" Triple Glazed IGU ... (shortened)
  gt:     ... design pressure + safety glazing clauses present ...

Type 3 benchmark (Fairway)¶

Fixture: VTT - Fairway T3.pdf
Pages tested: 1
Template: Standard Takeoff
Strategies fired: text_rule, image_legend
Run: debug/ui_01/run_20260413_003852_502094_4873242e__ui_01/
Flags: High Accuracy Tables on, Monolithic mode off

Current result summary¶

Metric	Value
Expected rows / Actual rows	21 / 21
Row completeness gate	Pass
High-confidence severe-error gate	Pass
Field accuracy (normalized comparator)	0.948413
Remaining normalized mismatches	`Product Type` (7), `Operability` (3), `Frame Material` (1), `Glass Type` (1)

Interpretation¶

Core Type 3 legend grounding is stable on this rerun (row completeness holds, no severe high-confidence failures).
Special Notes semantic match is fully passing under normalized comparison for this run.
Residual errors are concentrated in taxonomy/classification (Product Type) and a small operability cluster.

Regression breakdown (normalized comparator, GT-aligned)¶

`Product Type` — 7 mismatches¶

Cause

subtype extraction coverage is incomplete on a subset of window rows. Even when subtype evidence is present, Product Type is sometimes left blank.

Examples:

W1
  actual:
  gt:     Other: SLD. GLS.

W3
  actual:
  gt:     Other: CASEMENT

`Operability` — 3 mismatches¶

Cause

A small subset of door styles in the image legend are visually similar and semantically close, which can introduce ambiguity for Gemini when mapping style cards to exact operability labels. D11 and D12 are representative of this legend-level ambiguity.

Examples:

D11
  actual: Swing Double
  gt:     Sliding Door

D12
  actual: Sliding Door
  gt:     Folding

`Frame Material` — 1 mismatch¶

Cause

One door row still drops explicit frame material.

Example:

D2
  actual:
  gt:     Metal

`Glass Type` — 1 mismatch¶

Cause

One row carries a broader composite phrase than GT (double glazed insulated glass; Tempered glass vs Tempered glass).

Planned testing upgrades¶

Dissent-based confidence signal (planned)¶

A planned enhancement is to introduce a dissent-based signal so Gemini can lower confidence when extraction evidence is weak, ambiguous, or internally inconsistent.

Expected effect:

better confidence calibration in obscured/low-quality regions
clearer separation between high-confidence errors vs uncertain extractions
improved triage for manual audit workflows

Testing results¶

Context Type definitions¶

Type 1 — Contextual Tables¶

Type 2 — Contextual Text & General Notes¶

Type 3 — Contextual Item Cards (style/legend)¶

Type 4 — Item Cards with Variables¶

Type 5 — Multi-label Item Cards¶

Fixture coverage matrix¶

Evaluation methodology¶

Ground truth (GT)¶

Normalized comparator architecture¶

Score calculation¶

Type 1 + Type 2 benchmark (Parklane)¶

Current result summary¶

Interpretation¶

Regression breakdown (normalized comparator, GT-aligned)¶

Glass Layer — 19 mismatches¶

Glass Width — 19 mismatches¶

Type 1-5 broad benchmark (Kingsbrook)¶

Current result summary¶

Known non-deterministic extraction issue¶

Regression breakdown (normalized comparator, GT-aligned)¶

Material — 50 mismatches¶

Special Notes — 47 mismatches¶

Type 3 benchmark (Fairway)¶

Current result summary¶

Interpretation¶

Regression breakdown (normalized comparator, GT-aligned)¶

Product Type — 7 mismatches¶

Operability — 3 mismatches¶

Frame Material — 1 mismatch¶

Glass Type — 1 mismatch¶

Planned testing upgrades¶

Dissent-based confidence signal (planned)¶

`Glass Layer` — 19 mismatches¶

`Glass Width` — 19 mismatches¶

`Material` — 50 mismatches¶

`Special Notes` — 47 mismatches¶

`Product Type` — 7 mismatches¶

`Operability` — 3 mismatches¶

`Frame Material` — 1 mismatch¶

`Glass Type` — 1 mismatch¶