Skip to content

Testing results

This page tracks fixture-based validation of Context Types 1-5.

Result set version: April 13, 2026
Scope update in this pass:

  • Parklane and Kingsbrook sections remain aligned to the latest published reruns.
  • Fairway Type 3 section is refreshed from a new rerun audit (debug/ui_01/run_20260413_003852_502094_4873242e__ui_01/).

Context Type definitions

Type 1 — Contextual Tables

A separate table (for example, a glazing schedule) provides supplementary data that must be merged into the main item table. If the item table references GL-03, the pipeline should fill from the corresponding context-table row. Links can be direct 1:1 or rule-based (for example, if Fixed -> use this glass makeup).

Type 2 — Contextual Text & General Notes

Freeform notes (tempered glass rules, performance specs, bird-friendly requirements, structural criteria) define blanket or conditional rules. The pipeline should apply these as metadata to relevant items.

Type 3 — Contextual Item Cards (style/legend)

Visual legends define operability/category/configuration. If a schedule row says D1 is Style C, the pipeline should use the legend to populate operability and related attributes.

Type 4 — Item Cards with Variables

Complex diagrams define variable dimensions across sub-items (including computed dimensions such as total height - sill height). Output should produce separate line items with correct computed values.

Type 5 — Multi-label Item Cards

A single drawing can define multiple rows (for example W3A and W3B) sharing geometry but differing by basis-of-design or parameters. The pipeline should split items and assign shared vs variant-specific attributes correctly.


Fixture coverage matrix

Fixture Latest run used in docs Types covered Status
Parklane (CRPA-Park Lane T1.pdf) run_20260412_011036 Type 1, Type 2 Active
Kingsbrook (kingsbrook_takeOFF T2.pdf) run_20260412_010342 Type 1-5 (broad mixed coverage) Active
Fairway (VTT - Fairway T3.pdf) run_20260413_003852_502094_4873242e__ui_01 Type 3 Active

Evaluation methodology

Ground truth (GT)

GT files are manually curated expected outputs for each benchmark fixture. They define the canonical row-level and field-level values used for regression measurement.

  • Location: test/ground-truths/
  • Examples: gt__r01-02__parklane.json, gt__r01-05__kingsbrook.json
  • Alignment key: row_id

What GT is and is not

GT is a human-authored evaluation baseline, not model output. When source documents, field contracts, or unit policies change, GT should be updated in the same change window to keep benchmark scores meaningful.

Normalized comparator architecture

Cartex benchmark summaries use the normalized comparator implemented in test/gates/phase_gate_report.py.

Evaluation flow:

  1. Load actual rows from debug/run_*/<run_id>_rows.json and GT rows from test/ground-truths/.
  2. Align rows by row_id.
  3. Apply fixture-level ignore metadata for known non-deterministic obscured-row extraction artifacts in Kingsbrook-like layouts (see Obscured-row extraction artifacts).
  4. Compare every field in every common row using field-aware normalization logic.
  5. Aggregate mismatches into row-level and run-level metrics.

Field-aware matching rules:

Field group Normalized matching logic
General text fields Case-insensitive, whitespace-normalized string match
Width, Height Parse units into inches, then numeric compare
Rough Opening Measurements Parse A x B into inch pairs, compare canonical pairs
Quantity, Glass Layer Parse integer-like tokens, compare numeric values
Frame Brand Token overlap/containment check (e.g. Rehau vs Starr Rehau)
Glass Arrangement Configuration Structural compare of panel dimensions/count and metal-panel marker
Special Notes Semantic fact extraction + overlap thresholds (recall/precision gating)

Reporting policy

Testing documentation uses normalized comparator metrics as the release-quality KPI. Strict exact-string comparison may still be used for internal diagnostics but is intentionally not surfaced in benchmark summaries.

Score calculation

For a run with common_rows aligned rows and F comparable fields per row:

total_comparable_fields = common_rows * F
total_field_mismatches  = count of normalized non-matching field comparisons
field_accuracy          = (total_comparable_fields - total_field_mismatches) / total_comparable_fields

row_accuracy(row_i)     = matched_fields(row_i) / F

Gate metrics:

  • row_completeness_pass: missing_row_ids == [] and extra_row_ids == [] after configured ignore rules.
  • no_high_conf_severe_errors_pass: no row with confidence >= 0.9 and row_accuracy <= 0.4.

Type 1 + Type 2 benchmark (Parklane)

Fixture: CRPA-Park Lane T1.pdf
Pages tested: 1
Template: Glass Schedule
Strategies fired: auxiliary_table, text_rule, dimension_card
Run: debug/run_20260412_011036/
Flags: High Accuracy Tables on, Monolithic mode off

Current result summary

Metric Value
Expected rows / Actual rows 19 / 19
Row completeness gate Pass
High-confidence severe-error gate Pass
Field accuracy (normalized comparator) 0.857143
Remaining normalized mismatches Glass Layer (19), Glass Width (19)

Interpretation

  • Type 1 behavior is functioning (context table grounding and row mapping).
  • Type 2 behavior is functioning (text-note metadata grounding).
  • The remaining mismatch cluster is currently attributed to field-contract ambiguity in template source semantics (Glass Layer vs Glass Width), not row-loss behavior.

Regression breakdown (normalized comparator, GT-aligned)

Glass Layer — 19 mismatches

Cause

Thickness text (1" IGU) is being written into a layer-count slot. This behavior is currently shaped by CATO2 template field-rule contracts that are injected into specialist instructions. In the current template source, Glass Layer and Glass Width carry the same verbatim rule text (Glass thickness if determinable), creating ambiguous one-to-one guidance between these two fields. This is being investigated as a likely upstream rule-definition issue.

Examples:

C1
  actual: 1" IGU
  gt:     2

P1
  actual: 1" IGU
  gt:     2

Glass Width — 19 mismatches

Cause

The same thickness token is leaking into Glass Width for the same contract-level reason: CATO2-provided field-rule contracts are injected into specialist prompts, and Glass Width currently shares the same verbatim rule as Glass Layer (Glass thickness if determinable). This overlap in upstream rule text is under investigation.

Examples:

C1
  actual: 1" IGU
  gt:

P10
  actual: 1" IGU
  gt:

Type 1-5 broad benchmark (Kingsbrook)

Fixture: Pages from kingsbrook_takeOFF T2.pdf
Pages tested: 1-2
Template: Standard Takeoff
Strategies fired: auxiliary_table, text_rule, image_legend, dimension_card, multi_label
Run: debug/run_20260412_010342/
Flags: High Accuracy Tables on, Monolithic mode off

Current result summary

Metric Value
Expected rows / Actual rows (raw) 50 / 51
Expected rows / Actual rows (after fixture suppression) 50 / 50
Row completeness gate Pass
High-confidence severe-error gate Pass
Field accuracy (normalized comparator) 0.806
Remaining normalized mismatches Material (50), Special Notes (47)

Known non-deterministic extraction issue

The obscured lower portion of the main schedule can trigger hallucinated rows in the last ~5 rows of the detected main table.

  • This issue is not fixed.
  • W75B was one observed manifestation in this run, but hallucinated row IDs/values can differ run-to-run.
  • Treat this as an extraction instability zone, not a single deterministic row bug.

Regression breakdown (normalized comparator, GT-aligned)

Material — 50 mismatches

Cause

The document usually implies frame material through product family (Starr Rehau Artevo) instead of explicit material text. Current pipeline does not consistently infer that implicit material mapping.

Examples:

W11B
  actual:
  gt:     uPVC/vinyl

W1A
  actual: Triple Glazed
  gt:     uPVC/vinyl

W77A
  actual: Triple Glazed
  gt:     uPVC/vinyl

Special Notes — 47 mismatches

Cause

Notes synthesis captures many facts but often drops key constraint clauses. Dominant missing GT clusters are design_pressure, safety_glazing, air_leakage, head_height_rule, and some u_factor tokens.

Examples:

W11B
  actual: 1 3/4" Triple Glazed IGU ... Warm-edge spacer ... (shortened)
  gt:     ... U-0.18 ... Min head height 7'-0" ... +/- 48 PSF design pressure ... safety glazing ...

W13A
  actual: 1 3/4" Triple Glazed IGU ... (shortened)
  gt:     ... design pressure + safety glazing clauses present ...

Type 3 benchmark (Fairway)

Fixture: VTT - Fairway T3.pdf
Pages tested: 1
Template: Standard Takeoff
Strategies fired: text_rule, image_legend
Run: debug/ui_01/run_20260413_003852_502094_4873242e__ui_01/
Flags: High Accuracy Tables on, Monolithic mode off

Current result summary

Metric Value
Expected rows / Actual rows 21 / 21
Row completeness gate Pass
High-confidence severe-error gate Pass
Field accuracy (normalized comparator) 0.948413
Remaining normalized mismatches Product Type (7), Operability (3), Frame Material (1), Glass Type (1)

Interpretation

  • Core Type 3 legend grounding is stable on this rerun (row completeness holds, no severe high-confidence failures).
  • Special Notes semantic match is fully passing under normalized comparison for this run.
  • Residual errors are concentrated in taxonomy/classification (Product Type) and a small operability cluster.

Regression breakdown (normalized comparator, GT-aligned)

Product Type — 7 mismatches

Cause

subtype extraction coverage is incomplete on a subset of window rows. Even when subtype evidence is present, Product Type is sometimes left blank.

Examples:

W1
  actual:
  gt:     Other: SLD. GLS.

W3
  actual:
  gt:     Other: CASEMENT

Operability — 3 mismatches

Cause

A small subset of door styles in the image legend are visually similar and semantically close, which can introduce ambiguity for Gemini when mapping style cards to exact operability labels. D11 and D12 are representative of this legend-level ambiguity.

Examples:

D11
  actual: Swing Double
  gt:     Sliding Door

D12
  actual: Sliding Door
  gt:     Folding

Frame Material — 1 mismatch

Cause

One door row still drops explicit frame material.

Example:

D2
  actual:
  gt:     Metal

Glass Type — 1 mismatch

Cause

One row carries a broader composite phrase than GT (double glazed insulated glass; Tempered glass vs Tempered glass).


Planned testing upgrades

Dissent-based confidence signal (planned)

A planned enhancement is to introduce a dissent-based signal so Gemini can lower confidence when extraction evidence is weak, ambiguous, or internally inconsistent.

Expected effect:

  • better confidence calibration in obscured/low-quality regions
  • clearer separation between high-confidence errors vs uncertain extractions
  • improved triage for manual audit workflows