Merge algorithm¶

The enrichment pipeline uses a field-level merge architecture:

MergeResolver in src/pipeline/merge_resolver.py resolves structured fields.
SpecialNotesAdjudicator in src/pipeline/adjudicator.py synthesizes Special Notes.

Staged enrichment resolves structured fields with MergeResolver.resolve() and then applies SpecialNotesAdjudicator for semantic note synthesis.

Field-level authority matrix¶

Merge precedence is configuration-driven via src/pipeline/field_authority.yaml.

The matrix supports:

global: default per-field ranked source list
template_overrides: per-template ranked source overrides
field_aliases: canonical mapping for misspelled/variant column names

Example concepts:

Width can prioritize dimension_card or main_table_seed depending on template.
Material can prioritize text_rule over table-derived values.
Special Notes is delegated to the adjudicator instead of direct winner-takes-all.

Candidate collection¶

For each row and field, MergeResolver builds FieldCandidate entries from:

Main schedule seed (main_table_seed)
Direct values from main tables (with __row_id__) are always added as candidates.
Specialist outputs
Every selected strategy contributes candidates from GeminiEnrichedRowResult.

Each candidate includes:

strategy
value
field_source
row_conf
optional field_claim_conf
stage_index
reasoning

Winner selection¶

For each non-note field, MergeResolver scores candidates with this precedence:

authority rank from matrix (rankings(template, field))
enum validity penalty (valid values preferred)
higher field_claim_conf
higher row_conf
lower stage_index (earlier-stage tie-break)

This gives deterministic per-field selection without global strategy lock-in.

Computed Source Type¶

Source Type is configured as a computed field ([computed]) and is not selected through normal candidate ranking.

After non-note winners are selected, resolver computes Source Type from winning field_sources:

any FieldSource.IMAGE_CONTEXT present -> Image
else any of FieldSource.MAIN_TABLE / FieldSource.AUXILIARY_TABLE / FieldSource.TEXT_CONTEXT present -> Table
else -> empty string

Enum-aware preference¶

If an enum field has both valid and invalid candidates, the resolver prefers valid candidates before any fallback to invalid values. Final coercion to Other: ... still happens later in enricher enum validation.

Special Notes adjudication¶

Special Notes uses semantic adjudication rather than substring deduplication. In benchmark runs, lexical dedupe dropped or conflated valid clauses and reduced note recall. This follows Cartex's guardrail model: open-text fields are governed by soft semantic controls, while hard enforcement is reserved for bounded fields (see Guardrails: hard vs soft).

Flow:

Resolver collects note observations from main_table_seed and all specialists.
SpecialNotesAdjudicator groups by row_id.
Single-observation rows become one bullet immediately.
Multi-observation rows are batched and sent to fast-model synthesis (GeminiSpecialNotesSynthesisResult).
On model failure, deterministic fallback emits stable unique bullets from raw observations.

Output format in EnrichedRow.data["Special Notes"] is newline-separated bullet lines.

Confidence, sources, and reasoning¶

For resolved rows:

confidence: minimum row_conf of winning non-note field candidates (defaults to 1.0 if no candidates lower it)
field_sources: copied from winning candidate source metadata
reasoning: deduplicated concatenation of winner reasonings plus adjudicator summary for synthesized notes

Row recovery¶

Authoritative row IDs are prepared in Enricher before resolver execution:

tables whose IDs are purely index-based fallbacks ({table_id}_row_{n}) are excluded from the authoritative set to avoid false positives

Resolver then keeps row completeness guarantees using that authoritative set:

if a row ID has no resolved output, it is recovered as an empty EnrichedRow (confidence=0.0)

Merge flow¶

flowchart TB
    subgraph candidates["Candidate collection"]
        MAIN["main_table_seed candidates"]
        S1["stage specialists (all completed stages)"]
        MAIN --> CAND["FieldCandidate pool by row_id + field"]
        S1 --> CAND
    end

    subgraph resolve["MergeResolver"]
        CAND --> PICK["per-field winner scoring<br/>(rank, validity, confidence, stage)"]
        PICK --> ROWS["list[EnrichedRow] without final notes"]
    end

    subgraph notes["SpecialNotesAdjudicator"]
        CAND --> OBS["note observations by row_id"]
        OBS --> LLM["batched semantic synthesis<br/>(FAST model)"]
        LLM --> APPLY["bullet notes + fallback on failure"]
    end

    ROWS --> APPLY --> FINAL["final EnrichedRow list"]