Skip to content

Merge algorithm

The enrichment pipeline uses a field-level merge architecture:

  • MergeResolver in src/pipeline/merge_resolver.py resolves structured fields.
  • SpecialNotesAdjudicator in src/pipeline/adjudicator.py synthesizes Special Notes.

Staged enrichment resolves structured fields with MergeResolver.resolve() and then applies SpecialNotesAdjudicator for semantic note synthesis.

Field-level authority matrix

Merge precedence is configuration-driven via src/pipeline/field_authority.yaml.

The matrix supports:

  • global: default per-field ranked source list
  • template_overrides: per-template ranked source overrides
  • field_aliases: canonical mapping for misspelled/variant column names

Example concepts:

  • Width can prioritize dimension_card or main_table_seed depending on template.
  • Material can prioritize text_rule over table-derived values.
  • Special Notes is delegated to the adjudicator instead of direct winner-takes-all.

Candidate collection

For each row and field, MergeResolver builds FieldCandidate entries from:

  1. Main schedule seed (main_table_seed)
    Direct values from main tables (with __row_id__) are always added as candidates.
  2. Specialist outputs
    Every selected strategy contributes candidates from GeminiEnrichedRowResult.

Each candidate includes:

  • strategy
  • value
  • field_source
  • row_conf
  • optional field_claim_conf
  • stage_index
  • reasoning

Winner selection

For each non-note field, MergeResolver scores candidates with this precedence:

  1. authority rank from matrix (rankings(template, field))
  2. enum validity penalty (valid values preferred)
  3. higher field_claim_conf
  4. higher row_conf
  5. lower stage_index (earlier-stage tie-break)

This gives deterministic per-field selection without global strategy lock-in.

Computed Source Type

Source Type is configured as a computed field ([computed]) and is not selected through normal candidate ranking.

After non-note winners are selected, resolver computes Source Type from winning field_sources:

  • any FieldSource.IMAGE_CONTEXT present -> Image
  • else any of FieldSource.MAIN_TABLE / FieldSource.AUXILIARY_TABLE / FieldSource.TEXT_CONTEXT present -> Table
  • else -> empty string

Enum-aware preference

If an enum field has both valid and invalid candidates, the resolver prefers valid candidates before any fallback to invalid values. Final coercion to Other: ... still happens later in enricher enum validation.

Special Notes adjudication

Special Notes uses semantic adjudication rather than substring deduplication. In benchmark runs, lexical dedupe dropped or conflated valid clauses and reduced note recall. This follows Cartex's guardrail model: open-text fields are governed by soft semantic controls, while hard enforcement is reserved for bounded fields (see Guardrails: hard vs soft).

Flow:

  1. Resolver collects note observations from main_table_seed and all specialists.
  2. SpecialNotesAdjudicator groups by row_id.
  3. Single-observation rows become one bullet immediately.
  4. Multi-observation rows are batched and sent to fast-model synthesis (GeminiSpecialNotesSynthesisResult).
  5. On model failure, deterministic fallback emits stable unique bullets from raw observations.

Output format in EnrichedRow.data["Special Notes"] is newline-separated bullet lines.

Confidence, sources, and reasoning

For resolved rows:

  • confidence: minimum row_conf of winning non-note field candidates (defaults to 1.0 if no candidates lower it)
  • field_sources: copied from winning candidate source metadata
  • reasoning: deduplicated concatenation of winner reasonings plus adjudicator summary for synthesized notes

Row recovery

Authoritative row IDs are prepared in Enricher before resolver execution:

  • tables whose IDs are purely index-based fallbacks ({table_id}_row_{n}) are excluded from the authoritative set to avoid false positives

Resolver then keeps row completeness guarantees using that authoritative set:

  • if a row ID has no resolved output, it is recovered as an empty EnrichedRow (confidence=0.0)

Merge flow

flowchart TB
    subgraph candidates["Candidate collection"]
        MAIN["main_table_seed candidates"]
        S1["stage specialists (all completed stages)"]
        MAIN --> CAND["FieldCandidate pool by row_id + field"]
        S1 --> CAND
    end

    subgraph resolve["MergeResolver"]
        CAND --> PICK["per-field winner scoring<br/>(rank, validity, confidence, stage)"]
        PICK --> ROWS["list[EnrichedRow] without final notes"]
    end

    subgraph notes["SpecialNotesAdjudicator"]
        CAND --> OBS["note observations by row_id"]
        OBS --> LLM["batched semantic synthesis<br/>(FAST model)"]
        LLM --> APPLY["bullet notes + fallback on failure"]
    end

    ROWS --> APPLY --> FINAL["final EnrichedRow list"]