Merge algorithm¶
The enrichment pipeline uses a field-level merge architecture:
MergeResolverinsrc/pipeline/merge_resolver.pyresolves structured fields.SpecialNotesAdjudicatorinsrc/pipeline/adjudicator.pysynthesizesSpecial Notes.
Staged enrichment resolves structured fields with MergeResolver.resolve() and then applies SpecialNotesAdjudicator for semantic note synthesis.
Field-level authority matrix¶
Merge precedence is configuration-driven via src/pipeline/field_authority.yaml.
The matrix supports:
global: default per-field ranked source listtemplate_overrides: per-template ranked source overridesfield_aliases: canonical mapping for misspelled/variant column names
Example concepts:
Widthcan prioritizedimension_cardormain_table_seeddepending on template.Materialcan prioritizetext_ruleover table-derived values.Special Notesis delegated to the adjudicator instead of direct winner-takes-all.
Candidate collection¶
For each row and field, MergeResolver builds FieldCandidate entries from:
- Main schedule seed (
main_table_seed)
Direct values from main tables (with__row_id__) are always added as candidates. - Specialist outputs
Every selected strategy contributes candidates fromGeminiEnrichedRowResult.
Each candidate includes:
strategyvaluefield_sourcerow_conf- optional
field_claim_conf stage_indexreasoning
Winner selection¶
For each non-note field, MergeResolver scores candidates with this precedence:
- authority rank from matrix (
rankings(template, field)) - enum validity penalty (valid values preferred)
- higher
field_claim_conf - higher
row_conf - lower
stage_index(earlier-stage tie-break)
This gives deterministic per-field selection without global strategy lock-in.
Computed Source Type¶
Source Type is configured as a computed field ([computed]) and is not selected through normal candidate ranking.
After non-note winners are selected, resolver computes Source Type from winning field_sources:
- any
FieldSource.IMAGE_CONTEXTpresent ->Image - else any of
FieldSource.MAIN_TABLE/FieldSource.AUXILIARY_TABLE/FieldSource.TEXT_CONTEXTpresent ->Table - else -> empty string
Enum-aware preference¶
If an enum field has both valid and invalid candidates, the resolver prefers valid candidates before any fallback to invalid values. Final coercion to Other: ... still happens later in enricher enum validation.
Special Notes adjudication¶
Special Notes uses semantic adjudication rather than substring deduplication. In benchmark runs, lexical dedupe dropped or conflated valid clauses and reduced note recall. This follows Cartex's guardrail model: open-text fields are governed by soft semantic controls, while hard enforcement is reserved for bounded fields (see Guardrails: hard vs soft).
Flow:
- Resolver collects note observations from
main_table_seedand all specialists. SpecialNotesAdjudicatorgroups byrow_id.- Single-observation rows become one bullet immediately.
- Multi-observation rows are batched and sent to fast-model synthesis (
GeminiSpecialNotesSynthesisResult). - On model failure, deterministic fallback emits stable unique bullets from raw observations.
Output format in EnrichedRow.data["Special Notes"] is newline-separated bullet lines.
Confidence, sources, and reasoning¶
For resolved rows:
confidence: minimumrow_confof winning non-note field candidates (defaults to 1.0 if no candidates lower it)field_sources: copied from winning candidate source metadatareasoning: deduplicated concatenation of winner reasonings plus adjudicator summary for synthesized notes
Row recovery¶
Authoritative row IDs are prepared in Enricher before resolver execution:
- tables whose IDs are purely index-based fallbacks (
{table_id}_row_{n}) are excluded from the authoritative set to avoid false positives
Resolver then keeps row completeness guarantees using that authoritative set:
- if a row ID has no resolved output, it is recovered as an empty
EnrichedRow(confidence=0.0)
Merge flow¶
flowchart TB
subgraph candidates["Candidate collection"]
MAIN["main_table_seed candidates"]
S1["stage specialists (all completed stages)"]
MAIN --> CAND["FieldCandidate pool by row_id + field"]
S1 --> CAND
end
subgraph resolve["MergeResolver"]
CAND --> PICK["per-field winner scoring<br/>(rank, validity, confidence, stage)"]
PICK --> ROWS["list[EnrichedRow] without final notes"]
end
subgraph notes["SpecialNotesAdjudicator"]
CAND --> OBS["note observations by row_id"]
OBS --> LLM["batched semantic synthesis<br/>(FAST model)"]
LLM --> APPLY["bullet notes + fallback on failure"]
end
ROWS --> APPLY --> FINAL["final EnrichedRow list"]