Framework & Methodology

How the Product Knowledge Graph enriches, scores, and validates product data

6-Stage Enrichment Pipeline

From ~19 raw catalog attributes to ~79 enriched attributes across 12 schema layers. Total pipeline cost: $96.05 for 5,170 products.

Deterministic ExtractionTier A — Verified

Direct field mapping from source catalog. Brand, name, department, class — extracted without inference.

~19 raw attributes

LLM Explicit ExtractionTier B — Extracted

GPT-4o-mini extracts factual attributes from product names. Size, unit, container type, allergens — stated or implied by the name.

+23 attributes

LLM Implicit InferenceTier C — Inferred

GPT-4o-mini infers contextual attributes. Premium tier, occasions, dayparts, health positioning — not stated but inferrable.

+18 attributes

Variant DetectionTier B — Computed

Algorithmic grouping by name similarity. 484 variant groups identified, 1,522 products grouped by size or flavour.

+5 attributes

Governance & ValidationTier A — Validated

Cross-layer validation catches conflicts. Vegan products with dairy, allergen mismatches, completeness scoring.

+7 attributes

Relationship EngineTier D — Scored

Category-specific weighted scoring. Substitutes, complements, variants, upgrades — each scored with 6 signals.

154K relationships

6 Scoring Signals

Every relationship scored against these dimensions

Brand Affinity

Same parent company or brand tier similarity

Size/Format Match

Physical compatibility — same size, unit, container

Dietary Alignment

Allergen, vegan, gluten-free compatibility

Flavour Profile

Flavour group and taste characteristic matching

Price Proximity

Price tier and absolute price distance

Occasion Overlap

Shared usage occasions and dayparts

Category-Specific Weights

Scoring weights tuned per category — brand matters more for chocolate, dietary matters more for baby food

Baby Formula

Dietary: 0.35

Safety first — age stage matching, allergen strictness

Chocolate

Brand: 0.30

Brand loyalty matters more

Bread

Dietary: 0.30

Dietary restrictions are paramount

Beer & Cider

Flavour: 0.25

Flavour profile drives substitution

Ready Meals

Occasion: 0.25

Occasion and dietary balance

Weights are illustrative — real weights calibrated from six scoring signals per category.

Safety & Governance Layer

Deterministic rules that override scoring — safety is never a trade-off

4,686

Baby Safety Blocks

Age-stage mismatches blocked deterministically. Stage 1 formula never substituted with Stage 2+.

576

Allergen Conflicts

Cross-validated allergen mismatches. 46 vegan products with dairy. 565 undeclared allergens caught.

484

Variant Groups

1,522 products grouped by size or flavour. Same product, different pack — identified algorithmically.

Design Principles

Deterministic First

Rules before ML. Tier A attributes are directly mapped. LLM inference only where humans would also need to infer.

Confidence-Tracked

Every attribute has a tier. 30% Tier A (deterministic), 21% Tier B (LLM explicit), 49% Tier C (LLM implicit).

Category-Aware

Scoring weights tuned per category. What matters for chocolate is different from what matters for baby formula.