Framework & Methodology

How the Product Knowledge Graph enriches, scores, and validates product data

6-Stage Enrichment Pipeline

From ~19 raw catalog attributes to ~79 enriched attributes across 12 schema layers. Total pipeline cost: $96.05 for 5,170 products.

1
Deterministic ExtractionTier AVerified

Direct field mapping from source catalog. Brand, name, department, class — extracted without inference.

~19 raw attributes
2
LLM Explicit ExtractionTier BExtracted

GPT-4o-mini extracts factual attributes from product names. Size, unit, container type, allergens — stated or implied by the name.

+23 attributes
3
LLM Implicit InferenceTier CInferred

GPT-4o-mini infers contextual attributes. Premium tier, occasions, dayparts, health positioning — not stated but inferrable.

+18 attributes
4
Variant DetectionTier BComputed

Algorithmic grouping by name similarity. 484 variant groups identified, 1,522 products grouped by size or flavour.

+5 attributes
5
Governance & ValidationTier AValidated

Cross-layer validation catches conflicts. Vegan products with dairy, allergen mismatches, completeness scoring.

+7 attributes
6
Relationship EngineTier DScored

Category-specific weighted scoring. Substitutes, complements, variants, upgrades — each scored with 6 signals.

154K relationships
6 Scoring Signals

Every relationship scored against these dimensions

1

Brand Affinity

Same parent company or brand tier similarity

2

Size/Format Match

Physical compatibility — same size, unit, container

3

Dietary Alignment

Allergen, vegan, gluten-free compatibility

4

Flavour Profile

Flavour group and taste characteristic matching

5

Price Proximity

Price tier and absolute price distance

6

Occasion Overlap

Shared usage occasions and dayparts

Category-Specific Weights

Scoring weights tuned per category — brand matters more for chocolate, dietary matters more for baby food

Baby Formula

Dietary: 0.35

Safety first — age stage matching, allergen strictness

Chocolate

Brand: 0.30

Brand loyalty matters more

Bread

Dietary: 0.30

Dietary restrictions are paramount

Beer & Cider

Flavour: 0.25

Flavour profile drives substitution

Ready Meals

Occasion: 0.25

Occasion and dietary balance

Weights are illustrative — real weights calibrated from six scoring signals per category.

Safety & Governance Layer

Deterministic rules that override scoring — safety is never a trade-off

4,686

Baby Safety Blocks

Age-stage mismatches blocked deterministically. Stage 1 formula never substituted with Stage 2+.

576

Allergen Conflicts

Cross-validated allergen mismatches. 46 vegan products with dairy. 565 undeclared allergens caught.

484

Variant Groups

1,522 products grouped by size or flavour. Same product, different pack — identified algorithmically.

Design Principles

Deterministic First

Rules before ML. Tier A attributes are directly mapped. LLM inference only where humans would also need to infer.

Confidence-Tracked

Every attribute has a tier. 30% Tier A (deterministic), 21% Tier B (LLM explicit), 49% Tier C (LLM implicit).

Category-Aware

Scoring weights tuned per category. What matters for chocolate is different from what matters for baby formula.