Framework & Methodology
How the Product Knowledge Graph enriches, scores, and validates product data
From ~19 raw catalog attributes to ~79 enriched attributes across 12 schema layers. Total pipeline cost: $96.05 for 5,170 products.
Direct field mapping from source catalog. Brand, name, department, class — extracted without inference.
GPT-4o-mini extracts factual attributes from product names. Size, unit, container type, allergens — stated or implied by the name.
GPT-4o-mini infers contextual attributes. Premium tier, occasions, dayparts, health positioning — not stated but inferrable.
Algorithmic grouping by name similarity. 484 variant groups identified, 1,522 products grouped by size or flavour.
Cross-layer validation catches conflicts. Vegan products with dairy, allergen mismatches, completeness scoring.
Category-specific weighted scoring. Substitutes, complements, variants, upgrades — each scored with 6 signals.
Every relationship scored against these dimensions
Brand Affinity
Same parent company or brand tier similarity
Size/Format Match
Physical compatibility — same size, unit, container
Dietary Alignment
Allergen, vegan, gluten-free compatibility
Flavour Profile
Flavour group and taste characteristic matching
Price Proximity
Price tier and absolute price distance
Occasion Overlap
Shared usage occasions and dayparts
Scoring weights tuned per category — brand matters more for chocolate, dietary matters more for baby food
Baby Formula
Dietary: 0.35Safety first — age stage matching, allergen strictness
Chocolate
Brand: 0.30Brand loyalty matters more
Bread
Dietary: 0.30Dietary restrictions are paramount
Beer & Cider
Flavour: 0.25Flavour profile drives substitution
Ready Meals
Occasion: 0.25Occasion and dietary balance
Weights are illustrative — real weights calibrated from six scoring signals per category.
Deterministic rules that override scoring — safety is never a trade-off
4,686
Baby Safety Blocks
Age-stage mismatches blocked deterministically. Stage 1 formula never substituted with Stage 2+.
576
Allergen Conflicts
Cross-validated allergen mismatches. 46 vegan products with dairy. 565 undeclared allergens caught.
484
Variant Groups
1,522 products grouped by size or flavour. Same product, different pack — identified algorithmically.
Deterministic First
Rules before ML. Tier A attributes are directly mapped. LLM inference only where humans would also need to infer.
Confidence-Tracked
Every attribute has a tier. 30% Tier A (deterministic), 21% Tier B (LLM explicit), 49% Tier C (LLM implicit).
Category-Aware
Scoring weights tuned per category. What matters for chocolate is different from what matters for baby formula.