MATIH Platform is in active MVP development. Documentation reflects current implementation status.
14. Context Graph & Ontology
Memory System
Context Deduplication

Context Deduplication

The ContextDeduplicationService removes duplicate and near-duplicate contexts from retrieved results before assembling them into LLM prompts. It supports multiple deduplication strategies including exact hash matching, source ID deduplication, semantic similarity, entity-based deduplication, and hybrid approaches.


Overview

When the GraphRAG pipeline retrieves contexts from multiple sources, duplicates inevitably appear. Sending duplicate contexts to the LLM wastes tokens and can confuse the model. The deduplication service ensures that only unique, high-quality contexts are included in the final prompt.

Source: data-plane/ai-service/src/context_graph/services/context_deduplication_service.py


Deduplication Strategies

StrategyDescriptionSpeedAccuracy
EXACTSHA-256 hash of contentFastestCatches only identical duplicates
SOURCE_IDMatch by source identifierFastGood for structured data
SEMANTICCosine similarity of embeddingsModerateCatches paraphrased duplicates
ENTITY_BASEDMatch by linked entity URNsFastGood for entity-centric dedup
HYBRIDCombine all strategiesSlowestMost comprehensive

Merge Policies

When duplicates are found, the merge policy determines which version to keep:

PolicyDescription
KEEP_FIRSTKeep the first encountered context
KEEP_LATESTKeep the most recently created context
KEEP_HIGHEST_SCOREKeep the context with the highest relevance score
MERGE_CONTENTMerge content from all duplicates
UNION_METADATAKeep best content, merge metadata from all

Configuration

from context_graph.services.context_deduplication_service import (
    DeduplicationConfig,
    DeduplicationStrategy,
    MergePolicy,
)
 
config = DeduplicationConfig(
    strategy=DeduplicationStrategy.HYBRID,
    semantic_threshold=0.92,          # Cosine similarity threshold
    merge_policy=MergePolicy.KEEP_HIGHEST_SCORE,
    max_contexts=100,                 # Maximum contexts to process
    content_field="content",          # Field name for content
    score_field="relevance_score",    # Field name for score
    source_id_field="source_id",      # Field name for source ID
)

Deduplication Pipeline

  1. Exact Hash -- Remove contexts with identical content hashes
  2. Source ID -- Remove contexts from the same source
  3. Semantic -- Remove contexts with cosine similarity above the threshold
  4. Entity -- Remove contexts about the same entities
  5. Merge -- Apply the merge policy to combine remaining duplicates
  6. Rank -- Re-rank the deduplicated contexts by relevance score

API Usage

service = ContextDeduplicationService(config=config)
 
deduplicated = await service.deduplicate(contexts=[
    {"content": "Sales data is updated daily", "relevance_score": 0.9, "source_id": "doc-1"},
    {"content": "Sales data updates every day", "relevance_score": 0.85, "source_id": "doc-2"},
    {"content": "Customer churn is trending up", "relevance_score": 0.7, "source_id": "doc-3"},
])
# Returns 2 contexts: the sales data one (merged) and the churn one

Performance

StrategyLatency (100 contexts)Notes
EXACTUnder 1msHash computation only
SOURCE_IDUnder 1msDictionary lookup
SEMANTIC50-200msRequires embedding generation
ENTITY_BASED5-10msSet intersection
HYBRID60-220msSum of all strategies