Context Deduplication

The ContextDeduplicationService removes duplicate and near-duplicate contexts from retrieved results before assembling them into LLM prompts. It supports multiple deduplication strategies including exact hash matching, source ID deduplication, semantic similarity, entity-based deduplication, and hybrid approaches.

Overview

When the GraphRAG pipeline retrieves contexts from multiple sources, duplicates inevitably appear. Sending duplicate contexts to the LLM wastes tokens and can confuse the model. The deduplication service ensures that only unique, high-quality contexts are included in the final prompt.

Source: data-plane/ai-service/src/context_graph/services/context_deduplication_service.py

Deduplication Strategies

Strategy	Description	Speed	Accuracy
`EXACT`	SHA-256 hash of content	Fastest	Catches only identical duplicates
`SOURCE_ID`	Match by source identifier	Fast	Good for structured data
`SEMANTIC`	Cosine similarity of embeddings	Moderate	Catches paraphrased duplicates
`ENTITY_BASED`	Match by linked entity URNs	Fast	Good for entity-centric dedup
`HYBRID`	Combine all strategies	Slowest	Most comprehensive

Merge Policies

When duplicates are found, the merge policy determines which version to keep:

Policy	Description
`KEEP_FIRST`	Keep the first encountered context
`KEEP_LATEST`	Keep the most recently created context
`KEEP_HIGHEST_SCORE`	Keep the context with the highest relevance score
`MERGE_CONTENT`	Merge content from all duplicates
`UNION_METADATA`	Keep best content, merge metadata from all

Configuration

from context_graph.services.context_deduplication_service import (
    DeduplicationConfig,
    DeduplicationStrategy,
    MergePolicy,
)
 
config = DeduplicationConfig(
    strategy=DeduplicationStrategy.HYBRID,
    semantic_threshold=0.92,          # Cosine similarity threshold
    merge_policy=MergePolicy.KEEP_HIGHEST_SCORE,
    max_contexts=100,                 # Maximum contexts to process
    content_field="content",          # Field name for content
    score_field="relevance_score",    # Field name for score
    source_id_field="source_id",      # Field name for source ID
)

Deduplication Pipeline

Exact Hash -- Remove contexts with identical content hashes
Source ID -- Remove contexts from the same source
Semantic -- Remove contexts with cosine similarity above the threshold
Entity -- Remove contexts about the same entities
Merge -- Apply the merge policy to combine remaining duplicates
Rank -- Re-rank the deduplicated contexts by relevance score

API Usage

service = ContextDeduplicationService(config=config)
 
deduplicated = await service.deduplicate(contexts=[
    {"content": "Sales data is updated daily", "relevance_score": 0.9, "source_id": "doc-1"},
    {"content": "Sales data updates every day", "relevance_score": 0.85, "source_id": "doc-2"},
    {"content": "Customer churn is trending up", "relevance_score": 0.7, "source_id": "doc-3"},
])
# Returns 2 contexts: the sales data one (merged) and the churn one

Performance

Strategy	Latency (100 contexts)	Notes
`EXACT`	Under 1ms	Hash computation only
`SOURCE_ID`	Under 1ms	Dictionary lookup
`SEMANTIC`	50-200ms	Requires embedding generation
`ENTITY_BASED`	5-10ms	Set intersection
`HYBRID`	60-220ms	Sum of all strategies

Memory Retrieval Integration Overview