Context Deduplication
The ContextDeduplicationService removes duplicate and near-duplicate contexts from retrieved results before assembling them into LLM prompts. It supports multiple deduplication strategies including exact hash matching, source ID deduplication, semantic similarity, entity-based deduplication, and hybrid approaches.
Overview
When the GraphRAG pipeline retrieves contexts from multiple sources, duplicates inevitably appear. Sending duplicate contexts to the LLM wastes tokens and can confuse the model. The deduplication service ensures that only unique, high-quality contexts are included in the final prompt.
Source: data-plane/ai-service/src/context_graph/services/context_deduplication_service.py
Deduplication Strategies
| Strategy | Description | Speed | Accuracy |
|---|---|---|---|
EXACT | SHA-256 hash of content | Fastest | Catches only identical duplicates |
SOURCE_ID | Match by source identifier | Fast | Good for structured data |
SEMANTIC | Cosine similarity of embeddings | Moderate | Catches paraphrased duplicates |
ENTITY_BASED | Match by linked entity URNs | Fast | Good for entity-centric dedup |
HYBRID | Combine all strategies | Slowest | Most comprehensive |
Merge Policies
When duplicates are found, the merge policy determines which version to keep:
| Policy | Description |
|---|---|
KEEP_FIRST | Keep the first encountered context |
KEEP_LATEST | Keep the most recently created context |
KEEP_HIGHEST_SCORE | Keep the context with the highest relevance score |
MERGE_CONTENT | Merge content from all duplicates |
UNION_METADATA | Keep best content, merge metadata from all |
Configuration
from context_graph.services.context_deduplication_service import (
DeduplicationConfig,
DeduplicationStrategy,
MergePolicy,
)
config = DeduplicationConfig(
strategy=DeduplicationStrategy.HYBRID,
semantic_threshold=0.92, # Cosine similarity threshold
merge_policy=MergePolicy.KEEP_HIGHEST_SCORE,
max_contexts=100, # Maximum contexts to process
content_field="content", # Field name for content
score_field="relevance_score", # Field name for score
source_id_field="source_id", # Field name for source ID
)Deduplication Pipeline
- Exact Hash -- Remove contexts with identical content hashes
- Source ID -- Remove contexts from the same source
- Semantic -- Remove contexts with cosine similarity above the threshold
- Entity -- Remove contexts about the same entities
- Merge -- Apply the merge policy to combine remaining duplicates
- Rank -- Re-rank the deduplicated contexts by relevance score
API Usage
service = ContextDeduplicationService(config=config)
deduplicated = await service.deduplicate(contexts=[
{"content": "Sales data is updated daily", "relevance_score": 0.9, "source_id": "doc-1"},
{"content": "Sales data updates every day", "relevance_score": 0.85, "source_id": "doc-2"},
{"content": "Customer churn is trending up", "relevance_score": 0.7, "source_id": "doc-3"},
])
# Returns 2 contexts: the sales data one (merged) and the churn onePerformance
| Strategy | Latency (100 contexts) | Notes |
|---|---|---|
EXACT | Under 1ms | Hash computation only |
SOURCE_ID | Under 1ms | Dictionary lookup |
SEMANTIC | 50-200ms | Requires embedding generation |
ENTITY_BASED | 5-10ms | Set intersection |
HYBRID | 60-220ms | Sum of all strategies |