Sensitive Data Detection
Sensitive Data Detection in the Data Catalog automatically identifies and classifies data that may contain personally identifiable information (PII), financial data, health records, or other sensitive content. Detection results feed into governance policies for automatic enforcement of access controls, masking, and audit rules.
Classification Levels
The governance system supports classification levels that can be assigned to data entities.
| Classification | Description | Typical Content |
|---|---|---|
PUBLIC | No restrictions on access | Marketing data, public metrics |
INTERNAL | Restricted to organization members | Internal reports, team metrics |
CONFIDENTIAL | Limited to authorized personnel | Business strategies, contracts |
SENSITIVE | Requires explicit authorization | Financial records, HR data |
PII | Personally identifiable information | Names, emails, SSNs, phone numbers |
PHI | Protected health information | Medical records, diagnoses |
PCI | Payment card industry data | Credit card numbers, CVVs |
Detection Rule Types
Governance policies with CLASSIFICATION type define rules for automatic detection and classification.
| Rule Type | Description |
|---|---|
AUTO_CLASSIFY | Automatically classify data based on content patterns |
REQUIRES_CLASSIFICATION | Enforce that data must be classified before access |
CLASSIFICATION_INHERITANCE | Propagate classification from parent to child entities |
Pattern-Based Detection
The PATTERN_MATCH rule type identifies sensitive data through regular expression patterns.
| Pattern Name | Detects | Example Match |
|---|---|---|
| Email addresses | user@example.com | |
| SSN | Social Security Numbers | 123-45-6789 |
| Credit Card | Payment card numbers | 4111-1111-1111-1111 |
| Phone | Phone numbers | +1 (555) 123-4567 |
| IP Address | IP addresses | 192.168.1.1 |
| Date of Birth | Birth dates | 1990-01-15 |
Example Detection Policy
{
"name": "PII Auto-Detection",
"policyType": "CLASSIFICATION",
"scopeType": "GLOBAL",
"enforcementMode": "MONITOR",
"rules": [
{
"name": "Require Classification",
"ruleType": "REQUIRES_CLASSIFICATION",
"parameters": {},
"enabled": true,
"order": 1
},
{
"name": "Email Pattern",
"ruleType": "PATTERN_MATCH",
"parameters": {
"column": "email",
"pattern": "^[\\w.+-]+@[\\w-]+\\.[\\w.]+$",
"minMatchPercent": 80.0
},
"enabled": true,
"order": 2
}
],
"enforcementActions": [
{
"actionType": "LOG",
"parameters": {
"logLevel": "WARN"
},
"order": 1
},
{
"actionType": "NOTIFY",
"parameters": {
"recipients": ["data-stewards"],
"message": "Unclassified sensitive data detected"
},
"order": 2
}
]
}Data Quality Integration
Sensitive data detection integrates with data quality metrics provided through the evaluation context.
| Metric | Description |
|---|---|
column.null_percent | Percentage of null values in the column |
column.uniqueness_percent | Percentage of unique values |
column.pattern_match_percent | Percentage of values matching a detection pattern |
column.min | Minimum value for numeric columns |
column.max | Maximum value for numeric columns |
Freshness Monitoring
The FRESHNESS rule type monitors data age to ensure sensitive data is current and valid.
| Parameter | Description |
|---|---|
maxAgeMinutes | Maximum allowed age of the data in minutes |
When data exceeds the configured freshness threshold, the policy evaluator flags a violation. This is particularly important for sensitive data that must be kept up to date for compliance reasons.
Enforcement on Detection
When sensitive data is detected, the following actions can be triggered automatically.
| Action | Description |
|---|---|
MASK | Apply automatic masking to detected sensitive columns |
QUARANTINE | Quarantine the data for review before access |
ALERT | Alert data stewards about the detection |
BLOCK | Block access until classification is assigned |
WORKFLOW | Trigger a classification review workflow |
Best Practices
- Run detection scans regularly on newly ingested data
- Combine
REQUIRES_CLASSIFICATIONwithAUTO_CLASSIFYfor comprehensive coverage - Use
MONITORmode during initial deployment to tune detection patterns - Review detection results before switching to
HARD_ENFORCEmode - Maintain a list of known sensitive column name patterns (e.g.,
ssn,email,phone)