Sensitive Data Detection

Sensitive Data Detection in the Data Catalog automatically identifies and classifies data that may contain personally identifiable information (PII), financial data, health records, or other sensitive content. Detection results feed into governance policies for automatic enforcement of access controls, masking, and audit rules.

Classification Levels

The governance system supports classification levels that can be assigned to data entities.

Classification	Description	Typical Content
`PUBLIC`	No restrictions on access	Marketing data, public metrics
`INTERNAL`	Restricted to organization members	Internal reports, team metrics
`CONFIDENTIAL`	Limited to authorized personnel	Business strategies, contracts
`SENSITIVE`	Requires explicit authorization	Financial records, HR data
`PII`	Personally identifiable information	Names, emails, SSNs, phone numbers
`PHI`	Protected health information	Medical records, diagnoses
`PCI`	Payment card industry data	Credit card numbers, CVVs

Detection Rule Types

Governance policies with CLASSIFICATION type define rules for automatic detection and classification.

Rule Type	Description
`AUTO_CLASSIFY`	Automatically classify data based on content patterns
`REQUIRES_CLASSIFICATION`	Enforce that data must be classified before access
`CLASSIFICATION_INHERITANCE`	Propagate classification from parent to child entities

Pattern-Based Detection

The PATTERN_MATCH rule type identifies sensitive data through regular expression patterns.

Pattern Name	Detects	Example Match
Email	Email addresses	user@example.com
SSN	Social Security Numbers	123-45-6789
Credit Card	Payment card numbers	4111-1111-1111-1111
Phone	Phone numbers	+1 (555) 123-4567
IP Address	IP addresses	192.168.1.1
Date of Birth	Birth dates	1990-01-15

Example Detection Policy

{
  "name": "PII Auto-Detection",
  "policyType": "CLASSIFICATION",
  "scopeType": "GLOBAL",
  "enforcementMode": "MONITOR",
  "rules": [
    {
      "name": "Require Classification",
      "ruleType": "REQUIRES_CLASSIFICATION",
      "parameters": {},
      "enabled": true,
      "order": 1
    },
    {
      "name": "Email Pattern",
      "ruleType": "PATTERN_MATCH",
      "parameters": {
        "column": "email",
        "pattern": "^[\\w.+-]+@[\\w-]+\\.[\\w.]+$",
        "minMatchPercent": 80.0
      },
      "enabled": true,
      "order": 2
    }
  ],
  "enforcementActions": [
    {
      "actionType": "LOG",
      "parameters": {
        "logLevel": "WARN"
      },
      "order": 1
    },
    {
      "actionType": "NOTIFY",
      "parameters": {
        "recipients": ["data-stewards"],
        "message": "Unclassified sensitive data detected"
      },
      "order": 2
    }
  ]
}

Data Quality Integration

Sensitive data detection integrates with data quality metrics provided through the evaluation context.

Metric	Description
`column.null_percent`	Percentage of null values in the column
`column.uniqueness_percent`	Percentage of unique values
`column.pattern_match_percent`	Percentage of values matching a detection pattern
`column.min`	Minimum value for numeric columns
`column.max`	Maximum value for numeric columns

Freshness Monitoring

The FRESHNESS rule type monitors data age to ensure sensitive data is current and valid.

Parameter	Description
`maxAgeMinutes`	Maximum allowed age of the data in minutes

When data exceeds the configured freshness threshold, the policy evaluator flags a violation. This is particularly important for sensitive data that must be kept up to date for compliance reasons.

Enforcement on Detection

When sensitive data is detected, the following actions can be triggered automatically.

Action	Description
`MASK`	Apply automatic masking to detected sensitive columns
`QUARANTINE`	Quarantine the data for review before access
`ALERT`	Alert data stewards about the detection
`BLOCK`	Block access until classification is assigned
`WORKFLOW`	Trigger a classification review workflow

Best Practices

Run detection scans regularly on newly ingested data
Combine REQUIRES_CLASSIFICATION with AUTO_CLASSIFY for comprehensive coverage
Use MONITOR mode during initial deployment to tune detection patterns
Review detection results before switching to HARD_ENFORCE mode
Maintain a list of known sensitive column name patterns (e.g., ssn, email, phone)

Query Audit Compliance Reporting