Tags and Classification
Data classification is the foundation of the MATIH security and governance model. Every column, table, and data asset can carry classification tags that drive data masking rules, access policies, retention schedules, and compliance controls. This section covers the classification taxonomy, automatic PII detection, tag management, and the classification rules engine.
Classification Taxonomy
MATIH uses a multi-dimensional classification system:
Sensitivity Levels
| Level | Label | Description | Example Data |
|---|---|---|---|
| 0 | PUBLIC | No restrictions on access | Product names, public company info |
| 1 | INTERNAL | Internal use only | Employee names, department structures |
| 2 | CONFIDENTIAL | Business-sensitive | Revenue figures, customer counts |
| 3 | RESTRICTED | Personally Identifiable Information | Email, phone, date of birth |
| 4 | SECRET | Highly sensitive | SSN, financial account numbers, passwords |
PII Categories
| Category | Tag | Example Columns |
|---|---|---|
| Email address | PII:EMAIL | email, contact_email, user_email |
| Phone number | PII:PHONE | phone, mobile, contact_phone |
| National ID | PII:NATIONAL_ID | ssn, sin, national_id |
| Date of birth | PII:DOB | date_of_birth, dob, birth_date |
| Physical address | PII:ADDRESS | address, street, zip_code |
| Financial account | PII:FINANCIAL | account_number, iban, routing_number |
| Health information | PHI:MEDICAL | diagnosis, prescription, medical_record |
| Payment card | PCI:CARD | card_number, cvv, expiry_date |
| IP address | PII:IP | ip_address, client_ip, source_ip |
| Biometric data | PII:BIOMETRIC | fingerprint_hash, face_encoding |
Business Domain Tags
| Domain | Tags |
|---|---|
| Finance | finance, revenue, billing, payment |
| Sales | sales, orders, customers, pipeline |
| Marketing | marketing, campaigns, leads, attribution |
| Engineering | engineering, metrics, logs, infrastructure |
| HR | hr, employees, compensation, recruitment |
Automatic PII Detection
The PiiDetectionService automatically scans columns for PII patterns:
@Service
public class PiiDetectionService {
public List<PiiDetectionResult> detectPii(UUID tenantId, String tableFqn, int sampleSize) {
// 1. Fetch sample data from the table (default: 1000 rows)
// 2. For each column, apply pattern matching
// 3. Score confidence for each PII type
// 4. Return results above confidence threshold
}
}Detection Methods
| Method | Approach | Accuracy |
|---|---|---|
| Column name heuristics | Match column names against known PII patterns | High for standard names |
| Regex pattern matching | Apply regex patterns to sample data | High for structured PII (SSN, email) |
| Statistical analysis | Analyze value distributions for PII characteristics | Medium for unstructured PII |
| Data type analysis | Correlate column type with PII likelihood | Low (supplementary signal) |
Detection Patterns
// Email detection
Pattern.compile("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")
// SSN detection (US)
Pattern.compile("\\d{3}-\\d{2}-\\d{4}")
// Phone detection (US)
Pattern.compile("\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}")
// Credit card detection (Luhn-valid 13-19 digit numbers)
Pattern.compile("\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{1,7}")
// IP address detection
Pattern.compile("\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}")Detection Confidence Scoring
Each detection produces a confidence score:
| Score Range | Interpretation | Action |
|---|---|---|
| 0.95 - 1.00 | Definite PII | Auto-classify and apply masking |
| 0.80 - 0.95 | Likely PII | Auto-classify, flag for review |
| 0.50 - 0.80 | Possible PII | Flag for manual review |
| 0.00 - 0.50 | Unlikely PII | No action |
PII Detection API
POST /v1/catalog/classification/detect-pii
Request:
{
"tableFqn": "analytics.public.customers",
"sampleSize": 1000,
"autoClassify": false
}
Response:
{
"tableFqn": "analytics.public.customers",
"detectedAt": "2026-02-12T10:30:00Z",
"results": [
{
"column": "email",
"piiType": "PII:EMAIL",
"confidence": 0.99,
"sampleMatches": 987,
"sampleSize": 1000,
"currentClassification": null,
"suggestedClassification": "RESTRICTED",
"suggestedTags": ["PII", "PII:EMAIL"]
},
{
"column": "ssn",
"piiType": "PII:NATIONAL_ID",
"confidence": 0.97,
"sampleMatches": 965,
"sampleSize": 1000,
"currentClassification": null,
"suggestedClassification": "SECRET",
"suggestedTags": ["PII", "PII:NATIONAL_ID"]
},
{
"column": "notes",
"piiType": "PII:EMAIL",
"confidence": 0.35,
"sampleMatches": 45,
"sampleSize": 1000,
"currentClassification": null,
"suggestedClassification": null,
"suggestedTags": []
}
]
}Classification Rules Engine
The ClassificationRulesEngine applies rule-based classification to data assets:
@Service
public class ClassificationRulesEngine {
public List<ClassificationResult> applyRules(UUID tenantId, String tableFqn) {
// 1. Fetch column metadata from catalog
// 2. Apply name-based rules
// 3. Apply type-based rules
// 4. Apply PII detection rules
// 5. Apply custom tenant rules
// 6. Merge and resolve conflicts (highest sensitivity wins)
}
}Rule Types
| Rule Type | Input | Description |
|---|---|---|
| Name pattern | Column name | Classify based on column name matching regex |
| Data type | Column type | Classify based on SQL data type |
| PII detection | Sample data | Classify based on PII detection results |
| Table pattern | Table name | Classify all columns in matching tables |
| Custom expression | Column metadata | Tenant-defined rules with custom logic |
| Inheritance | Table/schema tag | Propagate classification from parent to child |
Rule Configuration
{
"rules": [
{
"name": "email-columns",
"type": "NAME_PATTERN",
"pattern": "(?i)(email|e_mail|email_address|contact_email)",
"classification": "RESTRICTED",
"tags": ["PII", "PII:EMAIL"],
"priority": 100
},
{
"name": "ssn-columns",
"type": "NAME_PATTERN",
"pattern": "(?i)(ssn|social_security|sin|national_id)",
"classification": "SECRET",
"tags": ["PII", "PII:NATIONAL_ID"],
"priority": 200
},
{
"name": "financial-tables",
"type": "TABLE_PATTERN",
"pattern": "(?i)(transactions|payments|invoices|billing)",
"classification": "CONFIDENTIAL",
"tags": ["finance"],
"priority": 50
}
]
}Tag Management
Applying Tags
POST /v1/catalog/tags
Request:
{
"entityType": "COLUMN",
"entityFqn": "analytics.public.customers.email",
"tags": ["PII", "PII:EMAIL"],
"classification": "RESTRICTED",
"appliedBy": "auto-classification"
}Listing Tags
GET /v1/catalog/tags?entityType=TABLE&entityFqn=analytics.public.customers
Response:
{
"entityFqn": "analytics.public.customers",
"tags": [
{"tag": "customer-data", "source": "manual", "appliedBy": "data-steward", "appliedAt": "2026-01-15"},
{"tag": "PII-containing", "source": "auto-classification", "appliedBy": "system", "appliedAt": "2026-02-01"}
],
"columns": [
{
"column": "email",
"classification": "RESTRICTED",
"tags": ["PII", "PII:EMAIL"],
"source": "auto-classification"
},
{
"column": "ssn",
"classification": "SECRET",
"tags": ["PII", "PII:NATIONAL_ID"],
"source": "auto-classification"
}
]
}Tag Propagation
Classification tags propagate through the governance system:
Classification applied to column
|
v
[Catalog Service] -- CatalogEvent --> [Query Engine]
| |
| Update masking rules
|
v
[Governance Service] -- Policy update --> [OPA]
|
|
Update access policiesClassification and Downstream Effects
| Classification Tag | Query Engine Effect | Governance Effect | Quality Effect |
|---|---|---|---|
PUBLIC | No masking | No access restriction | Standard validation |
INTERNAL | No masking for internal users | Internal role required | Standard validation |
CONFIDENTIAL | Partial masking for viewers | Department match required | Enhanced monitoring |
RESTRICTED | Heavy masking for non-stewards | Data steward or admin required | PII monitoring rules |
SECRET | Full redaction for non-admins | Admin only, audit trail | Critical validation rules |
PII:EMAIL | Email masking function | GDPR controls apply | Email format validation |
PII:FINANCIAL | Account number masking | PCI DSS controls apply | Financial integrity checks |
Related Sections
- Data Masking -- Masking driven by classification tags
- Governance Service -- Access policies based on classification
- Metadata Management -- Profiling that feeds classification
- API Reference -- Classification endpoints