MATIH Platform is in active MVP development. Documentation reflects current implementation status.
9. Query Engine & SQL
Data Masking

Data Masking

Data masking protects sensitive column data by transforming column values in query results based on the user's role and the column's classification level. Unlike Row-Level Security (which restricts which rows a user can see), data masking restricts how column values are displayed, allowing users to see the existence of data without revealing its actual content.


Masking Architecture

Data masking operates between the RLS injection phase and the engine execution phase:

SQL Query
    |
    v
[RLS Filter Injection]  -- Row-level restrictions
    |
    v
[Data Masking Service]  -- Column-level transformations
    |
    v
[Engine Execution]      -- Executes modified SQL
    |
    v
[Masked Results]        -- User sees transformed values

The masking service modifies the SELECT clause of SQL queries, replacing column references with masking function expressions. This approach ensures masking is applied at the database level rather than post-processing, preventing raw data from ever reaching the application layer.


Classification-Driven Masking

Masking rules are driven by data classification tags assigned in the Catalog Service. Each column can carry a classification tag that determines the masking strategy:

ClassificationLevelDescriptionDefault Masking
PUBLIC0No restrictionsNone (visible to all)
INTERNAL1Internal use onlyNone for internal users; redacted for external
CONFIDENTIAL2Business-sensitive dataPartial masking
RESTRICTED3Personally Identifiable Information (PII)Heavy masking or hashing
SECRET4Credentials, financial account numbersFull redaction

Role-Based Visibility Matrix

Each user role has a visibility level that determines which classifications they can see unmasked:

RoleVisibility LevelSees Unmasked
VIEWER0PUBLIC only
ANALYST1PUBLIC, INTERNAL
DATA_ENGINEER2PUBLIC, INTERNAL, CONFIDENTIAL
DATA_STEWARD3PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
ADMIN4All classifications

If a user's visibility level is lower than the column's classification level, masking is applied.


Masking Strategies

The masking service supports multiple masking strategies, each appropriate for different data types:

Full Redaction

Replaces the entire value with a constant:

-- Original
SELECT ssn FROM customers
 
-- Masked (for ANALYST role)
SELECT '***-**-****' AS ssn FROM customers

Partial Masking

Reveals a portion of the value while hiding the rest:

-- Email: show first 3 characters and domain
SELECT CONCAT(SUBSTR(email, 1, 3), '***@', SPLIT_PART(email, '@', 2)) AS email FROM customers
 
-- Phone: show last 4 digits
SELECT CONCAT('***-***-', SUBSTR(phone, -4)) AS phone FROM customers
 
-- Credit card: show last 4 digits
SELECT CONCAT('****-****-****-', SUBSTR(card_number, -4)) AS card_number FROM transactions

Hashing

Replaces the value with a deterministic hash, preserving equality comparisons but hiding actual values:

-- SHA-256 hash preserves grouping and joining on masked columns
SELECT SHA256(CAST(customer_id AS VARCHAR)) AS customer_id_hash, SUM(amount) AS total
FROM orders GROUP BY SHA256(CAST(customer_id AS VARCHAR))

Bucketing

Replaces numeric values with range buckets:

-- Salary: replace with range
SELECT CASE
    WHEN salary < 50000 THEN '< 50K'
    WHEN salary < 100000 THEN '50K-100K'
    WHEN salary < 150000 THEN '100K-150K'
    ELSE '150K+'
END AS salary_range FROM employees

Date Truncation

Reduces date precision:

-- Date of birth: show only year
SELECT DATE_TRUNC('year', date_of_birth) AS birth_year FROM customers

Nullification

Replaces the value with NULL:

-- Completely hide the column value
SELECT NULL AS medical_record_number FROM patients

Masking Rule Configuration

Masking rules are stored in the catalog service and associated with column metadata:

{
  "table": "customers",
  "column": "ssn",
  "classification": "SECRET",
  "maskingRules": [
    {
      "roleLevel": 3,
      "strategy": "PARTIAL",
      "config": {
        "showLast": 4,
        "maskChar": "*",
        "format": "***-**-{last4}"
      }
    },
    {
      "roleLevel": 0,
      "strategy": "REDACT",
      "config": {
        "replacement": "***-**-****"
      }
    }
  ]
}

Rules are evaluated in order of decreasing role level. The first rule where the user's role level meets or exceeds the rule's roleLevel is applied.


Masking Audit Trail

Every masking operation is logged to the MaskingAuditLogRepository:

@Entity
public class MaskingAuditLog {
    private UUID id;
    private UUID tenantId;
    private UUID userId;
    private UUID executionId;
    private String tableName;
    private String columnName;
    private String classification;
    private String maskingStrategy;
    private String userRole;
    private Instant timestamp;
}

The MaskingAuditController exposes endpoints for querying the audit trail:

GET /v1/queries/masking/audit?tenantId={tenantId}&table=customers&startDate=2026-02-01

Response:
{
  "content": [
    {
      "id": "...",
      "userId": "user-456",
      "tableName": "customers",
      "columnName": "ssn",
      "classification": "SECRET",
      "maskingStrategy": "PARTIAL",
      "userRole": "DATA_STEWARD",
      "timestamp": "2026-02-12T10:30:00Z"
    }
  ],
  "totalElements": 234
}

Masking and Query Semantics

Aggregation Compatibility

When masking is applied to columns used in GROUP BY or aggregate functions, the masking service must ensure query semantics are preserved:

OperationBehavior with Masking
GROUP BY masked_columnUses hash masking to preserve grouping
COUNT(DISTINCT masked_column)Applies hash to preserve count accuracy
SUM(masked_numeric_column)Bucketing applied; sum is approximate
WHERE masked_column = valueFilter evaluated before masking; masking is SELECT-only
ORDER BY masked_columnOrdering preserved if hash masking; lost if redaction

JOIN Key Handling

When a masked column is used as a JOIN key, the masking service applies consistent hashing on both sides of the join to preserve join correctness:

-- Both sides use the same hash function
SELECT a.*, b.order_total
FROM customers a
JOIN orders b ON SHA256(CAST(a.customer_id AS VARCHAR)) = SHA256(CAST(b.customer_id AS VARCHAR))

Performance Considerations

ConcernImpactMitigation
Masking function overhead5-15% execution time increaseFunctions are simple string/hash operations
Cache partitioningDifferent masking per role means different cache entriesRole hash included in cache key
Query plan complexityAdditional function calls in SELECTQuery plan cache stores post-masking plan

Related Sections