MATIH Platform is in active MVP development. Documentation reflects current implementation status.
12. AI Service
ML Integration
Exploratory Data Analysis

Exploratory Data Analysis

The Exploratory Data Analysis (EDA) integration enables users to profile datasets, analyze distributions, detect correlations, and identify data quality issues through the AI Service conversational interface. EDA results are presented as statistical summaries, visualizations, and natural language insights.


EDA Capabilities

CapabilityDescriptionOutput
Statistical profilingSummary statistics for all columnsMean, median, std, min, max, percentiles
Distribution analysisHistogram and density estimationDistribution plots, skewness, kurtosis
Correlation analysisPairwise feature correlationsCorrelation matrix, top correlated pairs
Missing value analysisNull and missing data patternsMissing counts, percentages, patterns
Outlier detectionStatistical outlier identificationIQR, z-score, isolation forest results
Cardinality analysisUnique value counts and frequenciesValue counts, entropy, cardinality ratio
Target analysisRelationship between features and targetFeature importance, mutual information

Profile Dataset

Generates a comprehensive statistical profile for a dataset:

POST /api/v1/ml/eda/profile
{
  "source": "sql",
  "query": "SELECT * FROM analytics.customer_features LIMIT 10000",
  "tenant_id": "acme-corp",
  "options": {
    "correlations": true,
    "missing_analysis": true,
    "outlier_detection": true,
    "sample_size": 10000
  }
}

Response

{
  "row_count": 10000,
  "column_count": 15,
  "columns": [
    {
      "name": "tenure",
      "type": "numeric",
      "stats": {
        "mean": 32.4,
        "median": 29.0,
        "std": 24.6,
        "min": 1,
        "max": 120,
        "missing": 0,
        "missing_pct": 0.0
      },
      "distribution": "right_skewed",
      "outliers": 23
    },
    {
      "name": "contract_type",
      "type": "categorical",
      "stats": {
        "unique": 3,
        "top": "month-to-month",
        "top_freq": 0.55,
        "missing": 0
      }
    }
  ],
  "correlations": {
    "top_pairs": [
      {"feature_a": "tenure", "feature_b": "total_charges", "correlation": 0.83},
      {"feature_a": "monthly_charges", "feature_b": "total_charges", "correlation": 0.65}
    ]
  }
}

Analyze Distribution

Generates distribution analysis for a specific column:

POST /api/v1/ml/eda/distribution
{
  "query": "SELECT monthly_charges FROM analytics.customer_features",
  "column": "monthly_charges",
  "bins": 30,
  "tenant_id": "acme-corp"
}

Detect Outliers

Identifies statistical outliers in the dataset:

POST /api/v1/ml/eda/outliers
{
  "query": "SELECT * FROM analytics.customer_features",
  "columns": ["monthly_charges", "total_charges"],
  "method": "iqr",
  "threshold": 1.5,
  "tenant_id": "acme-corp"
}

Correlation Matrix

Computes pairwise correlations for numeric columns:

POST /api/v1/ml/eda/correlations
{
  "query": "SELECT tenure, monthly_charges, total_charges FROM analytics.customer_features",
  "method": "pearson",
  "tenant_id": "acme-corp"
}

Conversational EDA

Users can request EDA through natural language:

User QueryEDA Action
"Profile the customer dataset"Full statistical profiling
"What does the revenue distribution look like?"Distribution analysis
"Are there any outliers in monthly charges?"Outlier detection
"Which features are most correlated?"Correlation matrix
"How much missing data is there?"Missing value analysis

Configuration

Environment VariableDefaultDescription
EDA_MAX_ROWS100000Maximum rows to profile
EDA_SAMPLE_SIZE10000Default sample size for profiling
EDA_OUTLIER_METHODiqrDefault outlier detection method
EDA_CORRELATION_THRESHOLD0.7Threshold for highlighting correlations