Exploratory Data Analysis
The Exploratory Data Analysis (EDA) integration enables users to profile datasets, analyze distributions, detect correlations, and identify data quality issues through the AI Service conversational interface. EDA results are presented as statistical summaries, visualizations, and natural language insights.
EDA Capabilities
| Capability | Description | Output |
|---|---|---|
| Statistical profiling | Summary statistics for all columns | Mean, median, std, min, max, percentiles |
| Distribution analysis | Histogram and density estimation | Distribution plots, skewness, kurtosis |
| Correlation analysis | Pairwise feature correlations | Correlation matrix, top correlated pairs |
| Missing value analysis | Null and missing data patterns | Missing counts, percentages, patterns |
| Outlier detection | Statistical outlier identification | IQR, z-score, isolation forest results |
| Cardinality analysis | Unique value counts and frequencies | Value counts, entropy, cardinality ratio |
| Target analysis | Relationship between features and target | Feature importance, mutual information |
Profile Dataset
Generates a comprehensive statistical profile for a dataset:
POST /api/v1/ml/eda/profile{
"source": "sql",
"query": "SELECT * FROM analytics.customer_features LIMIT 10000",
"tenant_id": "acme-corp",
"options": {
"correlations": true,
"missing_analysis": true,
"outlier_detection": true,
"sample_size": 10000
}
}Response
{
"row_count": 10000,
"column_count": 15,
"columns": [
{
"name": "tenure",
"type": "numeric",
"stats": {
"mean": 32.4,
"median": 29.0,
"std": 24.6,
"min": 1,
"max": 120,
"missing": 0,
"missing_pct": 0.0
},
"distribution": "right_skewed",
"outliers": 23
},
{
"name": "contract_type",
"type": "categorical",
"stats": {
"unique": 3,
"top": "month-to-month",
"top_freq": 0.55,
"missing": 0
}
}
],
"correlations": {
"top_pairs": [
{"feature_a": "tenure", "feature_b": "total_charges", "correlation": 0.83},
{"feature_a": "monthly_charges", "feature_b": "total_charges", "correlation": 0.65}
]
}
}Analyze Distribution
Generates distribution analysis for a specific column:
POST /api/v1/ml/eda/distribution{
"query": "SELECT monthly_charges FROM analytics.customer_features",
"column": "monthly_charges",
"bins": 30,
"tenant_id": "acme-corp"
}Detect Outliers
Identifies statistical outliers in the dataset:
POST /api/v1/ml/eda/outliers{
"query": "SELECT * FROM analytics.customer_features",
"columns": ["monthly_charges", "total_charges"],
"method": "iqr",
"threshold": 1.5,
"tenant_id": "acme-corp"
}Correlation Matrix
Computes pairwise correlations for numeric columns:
POST /api/v1/ml/eda/correlations{
"query": "SELECT tenure, monthly_charges, total_charges FROM analytics.customer_features",
"method": "pearson",
"tenant_id": "acme-corp"
}Conversational EDA
Users can request EDA through natural language:
| User Query | EDA Action |
|---|---|
| "Profile the customer dataset" | Full statistical profiling |
| "What does the revenue distribution look like?" | Distribution analysis |
| "Are there any outliers in monthly charges?" | Outlier detection |
| "Which features are most correlated?" | Correlation matrix |
| "How much missing data is there?" | Missing value analysis |
Configuration
| Environment Variable | Default | Description |
|---|---|---|
EDA_MAX_ROWS | 100000 | Maximum rows to profile |
EDA_SAMPLE_SIZE | 10000 | Default sample size for profiling |
EDA_OUTLIER_METHOD | iqr | Default outlier detection method |
EDA_CORRELATION_THRESHOLD | 0.7 | Threshold for highlighting correlations |