Chat Interface
The Chat Interface page provides an AI-assisted conversational tool for platform operations. Operators can ask questions about system health, investigate issues, query metrics using natural language, and receive guided troubleshooting assistance powered by the Ops Agent Service.
Capabilities
| Capability | Description | Example Query |
|---|---|---|
| Health queries | Check service status and health | "Is the AI service healthy?" |
| Metric queries | Query Prometheus metrics via natural language | "What is the p95 latency for the query engine?" |
| Log search | Search logs across services | "Show me errors from the AI service in the last hour" |
| Incident assistance | Get troubleshooting guidance | "The AI service is returning 503 errors" |
| Runbook execution | Execute operational runbooks | "Run the connection pool diagnostic" |
| Resource queries | Check resource utilization | "Which pods are using the most memory?" |
Chat Architecture
Operator --> Chat Interface --> Ops Agent Service --> Observability API
| |
| +--------+--------+
| | | |
| Prometheus Loki Tempo
|
+--> Kubernetes API
|
+--> AI Service (LLM)Message Types
User Messages
Plain text natural language queries about platform operations.
Assistant Responses
Structured responses with embedded data:
| Response Type | Content |
|---|---|
| Text | Natural language explanation |
| Metric chart | Inline time-series visualization |
| Log excerpt | Formatted log lines with highlighting |
| Table | Tabular data (pod list, service status) |
| Action suggestion | Clickable remediation actions |
| Runbook steps | Step-by-step troubleshooting guide |
Conversational Context
The chat maintains session context for multi-turn investigations:
interface OpsSessionContext {
session_id: string;
services_discussed: string[];
metrics_queried: string[];
time_range: { from: string; to: string };
active_incident?: string;
}This allows follow-up questions without repeating context:
| Turn | Message |
|---|---|
| 1 | "What is the AI service error rate?" |
| 2 | "Show me the logs for those errors" |
| 3 | "Is the database connection pool full?" |
| 4 | "Restart the affected pods" |
Tool Execution
The Ops Agent has access to operational tools:
| Tool | Action | Confirmation Required |
|---|---|---|
query_prometheus | Execute a PromQL query | No |
search_logs | Search Loki logs | No |
get_pod_status | List pod status | No |
get_service_health | Check service health | No |
describe_pod | Get pod details | No |
restart_pod | Restart a specific pod | Yes |
scale_deployment | Change replica count | Yes |
run_runbook | Execute a runbook | Yes |
Action Confirmation
Destructive or mutating actions require explicit operator confirmation:
interface ActionConfirmation {
action: string;
target: string;
description: string;
risk_level: 'low' | 'medium' | 'high';
requires_approval: boolean;
}Suggested Queries
The chat interface offers contextual suggestions based on current platform state:
| Context | Suggestions |
|---|---|
| Alert firing | "Investigate the current alert for [service]" |
| High latency | "What is causing latency in [service]?" |
| Pod restarts | "Why is [pod] restarting?" |
| General | "Show platform health summary" |
WebSocket Connection
The chat uses WebSocket for real-time streaming of assistant responses:
const useOpsChat = (sessionId: string) => {
const ws = useWebSocket(`/ws/ops/chat/${sessionId}`);
const sendMessage = (message: string) => {
ws.send(JSON.stringify({ type: 'message', content: message }));
};
return { sendMessage, messages: ws.messages };
};Configuration
| Setting | Default | Description |
|---|---|---|
| Session timeout | 30 minutes | Idle session expiration |
| Max message length | 2000 characters | Maximum user message length |
| Response streaming | Enabled | Stream responses in real-time |
| Action confirmation | Enabled | Require confirmation for mutating actions |