Chat Interface

The Chat Interface page provides an AI-assisted conversational tool for platform operations. Operators can ask questions about system health, investigate issues, query metrics using natural language, and receive guided troubleshooting assistance powered by the Ops Agent Service.

Capabilities

Capability	Description	Example Query
Health queries	Check service status and health	"Is the AI service healthy?"
Metric queries	Query Prometheus metrics via natural language	"What is the p95 latency for the query engine?"
Log search	Search logs across services	"Show me errors from the AI service in the last hour"
Incident assistance	Get troubleshooting guidance	"The AI service is returning 503 errors"
Runbook execution	Execute operational runbooks	"Run the connection pool diagnostic"
Resource queries	Check resource utilization	"Which pods are using the most memory?"

Chat Architecture

Operator --> Chat Interface --> Ops Agent Service --> Observability API
                                     |                       |
                                     |              +--------+--------+
                                     |              |        |        |
                                     |          Prometheus  Loki    Tempo
                                     |
                                     +--> Kubernetes API
                                     |
                                     +--> AI Service (LLM)

Message Types

User Messages

Plain text natural language queries about platform operations.

Assistant Responses

Structured responses with embedded data:

Response Type	Content
Text	Natural language explanation
Metric chart	Inline time-series visualization
Log excerpt	Formatted log lines with highlighting
Table	Tabular data (pod list, service status)
Action suggestion	Clickable remediation actions
Runbook steps	Step-by-step troubleshooting guide

Conversational Context

The chat maintains session context for multi-turn investigations:

interface OpsSessionContext {
  session_id: string;
  services_discussed: string[];
  metrics_queried: string[];
  time_range: { from: string; to: string };
  active_incident?: string;
}

This allows follow-up questions without repeating context:

Turn	Message
1	"What is the AI service error rate?"
2	"Show me the logs for those errors"
3	"Is the database connection pool full?"
4	"Restart the affected pods"

Tool Execution

The Ops Agent has access to operational tools:

Tool	Action	Confirmation Required
`query_prometheus`	Execute a PromQL query	No
`search_logs`	Search Loki logs	No
`get_pod_status`	List pod status	No
`get_service_health`	Check service health	No
`describe_pod`	Get pod details	No
`restart_pod`	Restart a specific pod	Yes
`scale_deployment`	Change replica count	Yes
`run_runbook`	Execute a runbook	Yes

Action Confirmation

Destructive or mutating actions require explicit operator confirmation:

interface ActionConfirmation {
  action: string;
  target: string;
  description: string;
  risk_level: 'low' | 'medium' | 'high';
  requires_approval: boolean;
}

Suggested Queries

The chat interface offers contextual suggestions based on current platform state:

Context	Suggestions
Alert firing	"Investigate the current alert for [service]"
High latency	"What is causing latency in [service]?"
Pod restarts	"Why is [pod] restarting?"
General	"Show platform health summary"

WebSocket Connection

The chat uses WebSocket for real-time streaming of assistant responses:

const useOpsChat = (sessionId: string) => {
  const ws = useWebSocket(`/ws/ops/chat/${sessionId}`);
 
  const sendMessage = (message: string) => {
    ws.send(JSON.stringify({ type: 'message', content: message }));
  };
 
  return { sendMessage, messages: ws.messages };
};

Configuration

Setting	Default	Description
Session timeout	30 minutes	Idle session expiration
Max message length	2000 characters	Maximum user message length
Response streaming	Enabled	Stream responses in real-time
Action confirmation	Enabled	Require confirmation for mutating actions

Incident Management Overview