# Wicklee

> Sovereign GPU fleet monitor for local AI inference.
> One Rust binary per node, React dashboard at localhost:7700, fleet aggregation at wicklee.dev.
> WES (Wicklee Efficiency Score) = tok/s / (Watts x ThermalPenalty) — the MPG for local AI.

## For LLMs: Start Here

If you are an AI agent or LLM reviewing Wicklee, read these files in order:
1. **This file** (llms.txt) — overview, API endpoints, key metrics
2. **[llms-full.txt](https://wicklee.dev/llms-full.txt)** — complete reference with JSON schemas, all endpoints with examples
3. **[docs.md](https://wicklee.dev/docs.md)** — full documentation: WES formula, inference states, 20 patterns, alerting, MCP tools, proxy, OTel
4. **[api.md](https://wicklee.dev/api.md)** — API reference with endpoint tables and response examples
5. **[metrics.md](https://wicklee.dev/metrics.md)** — dashboard metrics, visual indicators, formulas
6. **[openapi.json](https://wicklee.dev/openapi.json)** — OpenAPI 3.0 spec for structured integration
7. **[Interactive docs](https://wicklee.dev/docs)** — HTML documentation page

## Install

```bash
curl -fsSL https://wicklee.dev/install.sh | bash
```

Windows: `irm https://wicklee.dev/install.ps1 | iex`

## Localhost API (no auth, any tier)

- `GET /api/metrics` — SSE stream, 1 Hz telemetry
- `GET /ws` — WebSocket, 1 Hz telemetry (same payload as SSE, fallback transport)
- `GET /api/observations` — 18 server-side observation patterns (10-min DuckDB buffer)
- `GET /api/profile?minutes=60` — Inference Profiler: correlated TTFT/KV/queue/thermal/power timeline
- `GET /api/sla?window_min=60&target_ttft_ms=500` — Inference SLA Monitor: p50/p95/p99 for TTFT/E2E/TPOT, compliance vs target, per-model breakdown, recent violations
- `GET /api/v1/thermal-budget?node_id=X` — Thermal Budget Calculator (Pro+, cloud): sustainable_tps, push_threshold_tps, time_to_fair_min, fair_penalized_tps, plain-English backfire advice over a 7-day window
- `POST/GET/DELETE /api/v1/webhooks[/:id[/test]]` — Threshold Webhooks (Pro+): register HMAC-SHA256 signed push notifications for thermal_state_changed, inference_state_changed, wes_below, wes_above
- `GET /api/health` (localhost agent only) — diagnostic: agent_version, build_target, store_healthy bool, routes_available/unavailable lists. Returns store_failure_hint when DuckDB store init failed (which silently strips ~12 /api/* routes from the router).
- `GET /api/cost-by-model?hours=24` — Cost attribution per model (daily breakdown)
- `GET /api/explain-slowdown?ts_ms=N` — Root cause analysis for slow inference requests
- `GET /api/model-comparison?hours=168` — Side-by-side model efficiency (WES, tok/s, watts, cost)
- `GET /api/model-switches?hours=24` — Model switching cost (swap frequency, idle gap)
- `GET /api/model-candidates?search=llama&limit=20` — Model discovery: GGUF models from HuggingFace scored against local hardware
- `GET /api/history?node_id=WK-XXXX` — DuckDB metric history (1h buffer)
- `GET /api/traces` — Proxy inference traces
- `GET /api/events/history` — Node event log
- `GET /api/events/recent` — Recent in-memory events
- `GET /api/export?format=json|csv` — Data export
- `GET /api/tags` — Ollama model tags
- `GET /api/pair/status` — Pairing status
- `POST /mcp` — MCP (Model Context Protocol) JSON-RPC 2.0 endpoint
- `GET /.well-known/mcp.json` — MCP server manifest

### MCP Tools (via POST /mcp)

- `get_node_status` — Full hardware + inference metrics snapshot
- `get_inference_state` — Live/idle/busy state with sensor context
- `get_active_models` — Running models with context_length, parameter_count, quantization, tok/s
- `get_observations` — 18 patterns with routing_hint (steer_away/reduce_batch/monitor) per observation + node-level aggregate
- `get_metrics_history` — 1-hour rolling telemetry buffer

### MCP Resources

- `wicklee://node/metrics` — Live MetricsPayload JSON
- `wicklee://node/thermal` — Thermal state + WES penalty values

## Cloud MCP Server (Team+, Bearer auth)

`POST wicklee.dev/mcp` — fleet-aggregated MCP. 8 tools:
- `get_fleet_status` — all nodes with metrics + WES
- `get_fleet_wes` — compact WES scores
- `get_node_detail` — full metrics for a specific node
- `get_best_route` — routing recommendation by throughput/efficiency
- `get_fleet_insights` — fleet health summary + observation count
- `get_fleet_observations` — active/resolved observations
- `get_inference_profile` — correlated profiler timeline
- `explain_slowdown` — root cause analysis for slow requests

## Fleet API v1 (X-API-Key auth)

Base URL: `https://wicklee.dev/api/v1`

- `GET /api/v1/fleet` — All nodes with full MetricsPayload
- `GET /api/v1/fleet/wes` — WES scores ranked
- `GET /api/v1/nodes/{id}` — Single node deep dive
- `GET /api/v1/route/best` — Routing recommendation (latency or efficiency). `?model=qwen2.5:7b` for per-model routing
- `GET /api/v1/models/discover` — Model discovery: browse (`?search=`), simulate (`?simulate_hw=nvidia_4090`, Pro+), fleet match (`?fleet=true&model_id=X`, Team+)
- `GET /api/v1/insights/latest` — Fleet intelligence snapshot (Team+)
- `POST /api/v1/keys` — Create API key
- `GET /api/v1/keys` — List API keys
- `DELETE /api/v1/keys/{id}` — Revoke API key
- `GET /metrics` — Prometheus scrape endpoint (Team+, X-API-Key auth)

## Multi-Model Monitoring

When 2+ models are loaded in Ollama, the `active_models` array is included in the SSE/WS payload with per-model: tok/s, WES, VRAM, avg TTFT, avg latency, request count, size, quantization.
Per-model WES uses proportional VRAM share for power attribution: `model_tok_s / (total_watts * vram_share * thermal_penalty)`.
Requires proxy for per-model tok/s and latency. Without proxy, VRAM and model identity still tracked via /api/ps.
Singular fields (ollama_active_model, ollama_tokens_per_second) report most-recently-active model for backwards compat.
- `GET /api/model-switches?hours=24` — model swap frequency and idle overhead
- `GET /api/v1/route/best?model=qwen2.5:7b` — per-model routing: filters to nodes with the target model loaded, uses per-model WES

## Key Metrics

- `inference_state`: "live" | "idle-spd" | "busy" | "idle"
- `ollama_tokens_per_second`: tok/s from 20-token probe (~30s cadence)
- `apple_soc_power_w`: Combined CPU+GPU+ANE power (Apple Silicon)
- `nvidia_power_draw_w`: Board power (NVIDIA)
- `thermal_state`: "Normal" | "Fair" | "Serious" | "Critical"
- `penalty_avg`: Thermal penalty multiplier (1.0 = no penalty)
- `vllm_requests_waiting`: Queue depth (vLLM)
- `ollama_ttft_ms`: Time to first token (Ollama probe baseline)
- `vllm_avg_ttft_ms`: Time to first token (vLLM production histogram)

## 20 Observation Patterns + 5 Fleet Alerts

Agent-evaluated (18 patterns, 10-min DuckDB buffer, every 10s):
Community (9): thermal_drain, phantom_load, wes_velocity_drop, memory_trajectory,
power_jitter, swap_io_pressure, clock_drift, nvidia_thermal_redline, vram_overcommit
Pro (9): power_gpu_decoupling, bandwidth_saturation, efficiency_drag,
pcie_lane_degradation, vllm_kv_cache_saturation, ttft_regression, latency_spike,
vllm_queue_saturation, bandwidth_ceiling_reached

Cloud-evaluated (2 patterns, Pro):
- fleet_load_imbalance — node WES > 20% below best healthy peer
- wes_long_term_drift — recent 24h avg ≥15% below 6-day baseline (gradual degradation)

Cloud-evaluated (1 pattern): fleet_load_imbalance (Pro)

Fleet alerts (5, cloud, all tiers): zombied_engine, thermal_redline, oom_warning,
wes_cliff, agent_version_mismatch

## Runtimes Supported

- Ollama (macOS, Linux, Windows)
- vLLM (Linux)
- llama.cpp / llama-box (macOS, Linux)

## Pricing

- Community: Free forever, 3 nodes, 24h history, 9 patterns, local MCP, Ollama proxy
- Pro: $29/mo, 10 nodes, 7-day history, 20 patterns, Slack+Email alerts, custom thresholds
- Team: $49/seat/mo, 25 nodes (+$2/node over 25), 90-day history, shared dashboards, PagerDuty, Cloud MCP, OTel+Prometheus
- Business: $499/mo, unlimited seats, 100 nodes, 365-day history, everything in Team + SSO/SAML + audit logging
- Enterprise: Contact sales, unlimited nodes, sovereign deployment, custom SLA