# Wicklee > Sovereign GPU fleet monitor for local AI inference. > One Rust binary per node, React dashboard at localhost:7700, fleet aggregation at wicklee.dev. > WES (Wicklee Efficiency Score) = tok/s / (Watts x ThermalPenalty) — the MPG for local AI. ## For LLMs: Start Here If you are an AI agent or LLM reviewing Wicklee, read these files in order: 1. **This file** (llms.txt) — overview, API endpoints, key metrics 2. **[llms-full.txt](https://wicklee.dev/llms-full.txt)** — complete reference with JSON schemas, all endpoints with examples 3. **[docs.md](https://wicklee.dev/docs.md)** — full documentation: WES formula, inference states, 20 patterns, alerting, MCP tools, proxy, OTel 4. **[api.md](https://wicklee.dev/api.md)** — API reference with endpoint tables and response examples 5. **[metrics.md](https://wicklee.dev/metrics.md)** — dashboard metrics, visual indicators, formulas 6. **[openapi.json](https://wicklee.dev/openapi.json)** — OpenAPI 3.0 spec for structured integration 7. **[Interactive docs](https://wicklee.dev/docs)** — HTML documentation page ## Install ```bash curl -fsSL https://wicklee.dev/install.sh | bash ``` Windows: `irm https://wicklee.dev/install.ps1 | iex` ## Localhost API (no auth, any tier) - `GET /api/metrics` — SSE stream, 1 Hz telemetry - `GET /ws` — WebSocket, 1 Hz telemetry (same payload as SSE, fallback transport) - `GET /api/observations` — 18 server-side observation patterns (10-min DuckDB buffer) - `GET /api/profile?minutes=60` — Inference Profiler: correlated TTFT/KV/queue/thermal/power timeline - `GET /api/sla?window_min=60&target_ttft_ms=500` — Inference SLA Monitor: p50/p95/p99 for TTFT/E2E/TPOT, compliance vs target, per-model breakdown, recent violations - `GET /api/v1/thermal-budget?node_id=X` — Thermal Budget Calculator (Pro+, cloud): sustainable_tps, push_threshold_tps, time_to_fair_min, fair_penalized_tps, plain-English backfire advice over a 7-day window - `POST/GET/DELETE /api/v1/webhooks[/:id[/test]]` — Threshold Webhooks (Pro+): register HMAC-SHA256 signed push notifications for thermal_state_changed, inference_state_changed, wes_below, wes_above - `GET /api/health` (localhost agent only) — diagnostic: agent_version, build_target, store_healthy bool, routes_available/unavailable lists. Returns store_failure_hint when DuckDB store init failed (which silently strips ~12 /api/* routes from the router). - `GET /api/cost-by-model?hours=24` — Cost attribution per model (daily breakdown) - `GET /api/explain-slowdown?ts_ms=N` — Root cause analysis for slow inference requests - `GET /api/model-comparison?hours=168` — Side-by-side model efficiency (WES, tok/s, watts, cost) - `GET /api/model-switches?hours=24` — Model switching cost (swap frequency, idle gap) - `GET /api/model-candidates?search=llama&limit=20` — Model discovery: GGUF models from HuggingFace scored against local hardware - `GET /api/history?node_id=WK-XXXX` — DuckDB metric history (1h buffer) - `GET /api/traces` — Proxy inference traces - `GET /api/events/history` — Node event log - `GET /api/events/recent` — Recent in-memory events - `GET /api/export?format=json|csv` — Data export - `GET /api/tags` — Ollama model tags - `GET /api/pair/status` — Pairing status - `POST /mcp` — MCP (Model Context Protocol) JSON-RPC 2.0 endpoint - `GET /.well-known/mcp.json` — MCP server manifest ### MCP Tools (via POST /mcp) - `get_node_status` — Full hardware + inference metrics snapshot - `get_inference_state` — Live/idle/busy state with sensor context - `get_active_models` — Running models with context_length, parameter_count, quantization, tok/s - `get_observations` — 18 patterns with routing_hint (steer_away/reduce_batch/monitor) per observation + node-level aggregate - `get_metrics_history` — 1-hour rolling telemetry buffer ### MCP Resources - `wicklee://node/metrics` — Live MetricsPayload JSON - `wicklee://node/thermal` — Thermal state + WES penalty values ## Cloud MCP Server (Team+, Bearer auth) `POST wicklee.dev/mcp` — fleet-aggregated MCP. 8 tools: - `get_fleet_status` — all nodes with metrics + WES - `get_fleet_wes` — compact WES scores - `get_node_detail` — full metrics for a specific node - `get_best_route` — routing recommendation by throughput/efficiency - `get_fleet_insights` — fleet health summary + observation count - `get_fleet_observations` — active/resolved observations - `get_inference_profile` — correlated profiler timeline - `explain_slowdown` — root cause analysis for slow requests ## Fleet API v1 (X-API-Key auth) Base URL: `https://wicklee.dev/api/v1` - `GET /api/v1/fleet` — All nodes with full MetricsPayload - `GET /api/v1/fleet/wes` — WES scores ranked - `GET /api/v1/nodes/{id}` — Single node deep dive - `GET /api/v1/route/best` — Routing recommendation (latency or efficiency). `?model=qwen2.5:7b` for per-model routing - `GET /api/v1/models/discover` — Model discovery: browse (`?search=`), simulate (`?simulate_hw=nvidia_4090`, Pro+), fleet match (`?fleet=true&model_id=X`, Team+) - `GET /api/v1/insights/latest` — Fleet intelligence snapshot (Team+) - `POST /api/v1/keys` — Create API key - `GET /api/v1/keys` — List API keys - `DELETE /api/v1/keys/{id}` — Revoke API key - `GET /metrics` — Prometheus scrape endpoint (Team+, X-API-Key auth) ## Multi-Model Monitoring When 2+ models are loaded in Ollama, the `active_models` array is included in the SSE/WS payload with per-model: tok/s, WES, VRAM, avg TTFT, avg latency, request count, size, quantization. Per-model WES uses proportional VRAM share for power attribution: `model_tok_s / (total_watts * vram_share * thermal_penalty)`. Requires proxy for per-model tok/s and latency. Without proxy, VRAM and model identity still tracked via /api/ps. Singular fields (ollama_active_model, ollama_tokens_per_second) report most-recently-active model for backwards compat. - `GET /api/model-switches?hours=24` — model swap frequency and idle overhead - `GET /api/v1/route/best?model=qwen2.5:7b` — per-model routing: filters to nodes with the target model loaded, uses per-model WES ## Key Metrics - `inference_state`: "live" | "idle-spd" | "busy" | "idle" - `ollama_tokens_per_second`: tok/s from 20-token probe (~30s cadence) - `apple_soc_power_w`: Combined CPU+GPU+ANE power (Apple Silicon) - `nvidia_power_draw_w`: Board power (NVIDIA) - `thermal_state`: "Normal" | "Fair" | "Serious" | "Critical" - `penalty_avg`: Thermal penalty multiplier (1.0 = no penalty) - `vllm_requests_waiting`: Queue depth (vLLM) - `ollama_ttft_ms`: Time to first token (Ollama probe baseline) - `vllm_avg_ttft_ms`: Time to first token (vLLM production histogram) ## 20 Observation Patterns + 5 Fleet Alerts Agent-evaluated (18 patterns, 10-min DuckDB buffer, every 10s): Community (9): thermal_drain, phantom_load, wes_velocity_drop, memory_trajectory, power_jitter, swap_io_pressure, clock_drift, nvidia_thermal_redline, vram_overcommit Pro (9): power_gpu_decoupling, bandwidth_saturation, efficiency_drag, pcie_lane_degradation, vllm_kv_cache_saturation, ttft_regression, latency_spike, vllm_queue_saturation, bandwidth_ceiling_reached Cloud-evaluated (2 patterns, Pro): - fleet_load_imbalance — node WES > 20% below best healthy peer - wes_long_term_drift — recent 24h avg ≥15% below 6-day baseline (gradual degradation) Cloud-evaluated (1 pattern): fleet_load_imbalance (Pro) Fleet alerts (5, cloud, all tiers): zombied_engine, thermal_redline, oom_warning, wes_cliff, agent_version_mismatch ## Runtimes Supported - Ollama (macOS, Linux, Windows) - vLLM (Linux) - llama.cpp / llama-box (macOS, Linux) ## Pricing - Community: Free forever, 3 nodes, 24h history, 9 patterns, local MCP, Ollama proxy - Pro: $29/mo, 10 nodes, 7-day history, 20 patterns, Slack+Email alerts, custom thresholds - Team: $49/seat/mo, 25 nodes (+$2/node over 25), 90-day history, shared dashboards, PagerDuty, Cloud MCP, OTel+Prometheus - Business: $499/mo, unlimited seats, 100 nodes, 365-day history, everything in Team + SSO/SAML + audit logging - Enterprise: Contact sales, unlimited nodes, sovereign deployment, custom SLA