All Systems Operational

Last updated: Just now

Global Uptime (90 Days)
99.998%

Inference & APIs

Global REST API Routing
100.00%
90 days ago100% uptimeToday
LLM Inference Clusters (H100)
99.98%
90 days ago99.98% uptimeToday

Compute & Training

Fine-Tuning Jobs Manager
99.95%
90 days ago99.95% uptimeToday

Past Incidents

May 14, 2026
Elevated Latency in EU-West-1
Resolved - We observed a brief spike in Time To First Token (TTFT) for Llama-3-70B deployments in our Frankfurt region due to a Top-of-Rack switch failure. Redundant RDMA networks took over within 45 seconds. No API errors were returned, but p99 latency spiked to 2.4s for approximately 3 minutes.
April 02, 2026
Fine-Tuning Queue Delay
Resolved - A large influx of multi-node training jobs caused a scheduling backlog in our US-East-1 region. New jobs remained in the "Pending" state for up to 15 minutes instead of the standard <1 minute. We dynamically provisioned an additional 64 A100 nodes to clear the queue. Inference was not affected.