May 14, 2026
Elevated Latency in EU-West-1
Resolved - We observed a brief spike in Time To First Token (TTFT) for Llama-3-70B deployments in our Frankfurt region due to a Top-of-Rack switch failure. Redundant RDMA networks took over within 45 seconds. No API errors were returned, but p99 latency spiked to 2.4s for approximately 3 minutes.
April 02, 2026
Fine-Tuning Queue Delay
Resolved - A large influx of multi-node training jobs caused a scheduling backlog in our US-East-1 region. New jobs remained in the "Pending" state for up to 15 minutes instead of the standard <1 minute. We dynamically provisioned an additional 64 A100 nodes to clear the queue. Inference was not affected.