Major malfunction of one of the nodes

Resolved

Major outage

Started about 1 year agoLasted 4 days

Affected

Frida

Compute

Updates

Resolved
May 10, 2025 at 10:15:46
Resolved
May 10, 2025 at 10:15:46
GPU0 on node ixh has been successfully replaced and the node is back in production. Please, benchmark your runs against earlier ones and report any discrepancies.

Thank you for your patience.
Identified
May 07, 2025 at 12:46:06
Identified
May 07, 2025 at 12:46:06
Node ixh is down due to overheating of GPU0. We are working on a resolution with support.
Thank you for your patience.
Monitoring
May 07, 2025 at 10:56:40
Monitoring
May 07, 2025 at 10:56:40
We are reinstating the node and we'll monitor the status.
Investigating
May 06, 2025 at 09:30:00
Investigating
May 06, 2025 at 09:30:00
We are currently investigating this incident.

rdc@fri - Major malfunction of one of the nodes – Incident details