rdc@fri - Major malfunction of one of the nodes – Incident details

Storage experiencing degraded performance

Major malfunction of one of the nodes

Resolved
Major outage
Started 7 months agoLasted 4 days

Affected

Frida

Major outage from 9:30 AM to 10:56 AM, Degraded performance from 10:56 AM to 12:46 PM, Operational from 12:46 PM to 10:15 AM

Compute

Major outage from 9:30 AM to 10:56 AM, Degraded performance from 10:56 AM to 12:46 PM, Operational from 12:46 PM to 10:15 AM

Updates
  • Resolved
    Resolved

    GPU0 on node ixh has been successfully replaced and the node is back in production. Please, benchmark your runs against earlier ones and report any discrepancies.

    Thank you for your patience.

  • Identified
    Identified

    Node ixh is down due to overheating of GPU0. We are working on a resolution with support.

    Thank you for your patience.

  • Monitoring
    Monitoring

    We are reinstating the node and we'll monitor the status.

  • Investigating
    Investigating
    We are currently investigating this incident.