rdc@fri - RDC cooling malfunction – Podrobnosti o dogodku

Storage Izkušnje z zmanjšano zmogljivostjo

RDC cooling malfunction

Odpravljeno
Večji izpad delovanja
Začetek pred 6 meseciTrajalo 5 mesecev

Prizadete storitve

Frida

Delni izpad delovanja od 1:43 PM do 3:13 PM, Deluje od 1:43 PM do 3:13 PM, Večji izpad delovanja od 3:13 PM do 2:30 PM, Deluje od 2:30 PM do 2:37 PM, V vzdrževanju od 2:37 PM do 4:52 PM, Deluje od 2:37 PM do 4:52 PM, Poslabšano delovanje od 4:52 PM do 7:23 AM, Deluje od 4:52 PM do 7:23 AM, Večji izpad delovanja od 7:23 AM do 9:05 PM, Deluje od 7:23 AM do 9:05 PM

Login

Deluje od 1:43 PM do 3:13 PM, Večji izpad delovanja od 3:13 PM do 2:30 PM, Deluje od 2:30 PM do 9:05 PM

Storage

Deluje od 1:43 PM do 3:13 PM, Večji izpad delovanja od 3:13 PM do 2:30 PM, Deluje od 2:30 PM do 9:05 PM

Compute

Delni izpad delovanja od 1:43 PM do 3:13 PM, Večji izpad delovanja od 3:13 PM do 2:30 PM, Deluje od 2:30 PM do 2:37 PM, V vzdrževanju od 2:37 PM do 4:52 PM, Poslabšano delovanje od 4:52 PM do 7:23 AM, Večji izpad delovanja od 7:23 AM do 9:05 PM, Deluje od 12:29 PM do 9:05 PM

Posodobitve
  • Odpravljeno
    Odpravljeno

    FRIDA was successfully relocated to a new facility.

  • Spremljanje
    Spremljanje

    The evaporator of one of the cooling units has been successfully repaired; some repairs will need to be performed at a later time. We're bringing the cluster back online, but will continue monitoring the cooling behaviour.

    We appreciate your patience.

  • Napaka odkrita
    Napaka odkrita

    Due to cooling malfunction we need to stop all operations. We're actively working on a fix.

    We appreciate your patience.

  • Nadgradnja
    Nadgradnja

    The maintenance of the RDC cooling has finished (partially). Some procedures are more extensive and will need to be performed in the upcoming week. We'll keep the operation of FRIDA under close monitoring and try to keep disruptive actions to the minimum possible.

    We appreciate your patience.

  • Nadgradnja
    Nadgradnja

    Due to continual cooling malfunction we're forced to cancel all jobs and shut down the cluster. A larger maintenance of the RDC cooling that will hopefully resolve the issues is planned for tomorrow morning.

    We appreciate your patience and we'll keep you posted of the progress.

  • Spremljanje
    Spremljanje

    The RDC cooling has returned to normal values, and the cluster has been put back into operation. We will be monitoring the cooling. We are working on resolving the issue. Note that in case of high external temperatures we may be forced to temporarily pause the cluster or cancel jobs to prevent additional fallouts.

    We appreciate your patience.

  • Napaka odkrita
    Napaka odkrita
    We are continuing to work on a fix for this incident.
  • Raziskovanje
    Raziskovanje

    The RDC cooling is experiencing malfunction. As a preventative measure we're forced to shutdown the cluster; all running jobs were canceled. The cluster has been temporarily put into maintenance, with a planned release after 22:00 or once the datacenter's cooling resumes normal operation and temperatures reach acceptable levels. Due to high external temperatures and regular cooling malfunctions job cancelations of the most demanding jobs may become more frequent. We therefore advise to run demanding jobs overnight and perform regular and frequent checkpoints. Before and when running jobs please consult the dashboard “RDC heat load”.

    We appreciate your patience.