Docs - Deluje
Docs
Previous page
Next page
Collapse group
Frida
Login - Deluje
Login
Storage - Poslabšano delovanje
Storage
Compute - Deluje
Compute
We are planning an upgrade of the storage software; the process should go largely undetected, but in case of issues we may be forced to cancel active jobs.We appreciate your patience.
FRIDA was successfully relocated to a new facility.
The evaporator of one of the cooling units has been successfully repaired; some repairs will need to be performed at a later time. We're bringing the cluster back online, but will continue monitoring the cooling behaviour. We appreciate your patience.
Due to cooling malfunction we need to stop all operations. We're actively working on a fix.We appreciate your patience.
The maintenance of the RDC cooling has finished (partially). Some procedures are more extensive and will need to be performed in the upcoming week. We'll keep the operation of FRIDA under close monitoring and try to keep disruptive actions to the minimum possible. We appreciate your patience.
Due to continual cooling malfunction we're forced to cancel all jobs and shut down the cluster. A larger maintenance of the RDC cooling that will hopefully resolve the issues is planned for tomorrow morning. We appreciate your patience and we'll keep you posted of the progress.
The RDC cooling has returned to normal values, and the cluster has been put back into operation. We will be monitoring the cooling. We are working on resolving the issue. Note that in case of high external temperatures we may be forced to temporarily pause the cluster or cancel jobs to prevent additional fallouts.We appreciate your patience.
The RDC cooling is experiencing malfunction. As a preventative measure we're forced to shutdown the cluster; all running jobs were canceled. The cluster has been temporarily put into maintenance, with a planned release after 22:00 or once the datacenter's cooling resumes normal operation and temperatures reach acceptable levels. Due to high external temperatures and regular cooling malfunctions job cancelations of the most demanding jobs may become more frequent. We therefore advise to run demanding jobs overnight and perform regular and frequent checkpoints. Before and when running jobs please consult the dashboard “RDC heat load”.We appreciate your patience.
We implemented a fix, we're bringing the cluster back into operation. We'll be monitoring the cooling. Please make sure to perform regular checkpoints to avoid data loss in case of additional shutdowns.We appreciate your patience.
The RDC cooling is experiencing malfunction. As a preventative measure we're forced to shutdown the cluster; all jobs will be canceled. While working to fix the issue we'll also perform the scheduled FRIDA maintenance to keep the downtime at minimum. We appreciate your patience.
GPU0 on node ixh has been successfully replaced and the node is back in production. Please, benchmark your runs against earlier ones and report any discrepancies.Thank you for your patience.
Vozlišče ixh ne deluje zaradi pregrevanja grafične kartice 0. S podporo iščemo rešitev.
Hvala za vašo potrpežljivost.
We are reinstating the node and we'll monitor the status.
Maintenance has completed successfully. We performed the full set of updates & upgrades on all cluster nodes. The storage cluster has also undergone the full set of updates & upgrades. Most of the cluster is operational and ready to accept jobs. One node is currently kept in maintenance mode due to HW issues that require physical maintenance, it should resume operation in the next few days.
Cluster cannot be accessed at the moment. This incident was created by an automated monitoring service.
The RDC cooling was partially fixed, we're bringing the cluster back to production. We'll be monitoring the status. During the week the remaining RDC cooling issues will be resolved.We appreciate your patience
The RDC cooling is experiencing malfunction. As a preventative measure we're forced to shutdown the cluster; all jobs will be canceled.We appreciate your patience.
The RDC cooling was fixed, we're currently bringing the cluster back to life, performing the scheduled FRIDA maintenance, and monitoring the status.
apr. 2025 do jun. 2025
Naprej