Notice history

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

Jul 2025

Resolved
July 30, 2025 at 10:40:43
Resolved
July 30, 2025 at 10:40:43
The GPUs on node ixh has been successfully replaced and the node is back in production. Please, benchmark your runs against earlier ones and report any discrepancies.

Thank you for your patience.
Update
July 24, 2025 at 16:33:20
Update
July 24, 2025 at 16:33:20
Node ixh is down as during the replacement of GPU6 issues with GPU0 have been detected. We're coordinating a resolution with support.

Thank you for your patience.
Identified
July 17, 2025 at 12:42:58
Identified
July 17, 2025 at 12:42:58
Node ixh is down due to overheating of GPU6. We are working on a resolution with support.
Thank you for your patience.
Investigating
July 17, 2025 at 12:21:17
Investigating
July 17, 2025 at 12:21:17
We are currently investigating this incident.

Resolved
July 03, 2025 at 10:31:10
Resolved
July 03, 2025 at 10:31:10
This incident has been resolved. Access to all affected files should be restored.
Identified
July 03, 2025 at 08:10:00
Identified
July 03, 2025 at 08:10:00
We are currently experiencing an outage of the external Ceph tired storage, your /shared/home folders may be affected. We are investigating the issue and trying to resolve it as swiftly as possible.

We apologise for the inconvenience.

Jun 2025

Storage maintenance

Completed
June 18, 2025 at 13:00:00
Completed
June 18, 2025 at 13:00:00
Maintenance has completed successfully
In progress
June 18, 2025 at 07:00:01
In progress
June 18, 2025 at 07:00:01
Maintenance is now in progress
Planned
June 18, 2025 at 07:00:00
Planned
June 18, 2025 at 07:00:00
We are planning an upgrade of the storage software; the process should go largely undetected, but in case of issues we may be forced to cancel active jobs.

We appreciate your patience.

Resolved
November 24, 2025 at 21:05:53
Resolved
November 24, 2025 at 21:05:53
FRIDA was successfully relocated to a new facility.
Monitoring
August 11, 2025 at 12:29:57
Monitoring
August 11, 2025 at 12:29:57
The evaporator of one of the cooling units has been successfully repaired; some repairs will need to be performed at a later time. We're bringing the cluster back online, but will continue monitoring the cooling behaviour.

We appreciate your patience.
Identified
August 11, 2025 at 07:23:04
Identified
August 11, 2025 at 07:23:04
Due to cooling malfunction we need to stop all operations. We're actively working on a fix.

We appreciate your patience.
Update
June 20, 2025 at 16:52:20
Update
June 20, 2025 at 16:52:20
The maintenance of the RDC cooling has finished (partially). Some procedures are more extensive and will need to be performed in the upcoming week. We'll keep the operation of FRIDA under close monitoring and try to keep disruptive actions to the minimum possible.

We appreciate your patience.
Update
June 19, 2025 at 14:37:56
Update
June 19, 2025 at 14:37:56
Due to continual cooling malfunction we're forced to cancel all jobs and shut down the cluster. A larger maintenance of the RDC cooling that will hopefully resolve the issues is planned for tomorrow morning.

We appreciate your patience and we'll keep you posted of the progress.
Monitoring
June 16, 2025 at 14:30:53
Monitoring
June 16, 2025 at 14:30:53
The RDC cooling has returned to normal values, and the cluster has been put back into operation. We will be monitoring the cooling. We are working on resolving the issue. Note that in case of high external temperatures we may be forced to temporarily pause the cluster or cancel jobs to prevent additional fallouts.

We appreciate your patience.
Identified
June 15, 2025 at 15:13:43
Identified
June 15, 2025 at 15:13:43
We are continuing to work on a fix for this incident.
Investigating
June 15, 2025 at 13:43:16
Investigating
June 15, 2025 at 13:43:16
The RDC cooling is experiencing malfunction. As a preventative measure we're forced to shutdown the cluster; all running jobs were canceled. The cluster has been temporarily put into maintenance, with a planned release after 22:00 or once the datacenter's cooling resumes normal operation and temperatures reach acceptable levels. Due to high external temperatures and regular cooling malfunctions job cancelations of the most demanding jobs may become more frequent. We therefore advise to run demanding jobs overnight and perform regular and frequent checkpoints. Before and when running jobs please consult the dashboard “RDC heat load”.

We appreciate your patience.

Resolved
June 11, 2025 at 21:35:47
Resolved
June 11, 2025 at 21:35:47
We implemented a fix, we're bringing the cluster back into operation. We'll be monitoring the cooling. Please make sure to perform regular checkpoints to avoid data loss in case of additional shutdowns.

We appreciate your patience.
Identified
June 10, 2025 at 13:00:00
Identified
June 10, 2025 at 13:00:00
The RDC cooling is experiencing malfunction. As a preventative measure we're forced to shutdown the cluster; all jobs will be canceled. While working to fix the issue we'll also perform the scheduled FRIDA maintenance to keep the downtime at minimum.

We appreciate your patience.

May 2025

Resolved
May 10, 2025 at 10:15:46
Resolved
May 10, 2025 at 10:15:46
GPU0 on node ixh has been successfully replaced and the node is back in production. Please, benchmark your runs against earlier ones and report any discrepancies.

Thank you for your patience.
Identified
May 07, 2025 at 12:46:06
Identified
May 07, 2025 at 12:46:06
Node ixh is down due to overheating of GPU0. We are working on a resolution with support.
Thank you for your patience.
Monitoring
May 07, 2025 at 10:56:40
Monitoring
May 07, 2025 at 10:56:40
We are reinstating the node and we'll monitor the status.
Investigating
May 06, 2025 at 09:30:00
Investigating
May 06, 2025 at 09:30:00
We are currently investigating this incident.

May 2025 to Jul 2025

rdc@fri - Notice history

All systems operational

Notice history

Jul 2025

Jun 2025

May 2025