In the last few days rumors of terrible, CPU-level security vulnerabilities have been appearing in the tech news. Last night the embargo on details was broken, and it's quite a mess. The BCG will need to patch all machines in the department, probably more than once, to address the problems.
Over the next week, please:
- Log out if you are away from a machine more than a few hours (including remotely). This lets us patch machines when we see no one is logged into them.
- Please avoid long-running compute jobs, either directly or through condor. This will minimize work lost when we reboot a machine.
One of the two vulnerabilities cannot be fully solved short of replacing the CPU. The patches for that are work-arounds which try to minimize the risk of the vulnerability. These patches do degrade the performance somewhat, from nearly 20% for certain kinds of database tasks, to more modest 3-5% hits for purely computational work. What the hit will be like for average, daily workloads is not yet clear.
Patches are already available for all three of our platforms: Windows, MacOS, and Linux. We have already begun to apply patches on free machines. As firmware updates become available BCG staff will need to visit people's desktop machines and spend time with laptops.
R version 3.4.1 (with all packages current) is available as the command R34. Please give this version a test and let us know if there are any problems with it. We will make the the new default R (/s/bin/R) in a few weeks.
Two new servers have been added to the open compute pool. Nebula-1 replaces a host that died some time ago, and nebula-2 replaces a very old machine.
If you were using nebula-2 for compiling certain libraries, it has been renamed to nebula-0. It will be take out of service once the last few SL6 machines have been upgraded to SL7.
To address an extremely dangerous and widespread vulnerability, we will be patching all the linux machines in the next few days.
This patch requires a reboot.
Condor users will see a slightly higher rate of job restarts as we patch and reboot the condor-only hosts. We will try to schedule patch times with people for their desktop machines, but we don't want to spend more than a few days on getting every computer patched.
After a switch on April 10th to a new firewall for the Med School, we noticed that idle ssh sessions were being killed after very short times. This morning, the SMPH Network Group made a change to reduce that timeout, but now we are seeing incomplete ssh negotiation for new ssh sessions, from off campus and from VPNs. Within the Biostat network, ssh is fine.
The SMPH Network Group is aware of this and debugging the issue now. We'll update this status page when we know more.
The server is back up and answering.
We continue to investigate the cause of this problem. I apologize for the disruption and appreciate your patience.
Access to the main file server is intermittent. We're working on it now.
10:17pm Update: File system services to user home directories and project directories are now working again.
There will need to be a scheduled outage (perhaps more than one) to deal with additional clean-up and adjustments. The system might be a bit slower than usual for a few days.
Most machines have recovered on their own from the outage, but a few may be so confused they need a reboot. We'll be checking hosts, but please let us know if you run into any stuck machines.
There will be an outage to do hardware maintenance on both the user home directory (/ua/, /z/Proj/) as well as the computational file server (/z/Comp/). The outage will last from 9pm to 11pm on Monday, January 16th.
You should not run Condor jobs during this outage.
The main home directory file server will be down for two hours Tuesday evening to install urgent software updates.
Some project directories are on the same file server, and we recommend against scheduling Condor jobs through the outage window.
Update, Sep 14: all upgrades were successfully installed.