BCG Status Updates

July 18 2018: Directory Difficulties

Today one of our file servers had a drive fail. The normal failover process worked, but unfortunately took long enough to confuse a bunch of machines.

The usual symptom is that directories (home directories, project directories) will not attach. Often your login session will just hang.

Unfortunately, this takes a hard reboot to fix reliably. We are checking a bunch of the open login and compute servers (Wednesday afternoon), and rebooting as we can.

Jan 4 2018: Hardware Vulnerability Patching Schedule

In the last few days rumors of terrible, CPU-level security vulnerabilities have been appearing in the tech news. Last night the embargo on details was broken, and it's quite a mess. The BCG will need to patch all machines in the department, probably more than once, to address the problems.

Over the next week, please:

  1. Log out if you are away from a machine more than a few hours (including remotely). This lets us patch machines when we see no one is logged into them.
  2. Please avoid long-running compute jobs, either directly or through condor. This will minimize work lost when we reboot a machine.

One of the two vulnerabilities cannot be fully solved short of replacing the CPU. The patches for that are work-arounds which try to minimize the risk of the vulnerability. These patches do degrade the performance somewhat, from nearly 20% for certain kinds of database tasks, to more modest 3-5% hits for purely computational work. What the hit will be like for average, daily workloads is not yet clear.

Patches are already available for all three of our platforms: Windows, MacOS, and Linux. We have already begun to apply patches on free machines. As firmware updates become available BCG staff will need to visit people's desktop machines and spend time with laptops.

There are two different exploits: Meltdown and Spectre. Meltdown can be patched. Spectre is going to be harder to fix.

These vulnerabilities can be exploited by any software running on your computer. That includes the Javascript running in your web browser, which makes remote exploitation trivial. We are not sure if these are being used in the wild yet. We can expect that they will be soon.

We strongly recommend everyone update their personal machines (desktops, laptops, mobile) as well. Be aware that some Antivirus software on Windows has been blocking the Windows patches. [ZDNet]

If you are using Microsoft Windows Defender, Symantec Endpoint Protection, Kapersky, ESET, AVAST, or F-Secure SAFE, this is not a problem.

However, McAfee Endpoint Protection, Trend Micro, Sophos Anti-Virus and Central, Cyren F-PROT, EMSI Anti-Malware, Bitdefender, Carbon Black, Cylance PROTECT, CrowdStrike Falcon, and Webroot do have this problem until they release a patch.

For more details, see this table (Google Docs).

If you want to know whether your Windows 10 machine has the Microsoft patch, check PC Settings > Update and Security > click on Update history, and look for KB4056892. If it is not there, let it install updates.

Links:

Updates
  • Jan 4 2018. There are already proof-of-concept Javascript attacks. Browser vendors are releasing patches, so be sure to update the browsers on your personal devices, too. Chrome, IE11 and Firefox all have patches available.

Nov 30 2017 - R version 3.4.1 available for testing

R version 3.4.1 (with all packages current) is available as the command R34. Please give this version a test and let us know if there are any problems with it. We will make the the new default R (/s/bin/R) in a few weeks.

Power Outage at CSC - 11/29/17

On 11/29/17 at approximately 4:00AM - 6:00AM, all department desktops located in the CSC were impacted by brief power outage due to maintenance by hospital staff. This means all desktops were restarted, and any unsaved work would have been lost.

We have addressed the majority of desktops and issues caused by this outage, however if you see any abnormal behavior please contact the BCG.

Network Outage - 10/13/17

At approximately 11:30AM on 10/13, we were made of aware of network related issues effecting the BMI department file and compute servers. The issue was a result of failed network updates made by UW DoIT, and has since been resolved (by ~1:00PM).

This effected all Linux users, and network shared drives for Mac and Windows users. If you continue to have issues after 1:00PM, please try restarting your desktop. If this does not resolve it, please contact BCG support.

Power Outage at CSC Saturday Sept 30 3AM-5AM

There is going to be a power outage 3AM - 5AM in the CSC which will affect all BMI offices located there. It's asking for trouble to let a computer be rudely cut off from power without a chance to politely shut down, so before you leave for the weekend, please shut down your desktop computers.

The CSC server room is on backup generator power, so the servers should be unaffected.

One of us will be in on Saturday to check on your desktops and turn them back on if necessary. If you have any questions, please don't hesitate to ask. You can reach us by email at support@biostat.wisc.edu and by phone at 265-5757.

The part of the notice we got that affects BMI is below:

From: Barrett Matthew R [MBarrett@uwhealth.org]
Sent: Tuesday, September 26, 2017 3:40 PM
Subject: Electrical ShutDown

All,

We are planning an electrical shut down at 0300 that will take approx. 2 hours Saturday 9/30/17 that will affect the following areas.
Please let me know if you have any questions as soon as possible.

During the shutdown normal lighting will be lost in:
1. K4 floors 1-9
2. K6 floors 1-5
3. H4 floors 1-8
4. H6 floors 1-5
5. Basement lighting in E5, H6, and tunnels.

Normal power (receptacles and equipment fed from 120/208 systems) will be lost in:
1. K4 basement - 9
2. K6 floors 1-5
3. H4 floors 1-9
4. H6 basement - 5
5. J5 basement - 1
6. G5 floors 1-3

Wed, July 19, 2017: UW Campus Network Outage [RESOLVED]

Starting on July 18th at 8PM, the entire UW campus has been experiencing widespread network outages that are effecting many different buildings and departments including the SMPH. Any desktop or server connected to the Biostatistics and Medical Informatics network may have internet or connection issues preventing any outside connection. There are reports that campus WiFi currently works in most areas.

As a temporary or emergency work around, you may be able to connect your laptop to the campus wireless network, then connect to the BMI network resources (including your network shares or compute servers) via your VPN. This is not guaranteed to work in all areas based on the outage, however it has worked for some.

More information on the outage, and more up to date resolution notifications, can be found on the DoIT outage site.

Tu, July 18, 2017: nebula-1 and nebula-2

Two new servers have been added to the open compute pool. Nebula-1 replaces a host that died some time ago, and nebula-2 replaces a very old machine.

If you were using nebula-2 for compiling certain libraries, it has been renamed to nebula-0. It will be take out of service once the last few SL6 machines have been upgraded to SL7.

Wed, June 21, 2017: Linux machine security issue and reboots

To address an extremely dangerous and widespread vulnerability, we will be patching all the linux machines in the next few days.

This patch requires a reboot.

Condor users will see a slightly higher rate of job restarts as we patch and reboot the condor-only hosts. We will try to schedule patch times with people for their desktop machines, but we don't want to spend more than a few days on getting every computer patched.

Pages