BCG Status Updates

Scheduled File Server Outages: Wed Mar 13

Introduction:
We have scheduled a, two-hour file server outages for Wed March 13 @ 9pm to install hardware in preparation for an computing cluster enhanced network speed roll-out.

If you use Condor, or have long-running computational jobs, you should shut down your jobs during both outages.

Background:
Recently, our file servers have saturated network connections between the file server and our core infrastructure switches. At network saturation levels, general slow-downs to every connected computer occurs. The majority of the network traffic is bound for our computational cluster.

Method:
Offloading cluster compute traffic onto a private network will benefit the department by reducing network saturation levels to the file servers on the biostat network. Additionally, having a dedicated network for cluster compute traffic allows for better management and higher throughput of departmental core services and computational cluster resources.

Preliminary Data:
We are experimenting with a new 40 gigbit cluster computing network infrastructure which has the potential to provide speed improvements for our computational cluster of up to 10x to each node and in some cases up to 80x for some data intensive cluster computing jobs. The new network infrastructure is currently in testing and has shown improvements for throughput on testing compute nodes.

Dec 13 2018: New Bitcoin Scam Seen on Campus

(From the Office of Cybersecurity)

Colleagues,

Please be aware that the Office of Cybersecurity and the UW Police Department have received reports of a new Bitcoin (cryptocurrency) email scam. The message claims that a bomb has been placed in the building that the recipient is in and demands a large Bitcoin payment. This scam has also been seen nationwide at businesses and other educational institutions, including the University of Washington at Seattle and Penn State.

If you or others in your department receive this message, please delete it immediately. The Office of Cybersecurity is aware of the scam and is working with UWPD to investigate. Additionally, the FBI is aware of the threats and is working with law enforcement to provide assistance.

Aug 23, 2018: Central File Server Maintenance is complete

The hardware upgrade for the central userspace file server is now complete (1:40pm).

It took us a bit longer than anticipated, but we wanted to be sure certain configuration settings were properly persisting across a reboot.

Aug 20, 2018: Email delivery problems

Late last week the BCG made some changes to our name server configuration at the request of DoIT (Outlook integration). Unfortunately, in making those changes we introduced a bug into how email addresses are resolved. This error caused email from outside the UW Outlook system, to @biostat.wisc.edu addresses, to fail to deliver. Email sent to your @wisc.edu address, or sent entirely from within UW's Outlook, would still be delivered as normal.

We have fixed the problem (at about 9:45am this morning), and made additional changes to the configuration to prevent this particular sort of problem from happening again.

Many mail servers will try to deliver failed email for several days (up to five, but sometimes less). So, some of you will be getting email out of normal time order. If a mail server was unable to deliver within its retry period, the person trying to send you email will get a notice that the mail wasn't delivered, so there should be no silent losses of email.

I apologize for the inconveniences this has caused.

--
William S. Annis

Scheduled File Server Outages: Wed Aug 22, Thu Aug 23

We have scheduled two, two-hour file server outages for next week to address hardware issues we need to handle before classes start.

If you use Condor, or have long-running computational jobs, you should shut down your jobs during both outages.

First outage: Wednesday, Aug 22, 10:00am-12:00 noon.

This outage will affect the Computational file server, for data under /z/Comp and /z/DW. Your home directories should continue to be available during this outage, but regular users of computational space with customized environments might run into timeouts.

Second outage: Thursday, Aug 23, 11:00am-1:00pm.

This updates the main userspace and project directories under /ua (including the Z: and Q: drives) and /z/Proj. Normal work will not be possible during this outage.

August 13 2018: default R version updated to 3.5.1 tomorrow AM

The default version of R on the Linux machines will switch to 3.5.1 tomorrow morning (Aug 14, at 3:00am).

You can use the command R331 if you need to use the previous version after the switch has happened.

July 31 2018: R 3.5.1 ready

The newest version of R, 3.5.1, is ready and usable with the path /s/pkg/linux64/R/3.5.1/bin/R. After some testing, we will make this the new default, in a few weeks.

Some packages have dropped maintenance since the last update:

  • deal - very old
  • msgl
  • ncdf - replaced with ncdf4 package, which is now installed
  • ReporteRs
  • epicalc

If you need a particular replacement package for one of these, please let us know.

July 18 2018: Directory Difficulties

Today one of our file servers had a drive fail. The normal failover process worked, but unfortunately took long enough to confuse a bunch of machines.

The usual symptom is that directories (home directories, project directories) will not attach. Often your login session will just hang.

Unfortunately, this takes a hard reboot to fix reliably. We are checking a bunch of the open login and compute servers (Wednesday afternoon), and rebooting as we can.

Jan 4 2018: Hardware Vulnerability Patching Schedule

In the last few days rumors of terrible, CPU-level security vulnerabilities have been appearing in the tech news. Last night the embargo on details was broken, and it's quite a mess. The BCG will need to patch all machines in the department, probably more than once, to address the problems.

Over the next week, please:

  1. Log out if you are away from a machine more than a few hours (including remotely). This lets us patch machines when we see no one is logged into them.
  2. Please avoid long-running compute jobs, either directly or through condor. This will minimize work lost when we reboot a machine.

One of the two vulnerabilities cannot be fully solved short of replacing the CPU. The patches for that are work-arounds which try to minimize the risk of the vulnerability. These patches do degrade the performance somewhat, from nearly 20% for certain kinds of database tasks, to more modest 3-5% hits for purely computational work. What the hit will be like for average, daily workloads is not yet clear.

Patches are already available for all three of our platforms: Windows, MacOS, and Linux. We have already begun to apply patches on free machines. As firmware updates become available BCG staff will need to visit people's desktop machines and spend time with laptops.

There are two different exploits: Meltdown and Spectre. Meltdown can be patched. Spectre is going to be harder to fix.

These vulnerabilities can be exploited by any software running on your computer. That includes the Javascript running in your web browser, which makes remote exploitation trivial. We are not sure if these are being used in the wild yet. We can expect that they will be soon.

We strongly recommend everyone update their personal machines (desktops, laptops, mobile) as well. Be aware that some Antivirus software on Windows has been blocking the Windows patches. [ZDNet]

If you are using Microsoft Windows Defender, Symantec Endpoint Protection, Kapersky, ESET, AVAST, or F-Secure SAFE, this is not a problem.

However, McAfee Endpoint Protection, Trend Micro, Sophos Anti-Virus and Central, Cyren F-PROT, EMSI Anti-Malware, Bitdefender, Carbon Black, Cylance PROTECT, CrowdStrike Falcon, and Webroot do have this problem until they release a patch.

For more details, see this table (Google Docs).

If you want to know whether your Windows 10 machine has the Microsoft patch, check PC Settings > Update and Security > click on Update history, and look for KB4056892. If it is not there, let it install updates.

Links:

Updates
  • Jan 4 2018. There are already proof-of-concept Javascript attacks. Browser vendors are releasing patches, so be sure to update the browsers on your personal devices, too. Chrome, IE11 and Firefox all have patches available.

Pages