Wiki/Operation/Archive/RCAs_2016-2018.md

8.0 KiB

{{DowntimeRCA|Windows Update Failure|cause As background, Windows was installed originally as Windows 7 Home. It was later upgraded to Windows 10 Home and then Windows 10 Pro to add functionality for Hyper-V and Remote Desktop (though only to targeted IP's -- this service is considered a convenience and a vulnerability).

The AniNIX was brought down after a Windows update on 8/24/2016. The AniNIX had had other downtimes recently on calls with Microsoft as the Windows cdrom.sys driver was not recognizing drives that were ostensibly plug-and-play. Many fixes, including reordering SATA slots, registry updates, calls with drive manufacturers for drivers, adding/removing the hardware in Device Manager, etc. had been tried and not succeeded. This issue was on hold at the time of the update.

When Windows started again, it would only flash a "CRITICAL_PROCESS_DIED" error message. Booting with Holocron showed no memory issues or other hardware problems. Using a Windows installation medium (on USB), we were not able to restore to a system restore point before the update or correct the issue with the installation. We were also not able to access the prior system images taken with the Windows backup utility.

Discussions with Microsoft indicate that the upgrades fro Windows 7 to Windows 10 Pro had suffered silent failures that were not logged to the user. While the operating system would limp along in functionality, it was unable to be repaired and required a reinstall. |length=~24 hours |resolution

At this point, the Forge2 frame was disconnected and all SATA lines except for the 60GB SSD were severed. Windows 10 Pro was re-installed to this disk, and display drivers were re-installed. The standard and Hypervisor packages were re-installed. Games were left to be rebuilt over time.

The frame was then brought down again and other drives reconnected, with the exception of the original Windows 10 drive. This drive will be left disconnected and system images abandoned short-term as a form of backup. Instead, backups will be taken of all installers used during the recovery process along with the Windows 10 Pro product key and registry. |commits }}

{{DowntimeRCA|Windows Virtual Disk Failure|cause On 9/17/2016, the Forge2#Hypervisor_Notes Hypervisor logged error messages about a reset being sent to the RAID1 device, an Windows warning-level event with an id of 129 and source storahci. From then on, every five seconds, error messages were logged about disks: "An error was detected on device \Device\Harddisk1\DR1 during a paging operation." with an error ID of 51 and source of disk. Also every five seconds, NTFS error of id 140 was logged with the error message "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: F:, DeviceName: \Device\HarddiskVolume1. (A device which does not exist was specified.)".

These error messages occurred silently for several hours. Finally, on trying to access a mounted virtual disk inside Core on 9/18/2016, ArchLinux displayed a cascade of block update failures and required a reboot. All virtual machines on the Forge2 were to be rebooted, but none came up as Hyper-V could not detect the disks. The entire frame was restarted, VM definitions repaired manually, and services restored.

This issue recurred a few days later and perhaps once a week after. |length=6 hours over a series of downtimes |resolution=Hyper-V resource files and executables were excluded from antivirus scanning. When the antivirus didn't respect this, we dropped antivirus from the Hyper-V host. |commits=[https://aninix.net/mediawiki/index.php?title=Forge2&type=revision&diff=722&oldid=712 Wiki change] }}

{{DowntimeRCA|Windows Update Service Reboots|cause The Windows Update Service, if it does not see a reboot recently, will reboot the host to install updates. This causes downtimes on the Core VM and other service VM's; remote recovery has also been made difficult by BIOS randomly changing the boot device.

Thanks to Mathisen from ##windows on Freenode, we found a GPO we're testing. Set the "gpedit.msc > Computer Configuration \ Administrative Templates \ Windows Components \ Windows Update \ Configure Automatic Updates" option to Enabled and "Notify to download / Notify to install". This should stop automatic reboots but requires a Pro or Enterprise license. |length=2 hours from unattended host |resolution=Apply Group Policy. |commits https://aninix.net/mediawiki/index.php?title=Forge2&type=revision&diff=767&oldid=722 }}

{{DowntimeRCA|Charter ISP Outage|cause At 0043 CST on 11/18/2016, the Charter Residential modem lost connectivity with the wider Internet. Service was not restored until 6:45 a.m., and the admin was physically away from the system. At 1045 CST same-day, a direct check of the IP address luckily showed that the AniNIX had recovered its original IP before the outage. |resolution Admins have changed ISP contracting to be notified immediately of outages and resumption of service.

The upcoming changes to the Cerberus project will include detection of changes in WAN IP and will notify admins of changes so that the remote DNS can be updated. Admins will also investigate dynamic DNS services.Category:TODO. |length=6 hours of technical downtime, extended to 10 hours of practical downtime by a lack of reporting tools on ISP outages. |commits

{{DowntimeRCA|Windows Sleep Hangs Core VM|cause=Admin Error -- accidentally clicked sleep during RDP disconnect |length=8 hours |resolution=I was able to access the Shadowfeed and Forge2 entities by port forwarding through Bastion from my Tricorder, as I was offsite. I was able to restart Core and supply passphrases to unlock the storage.

This downtime was exacerbated as I did not check Heartbeat after resuming the hypervisor.

We've added notes for removing sleep options on servers and remote Heartbeat monitoring in response to this incident. |commits=* [https://aninix.net/mediawiki/index.php?title=Forge2&curid=64&diff=984&oldid=932 Wiki for Forge2 notes]

{{DowntimeRCA|Windows Wake Hangs Hypervisor|cause=Code issue|length=30 minutes|resolution=All services were restarted.|commits=None yet. I plan to move Bastion to its own host, and to ensure Forge2 runs VMware ESXi rather than Hyper-V. This will take time and planning.}}

{{DowntimeRCA|Alliant Energy Outage|cause=Power company|length=9 hours|resolution=AniNIX staff noticed connections drop to AniNIX services on 2018-09-18 12:05 CDT. At this point, power had already been out for 22 minutes and UPS power was exhausted, resulting in a shutdown of AniNIX hardware. Alliant notified AniNIX on-site admins 40 minutes later of the outage, and power was restored by 16:00. Unfortunately, AniNIX staff were not able to resume service operation until 21:15.|commits=We will add monitoring either through Nazara or Forge2 of the UPS, but this will not prevent an outage. The Forge3 may improve uptime capacity, but until a generator is on-premise, this outage is unpreventable. That generator cost will take significant time to defray.}}

{{DowntimeRCA|Charter ISP Outage|cause=ISP|length=4 hours|resolution=AniNIX on-site admin detected the outage at 2018-10-04 00:05 CDT and restarted modem hardware on-site several times. When this didn't work, the admin contacted the ISP and reported the outage. ISP acknowledged the outage and field techs were sent to repair the outage in the area. Service was restored at 04:10 CDT|commits=None. Our Zapier/Freshping alerting service worked appropriately, and Sharingan sent out notifications as designed when service resumed.}}

Category:RCA Archive