We ran into the Purple Screen of Death on one of our ESX 4.1 boxes yesterday. It is a Dell R610, and apparently had a hardware hiccup, and kicked out errors stating:
| | Tue Apr 19 21:09:15 2011 | | PCIE Fatal Err: Critical Event sensor, bus fatal error (Bus 1 Device 0 Function 1) was asserted | | 0xA10002FBF9AD4DB1000413186FAA0101h |
| | Tue Apr 19 21:09:15 2011 | | Err Reg Pointer: OEM sensor, OEM Diagnostic data event was asserted | | 0xA00002FBF9AD4DB10004C11A7E011610h |
| | Tue Apr 19 21:09:15 2011 | | PCIE Fatal Err: Critical Event sensor, bus fatal error (Bus 1 Device 0 Function 0) was asserted | | 0x9F0002FBF9AD4DB1000413186FAA0001h |
| | Tue Apr 19 21:09:15 2011 | | Err Reg Pointer: OEM sensor, OEM Diagnostic data event was asserted | | 0x9E0002FBF9AD4DB10004C11A7E011610h |
We rebooted the box, and it came back online just fine, but we didn't feel comfortable with it, so we stuck it in maintenance mode and had someone contact Dell. Dell reports that we need to update the Bios on it:
-----
Yes, it appears your system is affected by some of the microcode updates released from Intel on the 5500 and 5600 series processors. That is likely the cause of these PCI errors. The course of action we need to take is:
· Update the BIOS
· Update the iDRAC
· Clear out the old log entries
· Monitor for re-occurance.
------
So it's sitting in maintenance mode until someone has some time to love on it. The awesome thing is that we run N+1 (one more box than we need) so we have that luxury. I know plenty of people that refuse to listen to why you should go N+1 who would be scrambling to make a maintenance window to update it.
The downside to this whole fiasco was that when it hiccupped, it stayed online (as is the default with ESX), and held onto the Storage of it's VMs. Therefore, HA couldn't restart them on another box until someone manually SHUT OFF the pretty Purple-VM-Eater. As soon as they did that, all was well in the world and the phone stopped ringing.
Since I'm not fond on relying on manual intervention to make HA work, I found the command for auto-restart when a PSoD happens and applied to ALL our hosts:
esxcfg-advcfg -s X /Misc/BlueScreenTimeout
Were X = number of seconds before restart
I went with 30 seconds, that way I have the opportunity of seeing the screen if I so happen to be looking at it when it dies.
------
Dustin Shaw
VCP