Tuesday, 1 March 2016

How to Analyze Purple Screen of Death (PSOD)

Purple Screen of Death or commonly known as PSOD is something which we see most of the times on ESXi host.

Usually when we experience PSOD, we take the screenshot of PSOD and reboot the host and then capture the logs and upload it to VMware support for analysis.
Let’s analyze the dumps by yourself?

Step 1: Sometimes, we might miss out on the screenshot of PSOD. Well that's alright! If we have core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.
Once the host is back up from accident reboot post PSOD, login to the SSH/Putty of the host and go to the core directory. The core directory is the location where your PSOD logging go to.
The most important one is the “Core” folder which contains the kernel dump, the PSOD will purge what was in memory to a file called vmkernel-zdump.1 or something to that affect and place it in that directory.
So go to
# cd var/core 

Then list out the files here using ls –ltr . You will see the below file.

Step 2: 
How do we extract it?
Well, we have a nice extract script that does all the job, “vmkdump_extract ". This command must be executed against the zdump.1 file, which looks something like this:

# vmkdump_extract vmkernel-zdump.1 

It creates multiple below  files as mentioned in the screenshot.

Note: - All we require for analysis is the vmkernel-log.1 file.

Step 3: Open the vmkernel-log.1 file using one of the below method:
a. WinSCP (GUI) 
b.  less vmkernel-log.1   (Command line)
I am windows plus VMware support engineer, so defiantly I would preferred GUI method to analyze the log file J

Let’s use  WinSCP:
Step 4. Connect your ESXI host using WinSCP and browse /var/core path and copy vmkernel-log.1 to your local machine.

  Step 5. As you have already copied vmkernel-log.1 to your local machine. Now, You will have to use something like Notepad++ to open the vmkernel-log.1 file, right click on it and edit the log file in notepad++ editor and search for keyword “BlueScreen” and it will take you to the below events.

The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case issue it is pointed to “LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor” Which is pointing to hardware issue.
The VMKuptime tells the Kernel up-time before the crash.
The logging after that is the information that we need to be looking for, the cause as to why the crash occurred. 

Note:- The crash dump varies for every crash. These issues can range from hardware errors / driver issues / issues with ESXi build and a lot more.

While using the b method, skip to the end of the file by pressing Shift+G.and slowly go to the top by pressing Page Up. You will come across a line that says @BlueScreen: <event> and after that you know what exactly need to check

each dump analysis would be different, but fundamental is same.
Hope doc is helpful for you, you can try analyzing the dumps by yourself now J