Purple
Screen of Death or commonly known as PSOD is something which
we see most of the times on ESXi host.
Usually when we experience PSOD, we take the screenshot of PSOD and reboot the host and then capture the logs and upload it to VMware support for analysis.
Let’s analyze the dumps by yourself?
Step
1: Sometimes, we might miss out on the screenshot
of PSOD. Well that's alright! If we have core-dump configured for the ESXi, we
can extract the dump files to gather the crash logs.
Once the host is back up from accident reboot
post PSOD, login to the SSH/Putty of the host and go to the core directory. The
core directory is the location where your PSOD logging go to.
The
most important one is the “Core” folder which contains the kernel dump, the
PSOD will purge what was in memory to a file called vmkernel-zdump.1 or
something to that affect and place it in that directory.
So go to
# cd var/core
Then list out the files here using ls –ltr . You will see the below file.
Then list out the files here using ls –ltr . You will see the below file.
Vmkernel-zdump.1
Step 2: How do we extract it?
Well, we have a nice extract script that does
all the job, “vmkdump_extract ". This command must be executed against the
zdump.1 file, which looks something like this:
# vmkdump_extract vmkernel-zdump.1
It creates multiple below files as mentioned in the screenshot.
# vmkdump_extract vmkernel-zdump.1
It creates multiple below files as mentioned in the screenshot.
Note: - All we require for
analysis is the vmkernel-log.1 file.
Step 3: Open the vmkernel-log.1 file using one of the
below method:
a.
WinSCP (GUI)
b. less vmkernel-log.1 (Command line)
I am windows plus
VMware support engineer, so defiantly I would preferred GUI method to analyze the log file J
Let’s use WinSCP:
Step 4. Connect your ESXI host using WinSCP and browse /var/core path and copy vmkernel-log.1
to your local machine.
Step 5. As you have already copied vmkernel-log.1 to your local machine. Now,
You will have to use something like Notepad++ to open the vmkernel-log.1 file,
right click on it and edit the log file in notepad++ editor and search for
keyword “BlueScreen” and it will take you to the below events.
The first
line @BlueScreen: Tells the crash exception like Exception 13/14, in my case issue it is pointed to “LINT1/NMI
(motherboard nonmaskable interrupt), undiagnosed. This may be a hardware
problem; please contact your hardware vendor” Which is pointing to hardware
issue.
The VMKuptime tells
the Kernel up-time before the crash.
The logging after
that is the information that we need to be looking for, the cause as to why the
crash occurred.
Note:- The crash dump varies for every crash. These issues can range from
hardware errors / driver issues / issues with ESXi build and a lot more.
While using the b method, skip to the end of the file by pressing Shift+G.and slowly go to the top by pressing Page Up. You will come across a line that says @BlueScreen: <event> and after that you know what exactly need to check J
each dump analysis would be different, but fundamental is same.
Hope doc is helpful for you, you can try analyzing the dumps by yourself
now J
Brilliant. thanks
ReplyDeleteSuper thanks
ReplyDeleteWell, explained thank you for indetail will keep this as book mark in my browser
ReplyDelete