Introduction
Windows* black screen hangs and crashes are difficult to debug since the system doesn't display any status or debug information, and frequently regular WinDbg* connection methods are not usable. Intel® Debug Extensions for WinDbg* included with Intel® System Debugger can help you with the debug processes, providing a debug connection method to an otherwise unresponsive Windows* target.
This article shows how to use Intel® Debug Extensions for WinDbg* to analyze black screen hangs and crashes. It is assumed that you are familiar with Intel® DCI debug, and that you have installed Intel® System Debugger and WinDbg* on your host system, enabled DCI on your target system, and connected your host system to your target using a supported DCI method, such as Intel® SVT DCI DbC cable, or Intel® SVT Closed Chassis Adapter. If you are not familiar Intel® System Debugger, please review Intel® System Debugger User Guide.
Loading Debug Symbols
The first step for the debugging using Intel® Debug Extensions for WinDbg* is loading the debug symbols. This process takes a bit of time since the debugger has to enumerate all modules and to download PDB files from Microsoft* server. To see additional details about downloading status run .sym noisy before .reload /f command. The status word BUSY in the left down corner indicates that the command is still executing. Once the symbols are loaded, WinDbg will show kd> command prompt. At this point you can run lm command to see the modules list:
0: kd> lm
start end module name
fffff800`b4400000 fffff800`b445c000 msrpc (pdb symbols) c:\symbols\msrpc.pdb\D1A2C906531046A3B06666B1678B02DF1\msrpc.pdb
fffff800`b4460000 fffff800`b44c2000 FLTMGR (pdb symbols) c:\symbols\fltMgr.pdb\620A988036C34BAFAD3FA05B3C5E27FF1\fltMgr.pdb
fffff800`b44d0000 fffff800`b44f5000 ksecdd (pdb symbols) c:\symbols\ksecdd.pdb\A189E34E147D4EF784C97559FD6C40F61\ksecdd.pdb
fffff800`b89c0000 fffff800`b89e2000 WdNisDrv (pdb symbols) c:\symbols\WdNisDrv.pdb\0DF82E576FFE483E932D1BE92599179F1\WdNisDrv.pdb
fffff803`d0f56000 fffff803`d0f61000 kdcom (pdb symbols) c:\symbols\kd1394.pdb\13106436C8574DFBB424C696AF2BC1632\kd1394.pdb
fffff803`d2007000 fffff803`d27d3000 nt (pdb symbols) c:\symbols\ntkrnlmp.pdb\0DE6DC238E194BB78608D54B1E6FA3791\ntkrnlmp.pdb
fffff803`d27d3000 fffff803`d2846000 hal (pdb symbols) c:\symbols\hal.pdb\81C1AF690083498BA941D5EC628CDCF41\hal.pdb
fffff961`10a00000 fffff961`10d82000 win32kfull (pdb symbols) c:\symbols\win32kfull.pdb\7E008E1CFF454261A8C9F045658183421\win32kfull.pdb
fffff961`10d90000 fffff961`10ef2000 win32kbase (pdb symbols) c:\symbols\win32kbase.pdb\4DCD0ED713B74A56A031ECC9E0D3278F1\win32kbase.pdb
fffff961`10f10000 fffff961`10f1a000 TSDDD (pdb symbols) c:\symbols\tsddd.pdb\1D46FEC592A447A08B2DEBBA6ED270191\tsddd.pdb
fffff961`10f20000 fffff961`10f5c000 cdd (pdb symbols) c:\symbols\cdd.pdb\14C9A1BFDBE84658B69AFE70FE9BF0B11\cdd.pdb
fffff961`113c0000 fffff961`113e3000 win32k (pdb symbols) c:\symbols\win32k.pdb\770C6601DF3E461C95D324075C65528F1\win32k.pdb
Unloaded modules:
fffff800`b5770000 fffff800`b577f000 dump_storport.sys
fffff800`b57b0000 fffff800`b57d5000 dump_storahci.sys
Issue Analysis
When symbols are loaded, the stack trace become more informative, and you can analyze the current state of each processor core using ~<number> to switch cores.
There are several possible black screen causes or hardware related BSODs:
- Dead loops and deadlocks
- Kernel Debug transport configuration issues
- Memory corruption issues, invalid opcodes in key Windows processes
- Bug Check 0x124: WHEA, NMI interrupt, Machine Check
A common method for root causing Windows* issues is to use !analyze -v extension command. This extension performs a tremendous amount of automated analysis. The results of this analysis are displayed in the Debugger Command window.
In case !analyze command fails with “The debuggee is ready to run” message, you may want to force the analysis to take place as if a crash had occurred. Use !analyze -v -f to accomplish this task.
Dead Loops and Deadlocks in Windows*
Let’s start from with possible hang due to pure software issues. Fortunately, Windows* comes with an embedded Driver Verifier tool, that can profile spinlocks. Once deadlocks profiling is enabled, the tool will produce verbose information for a lock state in a crashdump. When the debug connection is established, the !deadlock extension can be used in conjunction with Driver Verifier to detect inconsistent use of locks in your code that have the potential to cause deadlocks.
The Driver Verifier doesn't support APC level locks: mutexes(fast, guard) and resources. These locks can be analyzed using !analyze -hang and !locks commands. If needed !thread extension command can be used to obtain the thread information.
For example, here is typical output of !locks command:
0: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks................................................................................
Resource @ 0xffff8186e25a6d90 Exclusively owned
Contention Count = 123641
NumberOfExclusiveWaiters = 5
Threads: ffff8186ee782080-01<*>
Threads Waiting On Exclusive Access:
ffff8186edf95080 ffff8186ef3e9080 ffff8186ce4887c0 ffff8186ea1cb080 ffff8186ee7f97c0
KD: Scanning for held locks....................................
Resource @ 0xffff8186e9bd3d40 Shared 1 owning threads
Threads: ffff8186ea99a7c0-01<*>
KD: Scanning for held locks..........
12517 total locks, 2 locks currently held
Dead loops can be identified by looking at the instruction pointer, stack, using breakpoints or step by step execution using Step Over command.
Kernel Debug Transport Configuration Issues
A software trap combined with a misconfiguration of the debug transport methods might cause Windows* to wait for the kernel debugger to connect instead of generating BSOD and a crashdump, giving an appearance of unresponsive black screen hang.
Here are some examples of such configuration issues:
- Network Kernel Debugging is configured, but a supported NIC is not installed in the system
- Kernel-Mode Debugging over a 1394 (Firewire) Cable is configured, but Firewire controller is not installed in the system
- Kernel-Mode USB Debugging is configured, but it conflicts with Intel© DCI
This might happen when you are debugging a difficult to reproduce issue, and in this case it is important to collect the debug data.
When Windows* is waiting for the kernel debugger connection, there will be at least one thread with TrapFrame. Stack would look like this:
fffff880`0c440f50 fffff800`035d0a96 nt!ObpLookupObjectName+0x588
fffff880`0c441040 fffff800`035aef66 nt!ObOpenObjectByName+0x306
fffff880`0c441110 fffff800`032d2e53 nt!NtQueryAttributesFile+0x145
fffff880`0c4413a0 00000000`7754168a nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`0c4413a0)
00000000`0008c9f8 00000000`73a2ae19 ntdll!NtQueryAttributesFile+0xa
00000000`0008ca00 00000000`73a1d18f wow64!whNtQueryAttributesFile+0x91
00000000`0008ca80 00000000`750c2776 wow64!Wow64SystemServiceEx+0xd7
00000000`0008d340 00000000`73a1d286 wow64cpu!ServiceNoTurbo+0x2d
00000000`0008d400 00000000`73a18a90 wow64!RunCpuSimulation+0xa
00000000`0008d450 00000000`739e2c52 wow64!Wow64KiUserCallbackDispatcher+0x204
00000000`0008d7a0 00000000`775411f5 wow64win!whcbfnDWORD+0xe2
00000000`0008e190 00000000`739efe4a ntdll!KiUserCallbackDispatcherContinue (TrapFrame @ 00000000`0008e058)
00000000`0008e218 00000000`739caf02 wow64win!ZwUserMessageCall+0xa
In this case you can restore register context from the trap information using .trap [Address] command. For example:
0: kd> .trap 0xfffffffff3a4ea60
ErrCode = 00000000
eax=0000e470 ebx=e1fdb600 ecx=e1753c2c edx=00000011 esi=00740065 edi=e1753be8
eip=8092e20b esp=f3a4ead4 ebp=f3a4eaec iopl=0 nv up ei pl nz na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010206
nt!ObpLookupDirectoryEntry+0xee:
8092e20b 394608 cmp dword ptr [esi+8],eax ds:0023:0074006d=????????
The following steps of analysis depend on the type of issue. Search MSDN for the exception type. Generate minidump after restoring trap. Run !analyze -f -v command to further analyze the crash.
Memory Corruption Issues and Invalid Opcodes in Key Windows Processes
The most complex issues to debug are memory corruption issues. This this case the system might crash with seemingly random errors and because of corrupted PTE (Page Table) crash dumps might not contain useful information. The recommendation is to is to run Driver Verifier on all non-Microsoft drivers. If it doesn't find any violations run !chkimg, and if memory corruption happens in a non-writable area, protected by NX bit, it might be caused by a BIOS issue, a memory controller issue, or a malware.
0: kd> !chkimg -lo 50 -d !nt
fffff803ae41f594-fffff803ae41f595 2 bytes - nt!MiDuplicateCloneLeaf+38
[ 80 fa:00 ec ]
fffff803ae42024f-fffff803ae420250 2 bytes - nt!MiExpandPagedPool+83 (+0xcbb)
[ 80 f6:00 ea ]
fffff803ae420461-fffff803ae420462 2 bytes - nt!MiExpandSystemCache+85 (+0x212)
[ 80 f6:00 ea ]
Bug Check 0x124: WHEA, NMI Interrupt, Machine Check
The most interesting to analyze are the hardware issues that lead to unrecoverable errors. In this case it is not guaranteed that system will fail with the 0x124 error. It also might not be able to successfully write the crashdump to the disk. The system might freeze after the second NMI interupt, but before the BSOD screen is shown. In such case first run !analyze -v to confirm that the issue is uncorrectable HW error. Next run !whea and !errrec extensions to obtain the crash details. Here is an example:
1: kd> !whea
Error Source Table @ fffff801a632f4a0
5 Error Sources
Error Source 0 @ ffff8200b72b0bd0
Notify Type : {b7f99bd0-8200-ffff-a8f4-32a601f8ffff}
Type : 0x4 (PCIe)
Error Count : 1
Record Count : 1
Record Length : 750 Error Records : wrapper @ ffff8200b733a010 record @ ffff8200b733a038
Descriptor : @ ffff8200b72b0c29
Length : 3cc
Max Raw Data Length : d0
Num Records To Preallocate : 1
Max Sections Per Record : 3
Error Source ID : 0
Flags : 00000000
Error Source 1 @ ffff8200b7f99bd0
Notify Type : {b7f9cbd0-8200-ffff-d00b-2bb70082ffff}
Type : 0x0 (MCE)
Error Count : 0
Record Count : 4
Record Length : 728
Error Records : wrapper @ ffff8200b91f3000 record @ ffff8200b91f3028
: wrapper @ ffff8200b91f3728 record @ ffff8200b91f3750
: wrapper @ ffff8200b91f3e50 record @ ffff8200b91f3e78
: wrapper @ ffff8200b91f4578 record @ ffff8200b91f45a0
........
Error Source 4 @ ffff8200b1a7fb60
Notify Type : {b7f65bd0-8200-ffff-d0cb-f9b70082ffff}
Type : 0x3 (NMI)
Error Count : 0
Record Count : 1
Record Length : 6c0
Error Records : wrapper @ ffff8200b7f91940 record @ ffff8200b7f91968
Descriptor : @ ffff8200b1a7fbb9
Length : 3cc
Max Raw Data Length : 100
Num Records To Preallocate : 1
Max Sections Per Record : 3
Error Source ID : 3
Flags : 00000000
And
1: kd> !errrec ffff8200b733a038
===============================================================================
Common Platform Error Record @ ffff8200b733a038
-------------------------------------------------------------------------------
Record Id : 01d22bad8598f73b
Severity : Fatal (1)
Length : 672
Creator : Microsoft
Notify Type : PCI Express Error
Timestamp : 10/21/2016 15:13:03 (UTC)
Flags : 0x00000000
===============================================================================
Section 0 : PCI Express
-------------------------------------------------------------------------------
Descriptor @ ffff8200b733a0b8
Section @ ffff8200b733a148
Offset : 272
Length : 208
Flags : 0x00000001 Primary
Severity : Fatal
Port Type : Root Port
Version : 1.1
Command/Status: 0x4010/0x0504
Device Id :
VenId:DevId : 8086:a296
Class code : 030400
Function No : 0x00
Device No : 0x1c
Segment : 0x0000
Primary Bus : 0x00
Second. Bus : 0x00
Slot : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ ffff8200b733a17c
Device Caps : 00008001 Role-Based Error Reporting: 1
Device Ctl : 0007 ur FE NF CE
Dev Status : 0014 ur FE nf ce
Root Ctl : 0008 fs nfs cs
AER Information @ ffff8200b733a1b8
Uncorrectable Error Status : 00000010 ur ecrc mtlp rof uc ca cto fcp ptlp sd DLP und
Uncorrectable Error Mask : 00010000 ur ecrc mtlp rof UC ca cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00060011 ur ecrc MTLP ROF uc ca cto fcp ptlp sd DLP UND
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 00002000 ADV rtto rnro dllp tlp re
Caps & Control : 00000004 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 00,00,00
These two commands contain enough information to find the actual problem. In this example, PCI Express (PCIe) advanced error reporting structure provided(errors marked red) for device 8086:a296 (South Bridge). From the PCIe documentation, it appears that 0x124 BSOD is triggered by the “Data Link Protocol Error” UCE. The further analysis could be done by the PCIe team.
Conclusion
While debugging Windows* black screen hangs and crashes is a difficult task, Intel® Debug Extensions for WinDbg* included with Intel® System Debugger simplifies the debug process by providing Intel© DCI connection method to otherwise unresponsive Windows* target.