这是我自己机器上遇到的一个问题.
问题描述:Windows Xp在启动过程中,进入Welcome screen之前,一切正常. Welcome screen出来后,一般来说,第一步操作当然是移动鼠标到用户的图标上,然后点击,进行登录. 现在移动鼠标正常,但是,当鼠标移动到用户的图标上面去的时候,正常情况下,图标应该会由暗变亮,但是,现在鼠标在上面的时候,没有任何反应.此时再继续移动鼠标和击键,桌面仍然没有反应,持续鼠标键若干次(总共约是6次),最后一次,出beep音后,鼠标指针也hang住,没有反应.至此系统就完全hang了.
以上情况只是偶尔出现,不是一个必然现象.
分析:由于user input到最后都被hang住了, local debug就不方便了. 先考虑用键盘手动制造一个bugcheck,当然,这要正常登录进系统,改了注册表,下次再hang的时候就可以capture一个full memory dump. 这只要加入一个DWORD注册表值HKLM\System\CurrentControlSet\Services\i8042prt\Parameters\CrashOnCtrlScroll,并将其设置为1,下次hang的时候,按下Ctrl,同时候再按两个ScrollLock,就可以得到dump了.
得到full memory dump, 第一步使用!analyze,但并没有得到很有用的信息, 仅提示MANUALLY_INITIATED_CRASH.
由于登录UI是由winlogon.exe维护的,接下来是看Winlogon.exe进程的message loop, 从这里看,似乎在正常的,它正在等待input message.到这里似乎一下子无从下手了.
接下来考虑,从键盘输入然后到Winlogon.exe接收到WM_KEYDOWN之类的消息这个过程,中间有哪些地方可能被卡住了.能手动制造crash,说明ISR还是能响应的. 鼠标消息基本上是从ISR到DPC到Desktop Thread再到RIT(Raw Input Thread)最后发到线程message queue,下面先检查Desktop Thread和RIT.
// stack of Raw Input Thread
THREAD 82e5c560 Cid 02b4.02d8 Teb: 7ffd8000 Win32Thread: e1a20710 WAIT: (Executive) KernelMode Non-Alertable 82e6cfc0 SynchronizationEvent 82e5c650 NotificationTimer IRP List: 83192bb0: (0006,0190) Flags: 00000970 Mdl: 00000000 Not impersonating DeviceMap e1005450 Owning Process 82eb3020 Image: csrss.exe Attached Process N/A Image: N/A Wait Start TickCount 4307 Ticks: 229 (0:00:00:03.578) Context Switch Count 848 LargeStack UserTime 00:00:00.000 KernelTime 00:00:00.000 Start Address winsrv!StartCreateSystemThreads (0x764e7cdf) Stack Init f828f000 Current f828e9e8 Base f828f000 Limit f828c000 Call 0 Priority 15 BasePriority 13 PriorityDecrement 0 DecrementCount 0 ChildEBP RetAddr f828ea00 804dd0f7 nt!KiSwapContext+0x2e f828ea0c 804dd143 nt!KiSwapThread+0x46 f828ea34 80517972 nt!KeWaitForSingleObject+0x1c2 f828ea70 804e589e nt!ExpWaitForResource+0xd2 f828ea80 bf800aee nt!ExAcquireResourceExclusiveLite+0x6f f828ea8c bf87b1d7 win32k!EnterCrit+0x14 f828ead4 bf895584 win32k!ProcessQueuedMouseEvents+0x23 f828ed30 bf879fa3 win32k!RawInputThread+0x5b9 f828ed40 bf80108a win32k!xxxCreateSystemThreads+0x60 f828ed54 804df7ec win32k!NtUserCallOneParam+0x23 f828ed54 7c92eb94 nt!KiFastCallEntry+0xf8 0075ffe0 764eba1a ntdll!KiFastSystemCallRet
// stack of Desktop Thread. THREAD 82e6c940 Cid 02b4.02dc Teb: 7ffd7000 Win32Thread: e1a00870 WAIT: (Executive) KernelMode Non-Alertable 82e6cfc0 SynchronizationEvent 82e6ca30 NotificationTimer IRP List: 83186728: (0006,0190) Flags: 00000970 Mdl: 00000000 82e51278: (0006,0190) Flags: 00000970 Mdl: 00000000 Not impersonating DeviceMap e1005450 Owning Process 82eb3020 Image: csrss.exe Attached Process N/A Image: N/A Wait Start TickCount 4373 Ticks: 163 (0:00:00:02.546) Context Switch Count 799 LargeStack UserTime 00:00:00.000 KernelTime 00:00:00.015 Start Address winsrv!StartCreateSystemThreads (0x764e7cdf) Stack Init f829f000 Current f829e834 Base f829f000 Limit f829c000 Call 0 Priority 15 BasePriority 13 PriorityDecrement 0 DecrementCount 16 ChildEBP RetAddr f829e84c 804dd0f7 nt!KiSwapContext+0x2e f829e858 804dd143 nt!KiSwapThread+0x46 f829e880 80517972 nt!KeWaitForSingleObject+0x1c2 f829e8bc 804e589e nt!ExpWaitForResource+0xd2 f829e8cc bf800aee nt!ExAcquireResourceExclusiveLite+0x6f f829e8d8 bf90fa59 win32k!EnterCrit+0x14 f829e928 bf87b52a win32k!UserBeep+0xd4 f829e938 bf87bfd7 win32k!QueueMouseEvent+0x54 f829e97c bf87aef9 win32k!ProcessMouseInput+0x1b1 f829e98c 804f0b50 win32k!InputApc+0x4e f829e9d4 804efed9 nt!KiDeliverApc+0x124 f829e9ec 804e4b7d nt!KiSwapThread+0x64 f829ea24 bf807c60 nt!KeWaitForMultipleObjects+0x284 f829ea5c bf890fec win32k!xxxMsgWaitForMultipleObjects+0xb0 f829ed30 bf87a011 win32k!xxxDesktopThread+0x333 f829ed40 bf80108a win32k!xxxCreateSystemThreads+0x6a f829ed54 804df7ec win32k!NtUserCallOneParam+0x23 f829ed54 7c92eb94 nt!KiFastCallEntry+0xf8 0079ffe0 764eba1a ntdll!KiFastSystemCallRet
从上面stack可以看到,DeskTop thread和RIT都在wait,并且都在等待win32k!EnterCrit这个critical section. 由于这时候没有什么I/O(从指示灯看的),
从这里可以猜测系统里面发生了deadlock.下面再确认是不是这样,
kd> !locks**** DUMP OF ALL RESOURCE OBJECTS ****KD: Scanning for held locks....
Resource @ nt!CmpRegistryLock (0x80558360) Exclusively owned Contention Count = 41 NumberOfSharedWaiters = 6 Threads: 82e1cb28-01<*> 837b6020-01 82e06ae0-01 82e7c020-01 837b87a8-01 82dd84f0-01 82eb2020-01 KD: Scanning for held locks.
Resource @ 0x837b27a0 Exclusively owned Contention Count = 2 NumberOfExclusiveWaiters = 1 Threads: 837b6020-01<*> Threads Waiting On Exclusive Access: 82e1cb28
KD: Scanning for held locks...............
Resource @ 0x82eb27a0 Exclusively owned Contention Count = 174 NumberOfExclusiveWaiters = 2 Threads: 82eb2020-01<*> Threads Waiting On Exclusive Access: 82e5c560 82e6c940
KD: Scanning for held locks........................
Resource @ 0x82dff808 Shared 1 owning threads Threads: 837b8c9b-01<*> *** Actual Thread 837b8c98KD: Scanning for held locks....
Resource @ 0x82ec4a50 Shared 1 owning threads Threads: 837b8c9b-01<*> *** Actual Thread 837b8c98KD: Scanning for held locks.
Resource @ 0x831e0b88 Shared 1 owning threads Threads: 831864cb-01<*> *** Actual Thread 831864c81440 total locks, 6 locks currently held
从上面的输出很容易找到,thread(82e1cb28)(后面简称thread1)和thread(837b6020)(后面简称thread2)确实发生了deadlock,两个线程都以exclusive的方式分别持有nt!CmpRegistryLock (0x80558360) (后称lock1)和0x837b27a0(后称lock2), 同时,thread1请求lock2的exclusive access,thread2请求lock1的share access.
下面再看这两个线程的stack,//thread1kd> !thread 82e1cb28THREAD 82e1cb28 Cid 03a4.03a8 Teb: 7ffdf000 Win32Thread: e1b90540 WAIT: (Executive) KernelMode Non-Alertable 82df2f00 SynchronizationEvent 82e1cc18 NotificationTimerNot impersonatingDeviceMap e1005450Owning Process 82e1cda0 Image: logonui.exeAttached Process N/A Image: N/AWait Start TickCount 4471 Ticks: 65 (0:00:00:01.015)Context Switch Count 1399 LargeStackUserTime 00:00:00.031KernelTime 00:00:01.296Win32 Start Address logonui!WinMainCRTStartup (0x0100b95d)Start Address kernel32!BaseProcessStartThunk (0x7c810665)Stack Init f77a7000 Current f77a691c Base f77a7000 Limit f77a2000 Call 0Priority 14 BasePriority 8 PriorityDecrement 6 DecrementCount 16ChildEBP RetAddr Args to Child f77a6934 804dd0f7 82e1cb98 82e1cb28 804dd143 nt!KiSwapContext+0x2ef77a6940 804dd143 00000000 837b27a0 82e1cb28 nt!KiSwapThread+0x46f77a6968 80517972 00000000 00000000 00000000 nt!KeWaitForSingleObject+0x1c2f77a69a4 804e589e e1c002a0 e1c002b8 f77a6a58 nt!ExpWaitForResource+0xd2f77a69b4 8056510e 837b27a0 00000001 82de9390 nt!ExAcquireResourceExclusiveLite+0x6ff77a6a58 805652ad 00000000 82e1cda0 e1c002b8 nt!ObpIncrementHandleCount+0x5bf77a6ac0 80568527 00000000 e1c002b8 837b2680 nt!ObpCreateHandle+0x17df77a6b10 80568dfd 00000000 837b2680 ffffff00 nt!ObOpenObjectByName+0x28cf77a6be4 804df7ec f77a6ca4 0002001f f77a6c7c nt!NtOpenKey+0x1aff77a6be4 804de019 f77a6ca4 0002001f f77a6c7c nt!KiFastCallEntry+0xf8f77a6c68 806573f2 f77a6ca4 0002001f f77a6c7c nt!ZwOpenKey+0x11f77a6ca8 80653b84 e1d80b60 00000020 e1c5ffb8 nt!CmpRemoveFromHiveFileList+0x48f77a6cc0 8064d4a3 e1d80b60 00000020 e1664bd8 nt!CmUnloadKey+0x55f77a6d58 804df7ec 0007ebc0 0007ebd8 7c92eb94 nt!NtUnloadKey+0x18cf77a6d58 7c92eb94 0007ebc0 0007ebd8 7c92eb94 nt!KiFastCallEntry+0xf80007ebb0 7c92e918 77e0bbc0 0007ebc0 00000018 ntdll!KiFastSystemCallRet0007ebb4 77e0bbc0 0007ebc0 00000018 00000250 ntdll!NtUnloadKey+0xc0007ebd8 77e0527a 00000250 0007ebf0 0007ec48 ADVAPI32!LocalBaseRegUnLoadKey+0x510007ebfc 73ce9a58 80000003 000a4b68 00000003 ADVAPI32!RegUnLoadKeyW+0x680007ec28 73ce5fd9 000000d8 00000009 000a07d8 shgina!CUserProfile::~CUserProfile+0x390007ec88 73ce6ac9 0007ecd8 77d19e36 0067f770 shgina!CLogonUser::_GetUnreadMail+0xfc0007eca4 73ce6af8 01001d8c 0007ecd8 00000000 shgina!CLogonUser::_UserSettingAccessor+0x2260007ecb8 0100a492 000c7834 01001d8c 0007ecd8 shgina!CLogonUser::get_setting+0x180007fd10 0100ac72 00000001 000003f6 00000001 logonui!LogonAccount::UpdateNotifications+0x2090007fd2c 0100784a 00000001 0007fdcc 01007538 logonui!LogonFrame::UpdateUserStatus+0x650007fd64 77d18734 00010038 00000113 000003f6 logonui!LogonWindowProc+0x3120007fd90 77d18816 01007538 00010038 00000113 USER32!InternalCallWinProc+0x280007fdf8 77d189cd 00000000 01007538 00010038 USER32!UserCallWinProcCheckWow+0x1500007fe58 77d18a10 0007fe7c 00000000 0007fe98 USER32!DispatchMessageWorker+0x3060007fe68 01017799 0007fe7c 77d1d8a4 00000000 USER32!DispatchMessageW+0xf0007fe98 0100b672 7c80b6a1 000a231c 00000000 logonui!DirectUI::StartMessagePump+0x480007ff1c 0100bad1 01000000 00000000 00653ac0 logonui!WinMain+0x32d0007ffc0 7c816fd7 00166860 0006e9fc 7ffd8000 logonui!WinMainCRTStartup+0x1740007fff0 00000000 0100b95d 00000000 78746341 kernel32!BaseProcessStart+0x23
//thread2kd> !thread 837b6020THREAD 837b6020 Cid 0004.003c Teb: 00000000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable 82e7b988 Semaphore Limit 0x7fffffff 837b6110 NotificationTimerNot impersonatingDeviceMap e1005450Owning Process 837b9660 Image: SystemAttached Process N/A Image: N/AWait Start TickCount 4469 Ticks: 67 (0:00:00:01.046)Context Switch Count 750 UserTime 00:00:00.000KernelTime 00:00:00.062Start Address nt!ExpWorkerThread (0x804e5196)Stack Init f84d3000 Current f84d244c Base f84d3000 Limit f84d0000 Call 0Priority 14 BasePriority 12 PriorityDecrement 2 DecrementCount 16*** ERROR: Module load completed but symbols could not be loaded for sptd.sysChildEBP RetAddr Args to Child f84d2464 804dd0f7 837b6090 837b6020 804dd143 nt!KiSwapContext+0x2ef84d2470 804dd143 00000000 80558360 837b6020 nt!KiSwapThread+0x46f84d2498 80517972 00000000 00000000 00000000 nt!KeWaitForSingleObject+0x1c2f84d24d4 804fb40f f84d2590 837b2680 f84d2584 nt!ExpWaitForResource+0xd2f84d24e8 80567012 80558360 00000001 8056c74c nt!ExAcquireResourceSharedLite+0xb2f84d24f4 8056c74c f84d2590 837b2680 f84d2584 nt!CmpLockRegistry+0x27f84d252c 8056c8a1 e164e250 00000001 f84d2560 nt!CmpSecurityMethod+0x1ef84d2568 80566006 e164e250 f84d2590 e164e24c nt!ObGetObjectSecurity+0x96f84d2594 80565166 e164e250 f84d26e8 00000001 nt!ObCheckObjectAccess+0x29f84d2644 805652ad 00000001 837b9660 e164e250 nt!ObpIncrementHandleCount+0x1f7f84d26ac 8056cc49 00000001 e164e250 00000000 nt!ObpCreateHandle+0x17df84d277c f7f7f309 e164e250 00000040 00000000 nt!ObOpenObjectByPointer+0xa4WARNING: Stack unwind information not available. Following frames may be wrong.f84d2848 80567b09 e101df24 00000300 e101df28 sptd+0x8309f84d28e4 f7f9165c f84d290c f84d2914 00000000 nt!ObpDecrementHandleCount+0x121f84d2958 8054bb38 0000005c f7fd59e0 00000001 sptd+0x1a65cf84d29bc f7f98cb2 ed952000 00068000 f84d2a7c nt!ExFreePoolWithTag+0x676f84d2aa8 8062d1c9 f84d2cf0 00000000 f84d2b44 sptd+0x21cb2f84d2ac8 805f16ce f84d2cf0 00000000 f84d2b44 nt!PsCallImageNotifyRoutines+0x36f84d2c6c 805a4d14 f84d2cf0 00000000 00000000 nt!MmLoadSystemImage+0x999f84d2d4c 805a51fc 00000308 00000001 00000000 nt!IopLoadDriver+0x370f84d2d74 804e526b 00000308 00000000 837b6020 nt!IopLoadUnloadDriver+0x45f84d2dac 8057e0f1 f7816cf4 00000000 00000000 nt!ExpWorkerThread+0x100f84d2ddc 804f927a 804e5196 00000001 00000000 nt!PspSystemThreadStartup+0x3400000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16
kd> dS f84d2cf0e1c04290 "\??\C:\WINDOWS\system32\Drivers\"e1c042d0 "vmx86.sys"
可thread1是logonui.exe的主线程,正在执行ntdll!NtUnloadKey,thread2是一个System thread,正在load driver,从其参数可以找到,正在load的driver就是vmx86.sys,在load的过程中会调用registered image notify functions,应该是sptd.sys注册了这个function,所以进入了sptd.sys执行. 在sptd.sys里会调用nt!ObOpenObjectByPointer,这个方法再去调用nt!CmpLockRegistry对registry进行lock,但是这个lock已经被thread1以exclusive的方式占用了, 而thread1也在等待thread2以exclusive方式持有的与Object相关的lock, 这里就发生了deadlock.
下面如果再要继续找原因,可以参考WRK的code,几个lock都发生在了nt!ObpIncrementHandleCount里面.
到这里解决方法也就比较明显了,问题是由sptd.sys导致的,可以search到sptd.sys是一个光驱软件的驱动,我机器上装了alcohol 120%,卸载它,应该可以解决这个问题,另外还可以尝试升级这个软件.
文中肯定有不足或者不对的分析,欢迎补充,指正.