最近碰到一个很奇怪的crash问题,系统是32位进程,运行在64位win 2008 server的机器上,运行一段时间后就会crash,post mortem debugger设的是ADPlus,抓到的dump是个1st chance mini dump,ADPlus的log文件中的记录如下:
--- 1st chance Process_Shut_Down exception -------------------------------------------------------------------This process is shutting down!This can happen for the following reasons1) Someone killed the process with Task Manager or the kill command2.) If this process is an MTS or COM+ server package, it could be* exiting because an MTS/COM+ server package idle limit was reached.3.) If this process is an MTS or COM+ server package,* someone may have shutdown the package via the MTS Explorer or* Component Services MMC snap-in.4.) If this process is an MTS or COM+ server package,* MTS or COM+ could be shutting down the process because an internal* error was detected in the process (MTS/COM+ fail fast condition).---------------------------------------------------------------
Occurrence happened at:Debug session time: Wed Jan 4 15:22:46.633 2012 (GMT+8)System Uptime: 8 days 0:59:56.555Process Uptime: 0 days 0:09:04.503 Kernel time: 0 days 0:01:18.453 User time: 0 days 0:02:29.875
All thread stacks below ---
.101 Id: 1d98.ce0 Suspend: 0 Teb: 7ee89000 Unfrozen # ChildEBP RetAddr Args to Child WARNING: Stack unwind information not available. Following frames may be wrong.00 1835ff88 75a5339a 0060bb58 1835ffd4 77079ed2 ntdll!ZwWaitForWorkViaWorkerFactory+0x1201 1835ff94 77079ed2 0060bb58 4640ad42 00000000 kernel32!BaseThreadInitThunk+0x1202 1835ffd4 77079ea5 77086679 0060bb58 ffffffff ntdll!RtlInitializeExceptionChain+0x6303 1835ffec 00000000 77086679 0060bb58 00000000 ntdll!RtlInitializeExceptionChain+0x36
Creating c:\\Crash_Mode__Date_01-04-2012__Time_15-15-1717\PID-7576__SERVMYHOST.EXE__1st_chance_Process_Shut_Down__full_214c_2012-01-04_15-22-46-637_1d98.dmp - mini user dumpDump successfully written=========================
用windbg打开dump文件后,发现的确如log显示的只有一个线程,而且都是看上去很正常的系统调用,!analyze -v的输出如下:
FAULTING_IP: +1902faf00a4df5c00000000 ?? ???
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)ExceptionAddress: 00000000 ExceptionCode: 80000003 (Break instruction exception) ExceptionFlags: 00000000NumberParameters: 0
FAULTING_THREAD: 00000ce0
DEFAULT_BUCKET_ID: STATUS_BREAKPOINT
PROCESS_NAME: servMyHost.exe
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid
MOD_LIST: <ANALYSIS/>
NTGLOBALFLAG: 400
APPLICATION_VERIFIER_FLAGS: 0
PRIMARY_PROBLEM_CLASS: STATUS_BREAKPOINT
BUGCHECK_STR: APPLICATION_FAULT_STATUS_BREAKPOINT
LAST_CONTROL_TRANSFER: from 7708471e to 77061f36
STACK_TEXT: 1835fe28 7708471e 000001c0 1835fedc 4640ad1e ntdll!ZwWaitForWorkViaWorkerFactory+0x121835ff88 75a5339a 0060bb58 1835ffd4 77079ed2 ntdll!TppWorkerThread+0x2161835ff94 77079ed2 0060bb58 4640ad42 00000000 kernel32!BaseThreadInitThunk+0xe1835ffd4 77079ea5 77086679 0060bb58 ffffffff ntdll!__RtlUserThreadStart+0x701835ffec 00000000 77086679 0060bb58 00000000 ntdll!_RtlUserThreadStart+0x1b
STACK_COMMAND: ~0s; .ecxr ; kb
FOLLOWUP_IP: ntdll!ZwWaitForWorkViaWorkerFactory+1277061f36 83c404 add esp,4
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: ntdll!ZwWaitForWorkViaWorkerFactory+12
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: ntdll
IMAGE_NAME: ntdll.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4ce7ba58
FAILURE_BUCKET_ID: STATUS_BREAKPOINT_80000003_ntdll.dll!ZwWaitForWorkViaWorkerFactory
BUCKET_ID: APPLICATION_FAULT_STATUS_BREAKPOINT_ntdll!ZwWaitForWorkViaWorkerFactory+12
WATSON_STAGEONE_URL: http://watson.microsoft.com/StageOne/servIMSHost_exe/1_51_0_1026/4efda2a8/unknown/0_0_0_0/bbbbbbb4/80000003/00000000.htm?Retriage=1
Followup: MachineOwner---------
最奇怪的exception居然是Breakpoint类型,据我所知只有assert或者有debugger break in时才会出现breakpoint int 3 exception,但assert在release build中是不起作用的,也没有用debugger。。。
同样的问题已经出现多次了,每次的dump中都只有一个线程,call stack有所不同,但看上去都很正常,有没有高人给解释下啊
ADPlus的脚本在处理进程退出事件上有明显不足,导致写dump时错过了最佳的时机,按道理,应该在收到进程退出时间时写。略微改了一下,可以抓到进程退出的过程:
0:000> kChildEBP RetAddr 0018fe70 7785db8c ntdll!NtTerminateProcess+0x120018fe8c 76667362 ntdll!RtlExitUserProcess+0x850018fea0 76a336dc kernel32!ExitProcessStub+0x120018feac 76a33371 msvcrt!__crtExitProcess+0x170018fee4 76a336bb msvcrt!_cinit+0xea*** WARNING: Unable to verify checksum for BadBoy.exe0018fef8 00402176 msvcrt!exit+0x110018ff88 76663677 BadBoy!WinMainCRTStartup+0x13e0018ff94 77849d72 kernel32!BaseThreadInitThunk+0xe0018ffd4 77849d45 ntdll!__RtlUserThreadStart+0x700018ffec 00000000 ntdll!_RtlUserThreadStart+0x1b
改过的脚本见附件,修改了两处,都是关于进程退出事件的,即epr,谨供参考...
用了新的adplus script,这次crash的时候没有dump抓下来,只有log,内容和以前一样
试了用windbg attach上去跑,crash进windbg后也是只能看到一个线程,很奇怪。现在只能用排除法,把最近的改动挨个rollback回去再跑。。。
可能是自动保存的workspace设置在作怪,可以考虑
1) 删除自动保存的工作空间设置
2)用WinDBG手工附加后,break,选Debug > Event Filters,对于Process Exit事件,Enable