How can know that any app was in hang and then user killed it or user normally killed it or any crash happend? - windbg

I want to differentiate between crashing,hanging and normal kill of an app? Like we have to do registry for WER to create crash dump, os send some signal to process if anything happen so how to handle this all and create a library that would assit in bucketing it according to crash or hang or simple kill? Is there

I want to differentiate between crashing, hanging and normal kill of an app?
You're missing the following options:
app works normally
app is about to crash, but maybe doesn't
And these two make it really hard to distinguish the states. In order to understand that, you need to know two things:
exception dispatching
how a crash dump is generated
Exception Dispatching
A crash is caused by an exception. But not all exceptions will cause a crash, because exceptions can be handled. Handling of an exception is typically done in a catch{} block.
So, imagine an exception occurs in your application. The following process begins:
if a debugger is attached, ask the debugger whether it want to react on that. This is the first chance for the debugger to do something.
if the debugger did not want to react, check for a catch{} block which might want to react on the exception.
If there was no catch{} block, check for a so called "unhandled exception handler" which might want to react on the exception.
if still nobody wanted to handle the exception, ask the debugger again. This is now the second chance for the debugger to do something.
if the debugger doesn't do anything, the OS needs to handle the situation. If some WER settings are enabled, it might save a crash dump now. After that, it will terminate the process and free the resources that were allocated by the app.
The terms "first chance exception" and "second chance exception" are important.
WinDbg tells you about this:
0:006> g
(2db0.2908): CLR exception - code e0434352 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=0098ebe0 ebx=00000005 ecx=00000005 edx=00000000 esi=0098eca4 edi=00000001
eip=76c44402 esp=0098ebe0 ebp=0098ec3c iopl=0 nv up ei pl nz ac po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000212
KERNELBASE!RaiseException+0x62:
76c44402 8b4c2454 mov ecx,dword ptr [esp+54h] ss:002b:0098ec34=5d02fd68
0:000>
As you can see, this exception is a first chance exception. WinDbg says
First chance exceptions are reported before any exception handling.
This means: the debugger has reacted before any catch{} block was run. And:
This exception may be expected and handled.
This means: the code may have a catch{} block, which does something useful so that the application might not crash.
A second chance exception looks like this:
0:000> g
(3e34.36c0): C++ EH exception - code e06d7363 (first chance)
(3e34.36c0): C++ EH exception - code e06d7363 (!!! second chance !!!)
eax=00daf940 ebx=00000000 ecx=00000003 edx=00000000 esi=00000001 edi=00000000
eip=76c44402 esp=00daf940 ebp=00daf998 iopl=0 nv up ei pl nz ac po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000212
KERNELBASE!RaiseException+0x62:
76c44402 8b4c2454 mov ecx,dword ptr [esp+54h] ss:002b:00daf994=0754642c
As you can see, there was a first chance exception before, but I instructed the debugger not to do anything at this point. The application did neither have a catch{} block nor an unhandled exception handler. Without a debugger, this application would crash and terminate.
How are crash dumps created
Crash dumps are created very similarly like a debugger creates crash dumps.
Attach a debugger to the process
Start a new thread
In that thread, force a known exception
When the debugger is informed about the first chance exception, create the crash dump file
The exception that is forced is typically a INT 3 instruction, which is a debugging breakpoint with exception code 0x80000003.
Identifying a crash
You have a crash when there is an exception and the exception cannot be continued.
In WinDbg you can use .exr -1 to get information about the last exception.
0:000> .exr -1
ExceptionAddress: 76c44402 (KERNELBASE!RaiseException+0x00000062)
ExceptionCode: e06d7363 (C++ EH exception)
ExceptionFlags: 00000001
With ExceptionFlags being 1, the exception is non-continuable.
Identifying a potential crash (but maybe it doesn't)
As before, but Exception flags is 0.
Identifying a kill
This is not easily possible. The OS will terminate the process. There's no exception. You'll typically not have a crash dump of this situation.
However, there are tools that can stop when a process terminates. But there's not much to analyze then. You would identify such a situation by having a look at the call stack:
0:000> k L1
# Child-SP RetAddr Call Site
00 0000003a`d2d3f968 00007fff`3b16a938 ntdll!NtTerminateProcess+0x14
Typically, there is just one thread left:
0:000> ~
. 0 Id: 2078.34ec Suspend: 0 Teb: 0000003a`d2e03000 Unfrozen
App works normally
In this case, the exception code will be 0x80000003, because a breakpoint was injected in order to generate the crash dump.
0:004> .exr -1
ExceptionAddress: 77964120 (ntdll!DbgBreakPoint)
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 1
Parameter[0]: 00000000
From the call stack, you typically see that is was injected by a debugger:
0:004> k L2
# ChildEBP RetAddr
00 0666fd34 7799ace9 ntdll!DbgBreakPoint
01 0666fd64 754c6359 ntdll!DbgUiRemoteBreakin+0x39
The main thread is typically doing nothing, i.e. it's waiting for user input
0:004> ~0k L1
# ChildEBP RetAddr
00 008fef50 6437a188 win32u!NtUserWaitMessage+0xc
App is hanging
A hang looks very much like a normal running app, because the process of generating the crash dump does the same:
0:004> .exr -1
ExceptionAddress: 77964120 (ntdll!DbgBreakPoint)
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 1
Parameter[0]: 00000000
0:004> k L2
# ChildEBP RetAddr
00 0666fd34 7799ace9 ntdll!DbgBreakPoint
01 0666fd64 754c6359 ntdll!DbgUiRemoteBreakin+0x39
There are two types of hang: a high CPU hang (the app is busy, maybe in an endless loop) or a low CPU hand (the app has deadlocked).
A high CPU hang can be identified by its call stack. It may not have a WaitForSingleObject() or WaitForMultipleObjects() method on top of the stack.
A low CPU hang may look exactly identical like a working app, because it is waiting as well. The only difference is: the working app is waiting for user input (which may occur soon) and the hanging app is waiting for a something else (which is may never get and thus deadlock).
The reality
The reality can be much more complex, depending on whether .NET is involved, you have multiple UI threads, etc. But IMHO, in a straight-forward app, this approach should work in ~70% of the cases.

Related

Computer Reboots After "sti" Instruction

I am trying to implement interrupts in x86 operating system project. However, after loading interrupt descriptor table with lidt, I issue sti command and this "sti" command reboots the computer. And also, I am in the protected mode. Any idea what might be happening?
Some things cause exceptions. When the CPU can't start the corresponding exception handler it falls back to a generic "double fault" exception, and when the CPU can't start that exception handler the CPU falls back to a "triple fault" condition which mostly means that the computer is reset.
It's likely that there are pending IRQs (that occurred while interrupts were masked with "cli" and have been waiting for CPU to be ready to receive them); so when you do "sti" the interrupt controller sees the CPU is ready to receive an IRQ now and immediately sends one to the CPU; and likely that the interrupt handler for whichever IRQ the CPU receives is causing an exception (that leads to double fault, that leads to triple fault/reset).
The easiest way to figure out what is happening is to run it under an emulator that tells you what happened in its logs. The alternative is to write usable exception handler/s for any exceptions that are involved (most likely, a general protection fault exception handler); so that the exception handler can give you information about what went wrong (e.g. the "error code" provided by the CPU to the general protection fault handler may indicate which IDT entry the CPU tried to use for the IRQ).
Note that during boot the best sequence is to mask all IRQs in the interrupt controller/s, then let firmware handle any pending IRQs (e.g. with interrupts enabled, do some "NOP" instructions). That way there can't be any pending IRQs when you "sti" later (and you can unmask individual IRQ sources when you actually want them unmasked - e.g. when you install a device driver that uses a specific IRQ). Sadly most people (tutorials, GRUB, etc) do everything wrong and just "cli" without masking IRQs in the interrupt controller/s (and then do things like remap the PIC chips, etc; which makes things even more confusing), and then end up having to cope with the consequences of doing everything wrong. ;-)

Breaking a stack/call frame information chain on ELF/Linux?

I'm trying to do a rather niche thing which is essentially breaking the CFI (Call Frame Information in DWARF EH info) and rbp & rsp links between frames. Main reason for that is that is that past a certain point in thread control flow I want to do a call continuation which is basically a one-way tailcall combined with a yield which should clean up the stack and then return to the top of the stack ready to be executed again at the continuation point.
Here is the idea in principle, which works as long as I keep the lines that mess with the stack commented out:
/*
* x86_64 SysV:
* rdi, rsi, rds, rcx, r8, r9, xmm0-xmm7
*/
__asm {
mov rax, TCB
mov rax, qword ptr [rax] OSThreadControlBlock.StartFn;
call rax;
mov rax, 0;
// end of stack
//push rax;
//push rax;
//push rbx;
// last "real" frame
//push rbp;
//mov rbp, rsp;
//push rbx;
// make the call
mov rdi, RL;
lea rax, qword ptr __OS_RUNLOOP_START__;
call rax;
// trap if it returns
//int 3;
}
I'm aware of the general principles behind SP/BP registers, I'm specifically using -fno-omit-frame-pointer. My question is, after having spent hours trying to get it to work, what am I missing? It seems that any alteration to the stack layout, even as simple as a push before a call while keeping it aligned will cause a snowball crash starting with something like this (custom signal handler):
Received fatal signal: Segmentation fault (11) [thread: 10298 ctl-thrd]
* Unknown error at address 0x0 Regs:
%rip=0x00000000003E2D91 %rbp=0x00007F820A547EA8 %rsp=0x00007F820A547DE8 %rax=0x00007F820A547DE8 %rbx=0x00007F820A547F38
%rdi=0x00000000002121E1 %rsi=0x000000000000007B %rcx=0x000000000000000A %r8=0x0000000000000900 %r9=0x00007F820A5490C0
The ABI in question is libc++/libc++abi on x86_64 Linux, with a LLVM/Clang 6.0.X based toolchain. I tried practically everything, I know the above looks weird but it's an MS extension for inline assembly, I checked multiple times in disassemblies that it generates perfectly sane code. As far as I understand this is some weird conflict between CFI and frame pointer based stuff but I'm not that amazingly good at x86_64 so I'm not really sure what I'm missing. I know the unwinding process is meant to be terminated by a sentinel (null SP/FP on the last frame) but at this point I'm honestly lost because even the debugger gets completely thrown off by this.
If anyone has any suggestions that would be really appreciated, I tried various things but the core problem is the same, as soon as I touch the stack, even if I return it to normal, everything goes haywire. Clobber beyond the asm block doesn't matter since the last call is not meant to conventionally return. One thing I did notice is that it seems this is somehow related to TLVs but I'm not sure how since NPTL is meant to configure that.
Any help or suggestions would me immensely appreciated.
Edit:
Looks like this comment from Valgrind may explain what is happening:
/* NB 9 Sept 07. There is a nasty kludge here in all these CALL_FN_
macros. In order not to trash the stack redzone, we need to drop
%rsp by 128 before the hidden call, and restore afterwards. The
nastyness is that it is only by luck that the stack still appears
to be unwindable during the hidden call - since then the behaviour
of any routine using this macro does not match what the CFI data
says. Sigh.
Why is this important? Imagine that a wrapper has a stack
allocated local, and passes to the hidden call, a pointer to it.
Because gcc does not know about the hidden call, it may allocate
that local in the redzone. Unfortunately the hidden call may then
trash it before it comes to use it. So we must step clear of the
redzone, for the duration of the hidden call, to make it safe.
Probably the same problem afflicts the other redzone-style ABIs too
(ppc64-linux, ppc32-aix5, ppc64-aix5); but for those, the stack is
self describing (none of this CFI nonsense) so at least messing
with the stack pointer doesn't give a danger of non-unwindable
stack. */

FreeRTOS ARM cortex hardfault escalation from systick

Under a special condition I'm experiencing an hardfault exception. The ICSR indicates that it's an escalation from systick (pending exception = 15).
Any ideas how this would happen?
My guess is, that it's some kind of dead-lock.
Any recommendations how to trace this (without Atmel Studio)?
I'm using FreeRTOS 7.5.2.
UPDATE:
I added some more fault register to the output dump. So it's indeed a bus fault with a systick interrupt pending:
EXCEPTION HANDLER
- ICSR active exception: 3
- ICSR pending exception: 15
- ICSR pending interrupt: 0
- Hardfault status: 0x40000000
- Memory fault status: 0x00
- Bus fault status: 0x04
- Usage fault status: 0x0000
I was able to track down the exception to a FreeRTOS call:
vTaskDelay(10/portTICK_RATE_MS);
The application has 2 tasks:
Task with priority 2 (parameter to xTaskCreate)
Task with priority 1
Tasks 1 enters an area locked with a semaphore and hits the line mentioned above. Task 2 should wake up and run until it also wants to enter the locked area.
I think you have misunderstood the ICSR. This is not saying the exception has escalated from a SYSTICK and does not have anything to do with the hard fault.
Firstly you need to look in the HFSR (hard fault status register). If forced is set is means it is either escalated from a bus fault, mem man fault or usage fault (I suspect it will be forced). If it is then look in the CFSR to see what kind of error you have.
You can then debug further from here. If it is a type of bus error (again quite likely) then you need to look at the BFARVALID bit in the CFSR. If this is set then you are lucky as the BFAR register will contain the address of the fault. If this is not set then things get a bit more difficult! Bare in mind then CFSR is actually several registers in one so needs decoding correctly, some of the bits are types of bus error and others are mem man faults etc..
I'm not sure why you would think a [software?] deadlock would cause a hardware hardfault, but some information on debugging hard faults can be found here: http://www.freertos.org/Debugging-Hard-Faults-On-Cortex-M-Microcontrollers.html
I would also recommend updating to a newer version of FreeRTOS as the newer the version the more assert() statements are including to catch interrupt priority and other interrupt related misuse and misconfguration.

Exception Types in iOS crash logs

I've seen a few different types of crash logs since I begin learning iOS development.
I know that:
Exception Type: EXC_BAD_ACCESS (SIGSEGV) mean we are accessing a released object.
but don't know about:
Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Type: EXC_CRASH (SIGABRT)
Exception Type: EXC_BREAKPOINT (SIGTRAP)
Do you know how many Exception Types in iOS crash logs and what do they mean?
I know that: Exception Type: EXC_BAD_ACCESS (SIGSEGV) mean we are accessing a released object.
No.
A SIGSEGV is a segmentation fault, meaning you are trying to access an invalid memory address.
Those exceptions (in fact, they are signals) are not related to Objective-C, but C.
So you can get such an exception without Objective-C objects.
Note that a signal is not an exception, meaning you can't catch them with #try and #catch blocks.
You may set a signal handler with the signal and sigaction functions. Keep in mind some signals, like SIGABRT cannot be blocked.
You can check the Wikipedia page about signals, if you want more informations.
That said, to resume:
SIGSEGV (Segmentation fault)
Access to an invalid memory address. The address exist, but your program does not have access to it.
SIGBUS (Bus error)
Access to an invalid memory address. The address does not exist, or the alignment is invalid.
SIGFPE (Floating point exception)
Invalid arithmetic operation. Can be related to integer operations, despite the name.
SIGPIPE
Broken pipe.
SIGILL
Illegal processor instruction.
SIGTRAP
Debugger related
SIGABRT
Program crash, not related to one of the preceding signal.
SIGSEGV literally means you're accessing an address you don't own. So it's not necessarily that you're accessing a released object; you could be accessing an object that never existed, as in:
UIView *view; // uninitialised, could point to anything
[view setFrame:someFrame];
Or even just making an error in C-level non-object stuff, such as:
int array[100];
array[1000] = 23; // out-of-bounds access
SIGBUS is very similar to SIGSEGV, the difference being at the hardware level (usually the difference between trying to access an address that does exist but which you don't own and trying to access an address that doesn't have anything behind it, but that's not a strict definition), but is usually associated with the same sort of errors, though a SIGBUS is much more likely to be to do with an uninitialised variable than a SIGSEGV.
If you're trying to map to errors you probably made in Objective-C, you probably just want to read SIGSEGV and SIGBUS together as meaning "a memory access I didn't have the right to make".
SIGABRT is a program attempting to abort itself, so it usually means that some sort of internal consistency check has failed. For example, SIGABRT is raised if you try to free the same memory twice, or — at the Cocoa level — if you raise an NSException that isn't caught. If you get a SIGABRT, you've done something wrong that is detected by the system software (in contrast to SEGV and BUS, which arise in hardware).
SIGTRAP is a call out from the program to a debugger. Anecdotally, Apple seem to use these when you do something wrong that can be detected in software but relates to the environment rather than your specific code. So, for example, you call a C function that exists in the SDK you built with but not on the device you are running on (such as when you build against the latest SDK with a lower deployment target), or do a similar thing with an object.
These messages are from gdb, and they are not exclusive for objective-C.
To get info about the signals all you have to do is enter info signals at the debugger console, this is an example output. Sorry for no posting it here, but the format of the console output is awful.
Source and more info about signals
I've recently studied this topic area and here is my summary:
EXC_BAD_ACCESS (SIGSEGV) or
EXC_BAD_ACCESS (SIGBUS)
Our program most likely tried to access a bad memory location or the address was good but we did not have the privilege to access it. The memory might have been deallocated due to memory pressure.
EXC_BREAKPOINT (SIGTRAP)
This is due to an NSException being raised (possibly by a library on our behalf) or _NSLockError or objc_exception_throw being called. For example, this can be the Swift environment detecting an anomaly such as force unwrapping a nil optional.
EXC_BAD_INSTRUCTION (SIGILL)
This is when the program code itself is faulty, not the memory it might be accessing. This should be rare on iOS devices; perhaps a compiler or optimizer bug, or faulty hand written assembly code. On Simulator, it is a different story as using an undefined opcode is a technique used by the Swift runtime to stop on access to zombie objects (deallocated objects).
EXC_GUARD
This is when the program closed a file descriptor that was guarded. An example is the SQLite database used by the system.

Data Formatters temporarily unavailable, not low on memory

I'm running a computationally intensive task that reads data from the viewfinder using UIGetScreenImage and does computations on it, repeatedly. After about 60 seconds (on 3GS) I'm getting a crash every time. But I can't debug it, because I get this:
Program received signal: “0”.
Data Formatters temporarily unavailable, will re-try after a 'continue'. (Unknown error loading shared library "/Developer/usr/lib/libXcodeDebuggerSupport.dylib")
(gdb) continue
The program is not being run.
And at this point I'm toast, the stack trace is all blank.
I've used Instruments, object allocations, allocations, activity monitor, and they all show that I'm not leaking. In activity monitor for example physical memory used rises from 77MB to to 112 MB and stays there (up and down a bit) until the crash.
Anyone have an idea of what to try?
You might have some recursion that's got a bit out of control?
I've seen your symptoms happen when I've accidentally call a setter from within a setter i.e.
-(void)setX:(int)value {
self.x = value; //!< Oops, accidentally called this method again :(
}
and you get odd errors from the debugger because you've broken the stack. Don't know how this answer helps you find the error though :(
Are you using any version control at all - I'd fix this by stepping back though your changes and finding the change that causes the bug?