FreeRTOS ARM cortex hardfault escalation from systick - cortex-m3

Under a special condition I'm experiencing an hardfault exception. The ICSR indicates that it's an escalation from systick (pending exception = 15).
Any ideas how this would happen?
My guess is, that it's some kind of dead-lock.
Any recommendations how to trace this (without Atmel Studio)?
I'm using FreeRTOS 7.5.2.
UPDATE:
I added some more fault register to the output dump. So it's indeed a bus fault with a systick interrupt pending:
EXCEPTION HANDLER
- ICSR active exception: 3
- ICSR pending exception: 15
- ICSR pending interrupt: 0
- Hardfault status: 0x40000000
- Memory fault status: 0x00
- Bus fault status: 0x04
- Usage fault status: 0x0000
I was able to track down the exception to a FreeRTOS call:
vTaskDelay(10/portTICK_RATE_MS);
The application has 2 tasks:
Task with priority 2 (parameter to xTaskCreate)
Task with priority 1
Tasks 1 enters an area locked with a semaphore and hits the line mentioned above. Task 2 should wake up and run until it also wants to enter the locked area.

I think you have misunderstood the ICSR. This is not saying the exception has escalated from a SYSTICK and does not have anything to do with the hard fault.
Firstly you need to look in the HFSR (hard fault status register). If forced is set is means it is either escalated from a bus fault, mem man fault or usage fault (I suspect it will be forced). If it is then look in the CFSR to see what kind of error you have.
You can then debug further from here. If it is a type of bus error (again quite likely) then you need to look at the BFARVALID bit in the CFSR. If this is set then you are lucky as the BFAR register will contain the address of the fault. If this is not set then things get a bit more difficult! Bare in mind then CFSR is actually several registers in one so needs decoding correctly, some of the bits are types of bus error and others are mem man faults etc..

I'm not sure why you would think a [software?] deadlock would cause a hardware hardfault, but some information on debugging hard faults can be found here: http://www.freertos.org/Debugging-Hard-Faults-On-Cortex-M-Microcontrollers.html
I would also recommend updating to a newer version of FreeRTOS as the newer the version the more assert() statements are including to catch interrupt priority and other interrupt related misuse and misconfguration.

Related

Write to NVIC_ICPR on Cortex M0 not clearing pending status for TIM2 interrupt

I'm working with TIM2 on the STM32L068K which is a Cortex M0 processor. When I write to the timer enable bit, all the interrupt flags immediately get set. This in itself is a known issue and apparently endemic to the processor design based on the online commentary I've read.
I can clear out the interrupt flags by writing to the status register, but the problem is that the NVIC pending IRQ bit for this source (#15) is also set. This means that the second I execute cpsie i I get vectored to the ISR for source #15 (confirmed by seeing that this is the reported source in IPSR). I've tried multiple techniques for writing to NVIC_ICPR, but the bit remains set. As one example of many things I've tried, check out this site : https://www.sciencedirect.com/topics/engineering/pending-interrupt. I've also tried the CMSIS calls to no good effect. Do writes to this register only work in handler mode, not thread mode? And if so, how then can you stop a spurious interrupt from happening? Is it possible to manually enable handler mode without triggering an exception?
Note that this website does say "If the interrupt source generates an interrupt request continuously (level output), then the pending status could remain high even if you try to clear it at the NVIC." I wouldn't expect the TIM2 IRQ to fall into this category as it should only be triggering when the count reaches zero, which is not happening here, and the interrupt flags for it have already been cleared anyway.

Computer Reboots After "sti" Instruction

I am trying to implement interrupts in x86 operating system project. However, after loading interrupt descriptor table with lidt, I issue sti command and this "sti" command reboots the computer. And also, I am in the protected mode. Any idea what might be happening?
Some things cause exceptions. When the CPU can't start the corresponding exception handler it falls back to a generic "double fault" exception, and when the CPU can't start that exception handler the CPU falls back to a "triple fault" condition which mostly means that the computer is reset.
It's likely that there are pending IRQs (that occurred while interrupts were masked with "cli" and have been waiting for CPU to be ready to receive them); so when you do "sti" the interrupt controller sees the CPU is ready to receive an IRQ now and immediately sends one to the CPU; and likely that the interrupt handler for whichever IRQ the CPU receives is causing an exception (that leads to double fault, that leads to triple fault/reset).
The easiest way to figure out what is happening is to run it under an emulator that tells you what happened in its logs. The alternative is to write usable exception handler/s for any exceptions that are involved (most likely, a general protection fault exception handler); so that the exception handler can give you information about what went wrong (e.g. the "error code" provided by the CPU to the general protection fault handler may indicate which IDT entry the CPU tried to use for the IRQ).
Note that during boot the best sequence is to mask all IRQs in the interrupt controller/s, then let firmware handle any pending IRQs (e.g. with interrupts enabled, do some "NOP" instructions). That way there can't be any pending IRQs when you "sti" later (and you can unmask individual IRQ sources when you actually want them unmasked - e.g. when you install a device driver that uses a specific IRQ). Sadly most people (tutorials, GRUB, etc) do everything wrong and just "cli" without masking IRQs in the interrupt controller/s (and then do things like remap the PIC chips, etc; which makes things even more confusing), and then end up having to cope with the consequences of doing everything wrong. ;-)

Is low latency mode safe to use with Linux serial ports?

Is it safe to use the low_latency tty mode with Linux serial ports? The tty_flip_buffer_push function is documented that it "must not be called from IRQ context if port->low_latency is set." Nevertheless, many low-level serial port drivers call it from an ISR whether or not the flag is set. For example, the mpc52xx driver calls flip buffer unconditionally after each read from its FIFO.
A consequence of the low latency flip buffer in the ISR is that the line discipline driver is entered within the IRQ context. My goal is to get latency of one millisecond or less, reading from a high speed mpc52xx serial port. Setting low_latency acheives the latency goal, but it also violates the documented precondition for tty_flip_buffer_push.
This question was asked on linux-serial on Fri, 19 Aug 2011.
No, low latency is not safe in general.
However, in the particular case of 3.10.5 low_latency is safe.
The comments above tty_flip_buffer_push read:
"This function must not be called from IRQ context if port->low_latency is set."
However, the code (3.10.5, drivers/tty/tty_buffer.c) contradicts this:
void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;
unsigned long flags;
spin_lock_irqsave(&buf->lock, flags);
if (buf->tail != NULL)
buf->tail->commit = buf->tail->used;
spin_unlock_irqrestore(&buf->lock, flags);
if (port->low_latency)
flush_to_ldisc(&buf->work);
else
schedule_work(&buf->work);
}
EXPORT_SYMBOL(tty_flip_buffer_push);
The use of spin_lock_irqsave/spin_unlock_irqrestore makes this code safe to call from interrupt context.
There is a test for low_latency and if it is set, flush_to_ldisc is called directly. This flushes the flip buffer to the line discipline immediately, at the cost of making the interrupt processing longer. The flush_to_ldisc routine is also coded to be safe for use in interrupt context. I guess that an earlier version was unsafe.
If low_latency is not set, then schedule_work is called. Calling schedule_work is the classic way to invoke the "bottom half" handler from the "top half" in interrupt context. This causes flush_to_ldisc to be called from the "bottom half" handler at the next clock tick.
Looking a little deeper, both the comment and the test seem to be in Alan Cox's original e0495736 commit of tty_buffer.c. This commit was a re-write of earlier code, so it seems that at one time there wasn't a test. Whoever added the test and fixed flush_to_ldisc to be interrupt-safe did not bother to fix the comment.
So, always believe the code, not the comments.
However, in the same code in 3.12-rc* (as of October 23, 2013) it looks like the problem was opened again when the spin_lock_irqsave's in flush_to_ldisc were removed and mutex_locks were added. That is, setting UPF_LOW_LATENCY in the serial_struct flags and calling the TIOCSSERIAL ioctl will again cause "scheduling while atomic".
The latest update from the maintainer is:
On 10/19/2013 07:16 PM, Jonathan Ben Avraham wrote:
> Hi Peter,
> "tty_flip_buffer_push" is called from IRQ handlers in most drivers/tty/serial UART drivers.
>
> "tty_flip_buffer_push" calls "flush_to_ldisc" if low_latency is set.
> "flush_to_ldisc" calls "mutex_lock" in 3.12-rc5, which cannot be used in interrupt context.
>
> Does this mean that setting "low_latency" cannot be used safely in 3.12-rc5?
Yes, I broke low_latency.
Part of the problem is that the 3.11- use of low_latency was unsafe; too many shared
data areas were simply accessed without appropriate safeguards.
I'm working on fixing it but probably won't make it for 3.12 final.
Regards,
Peter Hurley
So, it looks like you should not depend on low_latency unless you are sure that you are never going to change your kernel from a version that supports it.
Update: February 18, 2014, kernel 3.13.2
Stanislaw Gruszka wrote:
Hi,
setserial has low_latency option which should minimize receive latency
(scheduler delay). AFAICT it is used if someone talk to external device
via RS-485/RS-232 and need to have quick requests and responses . On
kernel this feature was implemented by direct tty processing from
interrupt context:
void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;
buf->tail->commit = buf->tail->used;
if (port->low_latency)
flush_to_ldisc(&buf->work);
else
schedule_work(&buf->work);
}
But after 3.12 tty locking changes, calling flush_to_ldisc() from
interrupt context is a bug (we got scheduling while atomic bug report
here: https://bugzilla.redhat.com/show_bug.cgi?id=1065087 )
I'm not sure how this should be solved. After Peter get rid all of those
race condition in tty layer, we probably don't want go back to use
spin_lock's there. Maybe we can create WQ_HIGHPRI workqueue and schedule
flush_to_ldisc() work there. Or perhaps users that need to low latency,
should switch to thread irq and prioritize serial irq to meat
retirements. Anyway setserial low_latency is now broken and all who use
this feature in the past can not do this any longer on 3.12+ kernels.
Thoughts ?
Stanislaw
A patch has been posted to LKML to address the problem. It removes the generic code for handling low_latency but keeps the parameter for the low-level drivers to use.
http://www.kernelhub.org/?p=2&msg=419071
I tried forcing low_latency on Linux 3.12 with serial console. The kernel was very unstable. If preemption was enabled, it would hang after a few minutes of use.
So the answer for now is to stay away.

Interrupt masking: why?

I was reading up on interrupts. It is possible to suspend non-critical interrupts via a special interrupt mask. This is called interrupt masking. What i dont know is when/why you might want to or need to temporarily suspend interrupts? Possibly Semaphores, or programming in a multi-processor environment?
The OS does that when it prepares to run its own "let's orchestrate the world" code.
For example, at some point the OS thread scheduler has control. It prepares the processor registers and everything else that needs to be done before it lets a thread run so that the environment for that process and thread is set up. Then, before letting that thread run, it sets a timer interrupt to be raised after the time it intends to let the thread have on the CPU elapses.
After that time period (quantum) has elapsed, the interrupt is raised and the OS scheduler takes control again. It has to figure out what needs to be done next. To do that, it needs to save the state of the CPU registers so that it knows how to undo the side effects of the code it executes. If another interrupt is raised for any reason (e.g. some async I/O completes) while state is being saved, this would leave the OS in a situation where its world is not in a valid state (in effect, saving the state needs to be an atomic operation).
To avoid being caught in that situation, the OS kernel therefore disables interrupts while any such operations that need to be atomic are performed. After it has done whatever needs doing and the system is in a known state again, it reenables interrupts.
I used to program on an ARM board that had about 10 interrupts that could occur. Each particular program that I wrote was never interested in more than 4 of them. For instance there were 2 timers on the board, but my programs only used 1. I would mask the 2nd timer's interrupt. If I didn't mask that timer, it might have been enabled and continued making interrupts which would slow down my code.
Another example was that I would use the UART receive REGISTER full interrupt and so would never need the UART receive BUFFER full interrupt to occur.
I hope this gives you some insight as to why you might want to disable interrupts.
In addition to answers already given, there's an element of priority to it. There are some interrupts you need or want to be able to respond to as quickly as possible and others you'd like to know about but only when you're not so busy. The most obvious example might be refilling the write buffer on a DVD writer (where, if you don't do so in time, some hardware will simply write the DVD incorrectly) versus processing a new packet from the network. You'd disable the interrupt for the latter upon receiving the interrupt for the former, and keep it disabled for the duration of filling the buffer.
In practise, quite a lot of CPUs have interrupt priority built directly into the hardware. When an interrupt occurs, the disabled flags are set for lesser interrupts and, often, that interrupt at the same time as reading the interrupt vector and jumping to the relevant address. Dictating that receipt of an interrupt also implicitly masks that interrupt until the end of the interrupt handler has the nice side effect of loosening restrictions on interrupting hardware. E.g. you can simply say that signal high triggers the interrupt and leave the external hardware to decide how long it wants to hold the line high for without worrying about inadvertently triggering multiple interrupts.
In many antiquated systems (including the z80 and 6502) there tends to be only two levels of interrupt — maskable and non-maskable, which I think is where the language of enabling or disabling interrupts comes from. But even as far back as the original 68000 you've got eight levels of interrupt and a current priority level in the CPU that dictates which levels of incoming interrupt will actually be allowed to take effect.
Imagine your CPU is in "int3" handler now and at that time "int2" happens and the newly happened "int2" has a lower priority compared with "int3". How would we handle with this situation?
A way is when handling "int3", we are masking out other lower priority interrupters. That is we see the "int2" is signaling to CPU but the CPU would not be interrupted by it. After we finishing handling the "int3", we make a return from "int3" and unmasking the lower priority interrupters.
The place we returned to can be:
Another process(in a preemptive system)
The process that was interrupted by "int3"(in a non-preemptive system or preemptive system)
An int handler that is interrupted by "int3", say int1's handler.
In cases 1 and 2, because we unmasked the lower priority interrupters and "int2" is still signaling the CPU: "hi, there is a something for you to handle immediately", then the CPU would be interrupted again, when it is executing instructions from a process, to handle "int2"
In case 3, if the priority of “int2” is higher than "int1", then the CPU would be interrupted again, when it is executing instructions from "int1"'s handler, to handle "int2".
Otherwise, "int1"'s handler is executed without interrupting (because we are also masking out the interrupters with priority lower then "int1" ) and the CPU would return to a process after handling the “int1” and unmask. At that time "int2" would be handled.

How could an assembly OUTB function cause a triple fault?

In my systems programming class we are working on a small, simple hobby OS. Personally I have been working on an ATA hard disk driver. I have discovered that a single line of code seems to cause a fault which then immediately reboots the system. The code in question is at the end of my interrupt service routine for the IDE interrupts. Since I was using the IDE channels, they are sent through the slave PIC (which is cascaded through the master). Originally my code was only sending the end-of-interrupt byte to the slave, but then my professor told me that I should be sending it to the master PIC as well.
SO here is my problem, when I un-comment the line which sends the EOI byte to the master PIC, the systems triple faults and then reboots. Likewise, if I leave it commented the system stays running.
_outb( PIC_MASTER_CMD_PORT, PIC_EOI ); // this causes (or at least sets off) a triple fault reboot
_outb( PIC_SLAVE_CMD_PORT, PIC_EOI );
Without seeing the rest of the system, is it possible for someone to explain what could possibly be happening here?
NOTE: Just as a shot in the dark, I replaced the _outb() call with another _outb() call which just made sure that the interrupts were enable for the IDE controller, however, the generated assembly would have been almost identical. This did not cause a fault.
*_outb() is a wrapper for the x86 OUTB instruction.
What is so special about my function to send EOI to the master PIC that is an issue?
I realize without seeing the code this may be impossible to answer, but thanks for looking!
Triple faults usually point to a stack overflow or odd stack pointer. When a fault or interrupt occurs, the system immediately tries to push some more junk onto the stack (before invoking the fault handler). If the stack is hosed, this will cause another fault, which then tries to push more stuff on the stack, which causes another fault. At this point, the system gives up on you and reboots.
I know this because I actually have a silly patent (while working at Dell about 20 years ago) on a way to cause a CPU reset without external hardware (used to be done through the keyboard controller):
MOV ESP,1
PUSH EAX ; triple fault and reset!
An OUTB instruction can't cause a fault on its own. My guess is you are re-enabling an interrupt, and the interrupt gets triggered while something is wrong with your stack.
When you re-enable the PIC, are you doing it with the CPU's interrupt flag set, or cleared (ie. are you doing it sometime after a CLI opcode, or, sometime after an STI opcode)?
Assuming that the CPU's interrupt flag is enabled, your act of re-enabling the PIC allows any pending interrupts to reach the CPU: which would interrupt your code, dispatch to a vector specified by the IDT, etc.
So I expect that it's not your opcode that's directly causing the fault: rather, what's faulting is code that's run as the result of an interrupt which happens as a result of your re-enabling the PIC.