I'm using FreeRTOS with STM32F407. I have problem with wrong context switch restoration. The code goes like this inside task code:
char *ptr = pvPortMalloc(sizeof(char) * size);
memcpy(ptr, buf, size);
...
log("Before:");
logItoa((int)ptr);
blockingFunction(); // Here preemption will occur
log("After:");
logItoa((int)ptr);
blockingFunction() does not use ptr.
When I debug I can see, that the address pointed by the ptr is stored with instruction:
STR R0, [R7, #24]
so I check the value under the address (R7 + 24)(^1) in data memory and see that the address to dynamically allocated data is successfully saved.
After context restoration I check the variable ptr and see, that it isn't pointing on my newly allocated data, so I check the value under the address (^1) and see, that the value remains unchanged, but value in R7 register (used for address counting) isn't same as before preemption.
It leads to situation when each of my local variables aren't the same, because they're wrongly fetched from data memory.
If it's kind of stack overflow issues, how can I debug it?
Most issues on Cortex-M come down to incorrect interrupt priority assignments and stack overflow, so later versions of FreeRTOS have lots of traps for both of these errors to let you know immediately if they occur - but you have to turn on the ability to trap these common errors, as per below:
Do you have configASSERT() defined, and which version of FreeRTOS are you using? The later the version the more helpful configASSERT() will be.
Do you have configCHECK_FOR_STACK_OVERFLOW set to 2 and a stack overflow hook defined?
I found the source of problem. Inside the blockingFunction() there were buffer explicitly filled. The buffer was too small and overwrote my task's TCB.
Related
In the following RISC-V assembly code:
...
#Using some temporary (t) registers
...
addi a7,zero,1 #Printint system call code
addi a0,zero,100
ecall
...
Should any temporary (t) registers be saved to the stack before using ecall? When ecall is used, an exception occurs, kernel mode is on and code is executed from an exception handler. Some information is saved when exceptions occur, such as EPC and CAUSE , but what about the temporary registers? Environment calls are considered not to be like procedures for safety reasons, but they look like it. Do the procedure calling conventions still apply in this case?
You are correct that the hardware captures some information, such as the epc (if it didn't the interrupted pc would be lost during transfer of control to the exception handler).
However that's all the hardware does — the rest is up to software. Here, RARS is providing the exception handler for ecall.
RARS documentation states (from the help section on syscalls, which is what these are called on MIPS (and RARS came from MARS)):
Register contents are not affected by a system call, except for result registers as specified in the table below.
And below this quote in the help is ecall function code table labeled "Table of Available Services", in which 1 is PrintInt.
Should any temporary (t) registers be saved to the stack before using ecall?
No, it is not necessary, the $t registers will be unaffected by an ecall.
This also means that we can add ecalls to do printf-style debugging without concern for preserving the $t registers, which is nice. However, keeping that in mind, we might generally avoid $a0 and $a7 for our own variables, since putting in an ecall for debugging will need those registers.
In addition, you can write your own exception handler, and it would not have to follow any particular conventions for parameter passing or even register preservation.
Is it safe to use the low_latency tty mode with Linux serial ports? The tty_flip_buffer_push function is documented that it "must not be called from IRQ context if port->low_latency is set." Nevertheless, many low-level serial port drivers call it from an ISR whether or not the flag is set. For example, the mpc52xx driver calls flip buffer unconditionally after each read from its FIFO.
A consequence of the low latency flip buffer in the ISR is that the line discipline driver is entered within the IRQ context. My goal is to get latency of one millisecond or less, reading from a high speed mpc52xx serial port. Setting low_latency acheives the latency goal, but it also violates the documented precondition for tty_flip_buffer_push.
This question was asked on linux-serial on Fri, 19 Aug 2011.
No, low latency is not safe in general.
However, in the particular case of 3.10.5 low_latency is safe.
The comments above tty_flip_buffer_push read:
"This function must not be called from IRQ context if port->low_latency is set."
However, the code (3.10.5, drivers/tty/tty_buffer.c) contradicts this:
void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;
unsigned long flags;
spin_lock_irqsave(&buf->lock, flags);
if (buf->tail != NULL)
buf->tail->commit = buf->tail->used;
spin_unlock_irqrestore(&buf->lock, flags);
if (port->low_latency)
flush_to_ldisc(&buf->work);
else
schedule_work(&buf->work);
}
EXPORT_SYMBOL(tty_flip_buffer_push);
The use of spin_lock_irqsave/spin_unlock_irqrestore makes this code safe to call from interrupt context.
There is a test for low_latency and if it is set, flush_to_ldisc is called directly. This flushes the flip buffer to the line discipline immediately, at the cost of making the interrupt processing longer. The flush_to_ldisc routine is also coded to be safe for use in interrupt context. I guess that an earlier version was unsafe.
If low_latency is not set, then schedule_work is called. Calling schedule_work is the classic way to invoke the "bottom half" handler from the "top half" in interrupt context. This causes flush_to_ldisc to be called from the "bottom half" handler at the next clock tick.
Looking a little deeper, both the comment and the test seem to be in Alan Cox's original e0495736 commit of tty_buffer.c. This commit was a re-write of earlier code, so it seems that at one time there wasn't a test. Whoever added the test and fixed flush_to_ldisc to be interrupt-safe did not bother to fix the comment.
So, always believe the code, not the comments.
However, in the same code in 3.12-rc* (as of October 23, 2013) it looks like the problem was opened again when the spin_lock_irqsave's in flush_to_ldisc were removed and mutex_locks were added. That is, setting UPF_LOW_LATENCY in the serial_struct flags and calling the TIOCSSERIAL ioctl will again cause "scheduling while atomic".
The latest update from the maintainer is:
On 10/19/2013 07:16 PM, Jonathan Ben Avraham wrote:
> Hi Peter,
> "tty_flip_buffer_push" is called from IRQ handlers in most drivers/tty/serial UART drivers.
>
> "tty_flip_buffer_push" calls "flush_to_ldisc" if low_latency is set.
> "flush_to_ldisc" calls "mutex_lock" in 3.12-rc5, which cannot be used in interrupt context.
>
> Does this mean that setting "low_latency" cannot be used safely in 3.12-rc5?
Yes, I broke low_latency.
Part of the problem is that the 3.11- use of low_latency was unsafe; too many shared
data areas were simply accessed without appropriate safeguards.
I'm working on fixing it but probably won't make it for 3.12 final.
Regards,
Peter Hurley
So, it looks like you should not depend on low_latency unless you are sure that you are never going to change your kernel from a version that supports it.
Update: February 18, 2014, kernel 3.13.2
Stanislaw Gruszka wrote:
Hi,
setserial has low_latency option which should minimize receive latency
(scheduler delay). AFAICT it is used if someone talk to external device
via RS-485/RS-232 and need to have quick requests and responses . On
kernel this feature was implemented by direct tty processing from
interrupt context:
void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;
buf->tail->commit = buf->tail->used;
if (port->low_latency)
flush_to_ldisc(&buf->work);
else
schedule_work(&buf->work);
}
But after 3.12 tty locking changes, calling flush_to_ldisc() from
interrupt context is a bug (we got scheduling while atomic bug report
here: https://bugzilla.redhat.com/show_bug.cgi?id=1065087 )
I'm not sure how this should be solved. After Peter get rid all of those
race condition in tty layer, we probably don't want go back to use
spin_lock's there. Maybe we can create WQ_HIGHPRI workqueue and schedule
flush_to_ldisc() work there. Or perhaps users that need to low latency,
should switch to thread irq and prioritize serial irq to meat
retirements. Anyway setserial low_latency is now broken and all who use
this feature in the past can not do this any longer on 3.12+ kernels.
Thoughts ?
Stanislaw
A patch has been posted to LKML to address the problem. It removes the generic code for handling low_latency but keeps the parameter for the low-level drivers to use.
http://www.kernelhub.org/?p=2&msg=419071
I tried forcing low_latency on Linux 3.12 with serial console. The kernel was very unstable. If preemption was enabled, it would hang after a few minutes of use.
So the answer for now is to stay away.
I'm looking at a crash dump. Some variables seem perfectly viewable in windbg, while others just say "memory access error". What causes this? Why do some variables have sensical values while others simply list ?
It appears that all the problems are associated with following pointers. I'm certain that while many of these pointers are uninitialized the vast majority of them should be pointing somewhere valid. Based on the nature of this crash (a simple null ptr dereference) I'm fairly certain the whole process hasn't gone out to lunch.
Mini-dumps are fairly useless, they don't contain a snapshot of all in use memory. Instead, all they contain are some critical structures/lists (e.g. the loaded module list) and the contents of the crashing stack.
So, any pointer that you try to follow in the dump will just give you question marks. Grab a full memory dump instead and you'll be able to see what these buffers point to.
-scott
If they are local pointer variables, what is most likely happening is that the pointers are not initialized, or that stack location has been reused to contain another variable, that may not be a pointer. In both cases, the pointer value may point to a random, unreadable portion of memory.
Since I'm not a native English speaker I might be missing something so maybe someone here knows better than me.
Taken from WSASend's doumentation at MSDN:
lpBuffers [in]
A pointer to an array of WSABUF
structures. Each WSABUF structure
contains a pointer to a buffer and the
length, in bytes, of the buffer. For a
Winsock application, once the WSASend
function is called, the system owns
these buffers and the application may
not access them. This array must
remain valid for the duration of the
send operation.
Ok, can you see the bold text? That's the unclear spot!
I can think of two translations for this line (might be something else, you name it):
Translation 1 - "buffers" refers to the OVERLAPPED structure that I pass this function when calling it. I may reuse the object again only when getting a completion notification about it.
Translation 2 - "buffers" refer to the actual buffers, those with the data I'm sending. If the WSABUF object points to one buffer, then I cannot touch this buffer until the operation is complete.
Can anyone tell what's the right interpretation to that line?
And..... If the answer is the second one - how would you resolve it?
Because to me it implies that for each and every data/buffer I'm sending I must retain a copy of it at the sender side - thus having MANY "pending" buffers (in different sizes) on an high traffic application, which really going to hurt "scalability".
Statement 1:
In addition to the above paragraph (the "And...."), I thought that IOCP copies the data to-be-sent to it's own buffer and sends from there, unless you set SO_SNDBUF to zero.
Statement 2:
I use stack-allocated buffers (you know, something like char cBuff[1024]; at the function body - if the translation to the main question is the second option (i.e buffers must stay as they are until the send is complete), then... that really screws things up big-time! Can you think of a way to resolve it? (I know, I asked it in other words above).
The answer is that the overlapped structure and the data buffer itself cannot be reused or released until the completion for the operation occurs.
This is because the operation is completed asynchronously so even if the data is eventually copied into operating system owned buffers in the TCP/IP stack that may not occur until some time in the future and you're notified of when by the write completion occurring. Note that with write completions these may be delayed for a surprising amount of time if you're sending without explicit flow control and relying on the the TCP stack to do flow control for you (see here: some OVERLAPS using WSASend not returning in a timely manner using GetQueuedCompletionStatus?) ...
You can't use stack allocated buffers unless you place an event in the overlapped structure and block on it until the async operation completes; there's not a lot of point in doing that as you add complexity over a normal blocking call and you don't gain a great deal by issuing the call async and then waiting on it.
In my IOCP server framework (which you can get for free from here) I use dynamically allocated buffers which include the OVERLAPPED structure and which are reference counted. This means that the cleanup (in my case they're returned to a pool for reuse) happens when the completion occurs and the reference is released. It also means that you can choose to continue to use the buffer after the operation and the cleanup is still simple.
See also here: I/O Completion Port, How to free Per Socket Context and Per I/O Context?
In my systems programming class we are working on a small, simple hobby OS. Personally I have been working on an ATA hard disk driver. I have discovered that a single line of code seems to cause a fault which then immediately reboots the system. The code in question is at the end of my interrupt service routine for the IDE interrupts. Since I was using the IDE channels, they are sent through the slave PIC (which is cascaded through the master). Originally my code was only sending the end-of-interrupt byte to the slave, but then my professor told me that I should be sending it to the master PIC as well.
SO here is my problem, when I un-comment the line which sends the EOI byte to the master PIC, the systems triple faults and then reboots. Likewise, if I leave it commented the system stays running.
_outb( PIC_MASTER_CMD_PORT, PIC_EOI ); // this causes (or at least sets off) a triple fault reboot
_outb( PIC_SLAVE_CMD_PORT, PIC_EOI );
Without seeing the rest of the system, is it possible for someone to explain what could possibly be happening here?
NOTE: Just as a shot in the dark, I replaced the _outb() call with another _outb() call which just made sure that the interrupts were enable for the IDE controller, however, the generated assembly would have been almost identical. This did not cause a fault.
*_outb() is a wrapper for the x86 OUTB instruction.
What is so special about my function to send EOI to the master PIC that is an issue?
I realize without seeing the code this may be impossible to answer, but thanks for looking!
Triple faults usually point to a stack overflow or odd stack pointer. When a fault or interrupt occurs, the system immediately tries to push some more junk onto the stack (before invoking the fault handler). If the stack is hosed, this will cause another fault, which then tries to push more stuff on the stack, which causes another fault. At this point, the system gives up on you and reboots.
I know this because I actually have a silly patent (while working at Dell about 20 years ago) on a way to cause a CPU reset without external hardware (used to be done through the keyboard controller):
MOV ESP,1
PUSH EAX ; triple fault and reset!
An OUTB instruction can't cause a fault on its own. My guess is you are re-enabling an interrupt, and the interrupt gets triggered while something is wrong with your stack.
When you re-enable the PIC, are you doing it with the CPU's interrupt flag set, or cleared (ie. are you doing it sometime after a CLI opcode, or, sometime after an STI opcode)?
Assuming that the CPU's interrupt flag is enabled, your act of re-enabling the PIC allows any pending interrupts to reach the CPU: which would interrupt your code, dispatch to a vector specified by the IDT, etc.
So I expect that it's not your opcode that's directly causing the fault: rather, what's faulting is code that's run as the result of an interrupt which happens as a result of your re-enabling the PIC.