MPI_SEND stops working after MPI_BARRIER

MPI_SEND stops working after MPI_BARRIER - webserver

I'm building a distributed web server in C/MPI and it seems like point-to-point communication completely stops working after the first MPI_BARRIER in my code. Standard C code works after the barrier, so I know that each of the threads makes it through the barrier. Point-to-point communication also works just fine before the barrier. However, when I copy-paste the same code that worked the line before the barrier into the line after the barrier it stops working entirely. The SEND will just wait forever. When I try using an ISEND instead, it makes it through the line, but the message is never received. I've been googling this problem a lot and everyone who has problems with MPI_BARRIER is told the barrier works correctly and their code is wrong, but I cannot for the life of me figure out why my code is wrong. What could be causing this behavior?
Here is a sample program that demonstrates this:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int procID;
int val;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &procID);
MPI_Barrier(MPI_COMM_WORLD);
if (procID == 0)
{
val = 4;
printf("Before send\n");
MPI_Send(&val, 1, MPI_INT, 1, 4, MPI_COMM_WORLD);
printf("after send\n");
}
if (procID == 1)
{
val = 1;
printf("before: val = %d\n", val);
MPI_Recv(&val, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
printf("after: val = %d\n", val);
}
MPI_Finalize();
return 0;
}
Moving the two if statements before the barrier causes this program to run correctly.
EDIT - It appears that the first communication, regardless of type, works, and all future communications fail. This is much more general that I thought at first. It doesn't matter if the first communication is a barrier or some other message, no future communications work properly.

Open MPI has a know feature when it uses TCP/IP for communications: it tries to use all configured network interfaces that are in "UP" state. This presents as a problem if some of the other nodes are not reachable through all those interfaces. This is part of the greedy communication optimisation that Open MPI employs and sometimes, like in your case, leads to problems.
It seems that at least the second node has more than one interfaces that are up and that this fact was introduced to the first node during the negotiation phase:
one configured with 128.2.100.167
one configured with 192.168.109.1 (do you have a tunnel or Xen running on the machine?)
The barrier communication happens over the first network and then the next MPI_Send tries to send to the second address over the second network which obviously does not connect all nodes.
The easiest solution is to tell Open MPI only to use the nework that connects your nodes. You can tell it do so using the following MCA parameter:
--mca btl_tcp_if_include 128.2.100.0/24
(or whatever your communication network is)
You can also specify the list of network interfaces if it is the same on all machines, e.g.
--mca btl_tcp_if_include eth0
or you can tell Open MPI to specifically exclude certain interfaces (but you must always tell it to exclude the loopback "lo" if you do so):
--mca btl_tcp_if_exclude lo,virt0
Hope that helps you and many others that appears to have the same problems around here at SO. It looks like that recently almost all Linux distros has started bringing up various network interfaces by default and that is likely to cause problems with Open MPI.
P.S. Put those nodes behind a firewall, please!

Related

share information between function(BPF/XDP)

Objective: If process id/name = xxx then drop the packet
So, I am bit confused. So far I know you can't extract process information from XDP but bpf trace allows you to trace it.
Here's my probable solution, use bpf hash maps to share information between two function. If process name == xx then XDP_DROP. (This maybe wrong, but something I was trying)
But I am confused how to use BPF_HASHMAPS, I read the documentation on bcc yet..
Example: From this hello function I can trace events
struct data_t {
u32 pid;
u64 ts;
char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
int hello(struct pt_regs *ctx) {
struct data_t data = {};
data.pid = bpf_get_current_pid_tgid();
data.ts = bpf_ktime_get_ns();
bpf_get_current_comm(&data.comm, sizeof(data.comm));
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
XDP function to drop packer
int udpfilter(struct xdp_md *ctx) {
bpf_trace_printk("got a packet\n");
//u32 cpu = bpf_get_smp_processor_id();
//bpf_trace_printk("%s looking\n",cpu);
//u32 pid = bpf_get_current_pid_tgid();
return XDP_DROP;
}
Now how do I fetch pid value and use it in XDP function, plus does the solution even makes any sense. Thanks for the help, really appreciated.

So, as you know eBPF programs can be loaded into the kernel at different locations. XDP programs are loaded just after the network driver and just before the network stack. At this point the kernel doesn't know for which process a packet might be since it will figure all of that out in the network stack.
The hello program you are showing is an example of a kprobe(kernel probe). It attaches to whatever kernel function you specify, but it is a tracing tool, can't make changes.
Also, some helper functions like bpf_get_current_pid_tgid are program type dependent. bpf_get_current_pid_tgid only works in kprobes, uprobes, tracepoint programs (perf programs), the may actually also work in socket and cGroup programs, the issue is that there is not a very clear list or overview of which work where, these are two good but non-comprehensive links:
https://blogs.oracle.com/linux/post/bpf-in-depth-bpf-helper-functions
https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#program-types
In the end it comes down to logic. The kernel can only give you access to data and actions it has access to itself. So if you want to do network related things based on process ID's you might need to use an eBPF program attached at a location where such info is available(keep in mind that this is obviously also slower).
So depending on what exactly you want to do you have a few options:
Attach an eBPF program to a network socket(BPF_PROG_TYPE_SOCKET_FILTER) so you can filter packets on the socket level. This does require the program that creates the socket to attach the program to it.
Use a cGroup and BPF_PROG_TYPE_CGROUP_SKB program to block packets. Since you attach the program to the cGroup, this doesn't require cooperation from the program.
Use an TC program(BPF_PROG_TYPE_SCHED_ACT), on this level a packet is already parsed, but you still need to match it to a process
Use an XDP program(BPF_PROG_TYPE_XDP) can still be used, this does require you to parse all layers of the network packet(Ethernet, VLAN, IP, UDP/TCP), and then manually extract the protocol, Destination IP, and Destination port. Just like in the TC program you then need to match it to an pid using a lookup table.
When going the XDP or TC route you need to create this lookup table. As far as I know you can't access the table of the kernel via helper functions. A few approaches are:
parsing the output of netstat -lpn(protocol, destination ip, destination port and PID) and setting the data in a map to be used by a program
Getting the same data but directly from /sys or /proc(I don't know where the data is stored exactly)
Recording which PIDs have which sockets during creation(using a second program(kprobe/tracepoint)) and setting this data in a map shared by both the XDP/TC program and the trace program. (not quite sure how to share maps between programs in BCC, but it is certainly possible when using libbpf)

Unaligned accesses are not detected by Raspberry PI version 1

I'm performing a set of activities to make sure Redis runs well in a set of embedded systems, including the Raspberry PI. In order to fix certain code paths of Redis where unaligned memory accesses are performed (due to a change introduced in Redis 3.2) I'm trying to force the PI to either log a message on unaligned memory accesses or send a signal to the process when this happens. In this way I can both make sure that Redis will run well where unaligned accesses are a violation, and that it will run faster in platforms where instead such accesses can be performed but are slower. ARM v6, the one used in the PI v1, is apparently able to deal with unaligned memory accesses, so if I use following command to configure Linux in order to sent a signal to the process performing the unaligned access:
echo 4 > /proc/cpu/alignment
And then run the following program:
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv) {
char *buf = "foobareklsjdfklsjdfslkjfskdljfskdfjdslkjfdslkjfsd";
uint32_t *l = (uint32_t*) (buf+1);
printf("%p\n", l);
printf("%d\n", (int)*l);
return 0;
}
I can't see any signal received by the process, or the counters at /proc/cpu/alignment incrementing.
My guess is that this is due to ARM v6 ability to deal with unaligned addresses automatically, if a given CPU configuration flag is set. My question is, is my hypothesis correct? And if so, how to force a PI version 1 to actually raise an exception in case of unaligned accesses so that the Linux kernel can trap it and send a signal, log the access, and so forth, according to /proc/cpu/alignment settings?
EDIT: It is worth to note that not all the instructions can perform unaligned accesses even in ARM v6. For instance STMDB, STMFD, LDMDB, LDMEA and similar multiple words instructions will indeed raise an exception and will be trapped by the Linux kernel.

I think I eventually found my answers:
Yes I'm correct, up to the word size ARM v6 (or greater) can silently handle the unaligned accesses so no trap is generated and is completely transparent for the Linux kernel. Nothing will be logged, nor the traps counter in /proc/cpu/alignment will be incremented.
AFAIK there is no way I can force the kernel to trap word-sized unaligned accesses, since to do that apparently the CPU should be configured in order to trap the unaligned addresses in every case, but the Linux kernel does not do that AFAIK, probably because there is alignment unsafe code inside the kernel itself. Checking the Linux kernel source code indeed one can see:
if (cpu_is_v6_unaligned()) {
set_cr(__clear_cr(CR_A));
ai_usermode = safe_usermode(ai_usermode, false);
}
What this means is that the SCTLR.A bit is always cleared, so no trap
will be generated for unaligned accesses ARM v6 can handle.
There are a great deal of instructions that will still generate traps when used with unaligned addresses, for example multi store/load instructions, loading and storing of double values.
However, there are instructions that GCC (the version shipped in the default Raspberry Linux distribution) will happily produced that are not handled by the Linux kernel correctly, that will result in a SIGBUS generated even when /proc/cpu/alignment is set to fix the access.
So point number 4 basically means that, it is not a good idea to fix programs to run in ARM v6 just letting the Linux kernel handle unaligned addresses for us, even when the performance implications of unaligned addresses are not a problem: the program can still crash since not all the instructions are handled.
How to reliably find all the unaligned accesses in a program remains an open question AFAIK, since unfortunately, the otherwise wonderful valgrind program, never implemented this feature. In the past I had to use QEMU emulating Sparc, however this is a very slow process. Valgrind would be the trivial way to do that.

Is low latency mode safe to use with Linux serial ports?

Is it safe to use the low_latency tty mode with Linux serial ports? The tty_flip_buffer_push function is documented that it "must not be called from IRQ context if port->low_latency is set." Nevertheless, many low-level serial port drivers call it from an ISR whether or not the flag is set. For example, the mpc52xx driver calls flip buffer unconditionally after each read from its FIFO.
A consequence of the low latency flip buffer in the ISR is that the line discipline driver is entered within the IRQ context. My goal is to get latency of one millisecond or less, reading from a high speed mpc52xx serial port. Setting low_latency acheives the latency goal, but it also violates the documented precondition for tty_flip_buffer_push.

This question was asked on linux-serial on Fri, 19 Aug 2011.
No, low latency is not safe in general.
However, in the particular case of 3.10.5 low_latency is safe.
The comments above tty_flip_buffer_push read:
"This function must not be called from IRQ context if port->low_latency is set."
However, the code (3.10.5, drivers/tty/tty_buffer.c) contradicts this:
void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;
unsigned long flags;
spin_lock_irqsave(&buf->lock, flags);
if (buf->tail != NULL)
buf->tail->commit = buf->tail->used;
spin_unlock_irqrestore(&buf->lock, flags);
if (port->low_latency)
flush_to_ldisc(&buf->work);
else
schedule_work(&buf->work);
}
EXPORT_SYMBOL(tty_flip_buffer_push);
The use of spin_lock_irqsave/spin_unlock_irqrestore makes this code safe to call from interrupt context.
There is a test for low_latency and if it is set, flush_to_ldisc is called directly. This flushes the flip buffer to the line discipline immediately, at the cost of making the interrupt processing longer. The flush_to_ldisc routine is also coded to be safe for use in interrupt context. I guess that an earlier version was unsafe.
If low_latency is not set, then schedule_work is called. Calling schedule_work is the classic way to invoke the "bottom half" handler from the "top half" in interrupt context. This causes flush_to_ldisc to be called from the "bottom half" handler at the next clock tick.
Looking a little deeper, both the comment and the test seem to be in Alan Cox's original e0495736 commit of tty_buffer.c. This commit was a re-write of earlier code, so it seems that at one time there wasn't a test. Whoever added the test and fixed flush_to_ldisc to be interrupt-safe did not bother to fix the comment.
So, always believe the code, not the comments.
However, in the same code in 3.12-rc* (as of October 23, 2013) it looks like the problem was opened again when the spin_lock_irqsave's in flush_to_ldisc were removed and mutex_locks were added. That is, setting UPF_LOW_LATENCY in the serial_struct flags and calling the TIOCSSERIAL ioctl will again cause "scheduling while atomic".
The latest update from the maintainer is:
On 10/19/2013 07:16 PM, Jonathan Ben Avraham wrote:
> Hi Peter,
> "tty_flip_buffer_push" is called from IRQ handlers in most drivers/tty/serial UART drivers.
>
> "tty_flip_buffer_push" calls "flush_to_ldisc" if low_latency is set.
> "flush_to_ldisc" calls "mutex_lock" in 3.12-rc5, which cannot be used in interrupt context.
>
> Does this mean that setting "low_latency" cannot be used safely in 3.12-rc5?
Yes, I broke low_latency.
Part of the problem is that the 3.11- use of low_latency was unsafe; too many shared
data areas were simply accessed without appropriate safeguards.
I'm working on fixing it but probably won't make it for 3.12 final.
Regards,
Peter Hurley
So, it looks like you should not depend on low_latency unless you are sure that you are never going to change your kernel from a version that supports it.
Update: February 18, 2014, kernel 3.13.2
Stanislaw Gruszka wrote:
Hi,
setserial has low_latency option which should minimize receive latency
(scheduler delay). AFAICT it is used if someone talk to external device
via RS-485/RS-232 and need to have quick requests and responses . On
kernel this feature was implemented by direct tty processing from
interrupt context:
void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;
buf->tail->commit = buf->tail->used;
if (port->low_latency)
flush_to_ldisc(&buf->work);
else
schedule_work(&buf->work);
}
But after 3.12 tty locking changes, calling flush_to_ldisc() from
interrupt context is a bug (we got scheduling while atomic bug report
here: https://bugzilla.redhat.com/show_bug.cgi?id=1065087 )
I'm not sure how this should be solved. After Peter get rid all of those
race condition in tty layer, we probably don't want go back to use
spin_lock's there. Maybe we can create WQ_HIGHPRI workqueue and schedule
flush_to_ldisc() work there. Or perhaps users that need to low latency,
should switch to thread irq and prioritize serial irq to meat
retirements. Anyway setserial low_latency is now broken and all who use
this feature in the past can not do this any longer on 3.12+ kernels.
Thoughts ?
Stanislaw

A patch has been posted to LKML to address the problem. It removes the generic code for handling low_latency but keeps the parameter for the low-level drivers to use.
http://www.kernelhub.org/?p=2&msg=419071
I tried forcing low_latency on Linux 3.12 with serial console. The kernel was very unstable. If preemption was enabled, it would hang after a few minutes of use.
So the answer for now is to stay away.

If a close is interrupted or fails, what is the state of the fd?

Reading the man page for close, if it's interrupted by a signal then the fd's state is unspecified. Is there a best practice for handling this case, or is it assumed to become the OS's problem afterwards.
I assume that failure after EIO closes the fd appropriately.

If you want your program to run for a long time, a possible file descriptor leak is never only the operating system's problem. Short-lived programs which don't use many file descriptors of course have the option of terminating the the descriptors unclosed, and rely on the kernel closing them when the program terminates. So, for the rest of my answer I'll assume your program is long-running.
If your program is not multi-threaded, you have a very easy situation:
int close_some_fd(int fd, int *was_interrupted) {
*was_interrupted = 0;
/* this is point X to which I will draw your attention later. */
for (;;) {
if (0 == close(fd))
return 0; /* Success. */
if (EIO != errno)
return -1; /* Some failure, not interrupted by a signal. */
/* Our close attempt was interrupted. */
*was_interrupted = 1;
int fdflags = 0;
/* Just use the fd to find out if it is still open. */
if (0 != fcntl(fd, F_GETFD, &fdflags))
return 0; /* Interrupted, but file is also closed. So we are done. */
}
}
On the other hand, if your code is multi-threaded, some other thread (and perhaps one you don't control, such as a name service cache) may have called dup, dup2, socket, open, accept or some other similar function that makes the kernel allocate a file descriptor.
To make a similar approach work in such an environment you will need to be able to tell the difference between the file descriptor you started with and a file descriptor newly opened by another thread. Knowing that there is already a lower-numbered fd which still isn't open is enough to discount may of those, but in a multi-threaded environment you don't have a simple way of figuring out that this is still the case.
One option is to rely on some common aspect to all the file descriptors your program works with. For example if it never uses O_CLOEXEC, you can use fcntl to set the O_CLOEXEC flag at the point marked X in the code, and then you just change the existing call to fcntl like this:
if (0 = fcntl(fd, F_GETFD, &fdflags)) {
if (fdflags & O_CLOEXEC) {
/* open, and still marked with O_CLOEXEC, unused elsewhere. */
continue;
} else {
return 0; /* Interrupted, but file is also closed. So we are done. */
}
}
You can adjust this approach to use something other than O_CLOEXEC (perhaps fstat for example, recording st_dev and st_ino), but if you can't be sure what the rest of your multithreaded program is doing, this general idea is likely to be unsatisfying.
There's another approach which is also a bit problematic, but which might serve. This is that, instead of closing your file descriptor, you use sendmsg to pass the file descriptor across a Unix domain socket to a separate, single-threaded special-purpose server whose only job is to close the file descriptor. Yes, this is a bit icky. Some entertaining properties of this approach though are:
In order to avoid uncertainty over whether your fd was really passed to the server and closed successfully, you should probably read from a return channel of fd-closed-OK messages coming back from the server. This avoids needing to block signal delivery while you are executing sendmsg too. However, it means a user-space context-switch overhead for every file descriptor close (unless you batch them up to amortise the cost). You need to avoid a situation where thread A may be reading an fd-closed-OK report corresponding to a request made from thread B. You can avoid that problem by serialising close operations (which will limit performance) or demultiplexing the responses (which is complex). Alternatively, you could use some other IPC mechanism to avoid the need to serialise or demultiplex (SYSV semaphores for example).
For a high-volume process this will place an entertainingly high load on the kernel's file descriptor garbage collector apparatus, which is normally not highly stressed and may therefore give you some interesting symptoms.
As for what I'd do personally in the applications I work with most often, I'd figure out to what extent I could make assumptions about what kind of file descriptor I was closing. If, for example, they were normally sockets, I'd just try this out with a test program and figure out whether EIO normally leaves the file descriptor closed or not. That is, determine whether the theoretically unspecified state of the file descriptor on EIO is, in practice, predictable. This will not work well if the fd could be anything (e.g. disk file, socket, tty, ...). Of course, if you're using some open-source operating system you may just be able to read the kernel source and determine what the possible outcomes are.
Certainly I'd try the above experiment-based system before worrying about sharding the fd-close servers to scale out on fd-closing.

How to set a timeout in connect/send ? ( as400 iseries v5r4, rpg )

From this rpg socket tutorial we created a socket client in rpg that calls a java server socket
The problem is that connect()/send() operations blocks and we have a requirement that if the connect/send couldn't be done in a matter of a second per say, we have to just log it and finish.
If I set the socket to non-blocking mode (I think with fnctl), we are not fully understanding how to proceed, and can't find any useful documentation with examples for it.
I think if I do connect to a non-blocking socket I have to do select(..., timeout) which tells us if the connect succeed and/ we are able to send(bytes). But, if we send(bytes) afterwards, as it is now a non-blocking socket (which will immediately return after the call), how do I know that send() did the actual sending of the bytes to the server before closing the socket ?
I can fall back to have the client socket in AS400 as a Java or C procedure, but I really want to just keep it in a simple RPG program.
Would somebody help me understand how to do that please ?
Thanks !

In my opinion, that RPG tutorial you mention has a slight defect. What I believe is causing your confusion is the following section's code:
...
Consequently, we typically call the
send() API like this:
D miscdata S 25A
D rc S 10I 0
C eval miscdata = 'The data to send goes here'
C eval rc = send(s: %addr(miscdata): 25: 0)
c if rc < 25
C* for some reason we weren't able to send all 25 bytes!
C endif
...
If you read the documentation of send() you will see that the return value does not indicate an error if it is greater than -1 yet in the code above it seems as if an error has occurred. In fact, the sum of the return values must equal the size of the buffer assuming that you keep moving the pointer into the buffer to reflect what has been sent. Look here in Beej's Guide to Network Programming. You might also like to look at Richard Stevens' book UNIX Network Programming, Volume 1 for really detailed explanations.
As to the problem of determining if the last send before close() did the actual send ... well the paragraph above explains how to determine what portion of the data was sent. However, calling close() will attempt to send all unsent data unless SO_LINGER is set.
fnctl() is used to control blocking while setsockopt() is used to set SO_LINGER.
The abstraction of network communications being used is BSD sockets. There are some slight differences in implementations across OS's but it is generally quite homogeneous. This means that one can generally use documentation written for other OS's for the broad overview. Most of the time.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse