Zerocopy TCP message sending from kernel module error EFAULT - sockets

I am working on a kernel module which receives data over DMA from an FPGA and stores it in a ring buffer allocated with dma_alloc_attrs(dev, size, &data->dma_addr, GFP_KERNEL, DMA_ATTR_FORCE_CONTIGUOUS). Everytime when new data is available in the ring buffer, a completion is fired.
In the same kernel module, I am running a TCP server and during the lifetime of the kernel module only one client (on a different machine) connects to the server(and stays connected). A separate thread in the kernel module sends data received in the ring buffer to the connected client whenever the completion was fired. The idea behind having a tcp server in the kernel space is to get rid of the unnecessary context switches from kernel space and user space whenever the data should be sent to the client, thus increasing performance. So far everything works, but the performance isn't as expected (on the TCP side).
After looking a bit into how to increase performance, i found the ZEROCOPY option.
I changed the settings of the server socket to set the SO_ZEROCOPY flag: kernel_setsockopt(socket, SOL_SOCKET, SO_ZEROCOPY, (char *)&one, sizeof(one)) and the implementation of the sending to client to:
static DEFINE_MUTEX(tcp_send_mtx);
static int send(struct socket *sock, const char *buf,
const size_t length, unsigned long flags)
{
struct msghdr msg;
struct kvec vec;
int len, written = 0;
int left = length;
if(sock == NULL)
{
printk(KERN_ERR MODULE_NAME ": tcp server send socket is NULL\n");
return -EFAULT;
}
msg.msg_name = 0;
msg.msg_namelen = 0;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_flags = MSG_ZEROCOPY;
repeat_send:
vec.iov_len = left;
vec.iov_base = (char *)buf + written;
len = kernel_sendmsg(sock, &msg, &vec, left, left);
if((len == -ERESTARTSYS) || (!(flags & MSG_DONTWAIT) && (len == -EAGAIN)))
goto repeat_send;
if(len > 0)
{
written += len;
left -= len;
if(left)
goto repeat_send;
}
return written?written:len;
}
Note the msg.msg_flags = MSG_ZEROCOPY; assignment in the send function.
Now when i am trying to use this, I am getting EFAULT(-14) error code from kernel_sendmsg just by adding the MSG_ZEROCOPY flag.
UPDATE:
I understand now that the ZEROCOPY flag is wrongly used in the kernel space since it's designed to remove the additional copy between the user-space and kernel-space.
My initial problem still exists. TCP transfer is still slow and the ring buffer overflows when the DMA transfer speed exceeds 120mb/s. The thread that forwards the messages to the client is not able to send the 8kb messages faster than 120mb/s.
Anyone knows what is wrong here? Maybe that the idea is wrong in the first place

Related

EBPF Newbie: Need Help, facing an error while loading a EBF code

I wrote a bpf code and compiled with clang, while trying to load, I face an error. I am not able to understand why and how to resolve it, need experts advice.
I am running this code in a VM
OS : Ubuntu 18.04.2
Kernel : Linux 4.18.0-15-generic x86_64
I tried simple programs and I able to load but not with this program.
static __inline int clone_netflow_record (struct __sk_buff *skb, unsigned long dstIpAddr)
{
return XDP_PASS;
}
static __inline int process_netflow_records( struct __sk_buff *skb)
{
int i = 0;
#pragma clang loop unroll(full)
for (i = 0; i < MAX_REPLICATIONS; i++) {
clone_netflow_record (skb, ipAddr[i]);
}
return XDP_DROP;
}
__section("action")
static int probe_packets(struct __sk_buff *skb)
{
/* We will access all data through pointers to structs */
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
if (data > data_end)
return XDP_DROP;
/* for easy access we re-use the Kernel's struct definitions */
struct ethhdr *eth = data;
struct iphdr *ip = (data + sizeof(struct ethhdr));
/* Only actual IP packets are allowed */
if (eth->h_proto != __constant_htons(ETH_P_IP))
return XDP_DROP;
/* If Netflow packets process it */
if (ip->protocol != IPPROTO_ICMP)
{
process_netflow_records (skb);
}
return XDP_PASS;
}
ERROR Seen:
$ sudo ip link set dev enp0s8 xdp object clone.o sec action
Prog section 'action' rejected: Permission denied (13)!
- Type: 6
- Instructions: 41 (0 over limit)
- License: GPL
Verifier analysis:
0: (bf) r2 = r1
1: (7b) *(u64 *)(r10 -16) = r1
2: (79) r1 = *(u64 *)(r10 -16)
3: (61) r1 = *(u32 *)(r1 +76)
invalid bpf_context access off=76 size=4
Error fetching program/map!
The kernel verifier that enforces checks on your program in the Linux kernel ensures that no out-of-bound accesses are attempted. Your program is rejected because it may trigger such out-of-bound access.
If we have a closer look at your snippet:
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
So here we get pointers to data (start of packet) and data_end.
if (data > data_end)
return XDP_DROP;
The above check is unnecessary (data will not be higher than data_end). But there's another check you should do here instead. Let's see below:
/* for easy access we re-use the Kernel's struct definitions */
struct ethhdr *eth = data;
struct iphdr *ip = (data + sizeof(struct ethhdr));
/* Only actual IP packets are allowed */
if (eth->h_proto != __constant_htons(ETH_P_IP))
return XDP_DROP;
What you do here is, first, making eth and ip point to the start of the packet and (supposedly) the start of the IP header. This step is fine. But then, you try to dereference eth to access its h_proto field.
Now, what would happen if the packet was not Ethernet, and it was not long enough to have an h_proto field in it? You would try to read some data outside of the bounds of the packet, this is the out-of-bound access I mentioned earlier. Note that it does not mean your program actually tried to read this data (as a matter of fact, I don't see how you could get a packet shorter than 14 bytes). But from the verifier's point of view, it is technically possible that this forbidden access could occur, so it rejects your program. This is what it means with invalid bpf_context access: your code tries to access the context (for XDP: packet data) in an invalid way.
So how do we fix that? The check that you should have before trying to dereference the pointer should not be on data > data_end, it should be instead:
if (data + sizeof(struct ethhdr) > data_end)
return XDP_DROP;
So if we pass the check without returning XDP_DROP, we are sure that the packet is long enough to contain a full struct ethhdr (and hence a h_proto field).
Note that a similar check on data + sizeof(struct ethhdr) + sizeof(struct iphdr) will be necessary before trying to dereference ip, for the same reason. Each time you try to access data from the packet (the context), you should make sure that your packet is long enough to dereference the pointer safely.

Implementation of Poll Mechanism in Char Device Driver

Hello Dear participants of stackoverflow,
I'm new to kernel space development and still in the beginning of the road.
I developed a basic char device driver that can read open close etc . But couldn't find a proper source and how to tutorial for Poll/select mechanism sample.
I've written the sample code for poll function below:
static unsigned int dev_poll(struct file * file, poll_table *wait)
{
poll_wait(file,&dev_wait,wait);
if (size_of_message > 0 ){
printk(KERN_INFO "size_of_message > 0 returning POLLIN | POLLRDNORM\n");
return POLLIN | POLLRDNORM;
}
else {
printk(KERN_INFO "dev_poll return 0\n");
return 0;
}
}
It works fine but couldn't undestand a few things.
When I call select from user space program as
struct timeval time = {5,0 } ;
select(fd + 1 , &readfs,NULL,NULL,&time);
the dev_poll function in driver called once and return zero or POLLIN in order to buffer size . And then never called again. In user space , after 5 seconds the program continue if dev_poll returned 0.
What I couldn't understand is here , How the driver code will decide and let user space program if there is something in buffer that is readable withing this 5 seconds , if it's called once and returned immediately.
Is there anyway in kernel module to gather information of timeval parameter that comes from userspace ?
Thank you from now on.
Regards,
Call poll_wait() actually places some wait object into a waitqueue, specified as a second parameter. When wait object is fired (via waitqueue's wake_up or similar function), the poll function is evaluated again.
Kernel driver needn't to bother about timeouts: when time is out, the wait object will be removed from the waitqueue automatically.
Hello dear curious people like me about poll . I came up with a solution.
From another topic on stackowerflow a guy said that the poll_function is called multiple times if kernel need to last situation. So basically I implement that code .
when poll called call wait_poll(wait_queue_head);
when device have buffered data(this is usually in driver write function).
call wake_up macro with wait_queue_head paramater.
So after this step poll function of driver is called again .
So here you can return whatever you want to return. In this case POLLIN | POLLRDNORM..
Here is my sample code for write and poll in the driver.
static unsigned int dev_poll(struct file * file, poll_table *wait)
{
static int dev_poll_called_count = 0 ;
dev_poll_called_count ++;
poll_wait(file,&dev_wait,wait);
read_wait_queue_length++;
printk(KERN_INFO "Inside dev_poll called time is : %d read_wait_queue_length %d\n",dev_poll_called_count,read_wait_queue_length);
printk(KERN_INFO "After poll_wait wake_up called\n");
if (size_of_message > 0 ){
printk(KERN_INFO "size_of_message > 0 returning POLLIN | POLLRDNORM\n");
return POLLIN | POLLRDNORM;
}
else {
printk(KERN_INFO "dev_poll return 0\n");
return 0;
}
}
static ssize_t dev_write(struct file *filep, const char *buffer, size_t len, loff_t *offset){
printk(KERN_INFO "Inside write \n");;
int ret;
ret = copy_from_user(message, buffer, len);
size_of_message = len ;
printk(KERN_INFO "EBBChar: Received %zu characters from the user\n", size_of_message);
if (ret)
return -EFAULT;
message[len] = '\0';
printk(KERN_INFO "gelen string %s", message);
if (read_wait_queue_length)
{
wake_up(&dev_wait);
read_wait_queue_length = 0;
}
return len;
}

MSG_PROXY not working to provide/specify alternate addresses for transparent proxying

I'm trying to write a transparent proxy that translates arbitrary UDP packets to a custom protocol and back again. I'm trying to use transparent proxying to read the incoming UDP packets that need translation, and to write the outgoing UDP packets that have just been reverse-translated.
My setup for the socket I use for both flavors of UDP sockets is as follows:
static int
setup_clear_sock(uint16_t proxy_port)
{
struct sockaddr_in saddr;
int sock;
int val = 1;
socklen_t ttllen = sizeof(std_ttl);
sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
if (sock < 0)
{
perror("Failed to create clear proxy socket");
return -1;
}
if (getsockopt(sock, IPPROTO_IP, IP_TTL, &std_ttl, &ttllen) < 0)
{
perror("Failed to read IP TTL option on clear proxy socket");
return -1;
}
if (setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val)) < 0)
{
perror("Failed to set reuse address option on clear socket");
return -1;
}
if (setsockopt(sock, IPPROTO_IP, IP_TRANSPARENT, &val, sizeof(val)) < 0)
{
perror("Failed to set transparent proxy option on clear socket");
return -1;
}
saddr.sin_family = AF_INET;
saddr.sin_port = htons(proxy_port);
saddr.sin_addr.s_addr = INADDR_ANY;
if (bind(sock, (struct sockaddr *) &saddr, sizeof(saddr)) < 0)
{
perror("Failed to bind local address to clear proxy socket");
return -1;
}
return sock;
}
I have two distinct, but possibly related problems. First, when I read an incoming UDP packet from this socket, using this code:
struct sock_double_addr_in
{
__SOCKADDR_COMMON (sin_);
in_port_t sin_port_a;
struct in_addr sin_addr_a;
sa_family_t sin_family_b;
in_port_t sin_port_b;
struct in_addr sin_addr_b;
unsigned char sin_zero[sizeof(struct sockaddr) - __SOCKADDR_COMMON_SIZE - 8
- sizeof(struct in_addr) - sizeof(in_port_t)];
};
void
handle_clear_sock(void)
{
ssize_t rcvlen;
uint16_t nbo_udp_len, coded_len;
struct sockaddr_in saddr;
struct sock_double_addr_in sdaddr;
bch_coding_context_t ctx;
socklen_t addrlen = sizeof(sdaddr);
rcvlen = recvfrom(sock_clear, &clear_buf, sizeof(clear_buf),
MSG_DONTWAIT | MSG_PROXY,
(struct sockaddr *) &sdaddr, &addrlen);
if (rcvlen < 0)
{
perror("Failed to receive a packet from clear socket");
return;
}
....
I don't see a destination address come back in sdaddr. The sin_family_b, sin_addr_b, and sin_port_b fields are all zero. I've done a block memory dump of the structure in gdb, and indeed the bytes are coming back zero from the kernel (it's not a bad placement of the field in my structure definition).
Temporarily working around this by hard-coding a fixed IP address and port for testing purposes, I can debug the rest of my proxy application until I get to the point of sending an outgoing UDP packet that has just been reverse-translated. That happens with this code:
....
udp_len = ntohs(clear_buf.u16[2]);
if (udp_len + 6 > decoded_len)
fprintf(stderr, "Decoded fewer bytes (%u) than outputting in clear "
"(6 + %u)!\n", decoded_len, udp_len);
sdaddr.sin_family = AF_INET;
sdaddr.sin_port_a = clear_buf.u16[0];
sdaddr.sin_addr_a.s_addr = coded_buf.u32[4];
sdaddr.sin_family_b = AF_INET;
sdaddr.sin_port_b = clear_buf.u16[1];
sdaddr.sin_addr_b.s_addr = coded_buf.u32[3];
if (sendto(sock_clear, &(clear_buf.u16[3]), udp_len, MSG_PROXY,
(struct sockaddr *) &sdaddr, sizeof(sdaddr)) < 0)
perror("Failed to send a packet on clear socket");
}
and the packet never shows up. I've checked the entire contents of the sdaddr structure I've built, and all fields look good. The UDP payload data looks good. There's no error coming back from the sendto() syscall -- indeed, it returns zero. And the packet never shows up in wireshark.
So what's going on with my transparent proxying? How do I get this to work? (FWIW: development host is a generic x86_64 ubuntu 14.04 LTS box.) Thanks!
Alright, so I've got half an answer.
It turns out if I just use a RAW IP socket, with the IP_HDRINCL option turned on, and build the outgoing UDP packet in userspace with a full IP header, the kernel will honor the non-local source address and send the packet that way.
I'm now using a third socket, sock_output, for that purpose, and decoded UDP packets are coming out correctly. (Interesting side note: the UDP checksum field must either be zero, or the correct checksum value. Anything else causes the kernel to silently drop the packet, and you'll never see it go out. The kernel won't fill in the proper checksum for you if you zero it out, but if you specify it, it will verify that it's correct. No sending UDP with intentionally bad checksums this way.)
So the first half of the question remains: when I read a packet from sock_clear with the MSG_PROXY flag to recvfrom(), why do I not get the actual destination address in the second half of sdaddr?

Flexible socket application

I'm writing a game wich playing on LAN with socket. I use 4 bytes length prefix to know how many data in the rest like this:
void trust_recv(int sock, int length, char *buffer)
{
int recved = 0;
int justRecv;
while(recved < length) {
justRecv = recv(sock, buffer + recved, length - recved, 0);
if (justRecv < 0) return;
recved += justRecv;
}
}
void onDataArrival(int sock)
{
int length;
char *data;
trust_recv(sock, 4, (char *) &length);
data = new char[length];
trust_recv(sock, length, data);
do_somethings_with_data(data);
}
The problem is if someone (an intruder or hacker for example) sends data with other format (maybe only 2 bytes or the length of the rest lower than 4 bytes prefix value) or an network problem, my application will be go to "not responding" state and have to close (because I use blocking socket). How to make my socket application more flexible but don't swith socket to non-blocking mode to pass this issue? (Or any ideas for organize data or algorithms as well)
You can set a receive timeout, during the socket setup phase, with setsockopt() call and SO_RCVTIMEO parameter;
struct timeval tv;
tv.tv_sec =8;
tv.tv_usec = 0 ;
if (setsockopt (your_sock_fd, SOL_SOCKET, SO_RCVTIMEO, (char *)&tv, sizeof tv)
perror(“setsockopt error”);
then test the return of recv() and his errno
if (justRecv < 0)
{
if (errno == EAGAIN)
perror("TIMEOUT!");
return;
}

Sending UDP packet in Linux Kernel

For a project, I'm trying to send UDP packets from Linux kernel-space. I'm currently 'hard-coding' my code into the kernel (which I appreciate isn't the best/neatest way) but I'm trying to get a simple test to work (sending "TEST"). It should be mentioned I'm a newbie to kernel hacking - I'm not that clued up on many principles and techniques!
Every time my code gets run the system hangs and I have to reboot - no mouse/keyboard response and the scroll and caps lock key lights flash together - I'm not sure what this means, but I'm assuming it's a kernel panic?
The repeat_send code is unnecessary for this test code, yet when it's working I want to send large messages that may require multiple 'send's - I'm not sure that if could be a cause of my issues?
N.B. This code is being inserted into neighbour.c of linux-source/net/core/ origin, hence the use of NEIGH_PRINTK1, it's just a macro wrapper round printk.
I'm really banging my head against a brick wall here, I can't spot anything obvious, can anyone point me in the right direction (or spot that blindingly obvious error!)?
Here's what I have so far:
void mymethod()
{
struct socket sock;
struct sockaddr_in addr_in;
int ret_val;
unsigned short port = htons(2048);
unsigned int host = in_aton("192.168.1.254");
unsigned int length = 5;
char *buf = "TEST\0";
struct msghdr msg;
struct iovec iov;
int len = 0, written = 0, left = length;
mm_segment_t oldmm;
NEIGH_PRINTK1("forwarding sk_buff at: %p.\n", skb);
if ((ret_val = sock_create(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
NEIGH_PRINTK1("Error during creation of socket; terminating. code: %d\n", ret_val);
return;
}
memset(&addr_in, 0, sizeof(struct sockaddr_in));
addr_in.sin_family=AF_INET;
addr_in.sin_port = port;
addr_in.sin_addr.s_addr = host;
if((ret_val = sock.ops->bind(&sock, (struct sockaddr *)&addr_in, sizeof(struct sockaddr_in))) < 0) {
NEIGH_PRINTK1("Error trying to bind socket. code: %d\n", ret_val);
goto close;
}
memset(&msg, 0, sizeof(struct msghdr));
msg.msg_flags = 0;
msg.msg_name = &addr_in;
msg.msg_namelen = sizeof(struct sockaddr_in);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = NULL;
msg.msg_controllen = 0;
repeat_send:
msg.msg_iov->iov_len = left;
msg.msg_iov->iov_base = (char *)buf + written;
oldmm = get_fs();
set_fs(KERNEL_DS);
len = sock_sendmsg(&sock, &msg, left);
set_fs(oldmm);
if (len == -ERESTARTSYS)
goto repeat_send;
if (len > 0) {
written += len;
left -= len;
if (left)
goto repeat_send;
}
close:
sock_release(&sock);
}
Any help would be hugely appreciated, thanks!
You may find it easier to use the netpoll API for UDP. Take a look at netconsole for an example of how it's used. The APIs you're using are more intended for userspace (you should never have to play with segment descriptors to send network data!)
Run your code when you're in a text mode console (i.e. press Ctrl+Alt+F1 to go to the text console). This way a kernel panic will print out the stack trace and any extra information about what went wrong.
If that doesn't help you, update your question with the stack trace.
I'm not much of a Linux Kernel developer, but can you throw some printk's in there and watch dmesg before it goes down? Or have you thought about hooking up with a kernel debugger?
I think you should try to put all variables outside mymethod() function and make them static. Remember, that the size of kernel stack is limited do 8KiB, so to much of/too big local variables may cause stack overflow and system hangup.