EBPF Newbie: Need Help, facing an error while loading a EBF code - ebpf

I wrote a bpf code and compiled with clang, while trying to load, I face an error. I am not able to understand why and how to resolve it, need experts advice.
I am running this code in a VM
OS : Ubuntu 18.04.2
Kernel : Linux 4.18.0-15-generic x86_64
I tried simple programs and I able to load but not with this program.
static __inline int clone_netflow_record (struct __sk_buff *skb, unsigned long dstIpAddr)
{
return XDP_PASS;
}
static __inline int process_netflow_records( struct __sk_buff *skb)
{
int i = 0;
#pragma clang loop unroll(full)
for (i = 0; i < MAX_REPLICATIONS; i++) {
clone_netflow_record (skb, ipAddr[i]);
}
return XDP_DROP;
}
__section("action")
static int probe_packets(struct __sk_buff *skb)
{
/* We will access all data through pointers to structs */
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
if (data > data_end)
return XDP_DROP;
/* for easy access we re-use the Kernel's struct definitions */
struct ethhdr *eth = data;
struct iphdr *ip = (data + sizeof(struct ethhdr));
/* Only actual IP packets are allowed */
if (eth->h_proto != __constant_htons(ETH_P_IP))
return XDP_DROP;
/* If Netflow packets process it */
if (ip->protocol != IPPROTO_ICMP)
{
process_netflow_records (skb);
}
return XDP_PASS;
}
ERROR Seen:
$ sudo ip link set dev enp0s8 xdp object clone.o sec action
Prog section 'action' rejected: Permission denied (13)!
- Type: 6
- Instructions: 41 (0 over limit)
- License: GPL
Verifier analysis:
0: (bf) r2 = r1
1: (7b) *(u64 *)(r10 -16) = r1
2: (79) r1 = *(u64 *)(r10 -16)
3: (61) r1 = *(u32 *)(r1 +76)
invalid bpf_context access off=76 size=4
Error fetching program/map!

The kernel verifier that enforces checks on your program in the Linux kernel ensures that no out-of-bound accesses are attempted. Your program is rejected because it may trigger such out-of-bound access.
If we have a closer look at your snippet:
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
So here we get pointers to data (start of packet) and data_end.
if (data > data_end)
return XDP_DROP;
The above check is unnecessary (data will not be higher than data_end). But there's another check you should do here instead. Let's see below:
/* for easy access we re-use the Kernel's struct definitions */
struct ethhdr *eth = data;
struct iphdr *ip = (data + sizeof(struct ethhdr));
/* Only actual IP packets are allowed */
if (eth->h_proto != __constant_htons(ETH_P_IP))
return XDP_DROP;
What you do here is, first, making eth and ip point to the start of the packet and (supposedly) the start of the IP header. This step is fine. But then, you try to dereference eth to access its h_proto field.
Now, what would happen if the packet was not Ethernet, and it was not long enough to have an h_proto field in it? You would try to read some data outside of the bounds of the packet, this is the out-of-bound access I mentioned earlier. Note that it does not mean your program actually tried to read this data (as a matter of fact, I don't see how you could get a packet shorter than 14 bytes). But from the verifier's point of view, it is technically possible that this forbidden access could occur, so it rejects your program. This is what it means with invalid bpf_context access: your code tries to access the context (for XDP: packet data) in an invalid way.
So how do we fix that? The check that you should have before trying to dereference the pointer should not be on data > data_end, it should be instead:
if (data + sizeof(struct ethhdr) > data_end)
return XDP_DROP;
So if we pass the check without returning XDP_DROP, we are sure that the packet is long enough to contain a full struct ethhdr (and hence a h_proto field).
Note that a similar check on data + sizeof(struct ethhdr) + sizeof(struct iphdr) will be necessary before trying to dereference ip, for the same reason. Each time you try to access data from the packet (the context), you should make sure that your packet is long enough to dereference the pointer safely.

Related

Zerocopy TCP message sending from kernel module error EFAULT

I am working on a kernel module which receives data over DMA from an FPGA and stores it in a ring buffer allocated with dma_alloc_attrs(dev, size, &data->dma_addr, GFP_KERNEL, DMA_ATTR_FORCE_CONTIGUOUS). Everytime when new data is available in the ring buffer, a completion is fired.
In the same kernel module, I am running a TCP server and during the lifetime of the kernel module only one client (on a different machine) connects to the server(and stays connected). A separate thread in the kernel module sends data received in the ring buffer to the connected client whenever the completion was fired. The idea behind having a tcp server in the kernel space is to get rid of the unnecessary context switches from kernel space and user space whenever the data should be sent to the client, thus increasing performance. So far everything works, but the performance isn't as expected (on the TCP side).
After looking a bit into how to increase performance, i found the ZEROCOPY option.
I changed the settings of the server socket to set the SO_ZEROCOPY flag: kernel_setsockopt(socket, SOL_SOCKET, SO_ZEROCOPY, (char *)&one, sizeof(one)) and the implementation of the sending to client to:
static DEFINE_MUTEX(tcp_send_mtx);
static int send(struct socket *sock, const char *buf,
const size_t length, unsigned long flags)
{
struct msghdr msg;
struct kvec vec;
int len, written = 0;
int left = length;
if(sock == NULL)
{
printk(KERN_ERR MODULE_NAME ": tcp server send socket is NULL\n");
return -EFAULT;
}
msg.msg_name = 0;
msg.msg_namelen = 0;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_flags = MSG_ZEROCOPY;
repeat_send:
vec.iov_len = left;
vec.iov_base = (char *)buf + written;
len = kernel_sendmsg(sock, &msg, &vec, left, left);
if((len == -ERESTARTSYS) || (!(flags & MSG_DONTWAIT) && (len == -EAGAIN)))
goto repeat_send;
if(len > 0)
{
written += len;
left -= len;
if(left)
goto repeat_send;
}
return written?written:len;
}
Note the msg.msg_flags = MSG_ZEROCOPY; assignment in the send function.
Now when i am trying to use this, I am getting EFAULT(-14) error code from kernel_sendmsg just by adding the MSG_ZEROCOPY flag.
UPDATE:
I understand now that the ZEROCOPY flag is wrongly used in the kernel space since it's designed to remove the additional copy between the user-space and kernel-space.
My initial problem still exists. TCP transfer is still slow and the ring buffer overflows when the DMA transfer speed exceeds 120mb/s. The thread that forwards the messages to the client is not able to send the 8kb messages faster than 120mb/s.
Anyone knows what is wrong here? Maybe that the idea is wrong in the first place

Linux socket hardware timestamping

I'm working on a project researching about network synchronisation. Since I want to achieve the best performance I'm trying to compare software timestamping results with hardware timestamping ones.
I have followed this previously commented issue: Linux kernel UDP reception timestamp but after several tests I got some problems when trying to get hardware reception timestamps.
My scenario is composed of 2 devices, a PC and a Gateworks Ventana board, both devices are supposed to be waiting for packets to be broadcasted in their network and timestamping their reception times, I have tried this code (some parts have been omitted):
int rc=1;
int flags;
flags = SOF_TIMESTAMPING_RX_HARDWARE
| SOF_TIMESTAMPING_RAW_HARDWARE;
rc = setsockopt(sock, SOL_SOCKET,SO_TIMESTAMPING, &flags, sizeof(flags));
rc = bind(sock, (struct sockaddr *) &serv_addr, sizeof(serv_addr));
struct msghdr msg;
struct iovec iov;
char pktbuf[2048];
char ctrl[CMSG_SPACE(sizeof(struct timespec))];
struct cmsghdr *cmsg = (struct cmsghdr *) &ctrl;
msg.msg_control = (char *) ctrl;
msg.msg_controllen = sizeof(ctrl);
msg.msg_name = &serv_addr;
msg.msg_namelen = sizeof(serv_addr);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
iov.iov_base = pktbuf;
iov.iov_len = sizeof(pktbuf);
//struct timeval time_kernel, time_user;
//int timediff = 0;
FILE *f = fopen("server.csv", "w");
if (f == NULL) {
error("Error opening file!\n");
exit(1);
}
fprintf(f, "Time\n");
struct timespec ts;
int level, type;
int i;
for (i = 0; i < 10; i++) {
rc = recvmsg(sock, &msg, 0);
for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL; cmsg = CMSG_NXTHDR(&msg, cmsg))
{
level = cmsg->cmsg_level;
type = cmsg->cmsg_type;
if (SOL_SOCKET == level && SO_TIMESTAMPING == type) {
//ts = (struct timespec *) CMSG_DATA(cmsg);
memcpy(&ts, CMSG_DATA(cmsg), sizeof(ts));
printf("HW TIMESTAMP %ld.%09ld\n", (long)ts.tv_sec, (long)ts.tv_nsec);
}
}
}
printf("COMPLETED\n");
fclose(f);
close(sock);
return 0;
}
In both devices the output I get after receiving a packet:
HW TIMESTAMP 0.000000000
On the other hand if with the same code my flags are:
flags = SOF_TIMESTAMPING_RX_HARDWARE
| SOF_TIMESTAMPING_RX_SOFTWARE
| SOF_TIMESTAMPING_SOFTWARE;
I get proper timestamps:
HW TIMESTAMP 1551721801.970270543
However, they seem to be software-timestamping ones. What would be the correct solution / method to handle hardware timestamping for packets received?
First of all, use ethtool -T "your NIC" to make sure your hardware supports the hardware timestamping feature.
You need to explicitly tell the Linux to enable the hardware timestamping feature of your NIC. In order to to that, you need to have a ioctl() call.
What you have to do is to call it with SIOCSHWTSTAMP, which is a device request code to indicate which device you want to handle as well as what you want to do. For example, there is a code called CDROMSTOP to stop the cdrom drive.
You also need to use a ifreq struct to configure your NIC.
You need something like this:
struct ifreq ifconfig;
strncpy(config.ifr_name, "your NIC name", sizeof(ifconfig.ifr_name));
ioctl("your file descriptor" , SIOCSHWTSTAMP, &ifconfig);
Here are some pages that you can look up to:
ioctl manual page,
ifreq manual page,
Read part 3.

Get ip address of client from the server

I'm trying to get the ip address of each of my clients that connect to my server. I save this into fields of a struct which I sent to a thread. I'm noticing that sometimes I get the right ip and sometimes the wrong one. My first peer to connect usually has an incorrect ip...
The problem is that inet_ntoa() returns a pointer to static memory that is overwritten each time you call inet_ntoa(). You need to make a copy of the data before calling inet_ntoa() again:
struct peerInfo{
char ip[16];
int socket;
};
while((newsockfd = accept(sockfd,(struct sockaddr *)&clt_addr, &addrlen)) > 0)
{
struct peerInfo *p = (struct peerInfo *) malloc(sizeof(struct peerInfo));
strncpy(p->ip, inet_ntoa(clt_addr.sin_addr), 16);
p->socket = newsockfd;
printf("A peer connection was accepted from %s:%hu\n", p->ip, ntohs(clt_addr.sin_port));
if (pthread_create(&thread_id , NULL, peer_handler, (void*)p) < 0)
{
syserr("could not create thread\n");
free(p);
return 1;
}
printf("Thread created for the peer.\n");
pthread_detach(thread_id);
}
if (newsockfd < 0)
{
syserr("Accept failed.\n");
}
From http://linux.die.net/man/3/inet_ntoa:
The inet_ntoa() function converts the Internet host address in, given
in network byte order, to a string in IPv4 dotted-decimal notation.
The string is returned in a statically allocated buffer, which
subsequent calls will overwrite.
Emphasis added.

error getting interface index using SIOCGIFINDEX

Hi i am trying to do packet injection using raw sockets, i have a problem in getting the interface index using SIOCGIFINDEX command of the ioctl. I am using ubuntu 12.04 as my OS. Please help the code is:
int BindRawSocketToInterface(char *device, int rawsock, int protocol)
{
struct sockaddr_ll sll;
struct ifreq ifr;
bzero(&sll, sizeof(sll));
bzero(&ifr, sizeof(ifr));
/* First Get the Interface Index */
strncpy ((char*) ifr.ifr_name, device, IFNAMSIZ);
if ((ioctl(rawsock, SIOCGIFINDEX, &ifr))== -1)
{
printf ("Error getting interface index!\n");
exit(-1);
}
/* Bind our rawsocket to this interface */
sll.sll_family = AF_PACKET;
sll.sll_ifindex = ifr.ifr_ifindex;
sll.sll_protocol = htons(protocol);
if ((bind(rawsock, (struct sockaddr*)&sll,sizeof(sll)))== -1)
{
perror("Error binding raw socket to interface \n");
exit(-1);
}
return 1;
}
Here is an example:
http://austinmarton.wordpress.com/2011/09/14/sending-raw-ethernet-packets-from-a-specific-interface-in-c-on-linux/
I hope this helps
As a reminder for anyone searching for such a function, i've seen many variants of this function and many of them have the following bug, so its probably a copy paste bug to be warned of:
strncpy ((char*) ifr.ifr_name, device, IFNAMSIZ);
This line has an OBOE (off-by-one error) and an unnecessary cast to char *.
strncpy (ifr.ifr_name, device, sizeof ifr.ifr_name - 1);
should be used instead.

Sending UDP packet in Linux Kernel

For a project, I'm trying to send UDP packets from Linux kernel-space. I'm currently 'hard-coding' my code into the kernel (which I appreciate isn't the best/neatest way) but I'm trying to get a simple test to work (sending "TEST"). It should be mentioned I'm a newbie to kernel hacking - I'm not that clued up on many principles and techniques!
Every time my code gets run the system hangs and I have to reboot - no mouse/keyboard response and the scroll and caps lock key lights flash together - I'm not sure what this means, but I'm assuming it's a kernel panic?
The repeat_send code is unnecessary for this test code, yet when it's working I want to send large messages that may require multiple 'send's - I'm not sure that if could be a cause of my issues?
N.B. This code is being inserted into neighbour.c of linux-source/net/core/ origin, hence the use of NEIGH_PRINTK1, it's just a macro wrapper round printk.
I'm really banging my head against a brick wall here, I can't spot anything obvious, can anyone point me in the right direction (or spot that blindingly obvious error!)?
Here's what I have so far:
void mymethod()
{
struct socket sock;
struct sockaddr_in addr_in;
int ret_val;
unsigned short port = htons(2048);
unsigned int host = in_aton("192.168.1.254");
unsigned int length = 5;
char *buf = "TEST\0";
struct msghdr msg;
struct iovec iov;
int len = 0, written = 0, left = length;
mm_segment_t oldmm;
NEIGH_PRINTK1("forwarding sk_buff at: %p.\n", skb);
if ((ret_val = sock_create(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
NEIGH_PRINTK1("Error during creation of socket; terminating. code: %d\n", ret_val);
return;
}
memset(&addr_in, 0, sizeof(struct sockaddr_in));
addr_in.sin_family=AF_INET;
addr_in.sin_port = port;
addr_in.sin_addr.s_addr = host;
if((ret_val = sock.ops->bind(&sock, (struct sockaddr *)&addr_in, sizeof(struct sockaddr_in))) < 0) {
NEIGH_PRINTK1("Error trying to bind socket. code: %d\n", ret_val);
goto close;
}
memset(&msg, 0, sizeof(struct msghdr));
msg.msg_flags = 0;
msg.msg_name = &addr_in;
msg.msg_namelen = sizeof(struct sockaddr_in);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = NULL;
msg.msg_controllen = 0;
repeat_send:
msg.msg_iov->iov_len = left;
msg.msg_iov->iov_base = (char *)buf + written;
oldmm = get_fs();
set_fs(KERNEL_DS);
len = sock_sendmsg(&sock, &msg, left);
set_fs(oldmm);
if (len == -ERESTARTSYS)
goto repeat_send;
if (len > 0) {
written += len;
left -= len;
if (left)
goto repeat_send;
}
close:
sock_release(&sock);
}
Any help would be hugely appreciated, thanks!
You may find it easier to use the netpoll API for UDP. Take a look at netconsole for an example of how it's used. The APIs you're using are more intended for userspace (you should never have to play with segment descriptors to send network data!)
Run your code when you're in a text mode console (i.e. press Ctrl+Alt+F1 to go to the text console). This way a kernel panic will print out the stack trace and any extra information about what went wrong.
If that doesn't help you, update your question with the stack trace.
I'm not much of a Linux Kernel developer, but can you throw some printk's in there and watch dmesg before it goes down? Or have you thought about hooking up with a kernel debugger?
I think you should try to put all variables outside mymethod() function and make them static. Remember, that the size of kernel stack is limited do 8KiB, so to much of/too big local variables may cause stack overflow and system hangup.