eBPF tools - skb_network_header crashes in a BPF Kernel trace function - ebpf

I am looking to trace ip_forward_finish. The intent is to trace latency of all TCP connections going through a linux based gateway router. Hence thought of tracing ip_forward_finish kernel function. And capture the time-stamp of SYN, SYN-ACK and ACK messages at the router.
The issue is accessing iphdr inside the trace function makes the verifier complain with the following error:
bpf: Failed to load program: Permission denied
0: (79) r6 = *(u64 *)(r1 +96)
1: (b7) r1 = 0
2: (6b) *(u16 *)(r10 -24) = r1
3: (bf) r3 = r6
4: (07) r3 += 192
5: (bf) r1 = r10
6: (07) r1 += -24
7: (b7) r2 = 2
8: (85) call bpf_probe_read#4
9: (69) r1 = *(u16 *)(r10 -24)
10: (55) if r1 != 0x8 goto pc+7
R0=inv(id=0) R1=inv8 R6=inv(id=0) R10=fp0
11: (69) r1 = *(u16 *)(r6 +196)
R6 invalid mem access 'inv'
HINT: The invalid mem access 'inv' error can happen if you try to dereference
memory without first using bpf_probe_read() to copy it to the BPF stack.
Sometimes the bpf_probe_read is automatic by the bcc rewriter, other times
you'll need to be explicit.
The code fragment I originally had was as below and the crash occurs when an access to ip_Hdr->protocol is made. And I also checked that ip_Hdr is not null.
int trace_forward_finish(struct pt_regs *ctx,struct net *net,
struct sock *sk, struct sk_buff *skb)
{
if (skb->protocol != htons(ETH_P_IP))
return 0;
struct iphdr* ip_Hdr = (struct iphdr *) skb_network_header(skb);
if (ip_Hdr->protocol != IPPROTO_TCP)
return 0;
/// More code
}
Per the HINT in the message, I did try to change to bpf_probe_read but still the same outcome
int trace_forward_finish(struct pt_regs *ctx,struct net *net,
struct sock *sk, struct sk_buff *skb)
{
if (skb->protocol != htons(ETH_P_IP))
return 0;
struct iphdr ip_Hdr;
bpf_probe_read(&ip_Hdr, sizeof(ip_Hdr), (void*)ip_hdr(skb));
if (ip_Hdr.protocol != IPPROTO_TCP)
return 0;
return 0;
}
Any help would be appreciated.

bcc will try to transform your dereferences of kernel pointers into calls to bpf_probe_read. You can see that happening by passing debug=4 to the BPF() call.
In your case, I suspect that you would need to include function skb_network_header in your code so that bcc is able to rewrite it.
If that's not sufficient, then you might need a manual call to bpf_probe_read to retrieve structure struct iphdr from skb_network_header's pointer.

Related

Strange problem with PIO state machine of RapsberryPi-Pico

I try to implement TFT control using PIO. I have written 4 PIO state machines for every sync signal of my TFT (CLOCK, DE, HSYNC, VSYNC). 3 of them work seamless, but if I add the VSYNC module, the whole Pico freezes. It doesn't change pins and doesn't blink with a LED using repeating timer.
Here is how my initialization looks like:
PIO pio = pio0;
uint h_offset = pio_add_program(pio, &hsync_program);
uint v_offset = pio_add_program(pio, &vsync_program);
uint c_offset = pio_add_program(pio, &clock_program);
uint d_offset = pio_add_program(pio, &de_program);
hsync_program_init(pio, 0, h_offset, HSYNC_PIN);
vsync_program_init(pio, 1, v_offset, VSYNC_PIN);
clock_program_init(pio, 2, c_offset, CLOCK_PIN);
de_program_init(pio, 3, d_offset, DE_PIN);
pio_sm_set_enabled(pio, 0, true);
pio_sm_set_enabled(pio, 1, true);
pio_sm_set_enabled(pio, 2, true);
pio_sm_set_enabled(pio, 3, true);
pio_sm_put_blocking(pio, 0, TFT_WIDTH);
pio_sm_put_blocking(pio, 1, TFT_HEIGHT);
Here is the content of the vsync.pio:
.define v_back_porch 12
.define v_front_porch 8
.define v_sync_len 4
.program vsync
pull block
mov y, osr
.wrap_target
set pins, 0
set x, v_sync_len
vactive:
wait 1 irq 2
jmp x-- vactive
; back porch
set pins, 1
set x, (v_back_porch - v_sync_len)
vbporch:
wait 1 irq 2
jmp x-- vbporch
; main cycle
mov x, y
vmain:
wait 1 irq 2
jmp x-- vmain
set x, v_front_porch
vfporch:
wait 1 irq 2
jmp x-- vfporch
; set sync interrupt for RGB
; irq 3
.wrap ; sync forever!
% c-sdk {
// this is a raw helper function for use by the user which sets up the GPIO output, and configures the SM to output on a particular pin
void vsync_program_init(PIO pio, uint sm, uint offset, uint pin) {
pio_gpio_init(pio, pin);
pio_sm_set_consecutive_pindirs(pio, sm, pin, 1, true);
pio_sm_config c = vsync_program_get_default_config(offset);
sm_config_set_set_pins(&c, pin, 1);
pio_sm_init(pio, sm, offset, &c);
}
%}
If I remove every line of code for the vsync state machine, everything works and generates exactly that, what I want.
After adding uint v_offset = pio_add_program(pio, &vsync_program); it doesn't work anymore. There are no errors or comments during the compilation. I have already tried almost everything. It seems, that registers x and y are faulty, but I can't make a counter without using them.
I have pretty the same code in the hsync.pio, but I don't have any problems with it.
Here is a compiled result for the vsync.pio:
static const uint16_t vsync_program_instructions[] = {
0x80a0, // 0: pull block
0xa047, // 1: mov y, osr
// .wrap_target
0xe000, // 2: set pins, 0
0xe024, // 3: set x, 4
0x20c2, // 4: wait 1 irq, 2
0x0044, // 5: jmp x--, 4
0xe001, // 6: set pins, 1
0xe028, // 7: set x, 8
0x20c2, // 8: wait 1 irq, 2
0x0048, // 9: jmp x--, 8
0xa022, // 10: mov x, y
0x20c2, // 11: wait 1 irq, 2
0x004b, // 12: jmp x--, 11
0xe028, // 13: set x, 8
0x20c2, // 14: wait 1 irq, 2
0x004e, // 15: jmp x--, 14
// .wrap
};
and for the hsync.pio to compare:
static const uint16_t hsync_program_instructions[] = {
0x80a0, // 0: pull block
0xa047, // 1: mov y, osr
// .wrap_target
0xe000, // 2: set pins, 0
0xe022, // 3: set x, 2
0x20c0, // 4: wait 1 irq, 0
0x0044, // 5: jmp x--, 4
0xe001, // 6: set pins, 1
0xe022, // 7: set x, 2
0x20c0, // 8: wait 1 irq, 0
0x0048, // 9: jmp x--, 8
0xc001, // 10: irq nowait 1
0xa022, // 11: mov x, y
0x20c0, // 12: wait 1 irq, 0
0x004c, // 13: jmp x--, 12
0xc001, // 14: irq nowait 1
0xe024, // 15: set x, 4
0x20c0, // 16: wait 1 irq, 0
0x0050, // 17: jmp x--, 16
0xc002, // 18: irq nowait 2
// .wrap
};
I don't see any significant differences there.
What could be the reasong for such a behavior?
It's very pitty, but every PIO instance (not every state machine in the instance!) allows 32 instructions only. So I have to use PIO1, but there is another problem - IRQs from one PIO don't interact with another PIO.
The structure of the Pico PIO makes me really sad :(

BPF verifier rejects when try to access __sk_buff member

I'm trying to write a sample eBPF program which can access __sk_buff member and dump it into
/sys/kernel/debug/tracing/trace.
#include <uapi/linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
SEC("dump_skb_member")
int test_prog(struct __sk_buff *skb)
{
char fmt[] = "packet: local %u remote %u\n";
__u32 local_ip4 = bpf_htonl(skb->local_ip4);
__u32 remote_ip4 = bpf_htonl(skb->remote_ip4);
bpf_trace_printk(fmt, sizeof(fmt), local_ip4, remote_ip4);
return BPF_OK;
}
char _license[] SEC("license") = "GPL";
When i compile this code, and load this program
ip route add 192.168.56.104 encap bpf out obj sample.o section dump_skb_member dev enp0s8
An error is thrown.
Prog section 'dump_skb_member' rejected: Permission denied (13)!
- Type: 11
- Instructions: 21 (0 over limit)
- License: GPL
Verifier analysis:
0: (b7) r2 = 685349
1: (63) *(u32 *)(r10 -8) = r2
2: (18) r2 = 0x2065746f6d657220
4: (7b) *(u64 *)(r10 -16) = r2
5: (18) r2 = 0x7525206c61636f6c
7: (7b) *(u64 *)(r10 -24) = r2
8: (18) r2 = 0x203a74656b636170
10: (7b) *(u64 *)(r10 -32) = r2
11: (61) r4 = *(u32 *)(r1 +92)
invalid bpf_context access off=92 size=4
processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
Error fetching program/map!
But if i don't call bpf_trace_printk to dump member, it can be loaded.
My question is why the error is caused by calling bpf_trace_printk?
The error is not caused by bpf_trace_prink(), but by the skb accesses that are present in your bytecode only when you call bpf_trace_printk().
Accessing skb->local_ip4 and skb->remote_ip4 is not allowed for programs of types BPF_PROG_TYPE_LWT_OUT, that you use.
See kernel code: The function that checks for valid access for this type returns false for certain offsets or range in skb:
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
[...]
return false;
This corresponds to the range where local_ip4 and remote_ip4 are defined:
struct __sk_buff {
[...]
/* Accessed by BPF_PROG_TYPE_sk_skb types from here to ... */
__u32 family;
__u32 remote_ip4; /* Stored in network byte order */
__u32 local_ip4; /* Stored in network byte order */
__u32 remote_ip6[4]; /* Stored in network byte order */
__u32 local_ip6[4]; /* Stored in network byte order */
__u32 remote_port; /* Stored in network byte order */
__u32 local_port; /* stored in host byte order */
/* ... here. */
When you remove your call to the bpf_trace_printk() helper, your local variables are no longer needed and clang compiles your code out of the program. The attempt to read at forbidden offsets is no longer part of your bytecode, so the program loads successfully.

How to find the values of the arguments in socket programming in Assembly?

I am trying socket programming for ARM, however I am not able to understand how the values for the arguments are decided.
For example
this is the link for Azeria Labs
I understand that sys call for ARM register R7 gets it hence its 281 in this case and arguments are passed using R0, R1, R2, R3. But here how do you decide the values for R0(AF_INET) as 2 and R1(SOCK_STREAM) as 1 while creating socket(AF_INET, SOCK_STREAM, 0)
Finding system call was easy
$ grep socket /usr/include/asm/unistd-common.h
#define __NR_socket (__NR_SYSCALL_BASE+281)
#define __NR_socketpair (__NR_SYSCALL_BASE+288)
Similarly is there a way to find the values for the arguments?
I found an another resource which was for X86 Assembly which also has similar approach.
%assign SOCK_STREAM 1
%assign AF_INET 2
%assign SYS_socketcall 102
%assign SYS_SOCKET 1
%assign SYS_CONNECT 3
%assign SYS_SEND 9
%assign SYS_RECV 10
section .text
global _start
;--------------------------------------------------
;Functions to make things easier. :]
;--------------------------------------------------
_socket:
mov [cArray+0], dword AF_INET
mov [cArray+4], dword SOCK_STREAM
mov [cArray+8], dword 0
mov eax, SYS_socketcall
mov ebx, SYS_SOCKET
mov ecx, cArray
int 0x80
ret
Kindly let me know.
Thank you.
Linux alarmpi 4.4.34+ #3 Thu Dec 1 14:44:23 IST 2016 armv6l GNU/Linux

Analog read from all 7 inputs using PRU and a host C program for beaglebone black

I'm kinda new to the beaglebone black world running on a AM335X Cortex A8 processor and I would like to use the PRU for fast analog read with the maximum sampling rate possible.
I would like to read all 7 inputs in a loop form like:
while( n*7 < sampling_rate){ //initial value for n = 0
read(AIN0); //and store it in shared memory(7*n + 0)
read(AIN1); //and store it in shared memory(7*n + 1)
read(AIN2); //and store it in shared memory(7*n + 2)
read(AIN3); //and store it in shared memory(7*n + 3)
read(AIN4); //and store it in shared memory(7*n + 4)
read(AIN5); //and store it in shared memory(7*n + 5)
read(AIN6); //and store it in shared memory(7*n + 6)
n++;
}
so that I can read them from a host program running on the main processor. Any idea how to do so? I tried using a ready code called ADCCollector.c from a package named AM335x_pru_package but I can't figure out how to get all the addresses and values of the registers used.
This is the code I was trying to modify (ADCCollector.p):
.origin 0 // offset of the start of the code in PRU memory
.entrypoint START // program entry point, used by debugger only
#include "ADCCollector.hp"
#define BUFF_SIZE 0x00000fa0 //Total buff size: 4kbyte(Each buffer has 2kbyte: 500 piece of data
#define HALF_SIZE BUFF_SIZE / 2
#define SAMPLING_RATE 1 //Sampling rate(16khz) //***//16000
#define DELAY_MICRO_SECONDS (1000000 / SAMPLING_RATE) //Delay by sampling rate
#define CLOCK 200000000 // PRU is always clocked at 200MHz
#define CLOCKS_PER_LOOP 2 // loop contains two instructions, one clock each
#define DELAYCOUNT DELAY_MICRO_SECONDS * CLOCK / CLOCKS_PER_LOOP / 1000 / 1000 * 3 //if sampling rate = 98000 --> = 3061.224
.macro DELAY
MOV r10, DELAYCOUNT
DELAY:
SUB r10, r10, 1
QBNE DELAY, r10, 0
.endm
.macro READADC
//Initialize buffer status (0: empty, 1: first buffer is ready, 2: second buffer is ready)
MOV r2, 0
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
INITV:
MOV r5, 0 //Shared RAM address of ADC Saving position
MOV r6, BUFF_SIZE //Counting variable
READ:
//Read ADC from FIFO0DATA
MOV r2, 0x44E0D100
LBBO r3, r2, 0, 4
//Add address counting
ADD r5, r5, 4
//Write ADC to PRU Shared RAM
SBCO r3, CONST_PRUSHAREDRAM, r5, 4
DELAY
SUB r6, r6, 4
MOV r2, HALF_SIZE
QBEQ CHBUFFSTATUS1, r6, r2 //If first buffer is ready
QBEQ CHBUFFSTATUS2, r6, 0 //If second buffer is ready
QBA READ
//Change buffer status to 1
CHBUFFSTATUS1:
MOV r2, 1
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
QBA READ
//Change buffer status to 2
CHBUFFSTATUS2:
MOV r2, 2
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
QBA INITV
//Send event to host program
MOV r31.b0, PRU0_ARM_INTERRUPT+16
HALT
.endm
// Starting point
START:
// Enable OCP master port
LBCO r0, CONST_PRUCFG, 4, 4 //#define CONST_PRUCFG C4 taken from ADCCollector.hp
CLR r0, r0, 4
SBCO r0, CONST_PRUCFG, 4, 4
//C28 will point to 0x00012000 (PRU shared RAM)
MOV r0, 0x00000120
MOV r1, CTPPR_0
ST32 r0, r1
//Init ADC CTRL register
MOV r2, 0x44E0D040
MOV r3, 0x00000005
SBBO r3, r2, 0, 4
//Enable ADC STEPCONFIG 1
MOV r2, 0x44E0D054
MOV r3, 0x00000002
SBBO r3, r2, 0, 4
//Init ADC STEPCONFIG 1
MOV r2, 0x44E0D064
MOV r3, 0x00000001 //continuous mode
SBBO r3, r2, 0, 4
//Read ADC and FIFOCOUNT
READADC
Another question is: if I simply changed the #define Sampling_rate from 16000 to any other number below or equal to 200000 in the (.p) file, I will get that sampling rate? or should I change other things?
Thanks in advance.
I used the c wrappers from libpruio: http://www.freebasic.net/forum/viewtopic.php?f=14&t=22501
and then use this code to get all my ADC values:
#include "stdio.h"
#include "c_wrapper/pruio.h" // include header
#include "sys/time.h"
//! The main function.
int main(int argc, char **argv) {
struct timeval start, now;
long mtime, seconds, useconds;
gettimeofday(&start, NULL);
int i,x;
pruIo *io = pruio_new(PRUIO_DEF_ACTIVE, 0x98, 0, 1); //! create new driver structure
if (pruio_config(io, 1, 0x1FE, 0, 4)){ // upload (default) settings, start IO mode
printf("config failed (%s)\n", io->Errr);}
else {
do {
gettimeofday(&now, NULL);
seconds = now.tv_sec - start.tv_sec;
useconds = now.tv_usec - start.tv_usec;
mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
printf("%lu",mtime);
for(i = 1; i < 9; i++) {
printf(",%d", io->Adc->Value[i]); //0-66504 for 0-1.8v
}
printf("\n");
x++;
}while (mtime < 100);
printf("count: %d \n", x);
pruio_destroy(io); /* destroy driver structure */
}
return 0;
}
In your example you use libpruio in IO mode (synchronous) and therefore you have no control over the sampling rate, since the host CPU doesn't work in real-time.
To get the maximum sampling rate (as mentioned in the OP) you have to use either RB or MM mode. In those modes libpruio buffers the samples in memory and the host can access them asynchronously. See example rb_file.c (or triggers.bas) in the libpruio package.

linker script load vs. virtual address

I've got the following linker script that is supposed to link code to run on a flash based micrcontroller. The uC has flash at address 0x0, and RAM at 0x40000000. I want to put the data section into flash, but link the program so that access to the data section is done in RAM. The point being, I'll manually copy it out of flash into the proper RAM location when the controller starts.
MEMORY
{
flash : ORIGIN = 0x00000000, LENGTH = 512K
ram : ORIGIN = 0x40000000, LENGTH = 32K
usbram : ORIGIN = 0x7FD00000, LENGTH = 8K
ethram : ORIGIN = 0x7FE00000, LENGTH = 16K
}
SECTIONS
{
.text : { *(.text) } >flash
__end_of_text__ = .;
.data :
{
__data_beg__ = .;
__data_beg_src__ = __end_of_text__;
*(.data)
__data_end__ = .;
} >ram AT>flash
.bss :
{
__bss_beg__ = .;
*(.bss)
} >ram
}
The code as shown above generates the following output:
40000000 <__data_beg__>:
40000000: 00000001 andeq r0, r0, r1
40000004: 00000002 andeq r0, r0, r2
40000008: 00000003 andeq r0, r0, r3
4000000c: 00000004 andeq r0, r0, r4
40000010: 00000005 andeq r0, r0, r5
40000014: 00000006 andeq r0, r0, r6
which represents an array of the form
int foo[] = {1,2,3,4,5,6};
Problem is that it's linked to 0x40000000, and not the flash region as I wanted. I expected the AT>flash part of the linker script to specify linking into flash, as explained in the LD manual.
http://sourceware.org/binutils/docs/ld/Output-Section-Attributes.html#Output-Section-Attributes
and here is my ld invocation:
arm-elf-ld -T ./lpc2368.ld entry.o main.o -o binary.elf
Thanks.
Your linker script LINKS .data to RAM and LOADS .data into FLASH. This is due to AT command.
Why do you want to link volatile data to flash? Data and Bss must ALWAYS be linked to RAM. Your linker script is quite correct. You would only link text and constand data into flash.
Please look at your map file. It would necessarily have assigned a RAM address to data variables.
The program loader code then copies copies (data_end - data_beg) bytes from data_beg_src to data_beg.
So the first data which is the array is copied into the begining of SRAM.
If you need to link data to flash:
Data :
{
*(.data);
} > flash
Linker will now LINK and LOAD .data into flash. But if I were you, I wouldnt do that.
Your .data virtual address = 0x40000000
Your .data logical address = 0x00000000
This can be seen with command
objdump -h file.elf
Sections:
Idx Name Size VMA LMA File off Algn
8 .data 00000014 40000000 00000000 00001000 2**2
CONTENTS, ALLOC, LOAD, DATA