eBPF / XDP map not getting created - ebpf

I have an implementation in BPF for XDP, wherein I specify five maps to be created as follows:
struct bpf_map_def SEC("maps") servers = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct ip_key),
.value_size = sizeof(struct dest_info),
.max_entries = MAX_SERVERS,
};
struct bpf_map_def SEC("maps") server_ips = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct ip_key),
.value_size = sizeof(struct server_ip_key),
.max_entries = MAX_SERVERS,
};
struct bpf_map_def SEC("maps") client_addrs = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct port_key),
.value_size = sizeof(struct client_port_addr),
.max_entries = MAX_CLIENTS,
};
struct bpf_map_def SEC("maps") stoc_port_maps = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct port_key),
.value_size = sizeof(struct port_map),
.max_entries = MAX_FLOWS,
};
struct bpf_map_def SEC("maps") ctos_port_maps = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct port_key),
.value_size = sizeof(struct port_map),
.max_entries = MAX_FLOWS,
};
However, no matter what I do, the servers map is not getting created. When I run bpftool map show, I only get such output as the following:
root#balancer:/xdp# bpftool map list
68: hash name client_addrs flags 0x0
key 8B value 16B max_entries 4096 memlock 98304B
69: hash name ctos_port_maps flags 0x0
key 8B value 20B max_entries 4096 memlock 131072B
70: hash name server_ips flags 0x0
key 8B value 8B max_entries 512 memlock 8192B
73: hash name stoc_port_maps flags 0x0
key 8B value 20B max_entries 4096 memlock 131072B
74: array name xdp_lb_k.rodata flags 0x480
key 4B value 50B max_entries 1 memlock 4096B
frozen
root#balancer:/xdp#
It is notable that each of the key or value structs have been padded to the closest multiple of eight bytes, and there are no compile or verifier errors. I am also running the program on docker containers. So far, I have tried moving the servers map definition around in my code, commenting out the other map definitions leaving only the servers definition active, changing the name to other combinations, and a few other minor changes but nothing has worked so far.
Please let me know if you would need any other portion of my code or information for a better analysis of the situation.
Appendix 1:
I am compiling the object file using this Makefile rule:
xdp_lb_kern.o: xdp_lb_kern.c
clang -S \
-target bpf \
-D __BPF_TRACING__ \
-I../../libbpf/src \
-I../../custom-headers \
-Wall \
-Wno-unused-value \
-Wno-pointer-sign \
-Wno-compare-distinct-pointer-types \
-O2 -emit-llvm -c -o ${#:.o=.ll} $<
llc -march=bpf -filetype=obj -o $# ${#:.o=.ll}
Then, in the container's environment, I load the program using this rule:
load_balancer:
bpftool net detach xdpgeneric dev eth0
rm -f /sys/fs/bpf/xdp_lb
bpftool prog load xdp_lb_kern.o /sys/fs/bpf/xdp_lb
bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_lb dev eth0
The compilation process generates a .o and a .ll output file. The beginning lines of the .ll output file, where the map definitions are visible, are shown below:
; ModuleID = 'xdp_lb_kern.c'
source_filename = "xdp_lb_kern.c"
target datalayout = "e-m:e-p:64:64-i64:64-n32:64-S128"
target triple = "bpf"
%struct.bpf_map_def = type { i32, i32, i32, i32, i32 }
%struct.xdp_md = type { i32, i32, i32, i32, i32 }
%struct.ip_key = type { i32, i32 }
%struct.port_key = type { i16, [3 x i16] }
%struct.ethhdr = type { [6 x i8], [6 x i8], i16 }
%struct.iphdr = type { i8, i8, i16, i16, i16, i8, i8, i16, i32, i32 }
#servers = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 32, i32 512, i32 0 }, section "maps", align 4
#server_ips = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 8, i32 512, i32 0 }, section "maps", align 4
#client_addrs = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 16, i32 4096, i32 0 }, section "maps", align 4
#stoc_port_maps = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 20, i32 4096, i32 0 }, section "maps", align 4
#ctos_port_maps = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 20, i32 4096, i32 0 }, section "maps", align 4
#loadbal.____fmt = internal constant [24 x i8] c"balancer got something!\00", align 1
#_license = dso_local global [4 x i8] c"GPL\00", section "license", align 1
#process_packet.____fmt = internal constant [26 x i8] c"it's an ip packet from %x\00", align 1
#llvm.used = appending global [7 x i8*] [i8* getelementptr inbounds ([4 x i8], [4 x i8]* #_license, i32 0, i32 0), i8* bitcast (%struct.bpf_map_def* #client_addrs to i8*), i8* bitcast (%struct.bpf_map_def* #ctos_port_maps to i8*), i8* bitcast (i32 (%struct.xdp_md*)* #loadbal to i8*), i8* bitcast (%struct.bpf_map_def* #server_ips to i8*), i8* bitcast (%struct.bpf_map_def* #servers to i8*), i8* bitcast (%struct.bpf_map_def* #stoc_port_maps to i8*)], section "llvm.metadata"
; Function Attrs: nounwind
define dso_local i32 #loadbal(%struct.xdp_md* nocapture readonly %0) #0 section "xdp" {
%2 = alloca %struct.ip_key, align 4
%3 = alloca %struct.port_key, align 2
%4 = alloca %struct.port_key, align 2
%5 = getelementptr inbounds %struct.xdp_md, %struct.xdp_md* %0, i64 0, i32 1
%6 = load i32, i32* %5, align 4, !tbaa !2
%7 = zext i32 %6 to i64
%8 = inttoptr i64 %7 to i8*
%9 = getelementptr inbounds %struct.xdp_md, %struct.xdp_md* %0, i64 0, i32 0
%10 = load i32, i32* %9, align 4, !tbaa !7

As per the discussion in the comments, the map is not created because it is not actually used in your eBPF code (not provided in the question).
As you realised yourself, the branch in your code that was calling the map was in fact unreachable. Based on that, it's likely that clang compiled out this portion of code, and that the map is not used in the resulting eBPF bytecode. When preparing to load your program, bpftool (libbpf) looks at what maps are necessary, and only creates the ones that are needed for your program. It may skip maps that are defined in the ELF file if no program uses them.
One hint here is that, if the program was effectively using the map, it couldn't load successfully if the map was missing: given that your program loads, the map would necessarily be present if it was needed. Note that bpftool prog show will show you the ids of the maps used by a program.

Related

Swift compiler LLVM IR optimization

For a project I am currently using the swift compiler front end to generate LLVM IR. I need to analyze the IR to find run time dependencies between variable read/writes with the end goal of finding parallelism.
For this it is important that I am able to generate LLVM IR that is unoptimized and all the load and store instructions are present. It works well with clang with the -O0 flag which prevents optimization.
However with swiftc I have the following issue on this simple example:
func returnInputSum(test: Int) -> Int {
var a = 5
var b = 10
var c = a + b
return c
}
with swiftc -Onone this becomes(removed debug etc.):
entry:
%a = alloca i64, align 8
%b = alloca i64, align 8
%c = alloca i64, align 8
store i64 5, i64* %a, align 8
store i64 10, i64* %b align 8
store i64 15, i64* %c, align 8
ret i64 15
so instead of loading a and b, inserting an "add" instruction and storing the result in c, the compiler seems to evaluate 5 + 10 directory to store the 15 in c. Then the return does not even return c but an immediate value this is a problem for me because I want to see all load/store instructions.
when I use clang with c/c++ code it works just fine and I get what I expect there:
%a = alloca i32, align 4
%b = alloca i32, align 4
%c = alloca i32, align 4
store i32 5, i32* %a, align
store i32 10, i32* %b, align 4
%0 = load i32, i32* %b, align 4
%1 = load i32, i32* %a, align 4
%add = add nsw i32 %0, %1
store i32 %add, i32* %c, align 4
%2 = load i32, i32* %c, align 4
ret i32 %2
is there any way to actually make the compiler do no optimization at all on the LLVM IR code? or am I not understanding something important here? I know that before generating LLVM IR there is SIL pass with some optimization but I am under the assumption that with the -Onone flag this should also be turned "off"?
thank you!

How do I write MIPS code for vector multiplication

Define vector mul(vector v, float t). It returns a vector by multiplying it by t.
If a=4i+3j+12k then mul(a,0.5) will return 2i+1.5j+6k.
Here's the code I've written:
.globl main
.text
main:
la $s0,t #loading t into s1
lw $s1,0($s0)
ori $s2,$zero,0
la $s3,v
#la $s0,v
#lw $s3,0($s0)
la $s0,s
lw $s4,0($s0)
jal f
f:
#if <cond>
bge $s2,$s4,DONE
#<for body>
lw $s5, 0($s3)
mul $s3,$s3,$s1
li $v0,10
syscall
j UPDATE
UPDATE:
addi $s2,$s2,1 #i=i+1
addi $s3,$s3,4 #moving address 4 bytes since int
j f
DONE:
li $v0,10
syscall
.data
s: .word 3
v: .word 4 3 12 #hard coding vector coefficients
t: .word 2 #value to be multiplied by
When I run this on SPIM simulator, the registers don't produce any value. Is my code wrong or do I need to add something?
mul $s3,$s3,$s1 : this instruction is wrong because $s3 register contains the address of the vector and no the value .
li $v0,10 ; syscall And remove these lines just ahead the jump to UPDATE .
Otherwise , a program will multiply only once
.data
s: .word 3
v: .word 14 3 12 #hard coding vector coefficients
t: .word 2 #value to be multiplied by
.globl main
.text
main:
la $s0,t #loading t into $s0
lw $s1,0($s0) # $s1=2
ori $s2,$zero,0 # $s2=0
la $s3,v # loading v into $s3
li $s7,0
la $s0,s # loading s into $s0
lw $s4,0($s0) # $s4 = 3
j f
f:
#if <cond>
bge $s2,$s4,DONE
#<for body>
lw $s5, ($s3) # $s5= 4
mulu $s5,$s5,$s1
addu $s7,$s7,$s5 # result stored into $s7
j UPDATE
UPDATE:
addiu $s2,$s2,1 #i=i+1
addiu $s3,$s3,4 #moving address 4 bytes since int
j f
DONE:
li $v0,10
syscall

Why are mmap syscall flags not set

I am trying to call mmap with a direct syscall.
#include <sys/mman.h>
int main() {
__asm__("mov $0x0, %r9;"
"mov $0xffffffffffffffff, %r8;"
"mov $0x32, %rcx;"
"mov $0x7, %rdx;"
"mov $0x1000, %rsi;"
"mov $0x303000, %rdi;"
"mov $0x9, %rax;"
"syscall;");
return 0;
}
I compiled the program statically:
$ gcc -static -o foo foo.c
But the syscall fails, as shown by strace:
$ strace ./foo
mmap(0x303000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, -1, 0) = -1 EBADF (Bad file descriptor)
We can see that mmap flags are wrongly set. 0x32 should be MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS.
The thing is, if I put another mmap call with mmap from the libc:
int main() {
mmap(0x202000, 4096, 0x7, 0x32, -1, 0);
__asm__("mov $0x0, %r9;"
"mov $0xffffffffffffffff, %r8;"
"mov $0x32, %rcx;"
"mov $0x7, %rdx;"
"mov $0x1000, %rsi;"
"mov $0x303000, %rdi;"
"mov $0x9, %rax;"
"syscall;");
return 0;
}
Then both mmap work:
$ strace ./foo
mmap(0x202000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x202000
mmap(0x303000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x303000
So it seems that using libc, mmap flags are "resolved" or something. But I can't really understand what is happening.
Why does the mmap syscall example only work if I put a libc mmap call before ?
Kernel syscall interface on AMD64 uses r10 register as a fourth argument, not rcx.
mov $0x32, %r10
See linked question for more details.

Which is more efficient: Creating a "var" and re-using it, or creating several "let"s?

Just curious which is more efficient/better in swift:
Creating three temporary constants (using let) and using those constants to define other variables
Creating one temporary variable (using var) and using that variable to hold three different values which will then be used to define other variables
This is perhaps better explained through an example:
var one = Object()
var two = Object()
var three = Object()
func firstFunction() {
let tempVar1 = //calculation1
one = tempVar1
let tempVar2 = //calculation2
two = tempVar2
let tempVar3 = //calculation3
three = tempVar3
}
func seconFunction() {
var tempVar = //calculation1
one = tempVar
tempVar = //calculation2
two = tempVar
tempVar = //calculation3
three = tempVar
}
Which of the two functions is more efficient? Thank you for your time!
Not to be too cute about it, but the most efficient version of your code above is:
var one = Object()
var two = Object()
var three = Object()
That is logically equivalent to all the code you've written since you never use the results of the computations (assuming the computations have no side-effects). It is the job of the optimizer to get down to this simplest form. Technically the simplest form is:
func main() {}
But the optimizer isn't quite that smart. But the optimizer really is smart enough to get to my first example. Consider this program:
var one = 1
var two = 2
var three = 3
func calculation1() -> Int { return 1 }
func calculation2() -> Int { return 2 }
func calculation3() -> Int { return 3 }
func firstFunction() {
let tempVar1 = calculation1()
one = tempVar1
let tempVar2 = calculation2()
two = tempVar2
let tempVar3 = calculation3()
three = tempVar3
}
func secondFunction() {
var tempVar = calculation1()
one = tempVar
tempVar = calculation2()
two = tempVar
tempVar = calculation3()
three = tempVar
}
func main() {
firstFunction()
secondFunction()
}
Run it through the compiler with optimizations:
$ swiftc -O -wmo -emit-assembly x.swift
Here's the whole output:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 9
.globl _main
.p2align 4, 0x90
_main:
pushq %rbp
movq %rsp, %rbp
movq $1, __Tv1x3oneSi(%rip)
movq $2, __Tv1x3twoSi(%rip)
movq $3, __Tv1x5threeSi(%rip)
xorl %eax, %eax
popq %rbp
retq
.private_extern __Tv1x3oneSi
.globl __Tv1x3oneSi
.zerofill __DATA,__common,__Tv1x3oneSi,8,3
.private_extern __Tv1x3twoSi
.globl __Tv1x3twoSi
.zerofill __DATA,__common,__Tv1x3twoSi,8,3
.private_extern __Tv1x5threeSi
.globl __Tv1x5threeSi
.zerofill __DATA,__common,__Tv1x5threeSi,8,3
.private_extern ___swift_reflection_version
.section __TEXT,__const
.globl ___swift_reflection_version
.weak_definition ___swift_reflection_version
.p2align 1
___swift_reflection_version:
.short 1
.no_dead_strip ___swift_reflection_version
.linker_option "-lswiftCore"
.linker_option "-lobjc"
.section __DATA,__objc_imageinfo,regular,no_dead_strip
L_OBJC_IMAGE_INFO:
.long 0
.long 1088
Your functions aren't even in the output because they don't do anything. main is simplified to:
_main:
pushq %rbp
movq %rsp, %rbp
movq $1, __Tv1x3oneSi(%rip)
movq $2, __Tv1x3twoSi(%rip)
movq $3, __Tv1x5threeSi(%rip)
xorl %eax, %eax
popq %rbp
retq
This sticks the values 1, 2, and 3 into the globals, and then exits.
My point here is that if it's smart enough to do that, don't try to second-guess it with temporary variables. It's job is to figure that out. In fact, let's see how smart it is. We'll turn off Whole Module Optimization (-wmo). Without that, it won't strip the functions, because it doesn't know whether something else will call them. And then we can see how it writes these functions.
Here's firstFunction():
__TF1x13firstFunctionFT_T_:
pushq %rbp
movq %rsp, %rbp
movq $1, __Tv1x3oneSi(%rip)
movq $2, __Tv1x3twoSi(%rip)
movq $3, __Tv1x5threeSi(%rip)
popq %rbp
retq
Since it can see that the calculation methods just return constants, it inlines those results and writes them to the globals.
Now how about secondFunction():
__TF1x14secondFunctionFT_T_:
pushq %rbp
movq %rsp, %rbp
popq %rbp
jmp __TF1x13firstFunctionFT_T_
Yes. It's that smart. It realized that secondFunction() is identical to firstFunction() and it just jumps to it. Your functions literally could not be more identical and the optimizer knows that.
So what's the most efficient? The one that is simplest to reason about. The one with the fewest side-effects. The one that is easiest to read and debug. That's the efficiency you should be focused on. Let the optimizer do its job. It's really quite smart. And the more you write in nice, clear, obvious Swift, the easier it is for the optimizer to do its job. Every time you do something clever "for performance," you're just making the optimizer work harder to figure out what you've done (and probably undo it).
Just to finish the thought: the local variables you create are barely hints to the compiler. The compiler generates its own local variables when it converts your code to its internal representation (IR). IR is in static single assignment form (SSA), in which every variable can only be assigned one time. Because of this, your second function actually creates more local variables than your first function. Here's function one (create using swiftc -emit-ir x.swift):
define hidden void #_TF1x13firstFunctionFT_T_() #0 {
entry:
%0 = call i64 #_TF1x12calculation1FT_Si()
store i64 %0, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3oneSi, i32 0, i32 0), align 8
%1 = call i64 #_TF1x12calculation2FT_Si()
store i64 %1, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3twoSi, i32 0, i32 0), align 8
%2 = call i64 #_TF1x12calculation3FT_Si()
store i64 %2, i64* getelementptr inbounds (%Si, %Si* #_Tv1x5threeSi, i32 0, i32 0), align 8
ret void
}
In this form, variables have a % prefix. As you can see, there are 3.
Here's your second function:
define hidden void #_TF1x14secondFunctionFT_T_() #0 {
entry:
%0 = alloca %Si, align 8
%1 = bitcast %Si* %0 to i8*
call void #llvm.lifetime.start(i64 8, i8* %1)
%2 = call i64 #_TF1x12calculation1FT_Si()
%._value = getelementptr inbounds %Si, %Si* %0, i32 0, i32 0
store i64 %2, i64* %._value, align 8
store i64 %2, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3oneSi, i32 0, i32 0), align 8
%3 = call i64 #_TF1x12calculation2FT_Si()
%._value1 = getelementptr inbounds %Si, %Si* %0, i32 0, i32 0
store i64 %3, i64* %._value1, align 8
store i64 %3, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3twoSi, i32 0, i32 0), align 8
%4 = call i64 #_TF1x12calculation3FT_Si()
%._value2 = getelementptr inbounds %Si, %Si* %0, i32 0, i32 0
store i64 %4, i64* %._value2, align 8
store i64 %4, i64* getelementptr inbounds (%Si, %Si* #_Tv1x5threeSi, i32 0, i32 0), align 8
%5 = bitcast %Si* %0 to i8*
call void #llvm.lifetime.end(i64 8, i8* %5)
ret void
}
This one has 6 local variables! But, just like the local variables in the original source code, this tells us nothing about final performance. The compiler just creates this version because it's easier to reason about (and therefore optimize) than a version where variables can change their values.
(Even more dramatic is this code in SIL (-emit-sil), which creates 16 local variables for function 1 and 17 for function 2! If the compiler is happy to invent 16 local variables just to make it easier for it to reason about 6 lines of code, you certainly shouldn't be worried about the local variables you create. They're not just a minor concern; they're completely free.)
Unless you're dealing with a VERY specialized use case, this should never make a meaningful performance difference.
It's likely the compiler can easily simplify things to direct assignments in firstFunction, I'm unsure if secondFunction will easily lend itself to similar compiler optimization. You would either have to be an expert on the compiler or do some performance tests to find any differences.
Regardless, unless you're doing this at a scale of hundreds of thousands or millions it's not something to worry about.
I personally think re-using variables in that way of the secondFunction is unnecessarily confusing, but to each their own.
Note: it looks like you're dealing with classes, but be aware that struct copy semantics means re-using variables is useless anyway.
You should really just inline the local variables:
var one: Object
var two: Object
var three: Object
func firstFunction() {
one = //calculation1
two = //calculation2
three = //calculation3
}
One exception to this is if you end up writing something like this:
var someOptional: Foo?
func init() {
self.someOptional = Foo()
self.someOptional?.a = a
self.someOptional?.b = b
self.someOptional?.c = c
}
In which case it would be better to do:
func init() {
let foo = Foo()
foo.a = a
foo.b = b
foo.c = c
self.someOptional = foo
}
or perhaps:
func init() {
self.someOptional = {
let foo = Foo()
foo.a = a
foo.b = b
foo.c = c
return foo
}()
}

How to extract information from the llvm metadata

I have the following piece of code:
int main(int argc, char *argv[])
{
int a = 2;
int b = 5;
int soma = a + b;
//...}
The resulting llvm bitcode is:
define i32 #main(i32 %argc, i8** %argv) #0 {
entry:
...
%a = alloca i32, align 4
%b = alloca i32, align 4
%soma = alloca i32, align 4
...
call void #llvm.dbg.declare(metadata !{i32* %a}, metadata !15), !dbg !16
store i32 2, i32* %a, align 4, !dbg !16
call void #llvm.dbg.declare(metadata !{i32* %b}, metadata !17), !dbg !18
store i32 5, i32* %b, align 4, !dbg !18
call void #llvm.dbg.declare(metadata !{i32* %soma}, metadata !19), !dbg !20
%0 = load i32* %a, align 4, !dbg !20
%1 = load i32* %b, align 4, !dbg !20
%add = add nsw i32 %0, %1, !dbg !20
store i32 %add, i32* %soma, align 4, !dbg !20
...
!1 = metadata !{i32 0}
!2 = metadata !{metadata !3}
...
!15 = metadata !{i32 786688, metadata !3, metadata !"a", metadata !4, i32 6, metadata !7, i32 0, i32 0} ; [ DW_TAG_auto_variable ] [a] [line 6]
!16 = metadata !{i32 6, i32 0, metadata !3, null}
!17 = metadata !{i32 786688, metadata !3, metadata !"b", metadata !4, i32 7, metadata !7, i32 0, i32 0} ; [ DW_TAG_auto_variable ] [b] [line 7]
!18 = metadata !{i32 7, i32 0, metadata !3, null}
!19 = metadata !{i32 786688, metadata !3, metadata !"soma", metadata !4, i32 8, metadata !7, i32 0, i32 0} ; [ DW_TAG_auto_variable ] [soma] [line 8]
!20 = metadata !{i32 8, i32 0, metadata !3, null}
From the bitcode I need to get the following text:
a = 2
b = 5
soma = a + b
My doubt is how to extract the information I need from the metadata (dgb)??
Right now I only have the name of the instructions I-> getName () and the name of the operands with valueOp Value * = I-> getOperand (i); valueOp-> getName (). Str ();
The metadata is very extensive. How do I get this information from the metadata?
Relying on I->getName() for finding the variable name is not a good idea - you have the debug info for that. The proper approach for finding out the names of all the C/C++ local variables is to traverse the IR and look for all calls to #llvm.dbg.declare, then go to their 2nd operand (the debug metadata), and retrieve the variable name from there.
Use the source-level debugging guide to find out how the debug metadata is laid out. In particular, for local variables, the 3rd argument will be a metadata string with the variable name in the C/C++ source.
So the remaining thing is to find out what the variables are initialized to. For that, follow the 1st argument to #llvm.dbg.declare for getting the actual LLVM value used, then locate the 1st store instruction into it, and check what data is used there.
If it's a constant, you now have everything you need to output a = 5-style information. If it's another instruction, you have to follow it yourself and "decode" it - e.g., if it's an "add", then you need to print its two operands with a "+" in-between, etc. And of course the prints have to be recursive... not simple. But it will provide you with the accurate initialization value.
If you're looking for something a lot more coarse and you have access to the original source, you can just get the line number in which the variable is declared (the 5th operand in the debug metadata, assuming the tag (1st operand) is indeed DW_TAG_auto_variable and not DW_TAG_arg_variable, which indicates a parameter). Then print out that line from the original source. But this will not print all the relevant information (if the initialization value is constructed from multiple lines) and can print irrelevant information (if there are multiple statements in that line, for example).
Finally, remember optimization can seriously screw with debug information. If getting those print-outs is important, be careful with your -O option, maybe stick to -O0.