Unnecessary load and store instruction in scala's byte code - scala

I just did some inverstigation on pattern match and its corresponding byte code.
val a = Array(1,2,3,4)
a.map {
case i => i + 1
For above code, I use javap and got the byte code for the annonymous function inside map:
public int apply$mcII$sp(int);
0: iload_1
1: istore_2
2: iload_2
3: iconst_1
4: iadd
5: ireturn
So it seems to me that in line 0 we push an int (the parameter), then in line 1 we load the int and in line 2 we push it back ... What's the purpose here?

Dude, try -optimise.
public int apply$mcII$sp(int);
stack=2, locals=2, args_size=2
0: iload_1
1: iconst_1
2: iadd
3: ireturn
scala> :javap -prv -
and then something like
scala> :javap -prv $line4/$read$$iw$$iw$$anonfun$1

This is not really an answer, since I couldn't figure out why this happens. I'm hoping that these observations will be at least helpful :)
I'm seeing the following bytecode in Scala 2.10:
public int apply$mcII$sp(int);
0: iload_1 ; var1 -> stack
1: istore_2 ; var2 <- stack
2: iload_2 ; var2 -> stack
3: iconst_1 ; 1 -> stack
4: iadd
5: istore_3 ; var3 <- stack
6: iload_3 ; var3 -> stack
7: ireturn ; return <- stack
The first two instructions seem to simply move the value of var1 to var2, then move var2 to the stack as a parameter. The same can be observed after iadd, where the result is stored in var3 for no apparent reason, since ireturn returns the value from the stack anyway.


Strange problem with PIO state machine of RapsberryPi-Pico

I try to implement TFT control using PIO. I have written 4 PIO state machines for every sync signal of my TFT (CLOCK, DE, HSYNC, VSYNC). 3 of them work seamless, but if I add the VSYNC module, the whole Pico freezes. It doesn't change pins and doesn't blink with a LED using repeating timer.
Here is how my initialization looks like:
PIO pio = pio0;
uint h_offset = pio_add_program(pio, &hsync_program);
uint v_offset = pio_add_program(pio, &vsync_program);
uint c_offset = pio_add_program(pio, &clock_program);
uint d_offset = pio_add_program(pio, &de_program);
hsync_program_init(pio, 0, h_offset, HSYNC_PIN);
vsync_program_init(pio, 1, v_offset, VSYNC_PIN);
clock_program_init(pio, 2, c_offset, CLOCK_PIN);
de_program_init(pio, 3, d_offset, DE_PIN);
pio_sm_set_enabled(pio, 0, true);
pio_sm_set_enabled(pio, 1, true);
pio_sm_set_enabled(pio, 2, true);
pio_sm_set_enabled(pio, 3, true);
pio_sm_put_blocking(pio, 0, TFT_WIDTH);
pio_sm_put_blocking(pio, 1, TFT_HEIGHT);
Here is the content of the vsync.pio:
.define v_back_porch 12
.define v_front_porch 8
.define v_sync_len 4
.program vsync
pull block
mov y, osr
set pins, 0
set x, v_sync_len
wait 1 irq 2
jmp x-- vactive
; back porch
set pins, 1
set x, (v_back_porch - v_sync_len)
wait 1 irq 2
jmp x-- vbporch
; main cycle
mov x, y
wait 1 irq 2
jmp x-- vmain
set x, v_front_porch
wait 1 irq 2
jmp x-- vfporch
; set sync interrupt for RGB
; irq 3
.wrap ; sync forever!
% c-sdk {
// this is a raw helper function for use by the user which sets up the GPIO output, and configures the SM to output on a particular pin
void vsync_program_init(PIO pio, uint sm, uint offset, uint pin) {
pio_gpio_init(pio, pin);
pio_sm_set_consecutive_pindirs(pio, sm, pin, 1, true);
pio_sm_config c = vsync_program_get_default_config(offset);
sm_config_set_set_pins(&c, pin, 1);
pio_sm_init(pio, sm, offset, &c);
If I remove every line of code for the vsync state machine, everything works and generates exactly that, what I want.
After adding uint v_offset = pio_add_program(pio, &vsync_program); it doesn't work anymore. There are no errors or comments during the compilation. I have already tried almost everything. It seems, that registers x and y are faulty, but I can't make a counter without using them.
I have pretty the same code in the hsync.pio, but I don't have any problems with it.
Here is a compiled result for the vsync.pio:
static const uint16_t vsync_program_instructions[] = {
0x80a0, // 0: pull block
0xa047, // 1: mov y, osr
// .wrap_target
0xe000, // 2: set pins, 0
0xe024, // 3: set x, 4
0x20c2, // 4: wait 1 irq, 2
0x0044, // 5: jmp x--, 4
0xe001, // 6: set pins, 1
0xe028, // 7: set x, 8
0x20c2, // 8: wait 1 irq, 2
0x0048, // 9: jmp x--, 8
0xa022, // 10: mov x, y
0x20c2, // 11: wait 1 irq, 2
0x004b, // 12: jmp x--, 11
0xe028, // 13: set x, 8
0x20c2, // 14: wait 1 irq, 2
0x004e, // 15: jmp x--, 14
// .wrap
and for the hsync.pio to compare:
static const uint16_t hsync_program_instructions[] = {
0x80a0, // 0: pull block
0xa047, // 1: mov y, osr
// .wrap_target
0xe000, // 2: set pins, 0
0xe022, // 3: set x, 2
0x20c0, // 4: wait 1 irq, 0
0x0044, // 5: jmp x--, 4
0xe001, // 6: set pins, 1
0xe022, // 7: set x, 2
0x20c0, // 8: wait 1 irq, 0
0x0048, // 9: jmp x--, 8
0xc001, // 10: irq nowait 1
0xa022, // 11: mov x, y
0x20c0, // 12: wait 1 irq, 0
0x004c, // 13: jmp x--, 12
0xc001, // 14: irq nowait 1
0xe024, // 15: set x, 4
0x20c0, // 16: wait 1 irq, 0
0x0050, // 17: jmp x--, 16
0xc002, // 18: irq nowait 2
// .wrap
I don't see any significant differences there.
What could be the reasong for such a behavior?
It's very pitty, but every PIO instance (not every state machine in the instance!) allows 32 instructions only. So I have to use PIO1, but there is another problem - IRQs from one PIO don't interact with another PIO.
The structure of the Pico PIO makes me really sad :(

Ceph OSD crashes

I am looking for a solution that can bring up our 3 ceph OSD which are down. According to the log I can see the following output that I guess it is because of OSD are full.
At the moment 3 osd of 4 are down and the ceph cluster is also down and I have no clue about the following error.
root#cephosd02 ~]# /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el7/BUILD/ceph-15.2.13/src/os/bluestore/AvlAllocator.cc: In function 'virtual void AvlAllocator::_add_to_tree(uint64_t, uint64_t)' thread 7f633fca3bc0 time 2021-11-28T12:02:33.118576+0330
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el7/BUILD/ceph-15.2.13/src/os/bluestore/AvlAllocator.cc: 60: FAILED ceph_assert(size != 0)
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55b6ed7219a3]
2: (()+0x4deb6b) [0x55b6ed721b6b]
3: (AvlAllocator::_add_to_tree(unsigned long, unsigned long)+0x2db9) [0x55b6edd9af89]
4: (AvlAllocator::init_add_free(unsigned long, unsigned long)+0x69) [0x55b6edd9e8d9]
5: (BlueStore::_open_alloc()+0x1d3) [0x55b6edc61ad3]
6: (BlueStore::_open_db_and_around(bool)+0xa7) [0x55b6edc7a507]
7: (BlueStore::_mount(bool, bool)+0x5c2) [0x55b6edcc6ba2]
8: (OSD::init()+0x35d) [0x55b6ed82f1bd]
9: (main()+0x1b5e) [0x55b6ed78372e]
10: (__libc_start_main()+0xf5) [0x7f633caf5555]
11: (()+0x575dd5) [0x55b6ed7b8dd5]
* Caught signal (Aborted) **
in thread 7f633fca3bc0 thread_name:ceph-osd
2021-11-28T12:02:33.125+0330 7f633fca3bc0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el7/BUILD/ceph-15.2.13/src/os/bluestore/AvlAllocator.cc: In function 'virtual void AvlAllocator::_add_to_tree(uint64_t, uint64_t)' thread 7f633fca3bc0 time 2021-11-28T12:02:33.118576+0330
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el7/BUILD/ceph-15.2.13/src/os/bluestore/AvlAllocator.cc: 60: FAILED ceph_assert(size != 0)
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55b6ed7219a3]
2: (()+0x4deb6b) [0x55b6ed721b6b]
3: (AvlAllocator::_add_to_tree(unsigned long, unsigned long)+0x2db9) [0x55b6edd9af89]
4: (AvlAllocator::init_add_free(unsigned long, unsigned long)+0x69) [0x55b6edd9e8d9]
5: (BlueStore::_open_alloc()+0x1d3) [0x55b6edc61ad3]
6: (BlueStore::_open_db_and_around(bool)+0xa7) [0x55b6edc7a507]
7: (BlueStore::_mount(bool, bool)+0x5c2) [0x55b6edcc6ba2]
8: (OSD::init()+0x35d) [0x55b6ed82f1bd]
9: (main()+0x1b5e) [0x55b6ed78372e]
10: (__libc_start_main()+0xf5) [0x7f633caf5555]
11: (()+0x575dd5) [0x55b6ed7b8dd5]
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (()+0xf630) [0x7f633dd16630]
2: (gsignal()+0x37) [0x7f633cb09387]
3: (abort()+0x148) [0x7f633cb0aa78]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55b6ed7219f2]
5: (()+0x4deb6b) [0x55b6ed721b6b]
6: (AvlAllocator::_add_to_tree(unsigned long, unsigned long)+0x2db9) [0x55b6edd9af89]
7: (AvlAllocator::init_add_free(unsigned long, unsigned long)+0x69) [0x55b6edd9e8d9]
8: (BlueStore::_open_alloc()+0x1d3) [0x55b6edc61ad3]
9: (BlueStore::_open_db_and_around(bool)+0xa7) [0x55b6edc7a507]
10: (BlueStore::_mount(bool, bool)+0x5c2) [0x55b6edcc6ba2]
11: (OSD::init()+0x35d) [0x55b6ed82f1bd]
12: (main()+0x1b5e) [0x55b6ed78372e]
13: (__libc_start_main()+0xf5) [0x7f633caf5555]
14: (()+0x575dd5) [0x55b6ed7b8dd5]
2021-11-28T12:02:33.133+0330 7f633fca3bc0 -1 * Caught signal (Aborted) **
in thread 7f633fca3bc0 thread_name:ceph-osd
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (()+0xf630) [0x7f633dd16630]
2: (gsignal()+0x37) [0x7f633cb09387]
3: (abort()+0x148) [0x7f633cb0aa78]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char

eBPF tools - skb_network_header crashes in a BPF Kernel trace function

I am looking to trace ip_forward_finish. The intent is to trace latency of all TCP connections going through a linux based gateway router. Hence thought of tracing ip_forward_finish kernel function. And capture the time-stamp of SYN, SYN-ACK and ACK messages at the router.
The issue is accessing iphdr inside the trace function makes the verifier complain with the following error:
bpf: Failed to load program: Permission denied
0: (79) r6 = *(u64 *)(r1 +96)
1: (b7) r1 = 0
2: (6b) *(u16 *)(r10 -24) = r1
3: (bf) r3 = r6
4: (07) r3 += 192
5: (bf) r1 = r10
6: (07) r1 += -24
7: (b7) r2 = 2
8: (85) call bpf_probe_read#4
9: (69) r1 = *(u16 *)(r10 -24)
10: (55) if r1 != 0x8 goto pc+7
R0=inv(id=0) R1=inv8 R6=inv(id=0) R10=fp0
11: (69) r1 = *(u16 *)(r6 +196)
R6 invalid mem access 'inv'
HINT: The invalid mem access 'inv' error can happen if you try to dereference
memory without first using bpf_probe_read() to copy it to the BPF stack.
Sometimes the bpf_probe_read is automatic by the bcc rewriter, other times
you'll need to be explicit.
The code fragment I originally had was as below and the crash occurs when an access to ip_Hdr->protocol is made. And I also checked that ip_Hdr is not null.
int trace_forward_finish(struct pt_regs *ctx,struct net *net,
struct sock *sk, struct sk_buff *skb)
if (skb->protocol != htons(ETH_P_IP))
return 0;
struct iphdr* ip_Hdr = (struct iphdr *) skb_network_header(skb);
if (ip_Hdr->protocol != IPPROTO_TCP)
return 0;
/// More code
Per the HINT in the message, I did try to change to bpf_probe_read but still the same outcome
int trace_forward_finish(struct pt_regs *ctx,struct net *net,
struct sock *sk, struct sk_buff *skb)
if (skb->protocol != htons(ETH_P_IP))
return 0;
struct iphdr ip_Hdr;
bpf_probe_read(&ip_Hdr, sizeof(ip_Hdr), (void*)ip_hdr(skb));
if (ip_Hdr.protocol != IPPROTO_TCP)
return 0;
return 0;
Any help would be appreciated.
bcc will try to transform your dereferences of kernel pointers into calls to bpf_probe_read. You can see that happening by passing debug=4 to the BPF() call.
In your case, I suspect that you would need to include function skb_network_header in your code so that bcc is able to rewrite it.
If that's not sufficient, then you might need a manual call to bpf_probe_read to retrieve structure struct iphdr from skb_network_header's pointer.

Huge amount of memory used by flink

Since the last couple week I build a DataStream programs in Flink in scala.
But I have a strange behavior, flink uses lots of more memory than I expected.
I have a 4 ListState of tuple(Int, long) in my processFunction keyed by INT, I use it to get different unique Counter in a different time frame, and I expected the most of the memory was used by this List.
But it's not the case.
So I print an histo live of the JVM.
And I was surprised how many memories are used.
num #instances #bytes class name
1: 138920685 6668192880 java.util.HashMap$Node
2: 138893041 5555721640 org.apache.flink.streaming.api.operators.InternalTimer
3: 149680624 3592334976 java.lang.Integer
4: 48313229 3092046656 org.apache.flink.runtime.state.heap.CopyOnWriteStateTable$StateTableEntry
5: 14042723 2579684280 [Ljava.lang.Object;
6: 4492 2047983264 [Ljava.util.HashMap$Node;
7: 41686732 1333975424 com.myJob.flink.tupleState
8: 201 784339688 [Lorg.apache.flink.runtime.state.heap.CopyOnWriteStateTable$StateTableEntry;
9: 17230300 689212000 com.myJob.flink.uniqStruct
10: 14025040 561001600 java.util.ArrayList
11: 8615581 413547888 com.myJob.flink.Data$FingerprintCnt
12: 6142006 393088384 com.myJob.flink.ProcessCountStruct
13: 4307549 172301960 com.myJob.flink.uniqresult
14: 4307841 137850912 com.myJob.flink.Data$FingerprintUniq
15: 2153904 137849856 com.myJob.flink.Data$StreamData
16: 1984742 79389680 scala.collection.mutable.ListBuffer
17: 1909472 61103104 scala.collection.immutable.$colon$colon
18: 22200 21844392 [B
19: 282624 9043968 org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
20: 59045 6552856 [C
21: 33194 2655520 java.nio.DirectByteBuffer
22: 32804 2361888 sun.misc.Cleaner
23: 35 2294600 [Lscala.concurrent.forkjoin.ForkJoinTask;
24: 640 2276352 [Lorg.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry;
25: 32768 2097152 org.apache.flink.core.memory.HybridMemorySegment
26: 12291 2082448 java.lang.Class
27: 58591 1874912 java.lang.String
28: 8581 1372960 java.lang.reflect.Method
29: 32790 1311600 java.nio.DirectByteBuffer$Deallocator
30: 18537 889776 java.util.concurrent.ConcurrentHashMap$Node
31: 4239 508680 java.lang.reflect.Field
32: 8810 493360 java.nio.HeapByteBuffer
33: 7389 472896 java.util.HashMap
34: 5208 400336 [I
The tupple(Int, long) is com.myJob.flink.tupleState in 7th position.
And I see the tuple use less than 2G of memory.
I don't understand why flink used this amount of memory for these classes.
Can anyone give me a light on this behavior, thanks in advance.
I run my job on a stand alone cluster (1 jobManager, 3 taskManager)
the flink version is 1.5-SNAPSHOT commit : e4486ae
I get the histo live on one taskManager node.
Update 2 :
In my processFunction I used :
ctx.timerService.registerProcessingTimeTimer(ctx.timestamp + 100)
And after on onTimer function, I process my listState to check all old data.
so it create a timer for each call on processFunction.
but why the timer is steel on memory after onTimer function triggered
How many windows do you end up with? Based on the top two entries what are are seeing is the "timers" that are used by Flink to track when to clean up the window. For every key in the window you will end up with (key, endTimestamp) effectively in the timer state. If you have a very large number of windows (perhaps out of order time or delayed watermarking) or a very large number of keys in each window, those will each take up memory.
Note that even if you are using RocksDB state, the TimerService uses Heap memory so you have to watch out for that.

Is multiple assignment of tuples slower than multiple assignment statements?

Is there a difference between assigning multiple variables using a tuple, and assigning them in multiple statements?
For example, is there a difference between the following code snippets?
// multiple assignment using tuples
val (x, y) = (str.length, str.substring(1, 2))
// multiple-statement assignment
val x = str.length
val y = str.substring(1, 2)
There is a difference. The approach of using tuples is actually invoking an extractor (the unapply method), which would incur a cost at runtime. The second approach is certainly faster.
To get an idea about the difference, here is a decompilation of two methods showing both approaches. You can see clearly how the first approach causes much more operations.
An important point to note here is that the first expression requires auto-boxing to a java.lang.Integer (because Tuple2 accepts objects), while the second expression uses the value without boxing.
public void m1(java.lang.String);
0: new #16 // class scala/Tuple2
3: dup
4: aload_1
5: invokevirtual #22 // Method java/lang/String.length:()I
8: invokestatic #28 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
11: aload_1
12: iconst_1
13: iconst_2
14: invokevirtual #32 // Method java/lang/String.substring:(II)Ljava/lang/String;
17: invokespecial #35 // Method scala/Tuple2."<init>":(Ljava/lang/Object;Ljava/lang/Object;)V
20: astore_3
21: aload_3
22: ifnull 75
25: aload_3
26: invokevirtual #38 // Method scala/Tuple2._1$mcI$sp:()I
29: istore 4
31: aload_3
32: invokevirtual #42 // Method scala/Tuple2._2:()Ljava/lang/Object;
35: checkcast #18 // class java/lang/String
38: astore 5
40: new #16 // class scala/Tuple2
43: dup
44: iload 4
46: invokestatic #28 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
49: aload 5
51: invokespecial #35 // Method scala/Tuple2."<init>":(Ljava/lang/Object;Ljava/lang/Object;)V
54: astore 6
56: aload 6
58: astore_2
59: aload_2
60: invokevirtual #38 // Method scala/Tuple2._1$mcI$sp:()I
63: istore 7
65: aload_2
66: invokevirtual #42 // Method scala/Tuple2._2:()Ljava/lang/Object;
69: checkcast #18 // class java/lang/String
72: astore 8
74: return
75: new #44 // class scala/MatchError
78: dup
79: aload_3
80: invokespecial #47 // Method scala/MatchError."<init>":(Ljava/lang/Object;)V
83: athrow
public void m2(java.lang.String);
0: aload_1
1: invokevirtual #22 // Method java/lang/String.length:()I
4: istore_2
5: aload_1
6: iconst_1
7: iconst_2
8: invokevirtual #32 // Method java/lang/String.substring:(II)Ljava/lang/String;
11: astore_3
12: return