Minimalmodbus: read multiple slaves on same serial port - modbus

I have a setup with 2 SDM120 kWh energy meters daisy chained on the same serial port (in the future I want to add a SDM630). I found "Using multiple instruments" in the MinimalModbus communication. I succeed in reading registers on the SDM120 on address 1, but I get an error on reading address 2. The error: minimalmodbus.NoResponseError: No communication with the instrument (no answer).
I can work around it by adding time.sleep(0.1), but I would think that RS485 allows to immediately read the registers of a second address after the first one is completed. I also tried lower values, but eg. time.sleep(0.01) also gave a NoResponseError.
I personally thought the setting instrument.serial.timeout = 1 already would have had the desired effect, but apparently I really need the time.sleep. Is the time.sleep(0.1) the correct way of doing? If so: how can I know the lowest value, so that I don't have a NoResponseError? Trial and error? Could my script be optimized? Especially when timing is important, eg. to avoid injection in the grid (pv diverter, ...). Thanks in advance!
The script:
#!/usr/bin/env python3
import minimalmodbus
import time
instrumentA = minimalmodbus.Instrument('/dev/ttyUSB0', 1, debug = True) # port name, slave address (in decimal)
instrumentA.serial.baudrate = 9600
instrumentA.serial.timeout = 1 # seconds
instrumentA.serial.bytesize = 8
instrumentA.serial.parity = minimalmodbus.serial.PARITY_NONE
instrumentA.serial.stopbits = 1
instrumentA.mode = minimalmodbus.MODE_RTU
instrumentB = minimalmodbus.Instrument('/dev/ttyUSB0', 2, debug = True)
instrumentB.mode = minimalmodbus.MODE_RTU
print ("====== SDM120 instrumentA on addres 1 ======")
print (instrumentA)
P = instrumentA.read_float(12, 4, 2)
print ("Active Power in Watts:", P)
#time.sleep(0.1) #workaround to avoid NoResponseError
print ("====== SDM120 instrumentB on addres 2 ======")
print (instrumentB)
P = instrumentB.read_float(12, 4, 2)
print ("Active Power in Watts:", P)
Output without the time.sleep(0.1):
MinimalModbus debug mode. Create serial port /dev/ttyUSB0
MinimalModbus debug mode. Serial port /dev/ttyUSB0 already exists
====== SDM120 instrumentA on addres 1 ======
minimalmodbus.Instrument<id=0x7f36e3dc0df0, address=1, mode=rtu, close_port_after_each_call=False, precalculate_read_size=True, clear_buffers_before_each_transaction=True, handle_local_echo=False, debug=True, serial=Serial<id=0x7f36e3dd90d0, open=True>(port='/dev/ttyUSB0', baudrate=9600, bytesize=8, parity='N', stopbits=1, timeout=1, xonxoff=False, rtscts=False, dsrdtr=False)>
MinimalModbus debug mode. Will write to instrument (expecting 9 bytes back): '\x01\x04\x00\x0c\x00\x02±È' (01 04 00 0C 00 02 B1 C8)
MinimalModbus debug mode. Clearing serial buffers for port /dev/ttyUSB0
MinimalModbus debug mode. No sleep required before write. Time since previous read: 190954.73 ms, minimum silent period: 4.01 ms.
MinimalModbus debug mode. Response from instrument: '\x01\x04\x04\x00\x00\x00\x00û\x84' (01 04 04 00 00 00 00 FB 84) (9 bytes), roundtrip time: 53.3 ms. Timeout for reading: 1000.0 ms.
Active Power in Watts: 0.0
====== SDM120 instrumentB on addres 2 ======
minimalmodbus.Instrument<id=0x7f36e3c55940, address=2, mode=rtu, close_port_after_each_call=False, precalculate_read_size=True, clear_buffers_before_each_transaction=True, handle_local_echo=False, debug=True, serial=Serial<id=0x7f36e3dd90d0, open=True>(port='/dev/ttyUSB0', baudrate=9600, bytesize=8, parity='N', stopbits=1, timeout=1, xonxoff=False, rtscts=False, dsrdtr=False)>
MinimalModbus debug mode. Will write to instrument (expecting 9 bytes back): '\x02\x04\x00\x0c\x00\x02±û' (02 04 00 0C 00 02 B1 FB)
MinimalModbus debug mode. Clearing serial buffers for port /dev/ttyUSB0
MinimalModbus debug mode. Sleeping 2.31 ms before sending. Minimum silent period: 4.01 ms, time since read: 1.70 ms.
MinimalModbus debug mode. Response from instrument: '' () (0 bytes), roundtrip time: 1001.3 ms. Timeout for reading: 1000.0 ms.
Traceback (most recent call last):
File "./sdm120-daisychain_v3.py", line 25, in <module>
P = instrumentB.read_float(12, 4, 2)
File "/home/mattias/.local/lib/python3.8/site-packages/minimalmodbus.py", line 662, in read_float
return self._generic_command(
File "/home/mattias/.local/lib/python3.8/site-packages/minimalmodbus.py", line 1170, in _generic_command
payload_from_slave = self._perform_command(functioncode, payload_to_slave)
File "/home/mattias/.local/lib/python3.8/site-packages/minimalmodbus.py", line 1240, in _perform_command
response = self._communicate(request, number_of_bytes_to_read)
File "/home/mattias/.local/lib/python3.8/site-packages/minimalmodbus.py", line 1406, in _communicate
raise NoResponseError("No communication with the instrument (no answer)")
minimalmodbus.NoResponseError: No communication with the instrument (no answer)
Output with the time.sleep(0.1):
MinimalModbus debug mode. Create serial port /dev/ttyUSB0
MinimalModbus debug mode. Serial port /dev/ttyUSB0 already exists
====== SDM120 instrumentA on addres 1 ======
minimalmodbus.Instrument<id=0x7f91feddcdf0, address=1, mode=rtu, close_port_after_each_call=False, precalculate_read_size=True, clear_buffers_before_each_transaction=True, handle_local_echo=False, debug=True, serial=Serial<id=0x7f91fedf50d0, open=True>(port='/dev/ttyUSB0', baudrate=9600, bytesize=8, parity='N', stopbits=1, timeout=1, xonxoff=False, rtscts=False, dsrdtr=False)>
MinimalModbus debug mode. Will write to instrument (expecting 9 bytes back): '\x01\x04\x00\x0c\x00\x02±È' (01 04 00 0C 00 02 B1 C8)
MinimalModbus debug mode. Clearing serial buffers for port /dev/ttyUSB0
MinimalModbus debug mode. No sleep required before write. Time since previous read: 176619.62 ms, minimum silent period: 4.01 ms.
MinimalModbus debug mode. Response from instrument: '\x01\x04\x04\x00\x00\x00\x00û\x84' (01 04 04 00 00 00 00 FB 84) (9 bytes), roundtrip time: 53.3 ms. Timeout for reading: 1000.0 ms.
Active Power in Watts: 0.0
====== SDM120 instrumentB on addres 2 ======
minimalmodbus.Instrument<id=0x7f91fec70940, address=2, mode=rtu, close_port_after_each_call=False, precalculate_read_size=True, clear_buffers_before_each_transaction=True, handle_local_echo=False, debug=True, serial=Serial<id=0x7f91fedf50d0, open=True>(port='/dev/ttyUSB0', baudrate=9600, bytesize=8, parity='N', stopbits=1, timeout=1, xonxoff=False, rtscts=False, dsrdtr=False)>
MinimalModbus debug mode. Will write to instrument (expecting 9 bytes back): '\x02\x04\x00\x0c\x00\x02±û' (02 04 00 0C 00 02 B1 FB)
MinimalModbus debug mode. Clearing serial buffers for port /dev/ttyUSB0
MinimalModbus debug mode. No sleep required before write. Time since previous read: 102.09 ms, minimum silent period: 4.01 ms.
MinimalModbus debug mode. Response from instrument: '\x02\x04\x04\x00\x00\x00\x00È\x84' (02 04 04 00 00 00 00 C8 84) (9 bytes), roundtrip time: 52.8 ms. Timeout for reading: 1000.0 ms.
Active Power in Watts: 0.0

There seems to be nothing wrong with your code or the library you are using (minimalmodbus).
As you probably know, Modbus works in a query-response mode over a half-duplex link. In plain English: you first send a query and the device that query is addressed to answers with the data you asked for.
Both parts of the transaction (queries and responses) travel over the same bus. But the bus is a shared medium and only one device is allowed to take control of the bus (to talk) at any time.
When you have a single master and one or multiple slaves this process works with no issues as long as you guarantee a short silent period after any device writes to the bus. The Modbus specification established this value at 3.5 characters (the time it takes to send 3 and a half characters serially on the bus at the baud rate you are using).
Unfortunately, some manufacturers do not stick to this rule. So some of those devices just take longer than 3.5 characters time to release control of the bus.
This seems to be the case at least with one of your devices. This manual can give you some clues:
My bet is out of your two devices one of them takes significantly less than the other to release the bus, but that's something you will have to confirm with the debug details. It might even be that the device takes longer to release the bus if you query 20 or 40 registers instead of 4 or 8...
What can you do about it? Well, from the device side, not much, it is what it is. On your software you can do many different things. As I said in the comments above you should not feel bad about using time.sleep() considering that's the way minimalmodbus tries to cope with the bus contention problem.
To make your code more robust you can add try: ... except:. This approach is explained in the documentation. You can keep retrying to read within a loop for a number of attempts or add a small delay to the except chunk. Maybe something like this.

The answer of Marcos G. (answered Apr 13 at 9:24) includes some background details. In short:
With some trial and error, one can have a value for time.sleep so that minimalmodbus can cope with the bus contention problem.
You can have more robust code with try: ... except:. It might be a good idea to only try a number of times, to avoid a infinite loop.
I include two scripts which use those approaches to my posted problem. Compared to my original question a for loop and an array for the addresses is used.
The first one is with time.sleep
#!/usr/bin/env python3
import minimalmodbus
import time
addr = 1
instrument = minimalmodbus.Instrument('/dev/ttyUSB0', addr) # port name, slave address (in decimal)
instrument.serial.baudrate = 9600 # Baud
instrument.serial.bytesize = 8
instrument.serial.parity = minimalmodbus.serial.PARITY_NONE
instrument.serial.stopbits = 1
instrument.serial.timeout = 1 # seconds
instrument.mode = minimalmodbus.MODE_RTU # rtu or ascii mode
addresses = [1,2]
registers = [ 0, 6, 12, 18, 24, 30, 70, 72, 74, 76, 78, 84, 86, 88, 90, 92, 94, 258, 264, 342, 344]
names = ["V","I","P","S", "Q","PF","f","IAE","EAE", "IRE", "ERE","TSP","MSP","ISP","MIP","ESP","MEP","ID","MID","TAE", "TRE"]
units = ["V","A","W","VA","var", "","Hz","kWh","kWh","kvarh","kvarh", "W", "W", "W", "W", "W", "W", "A", "A","kWh","kvarh"]
info = [
"(V for Voltage in volt)",
"(I for Current in ampere)",
"(P for Active Power in watt)",
"(S for Apparent power in volt-ampere)",
"(Q for Reactive power in volt-ampere reactive)",
"(PF for Power Factor)",
"(f for Frequency in hertz)",
"(IAE for Import active energy in kilowatt-hour)",
"(EAE for Export active energy in kilowatt-hour)",
"(IRE for Import reactive energy in kilovolt-ampere reactive hours)",
"(ERE for Export reactive energy in kilovolt-ampere reactive hours)",
"(TSP for Total system power demand in watt)",
"(MSP for Maximum total system power demand in watt)",
"(ISP for Import system power demand in watt)",
"(MIP for Maximum import system power demand in watt)",
"(ESP for Export system power demand in watt)",
"(MEP for MaximumExport system power demand in watt)",
"(ID for current demand in ampere)",
"(MID for Maximum current demand in ampere)",
"(TAE for Total active energy in kilowatt-hour)",
"(TRE for Total reactive energy in kilovolt-ampere reactive hours)",
]
for addr in addresses:
instrument.address = addr
print ("=== General info about address", addr, "===")
print (instrument)
print ("=== The registers for address", addr, "===")
for i in range(len(registers)):
value = instrument.read_float(registers[i], 4, 2)
print (str(registers[i]).rjust(3), str(value).rjust(20), units[i].ljust(5), info[i])
time.sleep(0.1) # To avoid minimalmodbus.NoResponseError
print ("")
The second one with try: ... except:
#!/usr/bin/env python3
import minimalmodbus
# This alternative script `sdm120-daisy-alt.py` will try to reread a device an extra 9 times before giving up and continuing with the other addresses in the array.
addr = 1
instrument = minimalmodbus.Instrument('/dev/ttyUSB0', addr) # port name, slave address (in decimal)
instrument.serial.baudrate = 9600 # Baud
instrument.serial.bytesize = 8
instrument.serial.parity = minimalmodbus.serial.PARITY_NONE
instrument.serial.stopbits = 1
instrument.serial.timeout = 1 # seconds
instrument.mode = minimalmodbus.MODE_RTU # rtu or ascii mode
addresses = [1,2]
registers = [ 0, 6, 12, 18, 24, 30, 70, 72, 74, 76, 78, 84, 86, 88, 90, 92, 94, 258, 264, 342, 344]
names = ["V","I","P","S", "Q","PF","f","IAE","EAE", "IRE", "ERE","TSP","MSP","ISP","MIP","ESP","MEP","ID","MID","TAE", "TRE"]
units = ["V","A","W","VA","var", "","Hz","kWh","kWh","kvarh","kvarh", "W", "W", "W", "W", "W", "W", "A", "A","kWh","kvarh"]
info = [
"(V for Voltage in volt)",
"(I for Current in ampere)",
"(P for Active Power in watt)",
"(S for Apparent power in volt-ampere)",
"(Q for Reactive power in volt-ampere reactive)",
"(PF for Power Factor)",
"(f for Frequency in hertz)",
"(IAE for Import active energy in kilowatt-hour)",
"(EAE for Export active energy in kilowatt-hour)",
"(IRE for Import reactive energy in kilovolt-ampere reactive hours)",
"(ERE for Export reactive energy in kilovolt-ampere reactive hours)",
"(TSP for Total system power demand in watt)",
"(MSP for Maximum total system power demand in watt)",
"(ISP for Import system power demand in watt)",
"(MIP for Maximum import system power demand in watt)",
"(ESP for Export system power demand in watt)",
"(MEP for MaximumExport system power demand in watt)",
"(ID for current demand in ampere)",
"(MID for Maximum current demand in ampere)",
"(TAE for Total active energy in kilowatt-hour)",
"(TRE for Total reactive energy in kilovolt-ampere reactive hours)",
]
for addr in addresses:
instrument.address = addr
print ("=== General info about address", addr, "===")
print (instrument)
print ("=== The registers for address", addr, "===")
for attempt in range(10):
try:
for i in range(len(registers)):
value = instrument.read_float(registers[i], 4, 2)
print (str(registers[i]).rjust(3), str(value).rjust(20), units[i].ljust(5), info[i])
except minimalmodbus.NoResponseError:
print("NoResponseError: did't work on attempt ", attempt+1, "I will retry")
else:
print ("Succeeded on attempt", attempt+1)
break
print ("")

Related

How do I synchronize clocks on two Raspberry Pi Picos?

I'm sending bits as LED blinks from one Pi Pico, and using another Pico to receive voltage from a photodiode and read the bits.
When sending/receiving from the same PiPico, it works accurately at 60us a bit. However, when sending from one pico and receiving from a second pico, I can only send/receive bits accurately at 0.1s a bit. If I move to sending/receiving at 0.01s, I start losing information. This leads me to believe my issue is with the clocks or sampling rates on the two pipicos, that they are slightly different.
Is there a way to synchronize two pipico clocks using Thonny/Micropython?
Code for Pico #1 sending blinks/bits:
led=Pin(13,Pin.OUT) #Led to blink
SaqVoltage = machine.ADC(28) #receive an input signal, when over a certain voltage, send the time using bits
#Ommitted code here to grab first low voltage values from SaqVoltage signal and use to determine if there is a high voltage received, when high voltage is received, send "hi" in led bits
#More omitted calibration code here for led/photodiode
hi = "0110100001101001"
while True:
SaqVoltage_value = SaqVoltage.read_u16()*3.3 / 65536 #read input signal to see if high voltage is sent
finalBitString = ""
if SaqVoltage_value > saQCutOff: #high voltage found, send "hi" sequence
finalBitString = #hi plus some start/stop sequences
for i in range (0, len(finalBitString)):
if finalBitString[i]=="1":
led(1)
elif finalBitString[i]=="0":
led(0)
utime.sleep(.01)
Code for Pico#2 receiving bits:
SaqDiode = machine.ADC(28) #read photodiode
#ommitted code here to calibrate leds/photodiodes
startSeqSaq = []
wordBits=[]
while True:
utime.sleep(.01) #sample diode every 0.01 seconds
SaqDiode_value = SaqDiode.read_u16()*3.3 / 65536
if (saqDiodeHi - saqCutOff <= SaqDiode_value): #read saq diode and determine 1 or 0
bit=1
elif (SaqDiode_value <= saqDiodeLo + saqCutOff):
bit=0
if len(startSeqSaq)==10: #record last 10 received bits to check if start sequence
startSeqSaq.pop(0)
startSeqSaq.append(bit)
elif len(startSeqSaq)<10:
startSeqSaq.append(bit)
if startSeqSaq == startSeq: #found start sequence, start reading bits
while True:
utime.sleep(.01)
SaqDiode_value = SaqDiode.read_u16()*3.3 / 65536
if (saqDiodeHi - saqCutOff <= SaqDiode_value):
bit=1
wordBits.append(bit)
elif (SaqDiode_value < saqDiodeLo + saqCutOff):
bit=0
wordBits.append(bit)
if len(wordBits)>10: #check for stop sequence
last14=wordBits[-10:]
else:
last14 = []
if last14==endSeq:
char = frombits(wordBits[:-10])
wordBits=[]
print("Function Generator Reset Signal Time in ms from Start: ", char)
break

Why does a loop transitioning from having its uops fed by the Uop Cache to LSD cause a spike in branch-misses?

All benchmarks are run on either
Icelake
or Whiskey Lake (In Skylake Family).
Summary
I am seeing a strange phenomina where it appears that when a loop
transitions from running out of the Uop Cache to running out of
the LSD (Loop Stream Detector) there is a spike in Branch
Misses that can cause severe performance hits. I tested on both
Icelake and Whiskey Lake by benchmarking a nested loop with the outer loop
having a sufficiently large body s.t the entire thing did not fit in
the LSD itself, but with an inner loop small enough to fit in the
LSD.
Basically once the inner loop reaches some iteration count decoding
seems to switch for idq.dsb_uops (Uop Cache) to lsd.uops (LSD)
and at that point there is a large increase in branch-misses
(without a corresponding jump in branches) causing a severe
performance drop. Note: This only seems to occur for nested
loops. Travis Down's Loop
Test for example will
not show any meaningful variation in branch misses. AFAICT this has
something to do with when a loop transitions from running out of the
Uop Cache to running out of the LSD.
Questions
What is happening when the loop transitions from running out of the
Uop Cache to running out of the LSD that causes this spike in
Branch Misses?
Is there a way to avoid this?
Benchmark
This is the minimum reproducible example I could come up with:
Note: If the .p2align statements are removed both loops will fit in
the LSD and there will not be a transitions.
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
}
Compile: gcc -O3 -march=native -mtune=native <filename>.c -o <filename>
Run Icelake: sudo perf stat -C 0 --all-user -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Run Whiskey Lake: sudo perf stat -C 0 -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Graphs
Edit: x label is N iterations of inner loop.
Below is a graph of Branch Misses, Branches, and LSD Uops.
Generally you can see that 1) there is no corresponding jump in Branches. 2) that the number of added Branch Misses stabilizes at a constant. And 3) That there is a strong relationship between the Branch Misses and LSD Uops.
Icelake Graph:
Whiskey Lake Graph:
Below is a graph of Branch Misses, Cycles, and LSD Uops
for Icelake only as performance is not affected nearly as much on:
Analysis
Hard numbers below.
For Icelake starting at N = 22 and finishing at N = 27 there
is some fluctuation in the number of uops coming from the LSD vs
Uop Cache and during that time Branch Misses increases by
roughly 3 order of magnitude from 10^4 -> 10^7. During this period
Cycles also increased by a factor of 2. For all N > 27
Branch Misses stays at around 1.67 x 10^7 (roughly outer_loop_N). For N = [17, 40]
branches continues to only increase linearly.
The results for Whiskey Lake look different in that 1) N begins fluctuating at N = 35 and continues to fluctuate until N = 49. And 2) there is less of a
performance impact and more fluctuation in the data. That being said
the increase in Branch-Misses corresponding with transitions from
uops being fed by Uop Cache to being fed by LSD still exists.
Results
Data is mean result for 25 runs.
Icelake Results:
N
cycles
branches
branch-misses
idq.ms_uops
idq.dsb_uops
lsd.uops
1
33893260
67129521
1590
43163
115243
83908732
2
42540891
83908928
1762
49023
142909
100690381
3
50725933
100686143
1782
47656
142506
117440256
4
67533597
117461172
1655
52538
186123
134158311
5
68022910
134238387
1711
53405
204481
150954035
6
85543126
151018722
1924
62445
141397
167633971
7
84847823
167799220
1935
60248
160146
184563523
8
101532158
184570060
1709
60064
361208
201100179
9
101864898
201347253
1773
63827
459873
217780207
10
118024033
218124499
1698
59480
177223
234834304
11
118644416
234908571
2201
62514
422977
251503052
12
134627567
251678909
1679
57262
133462
268435650
13
285607942
268456135
1770
74070
285032524
315423
14
302717754
285233352
1731
74663
302101097
15953
15
321627434
302010569
81796
77831
319192830
1520819
16
337876736
318787786
71638
77056
335904260
1265766
17
353054773
335565563
1798
79839
352434780
15879
18
369800279
352344970
1978
79863
369229396
16790
19
386921048
369119438
1972
84075
385984022
16115
20
404248461
385896655
29454
85348
402790977
510176
21
421100725
402673872
37598
83400
419537730
729397
22
519623794
419451095
4447767
91209
431865775
97827331
23
702206338
436228323
12603617
109064
427880075
327661987
24
710626194
453005538
12316933
106929
432926173
344838509
25
863214037
469782765
14641887
121776
428085132
614871430
26
761037251
486559974
13067814
113011
438093034
418124984
27
832686921
503337195
16381350
113953
421924080
556915419
28
854713119
520114412
16642396
124448
420515666
598907353
29
869873144
536891629
16572581
119280
421188631
629696780
30
889642335
553668847
16717446
120116
420086570
668628871
31
906912275
570446064
16735759
126094
419970933
702822733
32
923023862
587223281
16706519
132498
420332680
735003892
33
940308170
604000498
16744992
124365
419945191
770185745
34
957075000
620777716
16726856
133675
420215897
802779119
35
974557538
637554932
16763071
134871
419764866
838012637
36
991110971
654332149
16772560
130903
419641144
872037131
37
1008489575
671109367
16757219
138788
419900997
904638287
38
1024971256
687886583
16772585
139782
419663863
938988917
39
1041404669
704669411
16776722
137681
419617131
972896126
40
1058594326
721441018
16773492
142959
419662133
1006109192
41
1075179100
738218235
16776636
141185
419601996
1039892900
42
1092093726
754995452
16776233
142793
419611902
1073373451
43
1108706464
771773224
16776359
139500
419610885
1106976114
44
1125413652
788549886
16764637
143717
419889127
1139628280
45
1142023614
805327103
16778640
144397
419558217
1174329696
46
1158833317
822104321
16765518
148045
419889914
1206833484
47
1175665684
838881537
16778437
148347
419562885
1241397845
48
1192454164
855658755
16778865
153651
419552747
1275006511
49
1210199084
872436025
16778287
152468
419599314
1307945613
50
1226321832
889213188
16778464
155552
419572344
1341893668
51
1242886388
905990406
16778745
155401
419559249
1375589883
52
1259559053
922767623
16778809
154847
419554082
1409206082
53
1276875799
939544839
16778460
162521
419576455
1442424993
54
1293113199
956322057
16778931
154913
419550955
1476316161
55
1310449232
973099274
16778534
157364
419578102
1509485876
56
1327022109
989876491
16778794
162881
419562403
1543193559
57
1344097516
1006653708
16778906
157486
419567545
1576414302
58
1362935064
1023430928
16778959
315120
419583132
1609691339
59
1381567560
1040208143
16778564
179997
419661259
1640660745
60
1394829416
1056985359
16778779
167613
419575969
1677034188
61
1411847237
1073762626
16778071
166332
419613028
1710194702
62
1428918439
1090539795
16778409
168073
419610487
1743644637
63
1445223241
1107317011
16778486
172446
419591254
1777573503
64
1461530579
1124094228
16769606
169559
419970612
1810351736
Whiskey Lake Results:
N
cycles
branches
branch-misses
idq.dsb_uops
lsd.uops
1
8332553879
35005847
37925
1799462
6019
2
8329926329
51163346
34338
1114352
5919
3
8357233041
67925775
32270
9241935
5555
4
8379609449
85364250
35667
18215077
5712
5
8394301337
101563554
33177
26392216
2159
6
8409830612
118918934
35007
35318763
5295
7
8435794672
135162597
35592
43033739
4478
8
8445843118
152636271
37802
52154850
5629
9
8459141676
168577876
30766
59245754
1543
10
8475484632
185354280
30825
68059212
4672
11
8493529857
202489273
31703
77386249
5556
12
8509281533
218912407
32133
84390084
4399
13
8528605921
236303681
33056
93995496
2093
14
8553971099
252439989
572416
99700289
2477
15
8558526147
269148605
29912
109772044
6121
16
8576658106
286414453
29839
118504526
5850
17
8591545887
302698593
28993
126409458
4865
18
8611628234
319960954
32568
136298306
5066
19
8627289083
336312187
30094
143759724
6598
20
8644741581
353730396
49458
152217853
9275
21
8685908403
369886284
1175195
161313923
7958903
22
8694494654
387336207
354008
169541244
2553802
23
8702920906
403389097
29315
176524452
12932
24
8711458401
420211718
31924
184984842
11574
25
8729941722
437299615
32472
194553843
12002
26
8743658904
453739403
28809
202074676
13279
27
8763317458
470902005
32298
211321630
15377
28
8788189716
487432842
37105
218972477
27666
29
8796580152
504414945
36756
228334744
79954
30
8821174857
520930989
39550
235849655
140461
31
8818857058
537611096
34142
648080
79191
32
8855038758
555138781
37680
18414880
70489
33
8870680446
571194669
37541
34596108
131455
34
8888946679
588222521
33724
52553756
80009
35
9256640352
604791887
16658672
132185723
41881719
36
9189040776
621918353
12296238
257921026
235389707
37
8962737456
638241888
1086663
109613368
35222987
38
9005853511
655453884
2059624
131945369
73389550
39
9005576553
671845678
1434478
143002441
51959363
40
9284680907
688991063
12776341
349762585
347998221
41
9049931865
705399210
1778532
174597773
72566933
42
9314836359
722226758
12743442
365270833
380415682
43
9072200927
739449289
1344663
205181163
61284843
44
9346737669
755766179
12681859
383580355
409359111
45
9117099955
773167996
1801713
235583664
88985013
46
9108062783
789247474
860680
250992592
43508069
47
9129892784
806871038
984804
268229102
51249366
48
9146468279
822765997
1018387
282312588
58278399
49
9476835578
840085058
13985421
241172394
809315446
50
9495578885
856579327
14155046
241909464
847629148
51
9537115189
873483093
15057500
238735335
932663942
52
9556102594
890026435
15322279
238194482
982429654
53
9589094741
907142375
15899251
234845868
1052080437
54
9609053333
923477989
16049518
233890599
1092323040
55
9628950166
940997348
16172619
235383688
1131146866
56
9650657175
957049360
16445697
231276680
1183699383
57
9666446210
973785857
16330748
233203869
1205098118
58
9687274222
990692518
16523542
230842647
1254624242
59
9706652879
1007946602
16576268
231502185
1288374980
60
9720091630
1024044005
16547047
230966608
1321807705
61
9741079017
1041285110
16635400
230873663
1362929599
62
9761596587
1057847755
16683756
230289842
1399235989
63
9782104875
1075055403
16299138
237386812
1397167324
64
9790122724
1091147494
16650471
229928585
1463076072
Edit:
2 things worth noting:
If I add padding to the inner loop so it won't fit in the uop cache I don't see this behavior until ~150 iterations.
Adding an lfence on with the padding in the outer loop changes N threshold to 31.
edit2: Benchmark which clears branch history The condition was reversed. It should be cmove not cmovne. With the fixed version any iteration count sees elevated Branch Misses at the same rate as above (1.67 * 10^9). This means the LSD is not itself causing Branch Misses, but leaves open the possibility that LSD is in some way defeating the Branch Predictor (what I believe to be the case).
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"testl $3, %k[outer_loop_cnt]\n"
"movl $1000, %k[inner_loop_cnt]\n"
THIS NEEDS TO BE CMOVE
"cmovne %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
This may be a coincidence; similar misprediction effects happen on Skylake (with recent microcode that disables the LSD1): an inner-loop count of about 22 or 23 is enough to stop its IT-TAGE predictor from learning the pattern of 21 taken, 1 not-taken for the inner-loop branch, in exactly this simple nested loop situation, last time I tested.
Choosing that iteration threshold for when to lock the loop down into the LSD might make some sense, or be a side-effect of your 1-uop loop and the LSD's "unrolling" behaviour on Haswell and later to get multiple copies of tiny loops into the IDQ before locking it down, to reduce the impact of the loop not being a multiple of the pipeline width.
Footnote 1: I'm surprised your Whiskey Lake seems to have a working LSD; I thought LSD was still disabled in all Skylake derivatives, at least including Coffee Lake, which was launched concurrently with Whiskey Lake.
My test loop was two dec/jne loops nested simply, IIRC, but your code has padding after the inner loop. (Starting with a jmp because that's what a huge .p2align does.) This puts the two loop branches at significantly different addresses. Either of both of these differences may help them avoid aliasing or some other kind of interference, because I'm seeing mostly-correct predictions for many (but not all) counts much greater than 23.
With your test code on my i7-6700k, lsd.uops is always exactly 0 of course. Compared to your Whiskey Lake data, only a few inner-loop counts produce high mispredict rates, e.g. 40, but not 50.
So there might be some effect from the LSD on your WHL CPU, making it bad for some N values where SKL is fine. (Assuming their IT-TAGE predictors are truly identical.)
e.g. with perf stat ... -r 5 ./a.out on Skylake (i7-6700k) with microcode revision 0xe2.
N
count
rate
variance
17
59,602
0.02% of all branches
+- 10.85%
20
192,307
0.05% of all branches
( +- 44.60% )
21
79,853
0.02% of all branches
( +- 14.16% )
30
136,308
0.02% of all branches
( +- 18.57% )
31..32
similar to N=34
( +- 2 or 3% )
33
22,415,089
3.71% of all branches
( +- 0.11% )
34
91,483
0.01% of all branches
( +- 2.36% )
35 (and 36..37 similar)
98,806
0.02% of all branches
( +- 2.75% )
38
33,517,630
4.87% of all branches
( +- 0.05% )
39
102,077
0.01% of all branches
( +- 1.96% )
40
33,458,267
4.64% of all branches
( +- 0.06% )
41
116,241
0.02% of all branches
( +- 6.86% )
42
22,376,562
2.96% of all branches
( +- 0.01% )
43
116,713
0.02% of all branches
( +- 5.25% )
44
174,834
0.02% of all branches
( +- 35.08% )
45
124,555
0.02% of all branches
( +- 5.36% )
46
135,838
0.02% of all branches
( +- 9.95% )
These numbers are repeatable, it's not just system noise; the spikes of high mispredict counts are very real at those specific N values. Probably some effect of the size / geometry of the IT-TAGE predictor's tables.
Other counters like idq.ms_uops and idq.dsb_uops scale mostly as expected, although idq.ms_uops is somewhat higher in the ones with more misses. (That counts uops added to the IDQ while the MS-ROM is active, perhaps counting front-end work that happens while branch recovery is cleaning up the back-end? It's a very different counter from legacy-decode mite_uops.)
With higher mispredict rates, idq.dsb_uops is quite a lot higher, I guess because the IDQ gets discarded and refilled on mispredicts. Like 1,011,000,288 for N=42, vs. 789,170,126 for N=43.
Note the high variability for N=20, around that threshold near 23, but still a tiny overall miss rate, much lower than every time the inner loop exits.
This is surprising, and different from a loop without as much padding.
Reason
The cause of the spike in Branch Misses is caused by the inner loop running out of the LSD.
The reason the LSD causes an extra branch miss for low iteration counts is that the "stop" condition on the LSD is a branch miss.
From Intel Optimization Manual Page 86.
The loop is sent to allocation 5 µops per cycle. After 45 out of the 46 µops are sent, in the next cycle only
a single µop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern
repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number
of wasted slots during LSD.
Essentially what is happening is that when low enough iteration counts run out of the Uop Cache they are perfectly predictable. But when they run out of the LSD since the built in stop condition for the LSD is a branch mispredict, we see an extra branch miss for every iteration of the outer loop. I guess the takeaway is don't let nested loops execute out of the LSD. Note that LSD only kicks in after ~[20, 25] iterations so an inner loop with < 20 iterations will run optimally.
Benchmark
All benchmarks are run on either Icelake
The new benchmark is essentially the same as the one in the origional post but at the advise of #PeterCordes I added a fixed byte size but varying number of nops in the inner loop. The idea is fixed length so that there is no change in how the branches may alias in the BHT (Branch History Table) but varying the number of nops to sometimes defeat the LSD.
I used 124 bytes of nop padding so that the nop padding + size of decl; jcc would be 128 bytes total.
The benchmark code is as follows:
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef INNER_NOPS
#error "INNER_NOPS must be defined"
#endif
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static const uint64_t bht_shift = 4;
static const uint64_t bht_mask = (1023 << bht_shift);
#define NOP1 ".byte 0x90\n"
#define NOP2 ".byte 0x66,0x90\n"
#define NOP3 ".byte 0x0f,0x1f,0x00\n"
#define NOP4 ".byte 0x0f,0x1f,0x40,0x00\n"
#define NOP5 ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP6 ".byte 0x66,0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP7 ".byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00\n"
#define NOP8 ".byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP9 ".byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP10 ".byte 0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP11 ".byte 0x66,0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt, tmp;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
".p2align 10\n"
"2:\n"
// This is defined in "inner_nops.h" with the necessary padding.
INNER_NOPS
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt), [ tmp ] "=&r"(tmp)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N),
[ bht_mask ] "i"(bht_mask), [ bht_shift ] "i"(bht_shift)
:);
}
// gcc -O3 -march=native -mtune=native lsd-branchmiss.c -o lsd-branchmiss
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
return 0;
}
Tests
I tested nop count = [0, 39].
Note that nop count = 1 would not be only 1 nop in the inner loop but actually the following:
#define INNER_NOPS NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP3 NOP1
To reach the full 128 byte padding.
Results
At nop count <= 32 the inner loop is able to run out of the LSD and we consistently see elevant Branch Misses when Iterations is large enough that it does so. Note that the elevated Branch Misses number corresponds 1-1 with the number of outer loop iterations. For these numbers outer loop iterations = 2^24
At nop count > 32 the loop has to many uops for the LSD and runs out of the Uop Cache. Here we do not see a consistent elevated Branch Misses until Iterations becomes to large for its BHT entry to store its entire history.
nop count > 32 (No LSD)
Once there are too many nops for the LSD the number of Branch Misses stays relatively low with a few consistent spikes until Iterations = 146 where Branch Misses spike to number of outer loop iterations (2 ^ 24 in this case) and remain constants. My guess is that is the upper bound on history the BHT is able to store.
Below is a graph of Branch Misses (Y) versus Iterations (X) for nop count = [33, 39]:
All of the lines follow the same patterns and have the same spikes. The large spikes to outer loop iterations before 146 are at Iterations = [42, 70, 79, 86, 88]. This is consistently reproducible. I am not sure what is special about these values.
The key point, however, is that for the most cast for Iterations = [20, 145] Branch Misses is relatively low indicating that the entire inner loop is being predicted correctly.
nop count <= 32 (LSD)
This data is a bit more noising bit all of the different nop count follow roughly the same trend of spiking initialing to within a factor of 2 of outer loop iterations Branch Misses at Iterations = [21, 25] (note this is 2-3 orders of magnitude) at the same time the lsd.oups spiked by 4-5 orders of magnitude.
There is also a trend between nop count and what iteration value Branch Misses stablize at outer loop iterations with a Pearson Correlation of 0.81. for nop count = [0, 32] the stablization point is in range iterations = [15, 34].
Below is a graph of Branch Misses (Y) versus Iterations (X) for nops = [0, 32]:
Generally, with some noise, all of the different nop count follow the same trend. As well they follow the same trend when compared with lsd.uops.
Below is a table with nop and the Pearson Correlation between Branch Misses and lsd.uop and idq.dsb_uops respectively.
nop
lsd
uop cache
0
0.961
-0.041
1
0.955
-0.081
2
0.919
-0.122
3
0.918
-0.299
4
0.947
-0.117
5
0.934
-0.298
6
0.894
-0.329
7
0.907
-0.308
8
0.91
-0.322
9
0.915
-0.316
10
0.877
-0.342
11
0.908
-0.28
12
0.874
-0.281
13
0.875
-0.523
14
0.87
-0.513
15
0.889
-0.522
16
0.858
-0.569
17
0.89
-0.507
18
0.858
-0.537
19
0.844
-0.565
20
0.816
-0.459
21
0.862
-0.537
22
0.848
-0.556
23
0.852
-0.552
24
0.85
-0.561
25
0.828
-0.573
26
0.857
-0.559
27
0.802
-0.372
28
0.762
-0.425
29
0.721
-0.112
30
0.736
-0.047
31
0.768
-0.174
32
0.847
-0.129
Which should generally indicate that there is a strong correlation between LSD and Branch Misses and no meaningful relationship between the Uop Cache and branch misses.
Overall
Generally I think it is clear that when the inner loop executing out of the LSD is what is causing the Branch Misses until Iterations becomes too large for the BHT entry's history. For the N = [33, 39] save the explained spikes we don't see elevated Branch Misses but we do for the N = [0, 32] case and the only difference I can tell is the LSD.

Grove I2C display not working on Raspberry Pi 3B

I only have my RPi 3B and a I²C display, I don't own any GrovePi hat and I want to show some text to the said display. Is there a way to make it work?
My display.
i2cdetect -y 1 shows this result, meaning the RPi is detecting the I²C display and it should work. But nothing seemed to be showing on the display, except for the full block on the first row.
I've tried literally everything on the internet, some library worked (as in throwing no errors) but the display still stays the same, full block on the first row.
I've installed python3, smbus, smbus2 and i2c-tools. I've changed the address to 3e.
My most recent *.py file. It does write 'LCD printing' yet nothing else works.
I don't find a way to change the contrast of the LCD (no potentiometer or whatsoever)
#!/usr/bin/python3
import smbus2 as smbus
import time
# Define some device parameters
I2C_ADDR = 0x3e # I2C device address, if any error, change this address to 0x3f
LCD_WIDTH = 16 # Maximum characters per line
# Define some device constants
LCD_CHR = 1 # Mode - Sending data
LCD_CMD = 0 # Mode - Sending command
LCD_LINE_1 = 0x80 # LCD RAM address for the 1st line
LCD_LINE_2 = 0xC0 # LCD RAM address for the 2nd line
LCD_LINE_3 = 0x94 # LCD RAM address for the 3rd line
LCD_LINE_4 = 0xD4 # LCD RAM address for the 4th line
LCD_BACKLIGHT = 0x08 # On
ENABLE = 0b00000100 # Enable bit
# Timing constants
E_PULSE = 0.0005
E_DELAY = 0.0005
# Open I2C interface
# bus = smbus.SMBus(0) # Rev 1 Pi uses 0
bus = smbus.SMBus(1) # Rev 2 Pi uses 1
def lcd_init():
# Initialise display
lcd_byte(0x33, LCD_CMD) # 110011 Initialise
lcd_byte(0x32, LCD_CMD) # 110010 Initialise
lcd_byte(0x06, LCD_CMD) # 000110 Cursor move direction
lcd_byte(0x0C, LCD_CMD) # 001100 Display On,Cursor Off, Blink Off
lcd_byte(0x28, LCD_CMD) # 101000 Data length, number of lines, font size
lcd_byte(0x01, LCD_CMD) # 000001 Clear display
time.sleep(E_DELAY)
def lcd_byte(bits, mode):
# Send byte to data pins
# bits = the data
# mode = 1 for data
# 0 for command
bits_high = mode | (bits & 0xF0) | LCD_BACKLIGHT
bits_low = mode | ((bits << 4) & 0xF0) | LCD_BACKLIGHT
# High bits
bus.write_byte(I2C_ADDR, bits_high)
lcd_toggle_enable(bits_high)
# Low bits
bus.write_byte(I2C_ADDR, bits_low)
lcd_toggle_enable(bits_low)
def lcd_toggle_enable(bits):
# Toggle enable
time.sleep(E_DELAY)
bus.write_byte(I2C_ADDR, (bits | ENABLE))
time.sleep(E_PULSE)
bus.write_byte(I2C_ADDR, (bits & ~ENABLE))
time.sleep(E_DELAY)
def lcd_string(message, line):
# Send string to display
message = message.ljust(LCD_WIDTH, " ")
lcd_byte(line, LCD_CMD)
for i in range(LCD_WIDTH):
lcd_byte(ord(message[i]), LCD_CHR)
if __name__ == '__main__':
lcd_init()
while True:
# Send some test
lcd_string("Hello ", LCD_LINE_1)
lcd_string(" World", LCD_LINE_2)
print('LCD printing!')
time.sleep(3)

How to run scapy in cooked mode?

I have a problem with capturing traffic.
My system is configured with two iterfaces - ethX and tunelX.
tunelX is a tunneling iterface.
The scapy and tcpdump are capture different count of packets.
The problem is the tcpdump runs, if the "any" iterface was set, in cooked mode but scapy don't.
cooked mode means that the SOCK_DGRAM will be created instead the SOCK_RAW. It is nessesary because some data in "tunneling packtes" in link-layer might be missing or contain not enoght data to determinate type of the packet.
When I ran strace with my scapy sctipt I saw this.
927698 socket(PF_PACKET, SOCK_RAW, 768) = 4
927689 recvfrom(3, "..some-data..."..., 65535, 0, {sa_family=AF_INET6, sin6_port=htons(53), inet_pton(AF_INET6, "...some address...", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 105
927689 recvfrom(3, "..some-data..."..., 32767, 0, {sa_family=AF_PACKET, proto=0x86dd, if4, pkttype=PACKET_HOST, addr(6)={1, 30d17e75727f}, [18]) = 246
927689 recvfrom(3, "..some-data..."..., 32767, 0, {sa_family=AF_PACKET, proto=0x86dd, if4, pkttype=PACKET_HOST, addr(6)={1, 30d17e75727f}, [18]) = 86
927689 recvfrom(3, "..some-data..."..., 32767, 0, {sa_family=AF_PACKET, proto=0x86dd, if4, pkttype=PACKET_HOST, addr(6)={1, 30d17e75727f}, [18]) = 86
927689 recvfrom(3, "..some-data..."..., 32767, 0, {sa_family=AF_PACKET, proto=0x86dd, if4, pkttype=PACKET_OUTGOING, addr(6)={1, 90e2ba55f6e8}, [18]) = 271
The only last packet was added into dump.
The question is:
Is my assumption right? :) How can I launch scapy in cooked mode? I couldn't find this in manual.
Thank you.

having trouble returning a best possible interface from a set of routing entries

so i am trying to return a best possible matching interface from routing entries. However, it is not exactly working the way i want it to. I got 5 out 6 values returned the way should be but I am pretty sure I have a million entries in a routing table my algorithm would not work.
I am using Binary Search to solve this problem. But, for example, the interface that i want to return has a ipaddress which is smaller than the ipaddress i am passing as an argument, then the binary search algorithm does not work. the structure looks like this:
struct routeEntry_t
{
uint32_t ipAddr;
uint32_t netMask;
int interface;
};
routeEntry_t routing_table[100000];
let's say the routing table looks like this:
{ 0x00000000, 0x00000000, 1 }, // [0]
{ 0x0A000000, 0xFF000000, 2 }, // [1]
{ 0x0A010000, 0xFFFF0000, 10 }, // [2]
{ 0x0D010100, 0xFFFFFF00, 4 }, // [3]
{ 0x0B100000, 0xFFFF0000, 3 }, // [4]
{ 0x0A010101, 0xFFFFFFFF, 5 }, // [5]
Example input/output:
Regular search
Input: 0x0D010101 Output: 4 (entry [3])
Input: 0x0B100505 Output: 3 (entry [4])
To find an arbitrary address, it should go to the default interface.
Input: 0x0F0F0F0F Output: 1 (entry [0])
To find an address that matches multiple entries, take the best-match.
Input: 0x0A010200 Output: 10 (entry [2])
Input: 0x0A050001 Output: 2 (entry [1])
Input: 0x0A010101 Output: 5 (entry [5])
But my output looks like 2, 3, 1, 10, 1, 5. I don't understand where I am messing things up. Could you please explain where I am doing wrong? any help would be great. Thanks in advance. However this is what my algorithm looks like (assuming the entries are sorted):
int interface(uint32_t ipAddr)
{
int start = 0;
int end = SIZE-1;
int mid = 0;
vector<int> matched_entries;
vector<int>::iterator it;
matched_entries.reserve(SIZE);
it = matched_entries.begin();
if (start > end)
return -1;
while (start <= end)
{
mid = start + ((end-start)/2);
if (routing_table[mid].ipAddr == ipAddr)
return routing_table[mid].interface;
else if (routing_table[mid].ipAddr > ipAddr)
{
uint32_t result = routing_table[mid].netMask & ipAddr;
if (result == routing_table[mid].ipAddr)
{
matched_entries.push_back(mid);
}
end = mid-1;
}
else
{
uint32_t result = routing_table[mid].netMask & ipAddr;
if (result == routing_table[mid].ipAddr)
{
matched_entries.insert(it,mid);
}
start = mid+1;
}
}
int matched_ip = matched_entries.back();
if (routing_table[matched_ip].netMask & ipAddr)
return routing_table[matched_ip].interface;
else
return routing_table[0].interface;
}
The "right" interface is the entry with the most specific netmask whose IP address is on the same subnet as your input.
Let's look at what netmasks are, and how they work, in more detail.
Notation
Although netmasks are usually written in dotted-decimal or hex notation, the binary representation of an IPv4 netmask is always 32 bits; that is, it's exactly the same length as an IP address. The netmask always starts with zero or more 1 bits and is padded with 0 bits to complete its 32-bit length. When a netmask is applied to an IP address, they're "lined up" bit by bit. The bits in the IP address that correspond to the 1 bits in the netmask determine the network number of the IP address; those corresponding to the 0 bits in the netmask determine the device number.
Purpose
Netmasks are used to divide an address space into smaller subnets. Devices on the same subnet can communicate with each other directly using the TCP/IP protocol stack. Devices on different subnets must use one or more routers to forward data between them. Because they isolate subnets from each other, netmasks are a natural way to create logical groupings of devices. For example, each location or department within a company may have its own subnet, or each type of device (printers, PCs, etc.) may have its own subnet.
Example netmasks:
255.255.255.128 → FF FF FF 10 → 1111 1111 1111 1111 1111 1111 1000 0000
This netmask specifies that the first 25 bits of an IP address determine the network number; the final 7 bits determine the device number. This means there can be 225 different subnets, each with 27 = 128 devices.*
255.255.255.0 → FF FF FF 00 → 1111 1111 1111 1111 1111 1111 0000 0000
This netmask specifies an address space with 224 subnets, each with 28 = 256 individual addresses. This is a very common configuration—so common that it's known simply as a "Class C" network.
255.255.192.0 → FF FF FC 00 → 1111 1111 1111 1111 1111 1100 0000 0000
This netmask specifies 222 subnets, each with 210 = 1024 addresses. It might be used inside a large corporation, where each department has several hundred devices that should be logically grouped together.
An invalid netmask (note the internal zeroes):
255.128.255.0 → FF 80 FF 00 → 1111 1111 1000 0000 1111 1111 0000 0000
Calculations
Here are a few examples that show how a netmask determines the network number and the device number of an IP address.
IP Address: 192.168.0.1 → C0 A8 00 01
Netmask: 255.255.255.0 → FF FF FF 00
This device is on the subnet 192.168.0.0. It can communicate directly with other devices whose IP addresses are of the form 192.168.0.x
IP Address: 192.168.0.1 → C0 A8 00 01
IP Address: 192.168.0.130 → C0 A8 00 82
Netmask: 255.255.255.128 → FF FF FF 80
These two devices are on different subnets and cannot communicate with each other without a router.
IP Address: 10.10.195.27 → 0A 0A C3 1B
Netmask: 255.255.0.0 → FF FF 00 00
This is an address on a "Class B" network that can communicate with the 216 addresses on the 10.10.0.0 network.
You can see that the more 1 bits at the beginning of a netmask, the more specific it is. That is, more 1 bits create a "smaller" subnet that consists of fewer devices.
Putting it all together
A routing table, like yours, contains triplets of netmasks, IP addresses, and interfaces. (It may also contain a "cost' metric, which indicates which of several interfaces is the "cheapest" to use, if they are both capable of routing data to a particular destination. For example, one may use an expensive dedicated line.)
In order to route a packet, the router finds the interface with the most specific match for the packet's destination. An entry with an address addr and a netmask mask matches a destination IP address dest if (addr & netmask) == (dest & netmask), where & indicates a bitwise AND operation. In English, we want the smallest subnet that is common to both addresses.
Why? Suppose you and a colleague are in a hotel that's part of a huge chain with both a corporate wired network and a wireless network. You've also connected to your company's VPN. Your routing table might look something like this:
Destination Netmask Interface Notes
----------- -------- --------- -----
Company email FFFFFF00 VPN Route ALL company traffic thru VPN
Wired network FFFF0000 Wired Traffic to other hotel addresses worldwide
Default 00000000 Wireless All other traffic
The most specific rule will route your company email safely through the VPN, even if the address happens to match the wired network also. All traffic to other addresses within the hotel chain will be routed through the wired network. And everything else will be sent through the wireless network.
* Actually, in every subnet, 2 of the addresses—the highest and the lowest—are reserved. The all-ones address is the broadcast address: this address sends data to every device on the subnet. And the all-zeroes address is used by a device to refer to itself when it doesn't yet have it's own IP address. I've ignored those for simplicity.
So the algorithm would be something like this:
initialize:
Sort routing table by netmask from most-specific to least specific.
Within each netmask, sort by IP address.
search:
foreach netmask {
Search IP addresses for (input & netmask) == (IP address & netmask)
Return corresponding interface if found
}
Return default interface
Ok so I this is what my structure and algorithm looks like now. It works and gives me the results that I want, however I still don't know how to sort ip addresses within netmasks. I used STL sort to sort the netmasks.
struct routeEntry_t
{
uint32_t ipAddr;
uint32_t netMask;
int interface;
bool operator<(const routeEntry_t& lhs) const
{
return lhs.netMask < netMask;
}
};
const int SIZE = 6;
routeEntry_t routing_table[SIZE];
void sorting()
{
//using STL sort from lib: algorithm
sort(routing_table, routing_table+SIZE);
}
int interface(uint32_t ipAddr)
{
for (int i = 0; i < SIZE; ++i)
{
if ((routing_table[i].ipAddr & routing_table[i].netMask) == (ipAddr & routing_table[i].netMask))
return routing_table[i].interface;
}
return routing_table[SIZE-1].interface;
}