packed structure size in C, is this correct? - packed

I found it in some exsiting code, it looks some problems, but the code works fine, can you help if this piece of code has any tricking things in.
why ignore two unsigned when calculate the size of the structure?
tmsg_sz = sizeof(plfm_xml_header_t) + sizeof(oid_t) + sizeof(char*)
+ sizeof(unsigned) + sizeof(snmp_varbind_t)*5 ;
tmsg = (snmp_trap_t*) malloc(tmsg_sz);
if (!tmsg) {
PRINTF("malloc failed \n");
free(trap_msg);
return -1;
}
memset (tmsg, 0, tmsg_sz);
tmsg->hdr.type = PLFM_SNMPTRAP_MSG;
copy_oid_oidt(clog_msg_gen_notif_oid, OID_LENGTH(clog_msg_gen_notif_oid), &tmsg->oid);
tmsg->trap_type = SNMP_TRAP_ENTERPRISESPECIFIC;
tmsg->trap_specific = 1;
tmsg->trapmsg = strdup("Trap Message");
tmsg->numofvar = 5;
build_snmp_varbind(&(tmsg->vars[0]), facility, STR_DATA_TYPE, sizeof(facility)+1, clog_hist_facility_oid, 14);
build_snmp_varbind(&(tmsg->vars[1]), &sev, U32_DATA_TYPE, sizeof(sev),clog_hist_severity_oid, 14);
build_snmp_varbind(&(tmsg->vars[2]), name, STR_DATA_TYPE, sizeof(name)+1, clog_hist_msgname_oid, 14);
build_snmp_varbind(&(tmsg->vars[3]), trap_msg, STR_DATA_TYPE, strlen(trap_msg)+1,clog_hist_msgtext_oid, 14);
// get system uptime
long uptime = get_uptime();
build_snmp_varbind(&(tmsg->vars[4]), (long*)&uptime, TMR_DATA_TYPE, sizeof(uptime),clog_hist_timestamp_oid, 14);
typedef struct snmp_trap_s {
plfm_xml_header_t hdr;
oid_t oid; /* trap oid */
unsigned trap_type;
unsigned trap_specific;
char *trapmsg; /* text message for this trap */
unsigned numofvar;
snmp_varbind_t vars[0];
} __attribute__((__packed__)) snmp_trap_t;

Compilers try hard to put multibyte data aligned in various ways. For example, an int variable, in an architecture where sizeof int == 4, may need to be placed in a location divisible by 4. This may be a hard requirement, or this may just make the system more efficient; it depends on the computer. So, consider
typedef struct combo {
char c;
int i;
} combo;
Depending on the architecture, sizeof combo may be 5, 6, or most often 8. Swap the two members, and the size should be 5.
typedef struct combo2 {
int i;
char c;
} combo2;
However, an array of combo2s may have a size you do not expect:
combo2 cb[2];
The size of cb could very well be 16, as 3 bytes of wasted space follow combo2[0] and combo2[1]. This lets combo2[1].i start at a location divisible by 4.
A recommendation is to order the members of a structure by size; the 8-byte members should precede the 4-byte members, then the 2-byte members, then the 1-byte members. Of course, you have to be aware of typical sizes, and you can't be working on an oddball architecture where characters are not packed into larger words. Cray? cough-cough.

Related

Unaligned access performance on Intel x86 vs AMD x86 CPUs

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779
The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.
The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

EditorGuiLayout.MaskField issue with large enums

I'm working on an input system that would allow the user to translate input mappings between different input devices and operating systems and potentially define their own.
I'm trying to create a MaskField for an editor window where the user can select from a list of RuntimePlatforms, but selecting individual values results in multiple values being selected.
Mainly for debugging I set it up to generate an equivalent enum RuntimePlatformFlags that it uses instead of RuntimePlatform:
[System.Flags]
public enum RuntimePlatformFlags: long
{
OSXEditor=(0<<0),
OSXPlayer=(0<<1),
WindowsPlayer=(0<<2),
OSXWebPlayer=(0<<3),
OSXDashboardPlayer=(0<<4),
WindowsWebPlayer=(0<<5),
WindowsEditor=(0<<6),
IPhonePlayer=(0<<7),
PS3=(0<<8),
XBOX360=(0<<9),
Android=(0<<10),
NaCl=(0<<11),
LinuxPlayer=(0<<12),
FlashPlayer=(0<<13),
LinuxEditor=(0<<14),
WebGLPlayer=(0<<15),
WSAPlayerX86=(0<<16),
MetroPlayerX86=(0<<17),
MetroPlayerX64=(0<<18),
WSAPlayerX64=(0<<19),
MetroPlayerARM=(0<<20),
WSAPlayerARM=(0<<21),
WP8Player=(0<<22),
BB10Player=(0<<23),
BlackBerryPlayer=(0<<24),
TizenPlayer=(0<<25),
PSP2=(0<<26),
PS4=(0<<27),
PSM=(0<<28),
XboxOne=(0<<29),
SamsungTVPlayer=(0<<30),
WiiU=(0<<31),
tvOS=(0<<32),
Switch=(0<<33),
Lumin=(0<<34),
BJM=(0<<35),
}
In this linked screenshot, only the first 4 options were selected. The integer next to "Platforms: " is the mask itself.
I'm not a bitwise wizard by a large margin, but my assumption is that this occurs because EditorGUILayout.MaskField returns a 32bit int value, and there are over 32 enum options. Are there any workarounds for this or is something else causing the issue?
First thing I've noticed is that all values inside that Enum is the same because you are shifting 0 bits to left. You can observe this by logging your values with this script.
// Shifts 0 bits to the left, printing "0" 36 times.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((0 << i), 2));
}
// Shifts 1 bits to the left, printing values up to 2^35.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((1 << i), 2));
}
The reason inheriting from long does not work alone, is because of bit shifting. Check out this example I found about the issue:
UInt32 x = ....;
UInt32 y = ....;
UInt64 result = (x << 32) + y;
The programmer intended to form a 64-bit value from two 32-bit ones by shifting 'x' by 32 bits and adding the most significant and the least significant parts. However, as 'x' is a 32-bit value at the moment when the shift operation is performed, shifting by 32 bits will be equivalent to shifting by 0 bits, which will lead to an incorrect result.
So you should also cast the shifting bits. Like this:
public enum RuntimePlatformFlags : long {
OSXEditor = (1 << 0),
OSXPlayer = (1 << 1),
WindowsPlayer = (1 << 2),
OSXWebPlayer = (1 << 3),
// With literals.
tvOS = (1L << 32),
Switch = (1L << 33),
// Or with casts.
Lumin = ((long)1 << 34),
BJM = ((long)1 << 35),
}

How to pack/unpack a byte in q/kdb

What I'm trying to do here is pack a byte like I could in c# like this:
string symbol = "T" + "\0";
byte orderTypeEnum = (byte)OrderType.Limit;
int size = -10;
byte[] packed = new byte[symbol.Length + sizeof(byte) + sizeof(int)]; // byte = 1, int = 4
Encoding.UTF8.GetBytes(symbol, 0, symbol.Length, packed, 0); // add the symbol
packed[symbol.Length] = orderTypeEnum; // add order type
Array.ConstrainedCopy(BitConverter.GetBytes(size), 0, packed, symbol.Length + 1, sizeof(int)); // add size
client.Send(packed);
Is there any way to accomplish this in q?
As for the Unpacking in C# I can easily do this:
byte[] fillData = client.Receive();
long ticks = BitConverter.ToInt64(fillData, 0);
int fillSize = BitConverter.ToInt32(fillData, 8);
double fillPrice = BitConverter.ToDouble(fillData, 12);
new
{
Timestamp = ticks,
Size = fillSize,
Price = fillPrice
}.Dump("Received response");
Thanks!
One way to do it is
symbol:"T\000"
orderTypeEnum: 123 / (byte)OrderType.Limit
size: -10i;
packed: "x"$symbol,("c"$orderTypeEnum),reverse 0x0 vs size / *
UPDATE:
To do the reverse you can use 1: function:
(8 4 8; "jif")1:0x0000000000000400000008003ff3be76c8b43958 / server data is big-endian
("jif"; 8 4 8)1:0x0000000000000400000008003ff3be76c8b43958 / server data is little-endian
/ ticks=1024j, fillSize=2048i, fillPrice=1.234
*) When using BitConverter.GetBytes() you should also check the value of BitConverter.IsLittleEndian to make sure you send bytes over the wire in a proper order. Contrary to popular belief .NET is not always little-endian. Hovewer, an internal representation in kdb+ (a value returned by 0x0 vs ...) is always big-endian. Depending on your needs you may or may not want to use reverse above.

does kyoto cabinet support key range search?

Does Kyoto Cabinet support searching for a range of keys?
If so, what types of keys do support range search?
Can I do range search on a long (64bit) key?
Thanks
RG
it supports key prefix query, however, the efficiency of prefix query depends on what internal storage structure is. If you are using hashdb, it may be not a good idea, as keys & values are scattered around in the underline file.
Yes, for integers.
B+ tree database supports sequential access in order of the keys, which realizes forward matching search for strings and range search for integers - from docs
Yes you can, you just need a forward jump.
An example using C. Stores 5 records with 64 bits keys (from 1 to 5) and then apply a filter (from 2 to 4):
#include <kclangc.h>
#include <inttypes.h>
int main(void)
{
KCDB *db;
KCCUR *cur;
char *kbuf;
size_t ksiz, vsiz;
const char *cvbuf;
int64_t i, val, min, max;
int64_t keys[] = {1, 2, 3, 4, 5};
const char *values[] = {"one", "two", "three", "four", "five"};
char i64[8]; /* A buffer to store byte sequences */
/* create the database object */
db = kcdbnew();
/* open the database */
if (!kcdbopen(db, "db64.kct", KCOWRITER | KCOCREATE)) {
fprintf(stderr, "open error: %s\n", kcecodename(kcdbecode(db)));
}
/* store records */
for (i = 0; i < 5; i++) {
memcpy(i64, &keys[i], 8);
if (!kcdbset(db, i64, 8, values[i], strlen(values[i]))) {
fprintf(stderr, "set error: %s\n", kcecodename(kcdbecode(db)));
exit(EXIT_FAILURE);
}
}
/* traverse records */
min = 2;
max = 4;
printf("Range from %" PRId64 " to %" PRId64 "\n", min, max);
memcpy(i64, &min, 8);
cur = kcdbcursor(db);
kccurjumpkey(cur, i64, 8);
while ((kbuf = kccurget(cur, &ksiz, &cvbuf, &vsiz, 1)) != NULL) {
memcpy(&val, kbuf, 8);
if (val > max) {
break;
}
printf("Found %s\n", cvbuf);
kcfree(kbuf);
}
kccurdel(cur);
/* close the database */
if (!kcdbclose(db)) {
fprintf(stderr, "close error: %s\n", kcecodename(kcdbecode(db)));
}
/* delete the database object */
kcdbdel(db);
return 0;
}
LevelDB supports binary keys and ranged queries.
Edit: I forgot to mention that in order for the range query to work, the binary value needs to be packed in a comparable way. For your long example, you need to make sure it's big-endian encoded.

Fastest way of bitwise AND between two arrays on iPhone?

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y++)
for(int x=0; x<a_lenx; x++)
{
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap ;
}
Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.
How could I accelerate this job on iPhone ?
I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...
Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap = 0;
uint32_t* a_int = (uint32_t*)a;
uint32_t* b_int = (uint32_t*)b;
a_leny = a_leny / 4;
a_lenx = a_lenx / 4;
a_pitch = a_pitch / 4;
b_pitch = b_pitch / 4;
for(int y=0; y<a_leny_int; y++)
for(int x=0; x<a_lenx_int; x++)
{
uint32_t aVal = a_int[x + y * a_pitch_int];
uint32_t bVal = b_int[x+y*b_pitch_int];
if (aVal & 0xFF) & (bVal & 0xFF)
overlap++;
if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
overlap++;
if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
overlap++;
if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
overlap++;
}
return overlap ;
}
Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y+= 10)
for(int x=0; x<a_lenx; x+= 10)
{
//we compare 1% of all the pixels, and use that as the result
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap * 100;
}
Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.
Your code is Rambo for the CPU - its worst nightmare :
byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.
Simply put, everything is wrong that can be wrong.
Don't call me rude. Let me be your angel instead.
First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.
Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.
Why don't you learn ARM assembly? It's much easier and useful than x86 assembly.
It will also improve your C programming capabilities by a huge step.
Strongly recommended
cya
==============================================================================
Ok, here is an optimized version written in C with ARM assembly in mind.
Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.
There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)
Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.
And make sure that this function runs in ARM mode and not Thumb for best results.
unsigned int compare(unsigned int *a, unsigned int a_pitch,
unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny)
{
unsigned int overlap =0;
unsigned int a_gap = (a_pitch - a_lenx)>>2;
unsigned int b_gap = (b_pitch - a_lenx)>>2;
unsigned int aval, bval, xcount;
do
{
xcount = (a_lenx>>2);
do
{
aval = *a++;
// ldr aval, [a], #4
bval = *b++;
// ldr bavl, [b], #4
aval &= bval;
// and aval, aval, bval
if (aval & 0x000000ff) overlap += 1;
// tst aval, #0x000000ff
// addne overlap, overlap, #1
if (aval & 0x0000ff00) overlap += 1;
// tst aval, #0x0000ff00
// addne overlap, overlap, #1
if (aval & 0x00ff0000) overlap += 1;
// tst aval, #0x00ff0000
// addne overlap, overlap, #1
if (aval & 0xff000000) overlap += 1;
// tst aval, #0xff000000
// addne overlap, overlap, #1
} while (--xcount);
a += a_gap;
b += b_gap;
} while (--a_leny);
return overlap;
}
First of all, why the double loop? You can do it with a single loop and a couple of pointers.
Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.
Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.