C++ from unsigned char* to stringstream: Segmentation fault (core dumped) error [closed] - stringstream

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I am writing a C++ code and trying to convert an unsigned char* array into a string using a stringstream.
the code snippet:
unsigned char * arr;
do{
fill(*arr);
//if I print the array here, the print operation works fine
stringstream s((const char*)arr); //I also tired other castings without success
cout<<s.str()<<endl;
//condition...
} while(condition);
The do-while because I have to repeat it. The problem is that I get a Segmentation fault (core dumped) error here: stringstream s((const char*)arr);
Here is a more detailled code. The fill is the libusb_interrupt_transfer
int len = 64;
int transferred;
unsigned char *pkt = new unsigned char[len];
unsigned char * arr;
int arrLen;
do {
libusb_interrupt_transfer(handle, (EP_IN | LIBUSB_ENDPOINT_IN), pkt, len, &transferred, 1000);
arrLen = pkt[6];
arr = new unsigned char[arrLen];
for (int i = 0; i < arrLen; i++) {
arr[i] = pkt[i+7];
}
stringstream s;
s << (char*) arr;
}

I would go with:
stringstream s;
s << (char*) arr;
EDIT:
Ok, so after you've given us your Fill() this is what I think makes the problem:
Basing on this link: http://libusb.sourceforge.net/api-1.0/group__syncio.html#gac412bda21b7ecf57e4c76877d78e6486 you can NEVER assume that you will have the amount of chars passed in len. You should use your transferred to check how many chars have been transfered.
Directly problem is in the moment where you write
arrLen = pkt[6];
Because it is probable that pkt[6] has not been initialized. In this case it contains some random number which will (if we assume that the number is really random) give you a 1 to 2^31 - (64 - 7) probability of access violation - if arrLen is too big, you will exceed pkt range pretty fast.
So I would suggest something like this:
libusb_interrupt_transfer(handle, (EP_IN | LIBUSB_ENDPOINT_IN), pkt, len, &transferred, 1000);
// AHTUNG! ATTENTION!
if (transferred >= 7) // and you'll need even more
continue;
arrLen = pkt[6];
arr = new unsigned char[arrLen];
for (int i = 0; i < arrLen; i++) {
arr[i] = pkt[i+7];
}
stringstream s;
s << (char*) arr;
Also, it would be nice to have pkt and transferred dumped in the moment of the error. This would make analyzing the problem more easily.

Related

Unaligned access performance on Intel x86 vs AMD x86 CPUs

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779
The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.
The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

Getting "*** buffer overflow detected ***: terminated Aborted" after calling recvfrom( ) function for UDP communication

I am facing an error whenever I call a recvfrom() function for UDP in server side:
buffer overflow detected ***: terminated Aborted
What is the meaning of this error? I am not able to understand.
unsigned int len;
int rv,i;
int tmp;
//char msg[200],command=0;
unsigned short *Fptr;
float Float_Temp;
//make socket blocking
FD_ZERO(&readnbs);
FD_SET(g_iUDP_datalogger_soc, &readnbs);
g_UDP_Blocktimervalue.tv_sec = 0;
g_UDP_Blocktimervalue.tv_usec = UDP_REC_BLOCKTIME;
rv = select(g_iUDP_datalogger_soc + 1, &readnbs, NULL, NULL, &g_UDP_Blocktimervalue);
len = sizeof(g_UDP_ClientAddr);
if (rv == 1) {
printf(" \n\n\n\n\n rv=%d\n\n\n\n",rv);
tmp = recvfrom(g_iUDP_datalogger_soc, &tmp, SIZE_UDP_MSG, 0,(struct sockaddr *) &g_UDP_ClientAddr, &len);
tmp = recvfrom(g_iUDP_datalogger_soc, [YOUR BUFFER ?] , SIZE_UDP_MSG, 0,(struct sockaddr *)...
You're trying to buffer the received data into a single int (tmp) object, which you are also using as the return value from the recvfrom(), and even not checking in any way - either for error, nor for received data.
You are asking recvfrom() to read SIZE_UDP_MSG number of bytes into tmp, which is an int. You did not show what SIZE_UDP_MSG is defined as, but if SIZE_UDP_MSG > sizeof(int), you are going to be writing bytes past the bounds of tmp into surrounding memory, corrupting the memory. That is a buffer overflow.
Perhaps you meant to receive the bytes into your (commented out) msg buffer instead?
char msg[SIZE_UDP_MSG];
tmp = recvfrom(..., msg, SIZE_UDP_MSG, ...);

scanf hangs when copy and paste many line of inputs at a time

This may be a simple question, but I'm new to C, and yet couldn't find any answer. My program is simple, it takes 21 lines of string input in a for loop, and print them after that. The number could be less or greater.
int t = 21;
char *lines[t];
for (i = 0; i < t; i++) {
lines[i] = malloc(100);
scanf("%s", lines[i]);
}
for (int i = 0; i < t; i++) {
printf("%s\n", lines[i]);
free(lines[i]);
}
...
So when I copy & paste the inputs at a time, my program hangs, no error, no crash. It's fine if there's only 20 lines or below. And if I enter by hand line by line, it works normally regardless of number of inputs.
I'm using XCode 5 in Mac OS X 10.10, but I don't think this is the issue.
Update:
I tried to debug it when the program hangs, it stopped when i == 20 at the line below:
0x7fff9209430a: jae 0x7fff92094314 ; __read_nocancel + 20
The issue may be related to scanf, but it's so confused, why the number 20? May be I'm using it the wrong way, great thanks to any help.
Update:
I have tried to compile the program using the CLI gcc. It works just fine. So, it is the issue of XCode eventually. Somehow it prevents user from pasting multiple inputs.
Use fgets when you want to read a string in C , and see this documentation about that function:
[FGETS Function]
So you should use it like this :
fgets (lines[i],100,stdin);
So it'll get the string from the input of the user and you can have a look on these two posts as well about reading strings in C:
Post1
Post2
I hope that this'll help you with your problem.
Edit :
#include <stdio.h>
void main(){
int t = 21;
int i;
char *lines[t];
for (i = 0; i < t; i++) {
lines[i] = malloc(100);
fgets(lines[i],255,stdin);
}
for (i = 0; i < t; i++) {
printf("String %d : %s\n",i, lines[i]);
free(lines[i]);
}
}
This code gives :
As you can see , I got the 21 strings that I entered (From 0 to 20, that's why it stops when i==20).
I tried with your input ,here's the results :
I wrote the same code and ran. It works.
It might contain more than 99 characters (include line feed) per line...
Or it might contain spaces and tabs.
scanf(3)
When one or more whitespace characters (space, horizontal tab \t, vertical tab \v, form feed \f, carriage return \r, newline or linefeed \n) occur in the format string, input data up to the first non-whitespace character is read, or until no more data remains. If no whitespace characters are found in the input data, the scanning is complete, and the function returns.
To avoid this, try
scanf ("%[^\n]%*c", lines[i]);
The whole code is:
#include <stdio.h>
int main() {
const int T = 5;
char lines[T][100]; // length: 99 (null terminated string)
// if the length per line is fixed, you don't need to use malloc.
printf("input -------\n");
for (int i = 0; i < T; i++) {
scanf ("%[^\n]%*c", lines[i]);
}
printf("result -------\n");
for (int i = 0; i < T; i++) {
printf("%s\n", lines[i]);
}
return 0;
}
If you still continue to face the problem, show us the input data and more details. Best regards.

packed structure size in C, is this correct?

I found it in some exsiting code, it looks some problems, but the code works fine, can you help if this piece of code has any tricking things in.
why ignore two unsigned when calculate the size of the structure?
tmsg_sz = sizeof(plfm_xml_header_t) + sizeof(oid_t) + sizeof(char*)
+ sizeof(unsigned) + sizeof(snmp_varbind_t)*5 ;
tmsg = (snmp_trap_t*) malloc(tmsg_sz);
if (!tmsg) {
PRINTF("malloc failed \n");
free(trap_msg);
return -1;
}
memset (tmsg, 0, tmsg_sz);
tmsg->hdr.type = PLFM_SNMPTRAP_MSG;
copy_oid_oidt(clog_msg_gen_notif_oid, OID_LENGTH(clog_msg_gen_notif_oid), &tmsg->oid);
tmsg->trap_type = SNMP_TRAP_ENTERPRISESPECIFIC;
tmsg->trap_specific = 1;
tmsg->trapmsg = strdup("Trap Message");
tmsg->numofvar = 5;
build_snmp_varbind(&(tmsg->vars[0]), facility, STR_DATA_TYPE, sizeof(facility)+1, clog_hist_facility_oid, 14);
build_snmp_varbind(&(tmsg->vars[1]), &sev, U32_DATA_TYPE, sizeof(sev),clog_hist_severity_oid, 14);
build_snmp_varbind(&(tmsg->vars[2]), name, STR_DATA_TYPE, sizeof(name)+1, clog_hist_msgname_oid, 14);
build_snmp_varbind(&(tmsg->vars[3]), trap_msg, STR_DATA_TYPE, strlen(trap_msg)+1,clog_hist_msgtext_oid, 14);
// get system uptime
long uptime = get_uptime();
build_snmp_varbind(&(tmsg->vars[4]), (long*)&uptime, TMR_DATA_TYPE, sizeof(uptime),clog_hist_timestamp_oid, 14);
typedef struct snmp_trap_s {
plfm_xml_header_t hdr;
oid_t oid; /* trap oid */
unsigned trap_type;
unsigned trap_specific;
char *trapmsg; /* text message for this trap */
unsigned numofvar;
snmp_varbind_t vars[0];
} __attribute__((__packed__)) snmp_trap_t;
Compilers try hard to put multibyte data aligned in various ways. For example, an int variable, in an architecture where sizeof int == 4, may need to be placed in a location divisible by 4. This may be a hard requirement, or this may just make the system more efficient; it depends on the computer. So, consider
typedef struct combo {
char c;
int i;
} combo;
Depending on the architecture, sizeof combo may be 5, 6, or most often 8. Swap the two members, and the size should be 5.
typedef struct combo2 {
int i;
char c;
} combo2;
However, an array of combo2s may have a size you do not expect:
combo2 cb[2];
The size of cb could very well be 16, as 3 bytes of wasted space follow combo2[0] and combo2[1]. This lets combo2[1].i start at a location divisible by 4.
A recommendation is to order the members of a structure by size; the 8-byte members should precede the 4-byte members, then the 2-byte members, then the 1-byte members. Of course, you have to be aware of typical sizes, and you can't be working on an oddball architecture where characters are not packed into larger words. Cray? cough-cough.

Expression result unused

I got some codes and I'm trying to fix some compiling bugs:
StkFrames& PRCRev :: tick( StkFrames& frames, unsigned int channel )
{
#if defined(_STK_DEBUG_)
if ( channel >= frames.channels() - 1 ) {
errorString_ << "PRCRev::tick(): channel and StkFrames arguments are incompatible!";
handleError( StkError::FUNCTION_ARGUMENT );
}
#endif
StkFloat *samples = &frames[channel];
unsigned int hop = frames.channels();
for ( unsigned int i=0; i<frames.frames(); i++, samples += hop ) {
*samples = tick( *samples );
*samples++; <<<<<<<<<--------- Expression result unused.
*samples = lastFrame_[1];
}
return frames;
}
I don't understand what the codes is trying to do. The codes are huge and I fixed quite a few. But googling didn't work for this.
Any ideas?
First, you do an increment (the line which actually gives you warning).
*samples++;
And then you assign to that variable something else, which makes previous action unused.
*samples = lastFrame_[1];
I recommend you to read this code inside 'for' loop more carefully. It doesn't look very logical.