Weird functioning of read() in socket programming - sockets

I have the following server code. I took this from a website to learn socket programming.
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#define BUFFER_SIZE 356
#define READ_SIZE 255
void error(const char *msg)
{
std::cerr << msg;
exit(1);
}
int main(int argc, char *argv[])
{
int sockfd, newsockfd, portno;
socklen_t clilen;
char buffer[BUFFER_SIZE];
struct sockaddr_in serv_addr, cli_addr;
int n;
if(argc < 2) error("ERROR, no port provided\n");
sockfd = socket(AF_INET, SOCK_STREAM, 0);
if (sockfd < 0) error("ERROR opening socket\n");
bzero((char *) &serv_addr, sizeof(serv_addr));
portno = atoi(argv[1]);
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr = INADDR_ANY;
serv_addr.sin_port = htons(portno);
if (bind(sockfd, (struct sockaddr *) &serv_addr,
sizeof(serv_addr)) < 0)
error("ERROR on binding");
listen(sockfd,5);
clilen = sizeof(cli_addr);
while(true){
newsockfd = accept(sockfd,
(struct sockaddr *) &cli_addr,
&clilen);
if (newsockfd < 0)
error("ERROR on accept");
bzero(buffer,BUFFER_SIZE);
n = read(newsockfd,buffer,READ_SIZE);
printf("Here is the message: %s\n",buffer);
std::cout << "hellow" << "\n";
char *message = "HTTP/1.1 200 OK\r\n\r\n<html><body><h1>Hello. Please don't close</h1></body></html>";
n = write(newsockfd,message,strlen(message));
if (n < 0) error("ERROR writing to socket");
close(newsockfd);
}
close(sockfd);
return 0;
}
Problem appears to be with the read() function. When I run this server on a particular port and use Firefox as client, the browser reports Connection was reset.
Here is the output of tcpdump while closing the connection:
15:29:44.315802 IP (tos 0x0, ttl 64, id 7572, offset 0, flags [DF], proto TCP (6), length 52)
localhost.8882 > localhost.36360: Flags [R.], cksum 0xfe28 (incorrect -> 0xebb8), seq 80, ack 291, win 350, options [nop,nop,TS val 3928986 ecr 3928985], length 0
So the server is directly sending RST flag and is not following the FIN/ACK procedure.
However, if I change the value of READ_SIZE from 255 to BUFFER_SIZE-1, the code works fine.
Here is the new trace from tcpdump corresponding to connection closing:
15:30:21.353901 IP (tos 0x0, ttl 64, id 3241, offset 0, flags [DF], proto TCP (6), length 52)
localhost.8882 > localhost.36437: Flags [F.], cksum 0xfe28 (incorrect -> 0x20b7), seq 80, ack 302, win 350, options [nop,nop,TS val 3938245 ecr 3938245], length 0
15:30:21.354071 IP (tos 0x0, ttl 64, id 38322, offset 0, flags [DF], proto TCP (6), length 52)
localhost.36437 > localhost.8882: Flags [F.], cksum 0xfe28 (incorrect -> 0x20be), seq 302, ack 81, win 342, options [nop,nop,TS val 3938245 ecr 3938245], length 0
15:30:21.354093 IP (tos 0x0, ttl 64, id 3242, offset 0, flags [DF], proto TCP (6), length 52)
localhost.8882 > localhost.36437: Flags [.], cksum 0xfe28 (incorrect -> 0x20b6), ack 303, win 350, options [nop,nop,TS val 3938245 ecr 3938245], length 0
Why does read() function cause RST flag to be sent? Why is the problem solved by increasing the amount to be read?
Note: This happens every time. This is not due to some random interruption.
EDIT: As suggested by Aif, I tried calling read() multiple times. This also gives a similar problem. Following is the loop
while(true){
std::cout << toread << "\n";
readed = read(newsockfd, a, std::min(toread, 1));
// readed = read(newsockfd, a, 1);
std::cout << readed << "\n";
if(readed < 0){std::cout << "error reading"; exit(-1);}
if(readed == 0) break;
a += readed;
toread -= readed;
if(toread == 0) break;
}
I'm reading 1 byte everytime. Now, this gives a very strange problem. It reads only 294 bytes and then stops indefinitely. I don't know how the program came up with this number. Shortly, even this repeated read doesn't work. Any ideas?

Edit:
Check the return value of read()
iterate over read in case you receive more than you are reading, and keep the whole thing in a larger buffer.
Oh and by the way, this is the old way of doing sockets, you should now use getaddrinfo. For a (very very) good tutorial about socket programming, I recommand Beej's guide to skcet programming.
Edit 2:
There is indeed something I don't understand when reducing the buffer size. Anyway, I did some corrections on your code which makes it work.
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <errno.h> // Use errno
#define BUFFER_SIZE 10 // Very small buffer, to test the behaviour
#define READ_SIZE 255
void error(const char *msg)
{
std::cerr << msg;
exit(1);
}
int main(int argc, char *argv[])
{
int sockfd, newsockfd, portno;
socklen_t clilen;
char buffer[BUFFER_SIZE];
std::string msg; // Use a string for the "large" buffer
struct sockaddr_in serv_addr, cli_addr;
int n;
if(argc < 2) error("ERROR, no port provided\n");
sockfd = socket(AF_INET, SOCK_STREAM, 0);
if (sockfd < 0) error("ERROR opening socket\n");
bzero((char *) &serv_addr, sizeof(serv_addr));
portno = atoi(argv[1]);
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr = INADDR_ANY;
serv_addr.sin_port = htons(portno);
if (bind(sockfd, (struct sockaddr *) &serv_addr,
sizeof(serv_addr)) < 0)
error("ERROR on binding");
listen(sockfd,5);
clilen = sizeof(cli_addr);
while(1){
int pkt = 0;
newsockfd = accept(sockfd,
(struct sockaddr *) &cli_addr,
&clilen);
if (newsockfd < 0)
error("ERROR on accept");
bzero(buffer,BUFFER_SIZE);
//n = read(newsockfd,buffer,sizeof(buffer));
do
{
n = recv(newsockfd, buffer, sizeof(buffer), MSG_DONTWAIT); // Use recv instead of read
if (n>0)
{
buffer[n] = 0;
msg += buffer; // Actually increase the "large" buffer
pkt++;
}
} while (n > 0 || ((n == -1) && ((errno == EAGAIN) || (errno == EWOULDBLOCK)) && pkt == 0));
if (n<0)
{
if ((errno != EAGAIN) && (errno != EWOULDBLOCK))
error("Reading error\n");
else
std::cout << "Noting left to read" << std::endl;
}
std::cout << "Here is the message: " << msg << std::endl;
std::cout << "hellow" << "\n";
char *message = "HTTP/1.1 200 OK\r\n\r\n<html><body><h1>Hello. Please don't close</h1></body></html>";
n = write(newsockfd,message,strlen(message));
if (n < 0) error("ERROR writing to socket");
else std::cout << "Wrote " << n << " bytes" << std::endl;
close(newsockfd);
}
close(sockfd);
return 0;
}
The main changes are:
I use a string to store the whole client request without any allocation issues.
I use recv instead of read to ease the use of non blocking behaviour
The loop condition used to be errno = eagain or errno = wouldblock but the default "rst" behaviour forced me to count the number of received packets.
Hope this helps.

Related

RPi Wiringpi fails to read i2c correctly

I have an AHT21 that communicates over i2c: I send 3 bytes and get back 6. The arduino sketch works but the RPi does not. What is wrong with WiringPi i2c syntax?
I want to convert this arduino sketch to RPi c++ program using WiringPi.
This works:
#include <Wire.h>
#define AHT21 0x38
void setup() {
// put your setup code here, to run once:
Wire.begin(); // the SDA and SCL
Serial.begin(9600);
uint8_t rawData[7] = {0,0,0,0,0,0,0};
Wire.beginTransmission(AHT21);
Wire.write(0xAC); //send measurement command, start measurement
Wire.write(0x33); //send measurement control
Wire.write(0x00); //send measurement NOP control
Wire.endTransmission();
delay(100);
Wire.requestFrom(AHT21, 6);
for (uint8_t i = 0; i < 6; i++)
{
rawData[i] = Wire.read();
Serial.print(i);Serial.print(": ");
Serial.println(rawData[i]);
}
}
void loop() {}
Gives:
0: 28
1: 106
2: 90
3: 117
4: 126
5: 70
This RPI code fails giving the status byte over and over:
#include <wiringPi.h>
#include <wiringPiI2C.h>
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#define Address 0x38
int main (int argc, char **argv)
{
int fd = wiringPiI2CSetup(Address);
uint8_t rawData[7] = {0,0,0,0,0,0,0};
wiringPiI2CWrite(fd,0xAC); //send measurement command, start measurement
wiringPiI2CWrite(fd,0x33); //send measurement control
wiringPiI2CWrite(fd,0x00); //send measurement NOP control
delay(100);
for (uint8_t i = 0; i < 6; i++)
{
rawData[i] = wiringPiI2CRead(fd);
printf("%d: %d\n",i,rawData[i]);
}
}
Gives:
./aht21
0: 28
1: 28
2: 28
3: 28
4: 28
5: 28
I abandoned WiringPi and went with ioctl and i2c-dev.h. Works fine:
//gcc -g -Wall -Wextra -pedantic -std=c11 -D_DEFAULT_SOURCE -D_BSD_SOURCE -o aht21 aht21.c
#include <stdio.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h> // read/write usleep
#include <stdlib.h> // exit function
#include <inttypes.h> // uint8_t, etc
#include <linux/i2c-dev.h> // I2C bus definitions
int main (int argc, char **argv)
{
float ahtTemp, ahtHum;
uint8_t rawData[7] = {0, 0, 0, 0, 0, 0, 0};
// Create I2C bus
int fd;
char *bus = "/dev/i2c-1";
if ((fd = open(bus, O_RDWR)) < 0)
{
printf("Failed to open the bus. \n");
exit(1);
}
// Get I2C device,
ioctl(fd, I2C_SLAVE, 0x38);
char TriggerCMD[3] = {0};
TriggerCMD[0] = 0xAC;
TriggerCMD[1] = 0x33;
TriggerCMD[2] = 0x00;
write(fd, TriggerCMD, 3);
sleep(1);
if (read(fd, rawData, 7) != 7)
{
printf("Error : Input/Output Error \n");
}
else
{
uint32_t humidity = rawData[1]; //20-bit raw humidity data
humidity <<= 8;
humidity |= rawData[2];
humidity <<= 4;
humidity |= rawData[3] >> 4;
uint32_t temperature = rawData[3] & 0x0F; //20-bit raw temperature data
temperature <<= 8;
temperature |= rawData[4];
temperature <<= 8;
temperature |= rawData[5];
ahtHum = ((float)humidity / 0x100000) * 100.0;
ahtTemp = (((float)temperature / 0x100000) * 200.0 - 50.0) * 1.8 + 32.0;
printf("%.2f,%.2f\n", ahtHum, ahtTemp);
}
close(fd);
}

When approaching the gpio register address of RaspberryPi, why is the result different between unsigned int* and char*?

Using mmap(), I am going to write a value to the GPIO register address of the Raspberry Pi.
I thought the register value would have the same when reading mapped GPIO address in unsigned int * or char *, but it was not. I compared the results for both cases.
This is my code.
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#define GPIO_BASE 0x3F200000
#define GPFSEL1 0x04
#define GPSET0 0x1C
#define GPCLR0 0x28
int main()
{
int fd = open("/dev/mem", O_RDWR|O_SYNC);
// Error Handling
if (fd < 0) {
printf("Can't open /dev/mem \n");
exit(1);
}
// Map pages of memory
char *gpio_memory_map = (char*)mmap(0, 4096, PROT_READ|PROT_WRITE,
MAP_SHARED, fd, GPIO_BASE);
// Error Handling
if (gpio_memory_map == MAP_FAILED) {
printf("Error : mmap \n");
exit(-1);
}
// GPIO18
//volatile unsigned int *gpio = (volatile unsigned int*)gpio_memory_map;
//gpio[GPFSEL1/4] = (1<<24);
volatile char *gpio = (volatile char *)gpio_memory_map;
int i;
for (i = 0; i < 16; i++)
printf("gpio[%d](%#x) = %#0x\n", i, &gpio[i], gpio[i]);
/*
for (i = 0; i < 5; i++) {
gpio[GPCLR0 / 4] = (1 << 18);
sleep(1);
gpio[GPSET0 / 4] = (1 << 18);
sleep(1);
}
*/
// Unmap pages of memory
munmap(gpio_memory_map, 4096);
return 0;
}
And those below are the results.
volatile unsigned int *gpio = (volatile unsigned int *)gpio_memory_map;
gpio[0](0x76f12000) = 0x1
gpio[1](0x76f12004) = 0x1000000
gpio[2](0x76f12008) = 0
gpio[3](0x76f1200c) = 0x3fffffc0
gpio[4](0x76f12010) = 0x24000924
gpio[5](0x76f12014) = 0x924
gpio[6](0x76f12018) = 0
gpio[7](0x76f1201c) = 0x6770696f
gpio[8](0x76f12020) = 0x6770696f
gpio[9](0x76f12024) = 0x6770696f
gpio[10](0x76f12028) = 0x6770696f
gpio[11](0x76f1202c) = 0x6770696f
gpio[12](0x76f12030) = 0x6770696f
gpio[13](0x76f12034) = 0x2ffbbfff
gpio[14](0x76f12038) = 0x3ef4ff
gpio[15](0x76f1203c) = 0
volatile char *gpio = (volatile char *)gpio_memory_map;
As the result #1 above, I thought gpio[1], gpio[2], gpio[3] should be 0. But it was different. And even if I try to write a new value on gpio[1] or gpio[2] or gpio[3], it stays the same. Why are the results different when approaching char * and unsigned char *?
gpio[0](0x76f47000) = 0x1
gpio[1](0x76f47001) = 0x69
gpio[2](0x76f47002) = 0x70
gpio[3](0x76f47003) = 0x67
gpio[4](0x76f47004) = 0
gpio[5](0x76f47005) = 0x69
gpio[6](0x76f47006) = 0x70
gpio[7](0x76f47007) = 0x67
gpio[8](0x76f47008) = 0
gpio[9](0x76f47009) = 0x69
gpio[10](0x76f4700a) = 0x70
gpio[11](0x76f4700b) = 0x67
gpio[12](0x76f4700c) = 0xc0
gpio[13](0x76f4700d) = 0x69
gpio[14](0x76f4700e) = 0x70
gpio[15](0x76f4700f) = 0x67

PMC to count if software prefetch hit L1 cache

I am trying to find a PMC (Performance Monitoring Counter) that will display the amount of times that a prefetcht0 instruction hits L1 dcache (or misses).
icelake-client: Intel(R) Core(TM) i7-1065G7 CPU # 1.30GHz
I am trying to make this fine grain i.e (note should include lfence around prefetcht0)
xorl %ecx, %ecx
rdpmc
movl %eax, %edi
prefetcht0 (%rsi)
rdpmc
testl %eax, %edi
// jump depending on if it was a miss or not
The goal is to check if a prefetch hit L1. If didn't execute some code that is ready, otherwise proceed.
It seems that it will have to be a miss event just based on what is available.
I have tried a few events from libpfm4 and intel manual with no luck:
L1-DCACHE-LOAD-MISSES, emask=0x00, umask=0x10000
L1D.REPLACEMENT, emask=0x51, umask=0x1
L2_RQSTS.SWPF_HIT, emask=0x24, umask=0xc8
L2_RQSTS.SWPF_MISS, emask=0x24, umask=0x28
LOAD_HIT_PREFETCH.SWPF, emask=0x01, umask=0x4c (this very misleadingly is non-sw prefetch hits)
L1D.REPLACEMENT and L1-DCACHE-LOAD-MISSES kind of works, it works if I delay the rdpmc but if they are one after another it seems unreliable at best. The other ones are complete busts.
Questions:
Should any of these work for detecting if prefetches hit L1 dcache? (i.e my testing is bad)
If not. Whats events could be used to detect if a prefetch hit L1 dcache?
Edit: MEM_LOAD_RETIRED.L1_HIT does not appear to work for software prefetch.
Here is the code I am using to do test:
#include <asm/unistd.h>
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#define HIT 0
#define MISS 1
#define TODO MISS
#define PAGE_SIZE 4096
// to force hit make TSIZE low
#define TSIZE 10000
#define err_assert(cond) \
if (__builtin_expect(!(cond), 0)) { \
fprintf(stderr, "%d:%d: %s\n", __LINE__, errno, strerror(errno)); \
exit(-1); \
}
uint64_t
get_addr() {
uint8_t * addr =
(uint8_t *)mmap(NULL, TSIZE * PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
err_assert(addr != NULL);
for (uint32_t i = 0; i < TSIZE; ++i) {
addr[i * PAGE_SIZE + (PAGE_SIZE - 1)] = 0;
#if TODO == HIT
addr[i * PAGE_SIZE] = 0;
#endif
}
return uint64_t(addr);
}
int
perf_event_open(struct perf_event_attr * hw_event,
pid_t pid,
int cpu,
int group_fd,
unsigned long flags) {
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
return ret;
}
void
init_perf_event_struct(struct perf_event_attr * pe,
const uint32_t type,
const uint64_t ev_config,
int lead) {
__builtin_memset(pe, 0, sizeof(struct perf_event_attr));
pe->type = type;
pe->size = sizeof(struct perf_event_attr);
pe->config = ev_config;
pe->disabled = !!lead;
pe->exclude_kernel = 1;
pe->exclude_hv = 1;
}
/* Fixed Counters */
static constexpr uint32_t core_instruction_ev = 0x003c;
static constexpr uint32_t core_instruction_idx = (1 << 30) + 0;
static constexpr uint32_t core_cycles_ev = 0x00c0;
static constexpr uint32_t core_cycles_idx = (1 << 30) + 1;
static constexpr uint32_t ref_cycles_ev = 0x0300;
static constexpr uint32_t ref_cycles_idx = (1 << 30) + 2;
/* programmable counters */
static constexpr uint32_t mem_load_retired_l1_hit = 0x01d1;
static constexpr uint32_t mem_load_retired_l1_miss = 0x08d1;
int
init_perf_tracking() {
struct perf_event_attr pe;
init_perf_event_struct(&pe, PERF_TYPE_RAW, core_instruction_ev, 1);
int leadfd = perf_event_open(&pe, 0, -1, -1, 0);
err_assert(leadfd >= 0);
init_perf_event_struct(&pe, PERF_TYPE_RAW, core_cycles_ev, 0);
err_assert(perf_event_open(&pe, 0, -1, leadfd, 0) >= 0);
init_perf_event_struct(&pe, PERF_TYPE_RAW, ref_cycles_ev, 0);
err_assert(perf_event_open(&pe, 0, -1, leadfd, 0) >= 0);
init_perf_event_struct(&pe, PERF_TYPE_RAW, mem_load_retired_l1_hit, 0);
err_assert(perf_event_open(&pe, 0, -1, leadfd, 0) >= 0);
return leadfd;
}
void
start_perf_tracking(int leadfd) {
ioctl(leadfd, PERF_EVENT_IOC_RESET, 0);
ioctl(leadfd, PERF_EVENT_IOC_ENABLE, 0);
}
#define _V_TO_STR(X) #X
#define V_TO_STR(X) _V_TO_STR(X)
//#define DO_PREFETCH
#ifdef DO_PREFETCH
#define DO_MEMORY_OP(addr) "prefetcht0 (%[" V_TO_STR(addr) "])\n\t"
#else
#define DO_MEMORY_OP(addr) "movl (%[" V_TO_STR(addr) "]), %%eax\n\t"
#endif
int
main() {
int fd = init_perf_tracking();
start_perf_tracking(fd);
uint64_t addr = get_addr();
uint32_t prefetch_miss, cycles_to_detect;
asm volatile(
"lfence\n\t"
"movl %[core_cycles_idx], %%ecx\n\t"
"rdpmc\n\t"
"movl %%eax, %[cycles_to_detect]\n\t"
"xorl %%ecx, %%ecx\n\t"
"rdpmc\n\t"
"movl %%eax, %[prefetch_miss]\n\t"
"lfence\n\t"
DO_MEMORY_OP(prefetch_addr)
"lfence\n\t"
"xorl %%ecx, %%ecx\n\t"
"rdpmc\n\t"
"subl %[prefetch_miss], %%eax\n\t"
"movl %%eax, %[prefetch_miss]\n\t"
"movl %[core_cycles_idx], %%ecx\n\t"
"rdpmc\n\t"
"subl %[cycles_to_detect], %%eax\n\t"
"movl %%eax, %[cycles_to_detect]\n\t"
"lfence\n\t"
: [ prefetch_miss ] "=&r"(prefetch_miss),
[ cycles_to_detect ] "=&r"(cycles_to_detect)
: [ prefetch_addr ] "r"(addr), [ core_cycles_idx ] "i"(core_cycles_idx)
: "eax", "edx", "ecx");
fprintf(stderr, "Hit : %d\n", prefetch_miss);
fprintf(stderr, "Cycles : %d\n", cycles_to_detect);
}
if I define DO_PREFETCH the results for MEM_LOAD_RETIRED.L1_HIT are always 1 (always appears to get a hit). If I comment out DO_PREFETCH the results correspond with what I would expect (when the address is clearly not in cache reports miss, when it clearly is reports hit).
With DO_PREFETCH:
g++ -DDO_PREFETCH -O3 -march=native -mtune=native prefetch_hits.cc -o prefetch_hits
$> ./prefetch_hits
Hit : 1
Cycles : 554
and without DO_PREFETCH
g++ -DDO_PREFETCH -O3 -march=native -mtune=native prefetch_hits.cc -o prefetch_hits
$> ./prefetch_hits
Hit : 0
Cycles : 888
With L2_RQSTS.SWPF_HIT and L2_RQSTS.SWPF_MISS was able to get it to work. Big thanks to Hadi Brais. Worth noting that the reason L1D_PEND_MISS.PENDING didn't work might be related to Icelake. Hadi Brais reported getting it to work for predicting L1D cached misses on Haswell.
In the interest of trying to determine why L1_PEND_MISS.PENDING and MEM_LOAD_RETIRED.L1_HIT do not work posted the exact code I'm using for testing them:
#include <asm/unistd.h>
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#define HIT 0
#define MISS 1
#define TODO MISS
#define PAGE_SIZE 4096
#define TSIZE 1000
#define err_assert(cond) \
if (__builtin_expect(!(cond), 0)) { \
fprintf(stderr, "%d:%d: %s\n", __LINE__, errno, strerror(errno)); \
exit(-1); \
}
uint64_t
get_addr() {
uint8_t * addr =
(uint8_t *)mmap(NULL, TSIZE * PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
err_assert(addr != NULL);
__builtin_memset(addr, -1, TSIZE * PAGE_SIZE);
return uint64_t(addr);
}
int
perf_event_open(struct perf_event_attr * hw_event,
pid_t pid,
int cpu,
int group_fd,
unsigned long flags) {
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
return ret;
}
void
init_perf_event_struct(struct perf_event_attr * pe,
const uint32_t type,
const uint64_t ev_config,
int lead) {
__builtin_memset(pe, 0, sizeof(struct perf_event_attr));
pe->type = type;
pe->size = sizeof(struct perf_event_attr);
pe->config = ev_config;
pe->disabled = !!lead;
pe->exclude_kernel = 1;
pe->exclude_hv = 1;
}
/* Fixed Counters */
static constexpr uint32_t core_instruction_ev = 0x003c;
static constexpr uint32_t core_instruction_idx = (1 << 30) + 0;
static constexpr uint32_t core_cycles_ev = 0x00c0;
static constexpr uint32_t core_cycles_idx = (1 << 30) + 1;
static constexpr uint32_t ref_cycles_ev = 0x0300;
static constexpr uint32_t ref_cycles_idx = (1 << 30) + 2;
/* programmable counters */
static constexpr uint32_t mem_load_retired_l1_hit = 0x01d1;
static constexpr uint32_t mem_load_retired_l1_miss = 0x08d1;
static constexpr uint32_t l1d_pending = 0x0148;
static constexpr uint32_t swpf_hit = 0xc824;
static constexpr uint32_t swpf_miss = 0x2824;
static constexpr uint32_t ev0 = l1d_pending;
#define NEVENTS 1
#if NEVENTS > 1
static constexpr uint32_t ev1 = swpf_miss;
#endif
int
init_perf_tracking() {
struct perf_event_attr pe;
init_perf_event_struct(&pe, PERF_TYPE_RAW, core_instruction_ev, 1);
int leadfd = perf_event_open(&pe, 0, -1, -1, 0);
err_assert(leadfd >= 0);
init_perf_event_struct(&pe, PERF_TYPE_RAW, core_cycles_ev, 0);
err_assert(perf_event_open(&pe, 0, -1, leadfd, 0) >= 0);
init_perf_event_struct(&pe, PERF_TYPE_RAW, ref_cycles_ev, 0);
err_assert(perf_event_open(&pe, 0, -1, leadfd, 0) >= 0);
init_perf_event_struct(&pe, PERF_TYPE_RAW, ev0, 0);
err_assert(perf_event_open(&pe, 0, -1, leadfd, 0) >= 0);
#if NEVENTS > 1
init_perf_event_struct(&pe, PERF_TYPE_RAW, ev1, 0);
err_assert(perf_event_open(&pe, 0, -1, leadfd, 0) >= 0);
#endif
return leadfd;
}
void
start_perf_tracking(int leadfd) {
ioctl(leadfd, PERF_EVENT_IOC_RESET, 0);
ioctl(leadfd, PERF_EVENT_IOC_ENABLE, 0);
}
#define _V_TO_STR(X) #X
#define V_TO_STR(X) _V_TO_STR(X)
//#define LFENCE
#ifdef LFENCE
#define SERIALIZER() "lfence\n\t"
#else
#define SERIALIZER() \
"xorl %%ecx, %%ecx\n\t" \
"xorl %%eax, %%eax\n\t" \
"cpuid\n\t"
#endif
#define DO_PREFETCH
#ifdef DO_PREFETCH
#define DO_MEMORY_OP(addr) "prefetcht0 (%[" V_TO_STR(addr) "])\n\t"
#else
#define DO_MEMORY_OP(addr) "movl (%[" V_TO_STR(addr) "]), %%eax\n\t"
#endif
int
main() {
int fd = init_perf_tracking();
start_perf_tracking(fd);
uint64_t addr = get_addr();
// to ensure page in TLB
*((volatile uint64_t *)(addr + (PAGE_SIZE - 8))) = 0;
#if TODO == HIT
// loading from 0 offset to check cache miss / hit
*((volatile uint64_t *)addr) = 0;
#endif
uint32_t ecount0 = 0, ecount1 = 0, cycles_to_detect = 0;
asm volatile(
SERIALIZER()
"movl %[core_cycles_idx], %%ecx\n\t"
"rdpmc\n\t"
"movl %%eax, %[cycles_to_detect]\n\t"
"xorl %%ecx, %%ecx\n\t"
"rdpmc\n\t"
"movl %%eax, %[ecount0]\n\t"
#if NEVENTS > 1
"movl $1, %%ecx\n\t"
"rdpmc\n\t"
"movl %%eax, %[ecount1]\n\t"
#endif
SERIALIZER()
DO_MEMORY_OP(prefetch_addr)
SERIALIZER()
"xorl %%ecx, %%ecx\n\t"
"rdpmc\n\t"
"subl %[ecount0], %%eax\n\t"
"movl %%eax, %[ecount0]\n\t"
#if NEVENTS > 1
"movl $1, %%ecx\n\t"
"rdpmc\n\t"
"subl %[ecount1], %%eax\n\t"
"movl %%eax, %[ecount1]\n\t"
#endif
"movl %[core_cycles_idx], %%ecx\n\t"
"rdpmc\n\t"
"subl %[cycles_to_detect], %%eax\n\t"
"movl %%eax, %[cycles_to_detect]\n\t"
SERIALIZER()
: [ ecount0 ] "=&r"(ecount0),
#if NEVENTS > 1
[ ecount1 ] "=&r"(ecount1),
#endif
[ cycles_to_detect ] "=&r"(cycles_to_detect)
: [ prefetch_addr ] "r"(addr), [ core_cycles_idx ] "i"(core_cycles_idx)
: "eax", "edx", "ecx");
fprintf(stderr, "E0 : %d\n", ecount0);
fprintf(stderr, "E1 : %d\n", ecount1);
fprintf(stderr, "Cycles : %d\n", cycles_to_detect);
}
The rdpmc is not ordered with the events that may occur before it or after it in program order. A fully serializing instruction, such as cpuid, is required to obtain the desired ordering guarantees with respect to prefetcht0. The code should be as follows:
xor %eax, %eax # CPUID leaf eax=0 should be fast. Doing this before each CPUID might be a good idea, but omitted for clarity
cpuid
xorl %ecx, %ecx
rdpmc
movl %eax, %edi # save RDPMC result before CPUID overwrites EAX..EDX
cpuid
prefetcht0 (%rsi)
cpuid
xorl %ecx, %ecx
rdpmc
testl %eax, %edi # CPUID doesn't affect FLAGS
cpuid
Each of the rdpmc instructions are sandwiched between cpuid instructions. This ensures that any events and only these events that occur between the two rdpmc instructions are counted.
The prefetch operation of the prefetcht0 instruction may either be ignored or performed. If it was performed, it may either hit in a cache line that is in a valid state in the L1D or not. These are the cases that have to be considered.
The sum of L2_RQSTS.SWPF_HIT and L2_RQSTS.SWPF_MISS cannot be used to count or derive the number of prefetcht0 hits in the L1D, but their sum can be subtracted from SW_PREFETCH_ACCESS.T0 to get an upper bound on the number of prefetcht0 hits in the L1D. With the properly serialized sequence shown above, I think the only case where a non-ignored prefetcht0 doesn't hit in the L1D and is not counted by the sum SWPF_HIT+SWPF_MISS is if the software prefetch operation hits in an LFB allocated for a hardware prefetch.
L1-DCACHE-LOAD-MISSES is just another name for L1D.REPLACEMENT. The event code and umask you've shown for L1-DCACHE-LOAD-MISSES is incorrect. The L1D.REPLACEMENT event only occurs if the prefetch operation misses in the L1D (which causes a request to be sent to the L2) and causes a valid line in the L1D to be replaced. Usually most fills cause a replacement, but the event still cannot be used to distinguish between a prefetcht0 that hits in the L1D, a prefetcht0 that hits in an LFB allocated for a hardware prefetch, and an ignored prefetcht0.
The event LOAD_HIT_PREFETCH.SWPF occurs when a demand load hits in an LFB allocated for a software prefetch. This is obviously not useful here.
The event L1D_PEND_MISS.PENDING (event=0x48, umask=0x01) should work. According to the documentation, this event increments the counter by the number of pending L1D misses every cycle. I think it works for demand loads and prefetches. This is really an approximation, so it may count even if there are zero pending L1D misses. But I think it can still be used to determine with very high confidence whether a single prefetcht0 missed in the L1D by following these steps:
First, add the line uint64_t value = *(volatile uint64_t*)addr; just before the inline assembly. This is to increase the probability to near 100% that the line to be prefetched is in the L1D.
Second, measure the minimum increment of L1D_PEND_MISS.PENDING for a prefetcht0 that is very highly likely to hit in the L1D.
Run the experiment many times to build high confidence that the minimum increment is highly stable to the extent the the same exact value is observed in almost every run.
Comment out the line added in the first step so that the prefetcht0 misses and check that the event count change is always or almost always larger than the minimum increment measured previously.
So far, I've only been concerned with making a distinction between a prefetch that hits in the L1D and a non-ignored prefetch that misses in both the L1D and the LFBs. Now I'll consider the rest of the cases:
If the prefetch results in a page fault or if the memory type of the target cache line is WC or UC, the prefetch is ignored. I don't know whether the L1D_PEND_MISS.PENDING event can be used to distinguish between a hit and this case. You can run experiment where the target address of the prefetch instruction to is in a virtual page with no valid mapping or mapped to a kernel page. Check if the change in the event count is unique with high probability.
If no LFBs are available, the prefetch is ignored. This case can be eliminated by switching off the sibling logical core and using cpuid instead of lfence before the first rdpmc.
If the prefetch hits in an LFB allocated for an RFO, ItoM, or a hardware prefetch request, then the prefetch is effectively redundant. For all of these types of requests, the change in the L1D_PEND_MISS.PENDING count may or not be distinguishable from a hit in the L1D. This case can be eliminated by using cpuid instead of lfence before the first rdpmc and turning of the two L1D hardware prefetchers.
I don't think a prefetch to a prefetchable memory type can hit in a WCB because changing the memory type of a location is a fully serializing operation, so this case is not a problem.
One obvious advantage of using L1D_PEND_MISS.PENDING instead of the sum SWPF_HIT+SWPF_MISS is the smaller number of events. Another advantage is that L1D_PEND_MISS.PENDING is supported on some of the earlier the microarchitectures. Also, as discussed above, it can be more powerful. It works on my Haswell with a threshold of 69-70 cycles.
If the L1D_PEND_MISS.PENDING event changes in different cases are not distinguishable, then the sum SWPF_HIT+SWPF_MISS can be used. These two events occur at the L2 and so they only tell you whether the prefetch missed in the L1D and a request is sent and accepted by the L2. If the request is rejected or hit in the L2's SQ, none of the two events may occur. In addition, all of the aforementioned cases will not be distinguishable from an L1D hit.
For normal demand loads, you can use MEM_LOAD_RETIRED.L1_HIT. If the load hits in the L1D, a single L1_HIT occurs. Otherwise, in any other case, no L1_HIT events occur, assuming that no other instruction between the two rdpmcs, such as cpuid, can generate L1_HIT events. You'll have to verify that cpuid doesn't generate L1_HIT events. Don't forget to count only user-mode events because an interrupt can occur between any two instructions and the interrupt handler may generate one or more L1_HIT events in kernel mode. While it's very unlikely, if you want to be 100% sure, check also whether the occurrence of an interrupt itself generates L1_HIT events.

WinSock: How send() a PByte type?

Firstly, i want know if the PByte type is equivalent to a BYTE*(byte pointer) in C++. In negative case, what's could be on Delphi that more near to BYTE* of C++?
Well, suppose that i'm right about that PByte is BYTE* (C++), then based on following C++ code, how send() this data type (PByte) correctly using native WinSock?
See:
C++:
SOCKET sock;
BITMAPINFO bmpInfo;
BYTE *bytes = NULL;
BYTE *temp_bytes = NULL;
DWORD workSpaceSize, fragmntWorkSpaceSize, size;
RtlGetCompressionWorkSpaceSize(COMPRESSION_FORMAT_LZNT1, &workSpaceSize, &fragmntWorkSpaceSize);
bytes = (BYTE *) Alloc(bmpInfo.bmiHeader.biSizeImage);
temp_bytes = (BYTE *) Alloc(bmpInfo.bmiHeader.biSizeImage);
BYTE *memory = (BYTE *) Alloc(workSpaceSize);
RtlCompressBuffer(COMPRESSION_FORMAT_LZNT1,
bytes,
bmpInfo.bmiHeader.biSizeImage,
temp_bytes,
bmpInfo.bmiHeader.biSizeImage,
2048,
&size,
memory);
free(bytes);
free(memory);
if(Send(sock, (char *) temp_bytes, size, 0) <= 0) return;
free(temp_bytes);
Delphi:
var
Sock: TSocket;
bmpInfo: TBitMapInfo;
bytes: PByte = nil;
temp_bytes: PByte = nil;
memory: PByte;
workSpaceSize, fragmntWorkSpaceSize, Size: Cardinal;
//...
RtlGetCompressionWorkSpaceSize(COMPRESSION_FORMAT_LZNT1, #workSpaceSize, #fragmntWorkSpaceSize);
bytes := AllocMem(bmpInfo.bmiHeader.biSizeImage);
temp_bytes := AllocMem(bmpInfo.bmiHeader.biSizeImage);
memory := AllocMem(workSpaceSize);
RtlCompressBuffer(COMPRESSION_FORMAT_LZNT1, bytes, bmpInfo.bmiHeader.biSizeImage,
temp_bytes, bmpInfo.bmiHeader.biSizeImage, 2048, #Size, memory);
FreeMem(bytes);
FreeMem(memory);
if send(Sock, temp_bytes^, Size, 0) <= 0 then Exit;
FreeMem(temp_bytes);
Reference to RtlGetCompressionWorkSpaceSize() and RtlCompressBuffer() functions in C++.
Reference to RtlGetCompressionWorkSpaceSize() and RtlCompressBuffer() functions in Delphi.

snprintf() return value when size=0

Thanks first for your time spent here. I have a question with snprintf() when size=0, with code below:
#include <stdio.h>
#include <stdlib.h>
int main(int ac, char **av)
{
char *str;
int len;
len = snprintf(NULL, 0, "%s %d", *av, ac);
printf("this string has length %d\n", len);
if (!(str = malloc((len + 1) * sizeof(char))))
return EXIT_FAILURE;
len = snprintf(str, len + 1, "%s %d", *av, ac);
printf("%s %d\n", str, len);
free(str);
return EXIT_SUCCESS;
}
when I run:
momo#xue5:~/TestCode$ ./Test_snprintf
The result is:
this string has length 17
./Test_snprintf 1 17
What confuses me is in the code, the size to be written is 0, why displayed 17?
What did I miss?
Thanks~~
The solution can be found in the man page under Return value;
The functions snprintf() and vsnprintf() do not write more than size bytes (including the terminating null byte ('\0')). If the output was truncated due to this limit then the return value is the number of characters (excluding the terminating null byte) which would have been written to the final string if enough space had been available.
This is so that you can do exactly what you do, a "trial print" to get the correct length, then allocate the buffer dynamically to get the whole output when you snprintf again to the allocated buffer.