I'm trying to capture my desktop using Desktop Duplication API, encode the D3DTexture2D using NVENC and send it over the local network. The performance of everything is very high until I reach the part where we need to lock the bitstream and extract the data. Below is the code used:
lockBitstreamData.outputBitstream = vOutputBuffer[m_iGot % m_nEncoderBuffer];
lockBitstreamData.doNotWait = false;
auto starti = std::chrono::system_clock::now();
NVENC_API_CALL(m_nvenc.nvEncLockBitstream(m_hEncoder, &lockBitstreamData));
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - starti;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time)
<< "elapsed time: " << elapsed_seconds.count() << "s\n";
uint8_t *pData = (uint8_t *)lockBitstreamData.bitstreamBufferPtr;
if (vPacket.size() < i + 1)
vPacket[i].insert(vPacket[i].end(), &pData[0], &pData[lockBitstreamData.bitstreamSizeInBytes]);
NVENC_API_CALL(m_nvenc.nvEncUnlockBitstream(m_hEncoder, lockBitstreamData.outputBitstream));
The "NVENC_API_CALL(m_nvenc.nvEncLockBitstream(m_hEncoder, &lockBitstreamData));" takes anything from under 10ms when under desktop at low load to an average of 90ms when I run a game in full screen under heavy load. Our constraints require "real-time" 60fps so anything over 16ms is too high. Is there a way to get that down?
I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
std::cout << "Running benchmark." << std::endl;
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779
The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.
The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/
I'm starting my journey with microcontrollers and I'm getting my way with STM32F1 (Nucleo board with STM32F103RB). I try to learn writing using registers and it looks like I'm stuck with first 'task' - blinking led. I managed to turn led on, but I can't make it blink. What's strange, when I go to debug (I work on Keil uVision) and look into GPIOA peripheral, port 5 (led is PA5) has this tick going on and off which means it should blink in reality. But is not. I tried changing delay and nothing happens. I'm stuck.
What am I doing wrong?
Here's my code:
#include "stm32f10x.h"
void delay(unsigned int ms){
unsigned int i, j;
for(i = 0; i < ms; i++)
for(j = 0; j < 20000; j++);
int main(void){
RCC->APB2ENR |= (1<<2);
GPIOA->CRL |= ( (1<<21));
GPIOA->CRL &= ~( (1<<22) | (1<<23) | (1<<20) );
GPIOA->BSRR |= (1<<5);
GPIOA->BSRR |= (1<<21);
As dunajski put it, are you sure that you delay function delays for 200ms? Or to put it more directly: Dont ever use NOPs as a delay. This might work in some specific cases on a specific chip/system but you can assume it just won't delay anything. NOP-loops will be optimized out depending on the compiler. And even if not will have different runtimes on different frequencies/architectures.
Use sleep() or usleep() instead if you just want delay. This will 'block' the controller for the time, so you cant do anything in the meantime but for your testing, this will suffice. Use some systick callback (or interrupts) if you want concurrent timing.
After enabling the clock you need to wait this operation to propagate over the bus. It can be archived by using the barrier instruction or readback from the register.
Do not use magic numbers. USE definitions from CMSIS. Check if set the correct mode.
This is invalid:
GPIOA->BSRR |= (1<<5); // Set bit 5
delay(200); // delay some time
GPIOA->BSRR |= (1<<21); // Set bit 21 (not 5)
BSRR is write only and you should not read from it.
GPIOA->BSRR = (1<<5); // Set bit 5
delay(200); // delay some time
GPIOA->BSRR = (1<<21); // Set bit 21 (not 5)
If you want to use loops for delays do it a different way:
void delay(unsigned int ms)
unsigned int i, j;
for(i = 0; i < ms; i++)
for(j = 0; j < 20000; j++)
Your function will be optimized out to the single return if you enable optimizations.
To be clear, I'm very new to STM32 and MBED programming, but is it possible to create valid VGA signal, using STM32 nucleo-F070RB board? I've picked up this standard, and my goal is to display "something" on screen. With this i mean I should have control which pixel i want to, well, turn on.
For demo, here is my (very crude) sketch:
#include <mbed.h>
DigitalOut led(LED1);
DigitalOut h_sync(PC_3);
DigitalOut v_sync(PC_2);
DigitalOut c_red(PC_0);
int main() {
h_sync = 1;
v_sync = 1;
int line_count = 0;
int color_red = 0;
while(1) {
h_sync = 0;
h_sync = 1;
if (color_red == 16) { color_red = 0; c_red = !c_red; }
if (line_count == 601) v_sync = 0;
if (line_count == 605) v_sync = 1;
if (line_count == 628) { line_count = 0; c_red = 0; color_red = 0; }
I have connected V-Sync to V-Sync of my monitor, same with H-Sync and c_red (color red) through 560Ohm resistor to RED signal. And it (kind of) worked! It displayed red strips every 16 lines. Perfect, but I need to be able to control every pixel (if it's possible). I've seen some VGA libraries (maybe), but i really need to write it myself - something very crude. I just only want to have some sort of control over pixels, not the some super-super something (Just for me to learn :) ). And because I don't have much experience with STMs, after hours i was not able to "convince" my board to generate such "high-speed" signals, so, it is after all possible?
I was using MBED's Ticker function to generate the timing for each pixel, but it did not work - the fastest the Ticker went for me was something around few miliseconds, far too much. Can I use timer interrupts? Or something else?
I try to write 16 bit data into two selected 8-bits gpio ports . I must split data for
LSB and MSB :
void LCD_write_command(uint16_t cmd) {
GPIOD->ODR = cmd & 0x00ff; //lsb
GPIOA->ODR = (GPIOA->ODR & 0x00ff) | (cmd >> 8); //msb
and read data :
uint16_t LCD_read_data(void) {
(here is instruct gpio as input)
volatile uint16_t data = 0;
data = (uint16_t)GPIOD->IDR & 0x00ff; //lsb
data |= (uint16_t)GPIOA->IDR << 8 ; // msb
(here is instruct gpio as output)
return data;
When i use one 16bit gpio to write and read everything is fine:
void LCD_write_command(uint16_t cmd) {
GPIOD->ODR = cmd & 0xffff;
uint16_t LCD_read_data(void) {
volatile uint16_t data = 0;
data = (uint16_t)GPIOD->IDR & 0xffff;
return data;
I relay dont know what im missing.
wtite_bits(uint16_t cmd)
uint32_t data = GPIOA -> ODR;
data &= ~(0x1fff);
data |= cmd & 0x1fff;
GPIOA -> ODR = data;
data = GPIOB -> ODR;
data &= ~(0x0007);
data |= (cmd & 0x8fff) >> 13;
GPIOB -> ODR = data;
preserve other bits in the register
You need to learn a bit more about bitwise operations.
void LCD_write_command(uint16_t cmd) {
uint32_t tmp = GPIOD->ODR;
tmp &= ~(0xff);
tmp |= (cmd & 0x00ff);
GPIOD->ODR = tmp; //lsb
tmp = GPIOA->ODR;
tmp &= ~(0xff);
tmp |= (cmd >> 8);
GPIOA->ODR = tmp; //msb
void LCD_write_command(uint16_t cmd) {
*(volatile uint8_t *)&GPIOD->ODR = cmd & 0xff;
*(volatile uint8_t *)&GPIOA->ODR = cmd >> 8; //msb
forcing the compiler to use 8 bit store instructions.
Before using non word access to the registers check in the RM if your micro allows it:
Non of yours sugested code works for me , below is original source code of handling with LCD:
void LCD_write_command(uint16_t cmd) {
GPIOB->BRR = LCD_CS; // LCD_CS low (chip select pull)
GPIOB->BRR = LCD_RS; // LCD_RS low (register select = instruction)
//GPIOA->ODR = cmd; // put cmd to PortA (full length)
// put cmd [0..12] bits to PortA (actual LCD_DB00..LCD_DB12)
// put cmd [13..15] bits to PortB (actual LCD_DB13..LCD_DB15)
GPIOA->ODR = cmd & 0x1fff;
GPIOB->ODR = (GPIOB->ODR & 0xfff8) | (cmd >> 13);
GPIOB->BRR = LCD_WR; // pull LCD_WR to low (write strobe start)
// Write strobe 66ns long by datasheet. GPIO speed on STM32F103 at 72MHz slower -> delay is unnecessary
// asm volatile ("nop");
GPIOB->BSRR = LCD_WR; // pull LCD_WR to high (write strobe end)
GPIOB->BSRR = LCD_CS; // LCD_CS high (chip select release)
I've checked RM and i can operate on 8-bit ,half word and word.
I had similar problem: need to write some bits to port at same time.
My output bits are on port B: 0,1,2,4,5,6,7,8,9 (see missing bit 3).
It is possible to read whole port (ODR), and/or bits and write it back.
Faster version is to set all bits:
GPIOB ->BSRR =0b00000000000000000000001111111011;
and then clear only ones needed:
GPIOB ->BRR =some_bits_to_clear;
It was like some custom 74LS154 chip.
Or even try to clear and set bits at same time using:
GPIOB ->BSRR =bits_to clear<<16 | bits_to_set;
Depending on situation, there are many ways to optimize code.
I am currently in the process of building an application that reads in audio from my iPhone's microphone, and then does some processing and visuals. Of course I am starting with the audio stuff first, but am having one minor problem.
I am defining my sampling rate to be 44100 Hz and defining my buffer to hold 4096 samples. Which is does. However, when I print this data out, copy it into MATLAB to double check accuracy, the sample rate I have to use is half of my iPhone defined rate, or 22050 Hz, for it to be correct.
I think it has something to do with the following code and how it is putting 2 bytes per packet, and when I am looping through the buffer, the buffer is spitting out the whole packet, which my code assumes is a single number. So what I am wondering is how to split up those packets and read them as individual numbers.
- (void)setupAudioFormat {
memset(&dataFormat, 0, sizeof(dataFormat));
dataFormat.mSampleRate = kSampleRate;
dataFormat.mFormatID = kAudioFormatLinearPCM;
dataFormat.mFramesPerPacket = 1;
dataFormat.mChannelsPerFrame = 1;
// dataFormat.mBytesPerFrame = 2;
// dataFormat.mBytesPerPacket = 2;
dataFormat.mBitsPerChannel = 16;
dataFormat.mReserved = 0;
dataFormat.mBytesPerPacket = dataFormat.mBytesPerFrame = (dataFormat.mBitsPerChannel / 8) * dataFormat.mChannelsPerFrame;
dataFormat.mFormatFlags =
kLinearPCMFormatFlagIsSignedInteger |
If what I described is unclear, please let me know. Thanks!
Adding the code that I used to print the data
float *audioFloat = (float *)malloc(numBytes * sizeof(float));
int *temp = (int*)inBuffer->mAudioData;
int i;
float power = pow(2, 31);
for (i = 0;i<numBytes;i++) {
audioFloat[i] = temp[i]/power;
printf("%f ",audioFloat[i]);
I found the problem with what I was doing. It was a c pointer issue, and since I have never really programmed in C before, I of course got them wrong.
You can not directly cast inBuffer->mAudioData to an int array. So what I simply did was the following
SInt16 *buffer = malloc(sizeof(SInt16)*kBufferByteSize);
buffer = inBuffer->mAudioData;
This worked out just fine and now my data is of correct length and the data is represented properly.
I saw your answer, there also is an underlying issue which gives wrong sample data bytes which is because of an endian issue of bytes being swapped.
-(void)feedSamplesToEngine:(UInt32)audioDataBytesCapacity audioData:(void *)audioData {
int sampleCount = audioDataBytesCapacity / sizeof(SAMPLE_TYPE);
SAMPLE_TYPE *samples = (SAMPLE_TYPE*)audioData;
//SAMPLE_TYPE *sample_le = (SAMPLE_TYPE *)malloc(sizeof(SAMPLE_TYPE)*sampleCount );//for swapping endians
std::string shorts;
double power = pow(2,10);
for(int i = 0; i < sampleCount; i++)
SAMPLE_TYPE sample_le = (0xff00 & (samples[i] << 8)) | (0x00ff & (samples[i] >> 8)) ; //Endianess issue
char dataInterim[30];
sprintf(dataInterim,"%f ", sample_le/power); // normalize it.