Unaligned access performance on Intel x86 vs AMD x86 CPUs - x86-64

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779

The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.

The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

Related

nvEncLockBitstream low performance on NVENC

I'm trying to capture my desktop using Desktop Duplication API, encode the D3DTexture2D using NVENC and send it over the local network. The performance of everything is very high until I reach the part where we need to lock the bitstream and extract the data. Below is the code used:
NV_ENC_LOCK_BITSTREAM lockBitstreamData = { NV_ENC_LOCK_BITSTREAM_VER };
lockBitstreamData.outputBitstream = vOutputBuffer[m_iGot % m_nEncoderBuffer];
lockBitstreamData.doNotWait = false;
auto starti = std::chrono::system_clock::now();
NVENC_API_CALL(m_nvenc.nvEncLockBitstream(m_hEncoder, &lockBitstreamData));
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - starti;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time)
<< "elapsed time: " << elapsed_seconds.count() << "s\n";
uint8_t *pData = (uint8_t *)lockBitstreamData.bitstreamBufferPtr;
if (vPacket.size() < i + 1)
{
vPacket.push_back(std::vector<uint8_t>());
}
vPacket[i].clear();
vPacket[i].insert(vPacket[i].end(), &pData[0], &pData[lockBitstreamData.bitstreamSizeInBytes]);
i++;
NVENC_API_CALL(m_nvenc.nvEncUnlockBitstream(m_hEncoder, lockBitstreamData.outputBitstream));
The "NVENC_API_CALL(m_nvenc.nvEncLockBitstream(m_hEncoder, &lockBitstreamData));" takes anything from under 10ms when under desktop at low load to an average of 90ms when I run a game in full screen under heavy load. Our constraints require "real-time" 60fps so anything over 16ms is too high. Is there a way to get that down?

EditorGuiLayout.MaskField issue with large enums

I'm working on an input system that would allow the user to translate input mappings between different input devices and operating systems and potentially define their own.
I'm trying to create a MaskField for an editor window where the user can select from a list of RuntimePlatforms, but selecting individual values results in multiple values being selected.
Mainly for debugging I set it up to generate an equivalent enum RuntimePlatformFlags that it uses instead of RuntimePlatform:
[System.Flags]
public enum RuntimePlatformFlags: long
{
OSXEditor=(0<<0),
OSXPlayer=(0<<1),
WindowsPlayer=(0<<2),
OSXWebPlayer=(0<<3),
OSXDashboardPlayer=(0<<4),
WindowsWebPlayer=(0<<5),
WindowsEditor=(0<<6),
IPhonePlayer=(0<<7),
PS3=(0<<8),
XBOX360=(0<<9),
Android=(0<<10),
NaCl=(0<<11),
LinuxPlayer=(0<<12),
FlashPlayer=(0<<13),
LinuxEditor=(0<<14),
WebGLPlayer=(0<<15),
WSAPlayerX86=(0<<16),
MetroPlayerX86=(0<<17),
MetroPlayerX64=(0<<18),
WSAPlayerX64=(0<<19),
MetroPlayerARM=(0<<20),
WSAPlayerARM=(0<<21),
WP8Player=(0<<22),
BB10Player=(0<<23),
BlackBerryPlayer=(0<<24),
TizenPlayer=(0<<25),
PSP2=(0<<26),
PS4=(0<<27),
PSM=(0<<28),
XboxOne=(0<<29),
SamsungTVPlayer=(0<<30),
WiiU=(0<<31),
tvOS=(0<<32),
Switch=(0<<33),
Lumin=(0<<34),
BJM=(0<<35),
}
In this linked screenshot, only the first 4 options were selected. The integer next to "Platforms: " is the mask itself.
I'm not a bitwise wizard by a large margin, but my assumption is that this occurs because EditorGUILayout.MaskField returns a 32bit int value, and there are over 32 enum options. Are there any workarounds for this or is something else causing the issue?
First thing I've noticed is that all values inside that Enum is the same because you are shifting 0 bits to left. You can observe this by logging your values with this script.
// Shifts 0 bits to the left, printing "0" 36 times.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((0 << i), 2));
}
// Shifts 1 bits to the left, printing values up to 2^35.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((1 << i), 2));
}
The reason inheriting from long does not work alone, is because of bit shifting. Check out this example I found about the issue:
UInt32 x = ....;
UInt32 y = ....;
UInt64 result = (x << 32) + y;
The programmer intended to form a 64-bit value from two 32-bit ones by shifting 'x' by 32 bits and adding the most significant and the least significant parts. However, as 'x' is a 32-bit value at the moment when the shift operation is performed, shifting by 32 bits will be equivalent to shifting by 0 bits, which will lead to an incorrect result.
So you should also cast the shifting bits. Like this:
public enum RuntimePlatformFlags : long {
OSXEditor = (1 << 0),
OSXPlayer = (1 << 1),
WindowsPlayer = (1 << 2),
OSXWebPlayer = (1 << 3),
// With literals.
tvOS = (1L << 32),
Switch = (1L << 33),
// Or with casts.
Lumin = ((long)1 << 34),
BJM = ((long)1 << 35),
}

Understanding CRC32 value as division remainder

I'm struggling with understanding CRC algorithm. I've been reading this tutorial and if I got it correctly a CRC value is just a remainder of a division where message serves as the dividend and the divisor is a predefined value - carried out in a special kind of polynomial arithmetic. It looked quote simple so I tried implementing CRC-32:
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly)
uint crc = 0xffffffff; // (Init)
foreach (var it in bytes)
{
var b = (uint)it;
for (var i = 0; i < 8; ++i)
{
var prevcrc = crc;
// load LSB from current byte into LSB of crc (RefIn)
crc = (crc << 1) | (b & 1);
b >>= 1;
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return Reverse(crc ^ 0xffffffff); // (XorOut) (RefOut)
}
(where Reverese function reverses bit order)
It is supposed to be analogous to following method of division (with some additional adjustments):
1100001010
_______________
10011 ) 11010110110000
10011,,.,,....
-----,,.,,....
10011,.,,....
10011,.,,....
-----,.,,....
00001.,,....
00000.,,....
-----.,,....
00010,,....
00000,,....
-----,,....
00101,....
00000,....
-----,....
01011....
00000....
-----....
10110...
10011...
-----...
01010..
00000..
-----..
10100.
10011.
-----.
01110
00000
-----
1110 = Remainder
For: 0x00 function returns 0xd202ef8d which is correct, but for 0x01 - 0xd302ef8d instead of 0xa505df1b (I've been using this page to verify my results).
Solution from my implementation has more sense to me: incrementing dividend by 1 should only change reminder by 1, right? But it turns out that the result should look completely different. So apparently I am missing something obvious. What is it? How can changing the least significant number in a dividend influence the result this much?
This is an example of a left shifting CRC that emulates division, with the CRC initialized = 0, and no complementing or reversing of the crc. The example code is emulating a division where 4 bytes of zeroes are appended to bytes[] ({bytes[],0,0,0,0} is the dividend, the divisor is 0x104c11db7, the quotient is not used, and the remainder is the CRC).
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly is actually 0x104c11db7)
uint crc = 0; // (Init)
foreach (var it in bytes)
{
crc ^= (((int)it)<<24); // xor next byte to upper 8 bits of crc
for (var i = 0; i < 8; ++i) // cycle the crc 8 times
{
var prevcrc = crc;
crc = (crc << 1);
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return crc;
}
It's common to initialize the CRC to something other than zero, but it's not that common to post-complement the CRC, and I'm not aware of any CRC that does a post bit reversal of the CRC.
Another variations of CRC is one that right shifts, normally used to emulate hardware where data is sent in bytes least significant bit first.

packed structure size in C, is this correct?

I found it in some exsiting code, it looks some problems, but the code works fine, can you help if this piece of code has any tricking things in.
why ignore two unsigned when calculate the size of the structure?
tmsg_sz = sizeof(plfm_xml_header_t) + sizeof(oid_t) + sizeof(char*)
+ sizeof(unsigned) + sizeof(snmp_varbind_t)*5 ;
tmsg = (snmp_trap_t*) malloc(tmsg_sz);
if (!tmsg) {
PRINTF("malloc failed \n");
free(trap_msg);
return -1;
}
memset (tmsg, 0, tmsg_sz);
tmsg->hdr.type = PLFM_SNMPTRAP_MSG;
copy_oid_oidt(clog_msg_gen_notif_oid, OID_LENGTH(clog_msg_gen_notif_oid), &tmsg->oid);
tmsg->trap_type = SNMP_TRAP_ENTERPRISESPECIFIC;
tmsg->trap_specific = 1;
tmsg->trapmsg = strdup("Trap Message");
tmsg->numofvar = 5;
build_snmp_varbind(&(tmsg->vars[0]), facility, STR_DATA_TYPE, sizeof(facility)+1, clog_hist_facility_oid, 14);
build_snmp_varbind(&(tmsg->vars[1]), &sev, U32_DATA_TYPE, sizeof(sev),clog_hist_severity_oid, 14);
build_snmp_varbind(&(tmsg->vars[2]), name, STR_DATA_TYPE, sizeof(name)+1, clog_hist_msgname_oid, 14);
build_snmp_varbind(&(tmsg->vars[3]), trap_msg, STR_DATA_TYPE, strlen(trap_msg)+1,clog_hist_msgtext_oid, 14);
// get system uptime
long uptime = get_uptime();
build_snmp_varbind(&(tmsg->vars[4]), (long*)&uptime, TMR_DATA_TYPE, sizeof(uptime),clog_hist_timestamp_oid, 14);
typedef struct snmp_trap_s {
plfm_xml_header_t hdr;
oid_t oid; /* trap oid */
unsigned trap_type;
unsigned trap_specific;
char *trapmsg; /* text message for this trap */
unsigned numofvar;
snmp_varbind_t vars[0];
} __attribute__((__packed__)) snmp_trap_t;
Compilers try hard to put multibyte data aligned in various ways. For example, an int variable, in an architecture where sizeof int == 4, may need to be placed in a location divisible by 4. This may be a hard requirement, or this may just make the system more efficient; it depends on the computer. So, consider
typedef struct combo {
char c;
int i;
} combo;
Depending on the architecture, sizeof combo may be 5, 6, or most often 8. Swap the two members, and the size should be 5.
typedef struct combo2 {
int i;
char c;
} combo2;
However, an array of combo2s may have a size you do not expect:
combo2 cb[2];
The size of cb could very well be 16, as 3 bytes of wasted space follow combo2[0] and combo2[1]. This lets combo2[1].i start at a location divisible by 4.
A recommendation is to order the members of a structure by size; the 8-byte members should precede the 4-byte members, then the 2-byte members, then the 1-byte members. Of course, you have to be aware of typical sizes, and you can't be working on an oddball architecture where characters are not packed into larger words. Cray? cough-cough.

Fastest way of bitwise AND between two arrays on iPhone?

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y++)
for(int x=0; x<a_lenx; x++)
{
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap ;
}
Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.
How could I accelerate this job on iPhone ?
I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...
Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap = 0;
uint32_t* a_int = (uint32_t*)a;
uint32_t* b_int = (uint32_t*)b;
a_leny = a_leny / 4;
a_lenx = a_lenx / 4;
a_pitch = a_pitch / 4;
b_pitch = b_pitch / 4;
for(int y=0; y<a_leny_int; y++)
for(int x=0; x<a_lenx_int; x++)
{
uint32_t aVal = a_int[x + y * a_pitch_int];
uint32_t bVal = b_int[x+y*b_pitch_int];
if (aVal & 0xFF) & (bVal & 0xFF)
overlap++;
if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
overlap++;
if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
overlap++;
if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
overlap++;
}
return overlap ;
}
Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y+= 10)
for(int x=0; x<a_lenx; x+= 10)
{
//we compare 1% of all the pixels, and use that as the result
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap * 100;
}
Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.
Your code is Rambo for the CPU - its worst nightmare :
byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.
Simply put, everything is wrong that can be wrong.
Don't call me rude. Let me be your angel instead.
First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.
Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.
Why don't you learn ARM assembly? It's much easier and useful than x86 assembly.
It will also improve your C programming capabilities by a huge step.
Strongly recommended
cya
==============================================================================
Ok, here is an optimized version written in C with ARM assembly in mind.
Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.
There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)
Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.
And make sure that this function runs in ARM mode and not Thumb for best results.
unsigned int compare(unsigned int *a, unsigned int a_pitch,
unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny)
{
unsigned int overlap =0;
unsigned int a_gap = (a_pitch - a_lenx)>>2;
unsigned int b_gap = (b_pitch - a_lenx)>>2;
unsigned int aval, bval, xcount;
do
{
xcount = (a_lenx>>2);
do
{
aval = *a++;
// ldr aval, [a], #4
bval = *b++;
// ldr bavl, [b], #4
aval &= bval;
// and aval, aval, bval
if (aval & 0x000000ff) overlap += 1;
// tst aval, #0x000000ff
// addne overlap, overlap, #1
if (aval & 0x0000ff00) overlap += 1;
// tst aval, #0x0000ff00
// addne overlap, overlap, #1
if (aval & 0x00ff0000) overlap += 1;
// tst aval, #0x00ff0000
// addne overlap, overlap, #1
if (aval & 0xff000000) overlap += 1;
// tst aval, #0xff000000
// addne overlap, overlap, #1
} while (--xcount);
a += a_gap;
b += b_gap;
} while (--a_leny);
return overlap;
}
First of all, why the double loop? You can do it with a single loop and a couple of pointers.
Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.
Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.