Optimizing NSNumber numberWithInt: - iphone

I am profiling an iPhone app and I noticed a strange pattern. In a certain block of code that's called quite frequently...
[item setQuadrant:[NSNumber numberWithInt:a]];
[item setIndex:[NSNumber numberWithInt:b]];
[item setTimestamp:[NSNumber numberWithInt:c]];
[item setState:[NSNumber numberWithInt:d]];
[item setCompletionPercentage:[NSNumber numberWithInt:e]];
[item setId_:[NSNumber numberWithInt:f]];
...the first call to [NSNumber numberWithInt:] takes an inordinate amount of time, in the order of 10-15x that of the remaining calls. I've verified that the results are consistent if I shuffle the lines (the first line is always the slow one, by the same ratio). Is there something going on that I'm not aware of?
Perhaps this happens because this block is inside a try/catch?

If I had to guess, NSNumber performs some code in it's +load implementation, which slows down the initial call to the class. Also note that NSNumber caches it's return value, so future calls to +numberWithInt: with the same value are faster than before, that could possibly be part of the issue.

Maybe the first value is much larger? apart from CoreFoundation's CFNumber caching, the "new" runtime uses tagged pointers, allowing integers within the range of 24 bit to be encoded directly into the pointer - the runtime then figure out it's a tagged pointer by looking at the last bit (and that its a CFNumber by looking at the 3 bits before the last bit and the target number size - 8, 16, 32, 64 bit - using the next 4 bits before).
On a 32-bit system (current iPhones), that means that for ("small") negative 32 bit numbers or large positive numbers, CoreFoundation will allocate an object. For everything else, it uses the following expression that is way faster:
case kCFNumberSInt32Type: {
int32_t value = *(int32_t *)valuePtr; // this loads the actual numerical value passed to CFNumberCreate()
#if !__LP64__
// We don't bother allowing the min 24-bit integer -2^23 to also be fast-pathed;
// tell anybody that complains about that to go ... hang.
int32_t limit = (1L << 23);
if (value <= -limit || limit <= value) break;
#endif
uintptr_t ptr_val = ((uintptr_t)((intptr_t)value << 8) | (2 << 6) | kCFTaggedObjectID_Integer);
return (CFNumberRef)ptr_val;
}
(note that !__LP64__ is true for 32-bit systems)
Taken from: http://www.opensource.apple.com/source/CF/CF-744.12/CFNumber.c
Also, there is a caching mechanism that prevents a range of numbers from being re-created multiple times, just search for "__CFNumberCache" in the same source file.

Related

Why are the hex numbers for big endian different than little endian?

#include<stdio.h>
int main()
{
typedef unsigned char *byte_pointer;
void show_bytes(byte_pointer start, size_t len)
{
int i;
for (i = 0; i < len; i++)
{
printf(" %.2x", start[i]);
printf("\n");
}
}
void show_int(int x)
{
show_bytes((byte_pointer) &x, sizeof(int));
}
void show_float(int x)
{
show_bytes((byte_pointer) &x, sizeof(float));
}
void show_pointer(int x)
{
show_bytes((byte_pointer) &x, sizeof(void *));
}
int a = 0x12345678;
byte_pointer ap = (byte_pointer) &a;
show_bytes(ap, 3);
return 0;
}
(Solutions according to the CS:APP book)
Big endian: 12 34 56
Little endian: 78 56 34
I know systems have different conventions for storage allocation but if two systems use the same convention but are different endian why are the hex values different?
Endian-ness is an issue that arises when we use more than one storage location for a value/type, which we do because somethings won't fit in a single storage location.
As soon as we use multiple storage locations for a single value that gives rise to the question of:  What part of the value will we store in each storage location?
The first byte of a two-byte item will have a lower address than the second byte, and in particular, the address of the second byte will be at +1 from the address of the lower byte.
Storing a two-byte item in two bytes of storage, do we store the most significant byte first and the least significant byte second, or vice versa?
We choose to use directly consecutive bytes for the two bytes of the two-byte item, so no matter which (endian) way we choose to store such an item, we refer to the whole two-byte item by the lower address (the address of its first byte).
We can express these storage choices with a formula, here item[0] refer to the first byte while item[1] refers to the second byte.
item[0] = value >> 8 // also value / 256
item[1] = value & 0xFF // also value % 256
value = (item[0]<<8) | item[1] // also item[0]*256 | item[1]
--vs--
item[0] = value & 0xFF // also value % 256
item[1] = value >> 8 // also value / 256
value = item[0] | (item[1]<<8) // also item[0] | item[1]*256
The first set of formulas is for big endian, and the second for little endian.
By these formulas, it doesn't matter what order we access memory as to whether item[0] first, then item[1], or vice versa, or both at the same time (common in hardware), as long as the formulas for one endian are consistently used.
If the item in question is a four-byte value, then there are 4 possible orderings(!) — though only two of them are truly sensible.
For efficiency, the hardware offers us multibyte memory access in one instruction (and with one reference, namely to the lowest address of the multibyte item), and therefore, the hardware itself needs to define and consistently use one of the two possible/reasonable orderings.
If the hardware did not offer multibyte memory access, then the ordering would be entirely up to the software program itself to define (accessing memory one byte at a time), and the program could choose big or little endian, even differently for each variable, as long as it consistently accesses the multiple bytes of memory in the same manner to reassemble the values stored there.
In a similar manner, when we define a structure of multiple items (e.g. struct point { int x; int y; }, software chooses whether x comes first or y comes first in memory ordering.  However, since programmers (and compilers) will still choose to use hardware instructions to access individual fields such as x in one go, the hardware's endian configuration remains necessary.

Transferring arrays/classes/records between locales

In a typical N-Body simulation, at the end of each epoch, each locale would need to share its own portion of the world (i.e. all bodies) to the rest of the locales. I am working on this with a local-view approach (i.e. using on Loc statements). I encountered some strange behaviours that I couldn't make sense out of, so I decided to make a test program, in which things got more complicated. Here's the code to replicate the experiment.
proc log(args...?n) {
writeln("[locale = ", here.id, "] [", datetime.now(), "] => ", args);
}
const max: int = 50000;
record stuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class ctuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class wrapper {
// The point is that total size (in bytes) of data in `r`, `c` and `a` are the same here, because the record and the class hold two ints per index.
var r: [{1..max / 2}] stuff;
var c: [{1..max / 2}] owned ctuff?;
var a: [{1..max}] int;
proc init() {
this.a = here.id;
}
}
proc test() {
var wrappers: [LocaleSpace] owned wrapper?;
coforall loc in LocaleSpace {
on Locales[loc] {
wrappers[loc] = new owned wrapper();
}
}
// rest of the experiment further down.
}
Two interesting behaviours happen here.
1. Moving data
Now, each instance of wrapper in array wrappers should live in its locale. Specifically, the references (wrappers) will live in locale 0, but the internal data (r, c, a) should live in the respective locale. So we try to move some from locale 1 to locale 3, as such:
on Locales[3] {
var timer: Timer;
timer.start();
var local_stuff = wrappers[1]!.r;
timer.stop();
log("get r from 1", timer.elapsed());
log(local_stuff);
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_c = wrappers[1]!.c;
timer.stop();
log("get c from 1", timer.elapsed());
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_a = wrappers[1]!.a;
timer.stop();
log("get a from 1", timer.elapsed());
}
Surprisingly, my timings show that
Regardless of the size (const max), the time of sending the array and record strays constant, which doesn't make sense to me. I even checked with chplvis, and the size of GET actually increases, but the time stays the same.
The time to send the class field increases with time, which makes sense, but it is quite slow and I don't know which case to trust here.
2. Querying the locales directly.
To demystify the problem, I also query the .locale.id of some variables directly. First, we query the data, which we expect to live in locale 2, from locale 2:
on Locales[2] {
var wrappers_ref = wrappers[2]!; // This is always 1 GET from 0, okay.
log("array",
wrappers_ref.a.locale.id,
wrappers_ref.a[1].locale.id
);
log("record",
wrappers_ref.r.locale.id,
wrappers_ref.r[1].locale.id,
wrappers_ref.r[1].x1.locale.id,
);
log("class",
wrappers_ref.c.locale.id,
wrappers_ref.c[1]!.locale.id,
wrappers_ref.c[1]!.x1.locale.id
);
}
And the result is:
[locale = 2] [2020-12-26T19:36:26.834472] => (array, 2, 2)
[locale = 2] [2020-12-26T19:36:26.894779] => (record, 2, 2, 2)
[locale = 2] [2020-12-26T19:36:27.023112] => (class, 2, 2, 2)
Which is expected. Yet, if we query the locale of the same data on locale 1, then we get:
[locale = 1] [2020-12-26T19:34:28.509624] => (array, 2, 2)
[locale = 1] [2020-12-26T19:34:28.574125] => (record, 2, 2, 1)
[locale = 1] [2020-12-26T19:34:28.700481] => (class, 2, 2, 2)
Implying that wrappers_ref.r[1].x1.locale.id lives in locale 1, even though it should clearly be on locale 2. My only guess is that by the time .locale.id is executed, the data (i.e. the .x of the record) is already moved to the querying locale (1).
So all in all, the second part of the experiment lead to a secondary question, whilst not answering the first part.
NOTE: all experiment are run with -nl 4 in chapel/chapel-gasnet docker image.
Good observations, let me see if I can shed some light.
As an initial note, any timings taken with the gasnet Docker image should be taken with a grain of salt since that image simulates the execution across multiple nodes using your local system rather than running each locale on its own compute node as intended in Chapel. As a result, it is useful for developing distributed memory programs, but the performance characteristics are likely to be very different than running on an actual cluster or supercomputer. That said, it can still be useful for getting coarse timings (e.g., your "this is taking a much longer time" observation) or for counting communications using chplvis or the CommDiagnostics module.
With respect to your observations about timings, I also observe that the array-of-class case is much slower, and I believe I can explain some of the behaviors:
First, it's important to understand that any cross-node communications can be characterized using a formula like alpha + beta*length. Think of alpha as representing the basic cost of performing the communication, independent of length. This represents the cost of calling down through the software stack to get to the network, putting the data on the wire, receiving it on the other side, and getting it back up through the software stack to the application there. The precise value of alpha will depend on factors like the type of communication, choice of software stack, and physical hardware. Meanwhile, think of beta as representing the per-byte cost of the communication where, as you intuit, longer messages necessarily cost more because there's more data to put on the wire, or potentially to buffer or copy, depending on how the communication is implemented.
In my experience, the value of alpha typically dominates beta for most system configurations. That's not to say that it's free to do longer data transfers, but that the variance in execution time tends to be much smaller for longer vs. shorter transfers than it is for performing a single transfer versus many. As a result, when choosing between performing one transfer of n elements vs. n transfers of 1 element, you'll almost always want the former.
To investigate your timings, I bracketed your timed code portions with calls to the CommDiagnostics module as follows:
resetCommDiagnostics();
startCommDiagnostics();
...code to time here...
stopCommDiagnostics();
printCommDiagnosticsTable();
and found, as you did with chplvis, that the number of communications required to localize the array of records or array of ints was constant as I varied max, for example:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
This is consistent with what I'd expect from the implementation: That for an array of value types, we perform a fixed number of communications to access array meta-data, and then communicate the array elements themselves in a single data transfer to amortize the overheads (avoid paying multiple alpha costs).
In contrast, I found that the number of communications for localizing the array of classes was proportional to the size of the array. For example, for the default value of 50,000 for max, I saw:
locale
get
put
execute_on
0
0
0
0
1
0
0
0
2
0
0
0
3
25040
25000
1
I believe the reason for this distinction relates to the fact that c is an array of owned classes, in which only a single class variable can "own" a given ctuff object at a time. As a result, when copying the elements of array c from one locale to another, you're not just copying raw data, as with the record and integer cases, but also performing an ownership transfer per element. This essentially requires setting the remote value to nil after copying its value to the local class variable. In our current implementation, this seems to be done using a remote get to copy the remote class value to the local one, followed by a remote put to set the remote value to nil, hence, we have a get and put per array element, resulting in O(n) communications rather than O(1) as in the previous cases. With additional effort, we could potentially have the compiler optimize this case, though I believe it will always be more expensive than the others due to the need to perform the ownership transfer.
I tested the hypothesis that owned classes were resulting in the additional overhead by changing your ctuff objects from being owned to unmanaged, which removes any ownership semantics from the implementation. When I do this, I see a constant number of communications, as in the value cases:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
I believe this represents the fact that once the language has no need to manage the ownership of the class variables, it can simply transfer their pointer values in a single transfer again.
Beyond these performance notes, it's important to understand a key semantic difference between classes and records when choosing which to use. A class object is allocated on the heap, and a class variable is essentially a reference or pointer to that object. Thus, when a class variable is copied from one locale to another, only the pointer is copied, and the original object remains where it was (for better or worse). In contrast, a record variable represents the object itself, and can be thought of as being allocated "in place" (e.g., on the stack for a local variable). When a record variable is copied from one locale to the other, it's the object itself (i.e., the record's fields' values) which are copied, resulting in a new copy of the object itself. See this SO question for further details.
Moving on to your second observation, I believe that your interpretation is correct, and that this may be a bug in the implementation (I need to stew on it a bit more to be confident). Specifically, I think you're correct that what's happening is that wrappers_ref.r[1].x1 is being evaluated, with the result being stored in a local variable, and that the .locale.id query is being applied to the local variable storing the result rather than the original field. I tested this theory by taking a ref to the field and then printing locale.id of that ref, as follows:
ref x1loc = wrappers_ref.r[1].x1;
...wrappers_ref.c[1]!.x1.locale.id...
and that seemed to give the right result. I also looked at the generated code which seemed to indicate that our theories were correct. I don't believe that the implementation should behave this way, but need to think about it a bit more before being confident. If you'd like to open a bug against this on Chapel's GitHub issues page, for further discussion there, we'd appreciate that.

How to read and write bits in a chunk of memory in Swift

I would like to know how to read a binary file into memory (writing it to memory like an "Array Buffer" from JavaScript), and write to different parts of memory 8-bit, 16-bit, 32-bit etc. values, even 5 bit or 10 bit values.
extension Binary {
static func readFileToMemory(_ file) -> ArrayBuffer {
let data = NSData(contentsOfFile: "/path/to/file/7CHands.dat")!
var dataRange = NSRange(location: 0, length: ?)
var ? = [Int32](count: ?, repeatedValue: ?)
data.getBytes(&?, range: dataRange)
}
static func writeToMemory(_ buffer, location, value) {
buffer[location] = value
}
static func readFromMemory(_ buffer, location) {
return buffer[location]
}
}
I have looked at a bunch of places but haven't found a standard reference.
https://github.com/nst/BinUtils/blob/master/Sources/BinUtils.swift
https://github.com/apple/swift/blob/master/stdlib/public/core/ArrayBuffer.swift
https://github.com/uraimo/Bitter/blob/master/Sources/Bitter/Bitter.swift
In Swift, how do I read an existing binary file into an array?
Swift - writing a byte stream to file
https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html
https://github.com/Cosmo/BinaryKit/blob/master/Sources/BinaryKit.swift
https://github.com/vapor-community/bits/blob/master/Sources/Bits/Data%2BBytesConvertible.swift
https://academy.realm.io/posts/nate-cook-tryswift-tokyo-unsafe-swift-and-pointer-types/
https://medium.com/#gorjanshukov/working-with-bytes-in-ios-swift-4-de316a389a0c
I would like for this to be as low-level as possible. So perhaps using UnsafeMutablePointer, UnsafePointer, or UnsafeMutableRawPointer.
Saw this as well:
let data = NSMutableData()
var goesIn: Int32 = 42
data.appendBytes(&goesIn, length: sizeof(Int32))
println(data) // <2a000000]
var comesOut: Int32 = 0
data.getBytes(&comesOut, range: NSMakeRange(0, sizeof(Int32)))
println(comesOut) // 42
I would basically like to allocate a chunk of memory and be able to read and write from it. Not sure how to do that. Perhaps using C is the best way, not sure.
Just saw this too:
let rawData = UnsafeMutablePointer<UInt8>.allocate(capacity: width * height * 4)
If you're looking for low level code you'll need to use UnsafeMutableRawPointer. This is a pointer to a untyped data. Memory is accessed in bytes, so 8 chunks of at least 8 bits. I'll cover multiples of 8 bits first.
Reading a File
To read a file this way, you need to manage file handles and pointers yourself. Try the the following code:
// Open the file in read mode
let file = fopen("/Users/joannisorlandos/Desktop/ownership", "r")
// Files need to be closed manually
defer { fclose(file) }
// Find the end
fseek(file, 0, SEEK_END)
// Count the bytes from the start to the end
let fileByteSize = ftell(file)
// Return to the start
fseek(file, 0, SEEK_SET)
// Buffer of 1 byte entities
let pointer = UnsafeMutableRawPointer.allocate(byteCount: fileByteSize, alignment: 1)
// Buffer needs to be cleaned up manually
defer { pointer.deallocate() }
// Size is 1 byte
let readBytes = fread(pointer, 1, fileByteSize, file)
let errorOccurred = readBytes != fileByteSize
First you need to open the file. This can be done using Swift strings since the compiler makes them into a CString itself.
Because cleanup is all for us on this low level, a defer is put in place to close the file at the end.
Next, the file is set to seek the end of the file. Then the distance between the start of the file and the end is calculated. This is used later, so the value is kept.
Then the program is set to return to the start of the file, so the application starts reading from the start.
To store the file, a pointer is allocated with the amount of bytes that the file has in the file system. Note: This can change inbetween the steps if you're extremely unlucky or the file is accessed quite often. But I think for you, this is unlikely.
The amount of bytes is set, and aligned to one byte. (You can learn more about memory alignment on Wikipedia.
Then another defer is added to make sure no memory leaks at the end of this code. The pointer needs to be deallocated manually.
The file's bytes are read and stored in the pointer. Do note that this entire process reads the file in a blocking manner. It can be more preferred to read files asynchronously, if you plan on doing that I'll recommend looking into a library like SwiftNIO instead.
errorOccurred can be used to throw an error or handle issues in another manner.
From here, your buffer is ready for manipulation. You can print the file if it's text using the following code:
print(String(cString: pointer.bindMemory(to: Int8.self, capacity: fileByteSize)))
From here, it's time to learn how to read manipulate the memory.
Manipulating Memory
The below demonstrates reading byte 20..<24 as an Int32.
let int32 = pointer.load(fromByteOffset: 20, as: Int32.self)
I'll leave the other integers up to you. Next, you can alos put data at a position in memory.
pointer.storeBytes(of: 40, toByteOffset: 30, as: Int64.self)
This will replace byte 30..<38 with the number 40. Note that big endian systems, although uncommon, will store information in a different order from normal little endian systems. More about that here.
Modifying Bits
As you notes, you're also interested in modifying five or ten bits at a time. To do so, you'll need to mix the previous information with the new information.
var data32bits = pointer.load(fromByteOffset: 20, as: Int32.self)
var newData = 0b11111000
In this case, you'll be interested in the first 5 bits and want to write them over bit 2 through 7. To do so, first you'll need to shift the bits to a position that matches the new position.
newData = newData >> 2
This shifts the bits 2 places to the right. The two left bits that are now empty are therefore 0. The 2 bits on the right that got shoved off are not existing anymore.
Next, you'll want to get the old data from the buffer and overwrite the new bits.
To do so, first move the new byte into a 32-bits buffer.
var newBits = numericCast(newData) as Int32
The 32 bits will be aligned all the way to the right. If you want to replace the second of the four bytes, run the following:
newBits = newBits << 16
This moves the fourth pair 16 bit places left, or 2 bytes. So it's now on position 1 starting from 0.
Then, the two bytes need to be added on top of each other. One common method is the following:
let oldBits = data32bits & 0b11111111_11000001_11111111_11111111
let result = oldBits | newBits
What happens here is that we remove the 5 bits with new data from the old dataset. We do so by doing a bitwise and on the old 32 bits and a bitmap.
The bitmap has all 1's except for the new locations which are being replaced. Because those are empty in the bitmap, the and operator will exclude those bits since one of the two (old data vs. bitmap) is empty.
AND operators will only be 1 if both sides of the operator are 1.
Finally, the oldBits and the newBits are merged with an OR operator. This will take each bit on both sides and set the result to 1 if the bits at both positions are 1.
This will merge successfully since both buffers contain 1 bits that the other number doesn't set.

How can I judge where should I put memory barrier in the code?

When I am reading ldd3, I get the conception about memory barrier, it is said that code execution will be reordered, for the reason like caching and compilation optimizations. I think codes that have no dependencies can be reordered to get better peformance, and IO ports registers can not be optimized, because it need contain consistent data. But I can not understand the code below, and is there any rules to follow where should I insert functions like smb(),mb(),barrier()?
For example, in the examples code short from ldd3.
/*
* Atomicly increment an index into short_buffer
*/
static inline void short_incr_bp(volatile unsigned long *index, int delta)
{
unsigned long new = *index + delta;
barrier(); /* Don't optimize these two together */
*index = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
}
How line before barrier and the line after barrier reordered? I think the latter depends on the former to be executed first to get new value.
This really makes me confused.
Memory barriers are used instead of locking for gain better performanse. There are several standard patters, when memory barriers provide needed syncronizaton. You can read, e.g., Documentation/memory-barriers.txt from the kernel sources.
In the given example from ldd3 barriers usage is tricker than usual. In the terms of current kernel(as opposite to 2.20+, which is described in ldd3) same intention can be expressed using ACCESS_ONCE() macro.
unsigned long new = *index + delta;
ACCESS_ONCE(*index) = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
Without bariers, compiler may assign *index twice:
*index = *index + delta;
if(*index > (short_buffer + PAGE_BUFFER)
*index = short_buffer;
Because *index is used in multiple threads as unprotected invariant(it shows which buffer region is available), writting intermediate value *index + delta into it make invariant, seen by other thread, incorrect. This is prevented by ACCESS_ONCE() macro, which force compiler to generate access(write) to variable only when is is explicetely requested.
Actually, ACCESS_ONCE (and barrier in you code) is redudant for variable with volatile modifier.

NEON: loading uint8_t array into 128 bit register

I need to load values from uint8 array into 128 NEON register. There is a similar question. But there were no good answers.
My solution is:
uint8_t arr[4] = {1,2,3,4};
//load 4 of 8-bit vals into 64 bit reg
uint8x8_t _vld1_u8 = vld1_u8(arr);
//convert to 16-bit and move to 128-bit reg
uint16x8_t _vmovl_u8 = vmovl_u8(_vld1_u8);
//get low 64 bit and move them to 64-bit reg
uint16x4_t _vget_low_u16 = vget_low_u16(_vmovl_u8);
//convert to 32-bit and move to 128-bit reg
uint32x4_t ld32x4 = vmovl_u16(_vget_low_u16);
This works fine, but it seems to me that this approach is not the fastest. Maybe there is a better and faster way to load 8bit data into 128 reg as 32bit ?
Edit:
Thanks to #FrankH. I've came up with the second version using some hack:
uint8x16x2_t z = vzipq_u8(vld1q_u8(arr), q_zero);
uint8x16_t rr = *(uint8x16_t*)&z;
z = vzipq_u8(rr, q_zero);
ld32x4 = *(uint8x16_t*)&z;
It boils down to this assembly (when compiler optimisations are on):
vld1.8 {d16, d17}, [r5]
vzip.8 q8, q9
vorr q9, q4, q4
vzip.8 q8, q9
So there are no redundant stores and it's pretty fast. But still it is about x1.5 slower then the first solution.
You can do a "double zip" with zeroes:
uint16x4_t zero = 0;
uint32x4_t ld32x4 =
vreinterpretq_u32_u16(
vzipq_u8(
vzip_u8(
vld1_u8(arr),
vreinterpret_u8_u16(zero)
),
zero
)
);
Since the vreinterpretq_*() are no-ops, this boils down to three instructions. Don't have a crosscompiler around at the moment, can't validate that :(
Edit:
Don't get me wrong there ... while vreinterpretq_*() isn't resulting in a Neon instruction, it's not a no-op; that's because it stops the compiler from doing the type of funky things you'd see if you'd instead use widerVal.val[0]. All it tells the compiler is, like:
"you've got a uint8x16x2_t but I want to use only half of that as a uint8x16_t, give me half the registers."
Or:
"you have a uint8x16x2_t but I want to use those regs as a uint32x4_t instead."
I.e. it tells the compilers to alias sets of neon registers - preventing stores/loads to/from the stack as you'd get if you do the explicit sub-set access through the .val[...] syntax.
In a way, the .val[...] syntax "is a hack" but the better method, the use of vreinterpretq_*(), "looks like a hack". Not using it results in more instructions and slower/inferior code.