cache friendly way to add to arrays of ints

cache friendly way to add to arrays of ints - swift

I have become more mindful of optimizing my code for the cache. I am curious which of the following would be a more cache friendly way to add two arrays. The code is in swift.
struct A {
var x, y, z: [Int]
}
func add1(a: inout [A]) {
for i in 0 ..< a.count {
a[i].z = a[i].x + a[i].y
}
}
func add2(x: [Int], y:[Int], z: inout [Int]) {
for i in 0 ..< x.count {
z[i] = x[i] + y[i]
}
}
My concern is that in add2 the benefits of locality may be diminished since x, y and z need not be near one another in memory. For example suppose x[0] is loaded into cache, then y[0] is loaded into cache. Could the data near y[0] overwrite in cache the data near x[0], so that a new fetch from ram is needed to load x[1]? And if so would add1 resolve this issue?

An access pattern like in add2 is potentially a problem on a processor with a direct mapped cache, and still only if the addresses of the arrays are exactly the wrong thing. With a typical 4 or 8-way set-associative cache there isn't really a problem here, even with maximally unlucky array addresses: if the blocks containing x[0] and y[0] and z[0] all map to the same set, they will still fit and not eject each other. Direct mapped caches do suffer the conflict misses that you are worried about, which is part of why they are rare now, but there are more reasons.
Actually an access pattern like that of add2 is very nice, since depending on the operation being performed it can also be auto-vectorized. That isn't done with the overflow-checked addition (checked addition is hard to vectorize), but with the wrapping addition &+ the compiler can use movdqu to load and store two Ints at the same time, and paddq to add two Ints at the same time.

Related

Transferring arrays/classes/records between locales

In a typical N-Body simulation, at the end of each epoch, each locale would need to share its own portion of the world (i.e. all bodies) to the rest of the locales. I am working on this with a local-view approach (i.e. using on Loc statements). I encountered some strange behaviours that I couldn't make sense out of, so I decided to make a test program, in which things got more complicated. Here's the code to replicate the experiment.
proc log(args...?n) {
writeln("[locale = ", here.id, "] [", datetime.now(), "] => ", args);
}
const max: int = 50000;
record stuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class ctuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class wrapper {
// The point is that total size (in bytes) of data in `r`, `c` and `a` are the same here, because the record and the class hold two ints per index.
var r: [{1..max / 2}] stuff;
var c: [{1..max / 2}] owned ctuff?;
var a: [{1..max}] int;
proc init() {
this.a = here.id;
}
}
proc test() {
var wrappers: [LocaleSpace] owned wrapper?;
coforall loc in LocaleSpace {
on Locales[loc] {
wrappers[loc] = new owned wrapper();
}
}
// rest of the experiment further down.
}
Two interesting behaviours happen here.
1. Moving data
Now, each instance of wrapper in array wrappers should live in its locale. Specifically, the references (wrappers) will live in locale 0, but the internal data (r, c, a) should live in the respective locale. So we try to move some from locale 1 to locale 3, as such:
on Locales[3] {
var timer: Timer;
timer.start();
var local_stuff = wrappers[1]!.r;
timer.stop();
log("get r from 1", timer.elapsed());
log(local_stuff);
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_c = wrappers[1]!.c;
timer.stop();
log("get c from 1", timer.elapsed());
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_a = wrappers[1]!.a;
timer.stop();
log("get a from 1", timer.elapsed());
}
Surprisingly, my timings show that
Regardless of the size (const max), the time of sending the array and record strays constant, which doesn't make sense to me. I even checked with chplvis, and the size of GET actually increases, but the time stays the same.
The time to send the class field increases with time, which makes sense, but it is quite slow and I don't know which case to trust here.
2. Querying the locales directly.
To demystify the problem, I also query the .locale.id of some variables directly. First, we query the data, which we expect to live in locale 2, from locale 2:
on Locales[2] {
var wrappers_ref = wrappers[2]!; // This is always 1 GET from 0, okay.
log("array",
wrappers_ref.a.locale.id,
wrappers_ref.a[1].locale.id
);
log("record",
wrappers_ref.r.locale.id,
wrappers_ref.r[1].locale.id,
wrappers_ref.r[1].x1.locale.id,
);
log("class",
wrappers_ref.c.locale.id,
wrappers_ref.c[1]!.locale.id,
wrappers_ref.c[1]!.x1.locale.id
);
}
And the result is:
[locale = 2] [2020-12-26T19:36:26.834472] => (array, 2, 2)
[locale = 2] [2020-12-26T19:36:26.894779] => (record, 2, 2, 2)
[locale = 2] [2020-12-26T19:36:27.023112] => (class, 2, 2, 2)
Which is expected. Yet, if we query the locale of the same data on locale 1, then we get:
[locale = 1] [2020-12-26T19:34:28.509624] => (array, 2, 2)
[locale = 1] [2020-12-26T19:34:28.574125] => (record, 2, 2, 1)
[locale = 1] [2020-12-26T19:34:28.700481] => (class, 2, 2, 2)
Implying that wrappers_ref.r[1].x1.locale.id lives in locale 1, even though it should clearly be on locale 2. My only guess is that by the time .locale.id is executed, the data (i.e. the .x of the record) is already moved to the querying locale (1).
So all in all, the second part of the experiment lead to a secondary question, whilst not answering the first part.
NOTE: all experiment are run with -nl 4 in chapel/chapel-gasnet docker image.

Good observations, let me see if I can shed some light.
As an initial note, any timings taken with the gasnet Docker image should be taken with a grain of salt since that image simulates the execution across multiple nodes using your local system rather than running each locale on its own compute node as intended in Chapel. As a result, it is useful for developing distributed memory programs, but the performance characteristics are likely to be very different than running on an actual cluster or supercomputer. That said, it can still be useful for getting coarse timings (e.g., your "this is taking a much longer time" observation) or for counting communications using chplvis or the CommDiagnostics module.
With respect to your observations about timings, I also observe that the array-of-class case is much slower, and I believe I can explain some of the behaviors:
First, it's important to understand that any cross-node communications can be characterized using a formula like alpha + beta*length. Think of alpha as representing the basic cost of performing the communication, independent of length. This represents the cost of calling down through the software stack to get to the network, putting the data on the wire, receiving it on the other side, and getting it back up through the software stack to the application there. The precise value of alpha will depend on factors like the type of communication, choice of software stack, and physical hardware. Meanwhile, think of beta as representing the per-byte cost of the communication where, as you intuit, longer messages necessarily cost more because there's more data to put on the wire, or potentially to buffer or copy, depending on how the communication is implemented.
In my experience, the value of alpha typically dominates beta for most system configurations. That's not to say that it's free to do longer data transfers, but that the variance in execution time tends to be much smaller for longer vs. shorter transfers than it is for performing a single transfer versus many. As a result, when choosing between performing one transfer of n elements vs. n transfers of 1 element, you'll almost always want the former.
To investigate your timings, I bracketed your timed code portions with calls to the CommDiagnostics module as follows:
resetCommDiagnostics();
startCommDiagnostics();
...code to time here...
stopCommDiagnostics();
printCommDiagnosticsTable();
and found, as you did with chplvis, that the number of communications required to localize the array of records or array of ints was constant as I varied max, for example:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
This is consistent with what I'd expect from the implementation: That for an array of value types, we perform a fixed number of communications to access array meta-data, and then communicate the array elements themselves in a single data transfer to amortize the overheads (avoid paying multiple alpha costs).
In contrast, I found that the number of communications for localizing the array of classes was proportional to the size of the array. For example, for the default value of 50,000 for max, I saw:
locale
get
put
execute_on
0
0
0
0
1
0
0
0
2
0
0
0
3
25040
25000
1
I believe the reason for this distinction relates to the fact that c is an array of owned classes, in which only a single class variable can "own" a given ctuff object at a time. As a result, when copying the elements of array c from one locale to another, you're not just copying raw data, as with the record and integer cases, but also performing an ownership transfer per element. This essentially requires setting the remote value to nil after copying its value to the local class variable. In our current implementation, this seems to be done using a remote get to copy the remote class value to the local one, followed by a remote put to set the remote value to nil, hence, we have a get and put per array element, resulting in O(n) communications rather than O(1) as in the previous cases. With additional effort, we could potentially have the compiler optimize this case, though I believe it will always be more expensive than the others due to the need to perform the ownership transfer.
I tested the hypothesis that owned classes were resulting in the additional overhead by changing your ctuff objects from being owned to unmanaged, which removes any ownership semantics from the implementation. When I do this, I see a constant number of communications, as in the value cases:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
I believe this represents the fact that once the language has no need to manage the ownership of the class variables, it can simply transfer their pointer values in a single transfer again.
Beyond these performance notes, it's important to understand a key semantic difference between classes and records when choosing which to use. A class object is allocated on the heap, and a class variable is essentially a reference or pointer to that object. Thus, when a class variable is copied from one locale to another, only the pointer is copied, and the original object remains where it was (for better or worse). In contrast, a record variable represents the object itself, and can be thought of as being allocated "in place" (e.g., on the stack for a local variable). When a record variable is copied from one locale to the other, it's the object itself (i.e., the record's fields' values) which are copied, resulting in a new copy of the object itself. See this SO question for further details.
Moving on to your second observation, I believe that your interpretation is correct, and that this may be a bug in the implementation (I need to stew on it a bit more to be confident). Specifically, I think you're correct that what's happening is that wrappers_ref.r[1].x1 is being evaluated, with the result being stored in a local variable, and that the .locale.id query is being applied to the local variable storing the result rather than the original field. I tested this theory by taking a ref to the field and then printing locale.id of that ref, as follows:
ref x1loc = wrappers_ref.r[1].x1;
...wrappers_ref.c[1]!.x1.locale.id...
and that seemed to give the right result. I also looked at the generated code which seemed to indicate that our theories were correct. I don't believe that the implementation should behave this way, but need to think about it a bit more before being confident. If you'd like to open a bug against this on Chapel's GitHub issues page, for further discussion there, we'd appreciate that.

Does this sorting algorithm exist? (implemented in Swift)

This might be a bad question but I am curious.
I was following some data structures and algorithms courses online, and I came across algorithms such as selection sort, insertion sort, bubble sort, merge sort, quick sort, heap sort.. They almost never get close to O(n) when the array is reverse-sorted.
I was wondering one thing: why are we not using space in return of time?
When I organise something I pick up one, and put it where it belongs to. So I thought if we have an array of items, we could just put each value to the index with that value.
Here is my implementation in Swift 4:
let simpleArray = [5,8,3,2,1,9,4,7,0]
let maxSpace = 20
func spaceSort(array: [Int]) -> [Int] {
guard array.count > 1 else {
return array
}
var realResult = [Int]()
var result = Array<Int>(repeating: -1, count: maxSpace)
for i in 0..<array.count{
if(result[array[i]] != array[i]){
result[array[i]] = array[i]
}
}
for i in 0..<result.count{
if(result[i] != -1){
realResult.append(i)
}
}
return realResult
}
var spaceSorted = [Int]()
var execTime = BenchTimer.measureBlock {
spaceSorted = spaceSort(array: simpleArray)
}
print("Average execution time for simple array: \(execTime)")
print(spaceSorted)
Results I get:
Does this sorting algorithm exist already?
Is this a bad idea because it only takes unique values and loses the duplicates? Or could there be uses for it?
And why can't I use Int.max for the maxSpace?
Edit:
I get the error below
error: Execution was interrupted.
when I use let maxSpace = Int.max
MyPlayground(6961,0x7000024af000) malloc: Heap corruption detected,
free list is damaged at 0x600003b7ebc0
* Incorrect guard value: 0 MyPlayground(6961,0x7000024af000) malloc: * set a breakpoint in malloc_error_break to debug
Thanks for the answers

This is an extreme version of radix sort. Quoted from Wikipedia:
radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process is repeated for each digit, while preserving the ordering of the prior step, until all digits have been considered. For this reason, radix sort has also been called bucket sort and digital sort.
In this case you choose your radix as maxSpace, and so you don't have any "elements with more than one significant digit" (from quote above).
Now, if you would use a Hash Set data structure instead of an array, you would actually not need to really allocate the space for the whole range. You would still keep all the loop iterations though (from 0 to maxSpace), and it would check whether the hash set contains the value of i (the loop variable), and if so, output it.
This can only be an efficient algorithm if maxSpace has the same order of magnitude as the number of elements in your input array. Other sorting algorithms can sort with O(nlogn) time complexity, so for cases where maxSpace is much greater than nlogn, the algorithm is not that compelling.

Swift - should I create local variable of a strings "count"?

Does it matter if I use a strings 'count' multiple times within a function. That is, does Swift cache the 'count' after it firsts computes it. Below are two examples, does it matter which one I use? I assume the second is definitely okay but what about the first? I see example code like the first one all the time.
func Foo1 (str: String) {
...
// calling str.count twice
if x < str.count && y < str.count {
...
}
func Foo2 (str: String) {
...
// calling str.count once
let c = str.count
if x < c && y < c {
...
}

.count is defined by the Collection protocol with the following complexity:
Complexity: O(1) if the collection conforms to RandomAccessCollection; otherwise, O(n), where n is the length of the collection.
String is not a RandomAccessCollection. It's a BidirectionalCollection, so it does not promise O(1). It only promises O(n).
It definitely does not promise any caching (and you shouldn't expect any).
It happens to be true that in many (probably most) cases, String's count is cached. It's part of _StringObject, which is part of the low-level storage abstraction, and it's often inlined by the optimizer. But none of this is promised.
That said, unless you expect the String to be extremely large (10kB at a minimum, possibly more), it is difficult to imagine this being a major bottleneck by being called twice outside a tight loop. As with most things, you should write clearly, and then profile. I would likely create an extra variable just for clarity, but you shouldn't second-guess here too much. Write clearly. Then profile.
Do you have particularly large strings that you're working with?

"Appending" to an ArraySlice?

Say ...
you have about 20 Thing
very often, you do a complex calculation running through a loop of say 1000 items. The end result is a varying number around 20 each time
you don't know how many there will be until you run through the whole loop
you then want to quickly (and of course elegantly!) access the result set in many places
for performance reasons you don't want to just make a new array each time. note that unfortunately there's a differing amount so you can't just reuse the same array trivially.
What about ...
var thingsBacking = [Thing](repeating: Thing(), count: 100) // hard limit!
var things: ArraySlice<Thing> = []
func fatCalculation() {
var pin: Int = 0
// happily, no need to clean-out thingsBacking
for c in .. some huge loop {
... only some of the items (roughly 20 say) become the result
x = .. one of the result items
thingsBacking[pin] = Thing(... x, y, z )
pin += 1
}
// and then, magic of slices ...
things = thingsBacking[0..<pin]
(Then, you can do this anywhere... for t in things { .. } )
What I am wondering, is there a way you can call to an ArraySlice<Thing> to do that in one step - to "append to" an ArraySlice and avoid having to bother setting the length at the end?
So, something like this ..
things = ... set it to zero length
things.quasiAppend(x)
things.quasiAppend(x2)
things.quasiAppend(x3)
With no further effort, things now has a length of three and indeed the three items are already in the backing array.
I'm particularly interested in performance here (unusually!)
Another approach,
var thingsBacking = [Thing?](repeating: Thing(), count: 100) // hard limit!
and just set the first one after your data to nil as an end-marker. Again, you don't have to waste time zeroing. But the end marker is a nuisance.
Is there a more better way to solve this particular type of array-performance problem?

Based on MartinR's comments, it would seem that for the problem
the data points are incoming and
you don't know how many there will be until the last one (always less than a limit) and
you're having to redo the whole thing at high Hz
It would seem to be best to just:
(1) set up the array
var ra = [Thing](repeating: Thing(), count: 100) // hard limit!
(2) at the start of each run,
.removeAll(keepingCapacity: true)
(3) just go ahead and .append each one.
(4) you don't have to especially mark the end or set a length once finished.
It seems it will indeed then use the same array backing. And it of course "increases the length" as it were each time you append - and you can iterate happily at any time.
Slices - get lost!

In swift which loop is faster `for` or `for-in`? Why?

Which loop should I use when have to be extremely aware of the time it takes to iterate over a large array.

Short answer
Don’t micro-optimize like this – any difference there is could be far outweighed by the speed of the operation you are performing inside the loop. If you truly think this loop is a performance bottleneck, perhaps you would be better served by using something like the accelerate framework – but only if profiling shows you that effort is truly worth it.
And don’t fight the language. Use for…in unless what you want to achieve cannot be expressed with for…in. These cases are rare. The benefit of for…in is that it’s incredibly hard to get it wrong. That is much more important. Prioritize correctness over speed. Clarity is important. You might even want to skip a for loop entirely and use map or reduce.
Longer Answer
For arrays, if you try them without the fastest compiler optimization, they perform identically, because they essentially do the same thing.
Presumably your for ;; loop looks something like this:
var sum = 0
for var i = 0; i < a.count; ++i {
sum += a[i]
}
and your for…in loop something like this:
for x in a {
sum += x
}
Let’s rewrite the for…in to show what is really going on under the covers:
var g = a.generate()
while let x = g.next() {
sum += x
}
And then let’s rewrite that for what a.generate() returns, and something like what the let is doing:
var g = IndexingGenerator<[Int]>(a)
var wrapped_x = g.next()
while wrapped_x != nil {
let x = wrapped_x!
sum += x
wrapped_x = g.next()
}
Here is what the implementation for IndexingGenerator<[Int]> might look like:
struct IndexingGeneratorArrayOfInt {
private let _seq: [Int]
var _idx: Int = 0
init(_ seq: [Int]) {
_seq = seq
}
mutating func generate() -> Int? {
if _idx != _seq.endIndex {
return _seq[_idx++]
}
else {
return nil
}
}
}
Wow, that’s a lot of code, surely it performs way slower than the regular for ;; loop!
Nope. Because while that might be what it is logically doing, the compiler has a lot of latitude to optimize. For example, note that IndexingGeneratorArrayOfInt is a struct not a class. This means it has no overhead over declaring the two member variables directly. It also means the compiler might be able to inline the code in generate – there is no indirection going on here, no overloaded methods and vtables or objc_MsgSend. Just some simple pointer arithmetic and deferencing. If you strip away all the syntax for the structs and method calls, you’ll find that what the for…in code ends up being is almost exactly the same as what the for ;; loop is doing.
for…in helps avoid performance errors
If, on the other hand, for the code given at the beginning, you switch compiler optimization to the faster setting, for…in appears to blow for ;; away. In some non-scientific tests I ran using XCTestCase.measureBlock, summing a large array of random numbers, it was an order of magnitude faster.
Why? Because of the use of count:
for var i = 0; i < a.count; ++i {
// ^-- calling a.count every time...
sum += a[i]
}
Maybe the optimizer could have fixed this for you, but in this case it hasn’t. If you pull the invariant out, it goes back to being the same as for…in in terms of speed:
let count = a.count
for var i = 0; i < count; ++i {
sum += a[i]
}
“Oh, I would definitely do that every time, so it doesn’t matter”. To which I say, really? Are you sure? Bet you forget sometimes.
But you want the even better news? Doing the same summation with reduce was (in my, again not very scientific, tests) even faster than the for loops:
let sum = a.reduce(0,+)
But it is also so much more expressive and readable (IMO), and allows you to use let to declare your result. Given that this should be your primary goal anyway, the speed is an added bonus. But hopefully the performance will give you an incentive to do it regardless.
This is just for arrays, but what about other collections? Of course this depends on the implementation but there’s a good reason to believe it would be faster for other collections like dictionaries, custom user-defined collections.
My reason for this would be that the author of the collection can implement an optimized version of generate, because they know exactly how the collection is being used. Suppose subscript lookup involves some calculation (such as pointer arithmetic in the case of an array - you have to add multiple the index by the value size then add that to the base pointer). In the case of generate, you know what is being done is to sequentially walk the collection, and therefore you could optimize for this (for example, in the case of an array, hold a pointer to the next element which you increment each time next is called). Same goes for specialized member versions of reduce or map.
This might even be why reduce is performing so well on arrays – who knows (you could stick a breakpoint on the function passed in if you wanted to try and find out). But it’s just another justification for using the language construct you should probably be using regardless.

Famously stated: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" Donald Knuth. It seems unlikely that you are in the %3.
Focus on the bigger problem at hand. After it is working, if it needs a performance boost, then worry about for loops. But I guarantee you, in the end, bigger structural inefficiencies or poor algorithm choice will be the performance problem, not a for loop.
Worrying about for loops is oh so 1960s.

FWIW, a rudimentary playground test shows map() is about 10 times faster than for enumeration:
class SomeClass1 {
let value: UInt32 = arc4random_uniform(100)
}
class SomeClass2 {
let value: UInt32
init(value: UInt32) {
self.value = value
}
}
var someClass1s = [SomeClass1]()
for _ in 0..<1000 {
someClass1s.append(SomeClass1())
}
var someClass2s = [SomeClass2]()
let startTimeInterval1 = CFAbsoluteTimeGetCurrent()
someClass1s.map { someClass2s.append(SomeClass2(value: $0.value)) }
println("Time1: \(CFAbsoluteTimeGetCurrent() - startTimeInterval1)") // "Time1: 0.489435970783234"
var someMoreClass2s = [SomeClass2]()
let startTimeInterval2 = CFAbsoluteTimeGetCurrent()
for item in someClass1s { someMoreClass2s.append(SomeClass2(value: item.value)) }
println("Time2: \(CFAbsoluteTimeGetCurrent() - startTimeInterval2)") // "Time2 : 4.81457495689392"

The for (with a counter) is just incrementing a counter. Very fast. The for-in uses an iterator (call object to pass the next element). This is much slower. But finally you want to access the element in both cases wich will then make no difference in the end.