I'm implementing a dcache in a pipeline processor.
My dcache is a 2-way associative with 2 words per block and 8 indexes
This is how I initialized my cache structure.
typedef struct packed {
logic [25:0] tag;
logic valid, dirty;
word_t [1:0] data;
} block_t;
typedef struct packed {
block_t [1:0] way;
} dcache_t;
dcache_t [7:0] cache;
So to access a word: cache[i].way[j].data[k]
I can write to cache fine.
index, way and sel are variables that use combination logic to determine where to index.
So for example, this line is in my always_ff register for the cache.
cache[index].way[way].data[sel] = ccif.dload[CPUID];
After the above line of code, the following gets stored into cache
for index = 6, way = 0, sel = 0
cache[6].way[0].data[0] <== 0x01234567
and after the next clock cycle the following for index = 6, way = 0, sel = 1
cache[6].way[0].data[1] <== 0x89ABCDEF
Since I load two words at a time.
...but when I read from it using index = 6, way = 0, sel = 1
dcif.dmemload = cache[index].way[way].data[sel];
The following gets read from my cache
dcif.dmemload <== 0xCDEF0123
I get wrong value and don't know why, since the value in cache is still the same and hasn't changed.
This is the current state of a section of my cache at the time of reading
+-------+------------+------------+
| index | data[1] | data[0] |
+-------+------------+------------+
| 6 | 89ABCDEF | 01234567 |
+-------+------------+------------+
Any ideas? I'm confused because my indexing works fine when writing but something weird happens when reading
Edit: the value read isn't always offset by 2 bytes.
I'm not sure if I have too many nested arrays.
This is a bug in ModelSim/Questa that is fixed in the next release.
The solution is to not make the entire nested array all packed. You probably did not mean to have your cache packed anyways. You should not pack your arrays unless you need to access the whole array as a single integral value.
dcache_t cache[7:0];
Related
In a typical N-Body simulation, at the end of each epoch, each locale would need to share its own portion of the world (i.e. all bodies) to the rest of the locales. I am working on this with a local-view approach (i.e. using on Loc statements). I encountered some strange behaviours that I couldn't make sense out of, so I decided to make a test program, in which things got more complicated. Here's the code to replicate the experiment.
proc log(args...?n) {
writeln("[locale = ", here.id, "] [", datetime.now(), "] => ", args);
}
const max: int = 50000;
record stuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class ctuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class wrapper {
// The point is that total size (in bytes) of data in `r`, `c` and `a` are the same here, because the record and the class hold two ints per index.
var r: [{1..max / 2}] stuff;
var c: [{1..max / 2}] owned ctuff?;
var a: [{1..max}] int;
proc init() {
this.a = here.id;
}
}
proc test() {
var wrappers: [LocaleSpace] owned wrapper?;
coforall loc in LocaleSpace {
on Locales[loc] {
wrappers[loc] = new owned wrapper();
}
}
// rest of the experiment further down.
}
Two interesting behaviours happen here.
1. Moving data
Now, each instance of wrapper in array wrappers should live in its locale. Specifically, the references (wrappers) will live in locale 0, but the internal data (r, c, a) should live in the respective locale. So we try to move some from locale 1 to locale 3, as such:
on Locales[3] {
var timer: Timer;
timer.start();
var local_stuff = wrappers[1]!.r;
timer.stop();
log("get r from 1", timer.elapsed());
log(local_stuff);
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_c = wrappers[1]!.c;
timer.stop();
log("get c from 1", timer.elapsed());
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_a = wrappers[1]!.a;
timer.stop();
log("get a from 1", timer.elapsed());
}
Surprisingly, my timings show that
Regardless of the size (const max), the time of sending the array and record strays constant, which doesn't make sense to me. I even checked with chplvis, and the size of GET actually increases, but the time stays the same.
The time to send the class field increases with time, which makes sense, but it is quite slow and I don't know which case to trust here.
2. Querying the locales directly.
To demystify the problem, I also query the .locale.id of some variables directly. First, we query the data, which we expect to live in locale 2, from locale 2:
on Locales[2] {
var wrappers_ref = wrappers[2]!; // This is always 1 GET from 0, okay.
log("array",
wrappers_ref.a.locale.id,
wrappers_ref.a[1].locale.id
);
log("record",
wrappers_ref.r.locale.id,
wrappers_ref.r[1].locale.id,
wrappers_ref.r[1].x1.locale.id,
);
log("class",
wrappers_ref.c.locale.id,
wrappers_ref.c[1]!.locale.id,
wrappers_ref.c[1]!.x1.locale.id
);
}
And the result is:
[locale = 2] [2020-12-26T19:36:26.834472] => (array, 2, 2)
[locale = 2] [2020-12-26T19:36:26.894779] => (record, 2, 2, 2)
[locale = 2] [2020-12-26T19:36:27.023112] => (class, 2, 2, 2)
Which is expected. Yet, if we query the locale of the same data on locale 1, then we get:
[locale = 1] [2020-12-26T19:34:28.509624] => (array, 2, 2)
[locale = 1] [2020-12-26T19:34:28.574125] => (record, 2, 2, 1)
[locale = 1] [2020-12-26T19:34:28.700481] => (class, 2, 2, 2)
Implying that wrappers_ref.r[1].x1.locale.id lives in locale 1, even though it should clearly be on locale 2. My only guess is that by the time .locale.id is executed, the data (i.e. the .x of the record) is already moved to the querying locale (1).
So all in all, the second part of the experiment lead to a secondary question, whilst not answering the first part.
NOTE: all experiment are run with -nl 4 in chapel/chapel-gasnet docker image.
Good observations, let me see if I can shed some light.
As an initial note, any timings taken with the gasnet Docker image should be taken with a grain of salt since that image simulates the execution across multiple nodes using your local system rather than running each locale on its own compute node as intended in Chapel. As a result, it is useful for developing distributed memory programs, but the performance characteristics are likely to be very different than running on an actual cluster or supercomputer. That said, it can still be useful for getting coarse timings (e.g., your "this is taking a much longer time" observation) or for counting communications using chplvis or the CommDiagnostics module.
With respect to your observations about timings, I also observe that the array-of-class case is much slower, and I believe I can explain some of the behaviors:
First, it's important to understand that any cross-node communications can be characterized using a formula like alpha + beta*length. Think of alpha as representing the basic cost of performing the communication, independent of length. This represents the cost of calling down through the software stack to get to the network, putting the data on the wire, receiving it on the other side, and getting it back up through the software stack to the application there. The precise value of alpha will depend on factors like the type of communication, choice of software stack, and physical hardware. Meanwhile, think of beta as representing the per-byte cost of the communication where, as you intuit, longer messages necessarily cost more because there's more data to put on the wire, or potentially to buffer or copy, depending on how the communication is implemented.
In my experience, the value of alpha typically dominates beta for most system configurations. That's not to say that it's free to do longer data transfers, but that the variance in execution time tends to be much smaller for longer vs. shorter transfers than it is for performing a single transfer versus many. As a result, when choosing between performing one transfer of n elements vs. n transfers of 1 element, you'll almost always want the former.
To investigate your timings, I bracketed your timed code portions with calls to the CommDiagnostics module as follows:
resetCommDiagnostics();
startCommDiagnostics();
...code to time here...
stopCommDiagnostics();
printCommDiagnosticsTable();
and found, as you did with chplvis, that the number of communications required to localize the array of records or array of ints was constant as I varied max, for example:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
This is consistent with what I'd expect from the implementation: That for an array of value types, we perform a fixed number of communications to access array meta-data, and then communicate the array elements themselves in a single data transfer to amortize the overheads (avoid paying multiple alpha costs).
In contrast, I found that the number of communications for localizing the array of classes was proportional to the size of the array. For example, for the default value of 50,000 for max, I saw:
locale
get
put
execute_on
0
0
0
0
1
0
0
0
2
0
0
0
3
25040
25000
1
I believe the reason for this distinction relates to the fact that c is an array of owned classes, in which only a single class variable can "own" a given ctuff object at a time. As a result, when copying the elements of array c from one locale to another, you're not just copying raw data, as with the record and integer cases, but also performing an ownership transfer per element. This essentially requires setting the remote value to nil after copying its value to the local class variable. In our current implementation, this seems to be done using a remote get to copy the remote class value to the local one, followed by a remote put to set the remote value to nil, hence, we have a get and put per array element, resulting in O(n) communications rather than O(1) as in the previous cases. With additional effort, we could potentially have the compiler optimize this case, though I believe it will always be more expensive than the others due to the need to perform the ownership transfer.
I tested the hypothesis that owned classes were resulting in the additional overhead by changing your ctuff objects from being owned to unmanaged, which removes any ownership semantics from the implementation. When I do this, I see a constant number of communications, as in the value cases:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
I believe this represents the fact that once the language has no need to manage the ownership of the class variables, it can simply transfer their pointer values in a single transfer again.
Beyond these performance notes, it's important to understand a key semantic difference between classes and records when choosing which to use. A class object is allocated on the heap, and a class variable is essentially a reference or pointer to that object. Thus, when a class variable is copied from one locale to another, only the pointer is copied, and the original object remains where it was (for better or worse). In contrast, a record variable represents the object itself, and can be thought of as being allocated "in place" (e.g., on the stack for a local variable). When a record variable is copied from one locale to the other, it's the object itself (i.e., the record's fields' values) which are copied, resulting in a new copy of the object itself. See this SO question for further details.
Moving on to your second observation, I believe that your interpretation is correct, and that this may be a bug in the implementation (I need to stew on it a bit more to be confident). Specifically, I think you're correct that what's happening is that wrappers_ref.r[1].x1 is being evaluated, with the result being stored in a local variable, and that the .locale.id query is being applied to the local variable storing the result rather than the original field. I tested this theory by taking a ref to the field and then printing locale.id of that ref, as follows:
ref x1loc = wrappers_ref.r[1].x1;
...wrappers_ref.c[1]!.x1.locale.id...
and that seemed to give the right result. I also looked at the generated code which seemed to indicate that our theories were correct. I don't believe that the implementation should behave this way, but need to think about it a bit more before being confident. If you'd like to open a bug against this on Chapel's GitHub issues page, for further discussion there, we'd appreciate that.
I would like to know how to read a binary file into memory (writing it to memory like an "Array Buffer" from JavaScript), and write to different parts of memory 8-bit, 16-bit, 32-bit etc. values, even 5 bit or 10 bit values.
extension Binary {
static func readFileToMemory(_ file) -> ArrayBuffer {
let data = NSData(contentsOfFile: "/path/to/file/7CHands.dat")!
var dataRange = NSRange(location: 0, length: ?)
var ? = [Int32](count: ?, repeatedValue: ?)
data.getBytes(&?, range: dataRange)
}
static func writeToMemory(_ buffer, location, value) {
buffer[location] = value
}
static func readFromMemory(_ buffer, location) {
return buffer[location]
}
}
I have looked at a bunch of places but haven't found a standard reference.
https://github.com/nst/BinUtils/blob/master/Sources/BinUtils.swift
https://github.com/apple/swift/blob/master/stdlib/public/core/ArrayBuffer.swift
https://github.com/uraimo/Bitter/blob/master/Sources/Bitter/Bitter.swift
In Swift, how do I read an existing binary file into an array?
Swift - writing a byte stream to file
https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html
https://github.com/Cosmo/BinaryKit/blob/master/Sources/BinaryKit.swift
https://github.com/vapor-community/bits/blob/master/Sources/Bits/Data%2BBytesConvertible.swift
https://academy.realm.io/posts/nate-cook-tryswift-tokyo-unsafe-swift-and-pointer-types/
https://medium.com/#gorjanshukov/working-with-bytes-in-ios-swift-4-de316a389a0c
I would like for this to be as low-level as possible. So perhaps using UnsafeMutablePointer, UnsafePointer, or UnsafeMutableRawPointer.
Saw this as well:
let data = NSMutableData()
var goesIn: Int32 = 42
data.appendBytes(&goesIn, length: sizeof(Int32))
println(data) // <2a000000]
var comesOut: Int32 = 0
data.getBytes(&comesOut, range: NSMakeRange(0, sizeof(Int32)))
println(comesOut) // 42
I would basically like to allocate a chunk of memory and be able to read and write from it. Not sure how to do that. Perhaps using C is the best way, not sure.
Just saw this too:
let rawData = UnsafeMutablePointer<UInt8>.allocate(capacity: width * height * 4)
If you're looking for low level code you'll need to use UnsafeMutableRawPointer. This is a pointer to a untyped data. Memory is accessed in bytes, so 8 chunks of at least 8 bits. I'll cover multiples of 8 bits first.
Reading a File
To read a file this way, you need to manage file handles and pointers yourself. Try the the following code:
// Open the file in read mode
let file = fopen("/Users/joannisorlandos/Desktop/ownership", "r")
// Files need to be closed manually
defer { fclose(file) }
// Find the end
fseek(file, 0, SEEK_END)
// Count the bytes from the start to the end
let fileByteSize = ftell(file)
// Return to the start
fseek(file, 0, SEEK_SET)
// Buffer of 1 byte entities
let pointer = UnsafeMutableRawPointer.allocate(byteCount: fileByteSize, alignment: 1)
// Buffer needs to be cleaned up manually
defer { pointer.deallocate() }
// Size is 1 byte
let readBytes = fread(pointer, 1, fileByteSize, file)
let errorOccurred = readBytes != fileByteSize
First you need to open the file. This can be done using Swift strings since the compiler makes them into a CString itself.
Because cleanup is all for us on this low level, a defer is put in place to close the file at the end.
Next, the file is set to seek the end of the file. Then the distance between the start of the file and the end is calculated. This is used later, so the value is kept.
Then the program is set to return to the start of the file, so the application starts reading from the start.
To store the file, a pointer is allocated with the amount of bytes that the file has in the file system. Note: This can change inbetween the steps if you're extremely unlucky or the file is accessed quite often. But I think for you, this is unlikely.
The amount of bytes is set, and aligned to one byte. (You can learn more about memory alignment on Wikipedia.
Then another defer is added to make sure no memory leaks at the end of this code. The pointer needs to be deallocated manually.
The file's bytes are read and stored in the pointer. Do note that this entire process reads the file in a blocking manner. It can be more preferred to read files asynchronously, if you plan on doing that I'll recommend looking into a library like SwiftNIO instead.
errorOccurred can be used to throw an error or handle issues in another manner.
From here, your buffer is ready for manipulation. You can print the file if it's text using the following code:
print(String(cString: pointer.bindMemory(to: Int8.self, capacity: fileByteSize)))
From here, it's time to learn how to read manipulate the memory.
Manipulating Memory
The below demonstrates reading byte 20..<24 as an Int32.
let int32 = pointer.load(fromByteOffset: 20, as: Int32.self)
I'll leave the other integers up to you. Next, you can alos put data at a position in memory.
pointer.storeBytes(of: 40, toByteOffset: 30, as: Int64.self)
This will replace byte 30..<38 with the number 40. Note that big endian systems, although uncommon, will store information in a different order from normal little endian systems. More about that here.
Modifying Bits
As you notes, you're also interested in modifying five or ten bits at a time. To do so, you'll need to mix the previous information with the new information.
var data32bits = pointer.load(fromByteOffset: 20, as: Int32.self)
var newData = 0b11111000
In this case, you'll be interested in the first 5 bits and want to write them over bit 2 through 7. To do so, first you'll need to shift the bits to a position that matches the new position.
newData = newData >> 2
This shifts the bits 2 places to the right. The two left bits that are now empty are therefore 0. The 2 bits on the right that got shoved off are not existing anymore.
Next, you'll want to get the old data from the buffer and overwrite the new bits.
To do so, first move the new byte into a 32-bits buffer.
var newBits = numericCast(newData) as Int32
The 32 bits will be aligned all the way to the right. If you want to replace the second of the four bytes, run the following:
newBits = newBits << 16
This moves the fourth pair 16 bit places left, or 2 bytes. So it's now on position 1 starting from 0.
Then, the two bytes need to be added on top of each other. One common method is the following:
let oldBits = data32bits & 0b11111111_11000001_11111111_11111111
let result = oldBits | newBits
What happens here is that we remove the 5 bits with new data from the old dataset. We do so by doing a bitwise and on the old 32 bits and a bitmap.
The bitmap has all 1's except for the new locations which are being replaced. Because those are empty in the bitmap, the and operator will exclude those bits since one of the two (old data vs. bitmap) is empty.
AND operators will only be 1 if both sides of the operator are 1.
Finally, the oldBits and the newBits are merged with an OR operator. This will take each bit on both sides and set the result to 1 if the bits at both positions are 1.
This will merge successfully since both buffers contain 1 bits that the other number doesn't set.
Say ...
you have about 20 Thing
very often, you do a complex calculation running through a loop of say 1000 items. The end result is a varying number around 20 each time
you don't know how many there will be until you run through the whole loop
you then want to quickly (and of course elegantly!) access the result set in many places
for performance reasons you don't want to just make a new array each time. note that unfortunately there's a differing amount so you can't just reuse the same array trivially.
What about ...
var thingsBacking = [Thing](repeating: Thing(), count: 100) // hard limit!
var things: ArraySlice<Thing> = []
func fatCalculation() {
var pin: Int = 0
// happily, no need to clean-out thingsBacking
for c in .. some huge loop {
... only some of the items (roughly 20 say) become the result
x = .. one of the result items
thingsBacking[pin] = Thing(... x, y, z )
pin += 1
}
// and then, magic of slices ...
things = thingsBacking[0..<pin]
(Then, you can do this anywhere... for t in things { .. } )
What I am wondering, is there a way you can call to an ArraySlice<Thing> to do that in one step - to "append to" an ArraySlice and avoid having to bother setting the length at the end?
So, something like this ..
things = ... set it to zero length
things.quasiAppend(x)
things.quasiAppend(x2)
things.quasiAppend(x3)
With no further effort, things now has a length of three and indeed the three items are already in the backing array.
I'm particularly interested in performance here (unusually!)
Another approach,
var thingsBacking = [Thing?](repeating: Thing(), count: 100) // hard limit!
and just set the first one after your data to nil as an end-marker. Again, you don't have to waste time zeroing. But the end marker is a nuisance.
Is there a more better way to solve this particular type of array-performance problem?
Based on MartinR's comments, it would seem that for the problem
the data points are incoming and
you don't know how many there will be until the last one (always less than a limit) and
you're having to redo the whole thing at high Hz
It would seem to be best to just:
(1) set up the array
var ra = [Thing](repeating: Thing(), count: 100) // hard limit!
(2) at the start of each run,
.removeAll(keepingCapacity: true)
(3) just go ahead and .append each one.
(4) you don't have to especially mark the end or set a length once finished.
It seems it will indeed then use the same array backing. And it of course "increases the length" as it were each time you append - and you can iterate happily at any time.
Slices - get lost!
I've just started learning about hash tables and I understand how to insert but not how to search. These are the algorithms I'll be basing this question off:
Hashing the key
int Hash (int key) {
return key % 10; //table has a max size of 10
}
Linear probing for collision resolution.
Suppose I call insert twice with the keys 1, 11, and 21. This would return slot 1 for all 3 keys. After collision resolution the table would have the values 1, 11, and 21 at slots 1, 2, and 3. This is what I assume would happen with my understanding of inserting.
After doing this, how would I get the slots 2 and 3 if I search for the keys 11 and 21? From what I've read searching a hash table should do literally the same thing as inserting except when you arrive at the desired slot, you return the value at that slot instead of inserting something into it.
If I take this literally and apply the same algorithm, if I search for the key 11 I would arrive at slot 4 because it would start at slot 1 and keep probing forward until it finds an empty slot. It wouldn't stop at slot 2 even though it's what I want because it's not empty.
I'm struggling with this even if I use separate chaining. All 3 keys would be stored at slot 1 but using the same algorithm to search would return slot 1, not which node in the linked list.
Each slot stores a key/value pair. As you're searching through each slot, check whether the key is equal to the key you're searching for. Stop searching and return the value when you find an equal key.
With separate chaining, you can do a linear search through the list, checking the key against each key in the list.
I usually prefer to make each entry in the table a struct so I can create a linked list to handle collisions. This reduces collisions significantly. Something like this.
struct hashtable
{
int key;
struct hashtable *pList;
};
struct hashtable ht[10];
void Insert(int key);
{
index = Hash(key);
if (!ht[index].key)
{
ht[index].key = key;
ht[idnex].pList = 0;
}
else
{
struct hashtable *pht;
pht = ht[index].pList;
while (pht->pList)
pht = pht->pList;
pht->pList = new struct hashtable;
pht->pList->key = key;
pht->pList->pList = 0;
}
return;
}
The lookup function would, of course, have to traverse the list if it doesn't find the first entry's key matches. If performance is critical, you could use other strategies for the linked lists such as sorting them and using a binary search.
I am profiling an iPhone app and I noticed a strange pattern. In a certain block of code that's called quite frequently...
[item setQuadrant:[NSNumber numberWithInt:a]];
[item setIndex:[NSNumber numberWithInt:b]];
[item setTimestamp:[NSNumber numberWithInt:c]];
[item setState:[NSNumber numberWithInt:d]];
[item setCompletionPercentage:[NSNumber numberWithInt:e]];
[item setId_:[NSNumber numberWithInt:f]];
...the first call to [NSNumber numberWithInt:] takes an inordinate amount of time, in the order of 10-15x that of the remaining calls. I've verified that the results are consistent if I shuffle the lines (the first line is always the slow one, by the same ratio). Is there something going on that I'm not aware of?
Perhaps this happens because this block is inside a try/catch?
If I had to guess, NSNumber performs some code in it's +load implementation, which slows down the initial call to the class. Also note that NSNumber caches it's return value, so future calls to +numberWithInt: with the same value are faster than before, that could possibly be part of the issue.
Maybe the first value is much larger? apart from CoreFoundation's CFNumber caching, the "new" runtime uses tagged pointers, allowing integers within the range of 24 bit to be encoded directly into the pointer - the runtime then figure out it's a tagged pointer by looking at the last bit (and that its a CFNumber by looking at the 3 bits before the last bit and the target number size - 8, 16, 32, 64 bit - using the next 4 bits before).
On a 32-bit system (current iPhones), that means that for ("small") negative 32 bit numbers or large positive numbers, CoreFoundation will allocate an object. For everything else, it uses the following expression that is way faster:
case kCFNumberSInt32Type: {
int32_t value = *(int32_t *)valuePtr; // this loads the actual numerical value passed to CFNumberCreate()
#if !__LP64__
// We don't bother allowing the min 24-bit integer -2^23 to also be fast-pathed;
// tell anybody that complains about that to go ... hang.
int32_t limit = (1L << 23);
if (value <= -limit || limit <= value) break;
#endif
uintptr_t ptr_val = ((uintptr_t)((intptr_t)value << 8) | (2 << 6) | kCFTaggedObjectID_Integer);
return (CFNumberRef)ptr_val;
}
(note that !__LP64__ is true for 32-bit systems)
Taken from: http://www.opensource.apple.com/source/CF/CF-744.12/CFNumber.c
Also, there is a caching mechanism that prevents a range of numbers from being re-created multiple times, just search for "__CFNumberCache" in the same source file.