Code like this:
__constant char a[1] = "x";
...
__local char b[1];
async_work_group_copy(b, a, 1, 0);
throws a compile error:
no instance of overloaded function "async_work_group_copy" matches the argument list
So it seems that this function cannot be used to copy from __constant address space. Am I right? If yes, what's the preferred method to make a copy of __constant data to __local memory for faster access? Now I use a simple for loop, where each workitem copies several elements.
async_work_group_copy() is defined to copy between local and global memory only (see here: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/).
As far as I know, there is no method to perform bulk copy from constant to local memory. Maybe the reason is that constant memory is actually cached on all GPUs that I know of, which essentially means that it works at the same speed as local memory.
The vloadn() family of functions can load whole vectors for all types of memory, including constant, so that may partially match what you need. However, it is not bulk copy.
Related
I was able to confirm from the documentation that bpf_map_update_elem is an atomic operation if done on HASH_MAPs. Source (https://man7.org/linux/man-pages/man2/bpf.2.html). [Cite: map_update_elem() replaces existing elements atomically]
My question is 2 folds.
What if the element does not exist, is the map_update_elem still atomic?
Is the XDP operation bpf_map_delete_elem thread safe from User space program?
The map is a HASH_MAP.
Atomic ops, race conditions and thread safety are sort of complex in eBPF, so I will make a broad answer since it is hard to judge from your question what your goals are.
Yes, both the bpf_map_update_elem command via the syscall and the helper function update the maps 'atmomically', which in this case means that if we go from value 'A' to value 'B' that the program always sees either 'A' or 'B' not some combination of the two(first bytes of 'B' and last bytes of 'A' for example). This is true for all map types. This holds true for all map modifying syscall commands(including bpf_map_delete_elem).
This however doesn't make race conditions impossible since the value of the map may have changed between a map_lookup_elem and the moment you update it.
What is also good to keep in mind is that the map_lookup_elem syscall command(userspace) works differently from the helper function(kernelspace). The syscall will always return a copy of the data which isn't mutable. But the helper function will return a pointer to the location in kernel memory where the map value is stored, and you can directly update the map value this way without using the map_update_elem helper. That is why you often see hash maps used like:
value = bpf_map_lookup_elem(&hash_map, &key);
if (value) {
__sync_fetch_and_add(&value->packets, 1);
__sync_fetch_and_add(&value->bytes, skb->len);
} else {
struct pair val = {1, skb->len};
bpf_map_update_elem(&hash_map, &key, &val, BPF_ANY);
}
Note that in this example, __sync_fetch_and_add is used to update parts of the map value. We need to do this since updating it like value->packets++; or value->packets += 1 would result in a race condition. The __sync_fetch_and_add emits a atomic CPU instruction which in this case fetches, adds and writes back all in one instruction.
Also, in this example, the two struct fields are atomically updated, but not together, it is still possible for the packets to have incremented but bytes not yet. If you want to avoid this you need to use a spinlock(using the bpf_spin_lock and bpf_spin_unlock helpers).
Another way to sidestep the issue entirely is to use the _PER_CPU variants of maps, where you trade-off congestion/speed and memory use.
For example,
atomic_int test(void)
{
atomic_int tmp = ATOMIC_VAR_INIT(14);
tmp = 47; // Looks like atomic_store
atomic_int mc; // Probably just uninitialised data
memcpy(&mc,&tmp,sizeof(mc)); // Probably equivalent to a copy
tmp = mc + 4; // Arithmetic
return tmp; // A copy - perhaps load then store
}
Clang is happy with all this. I've read section 7.17 of the standard, and it says a lot about the memory model and the defined functions (init, store, load etc) but doesn't say anything about the usual operations (+, = etc).
Also of interest is the behaviour of passing struct wot { atomic_int value; } to functions.
I would like to believe that assignment behaves identically to an atomic load then store using memory_order_seq_cst.
Even more optimistically, I would like to believe that struct assignment, passing to function, returning from function and even memcpy also behaves identically to carefully copying the bit pattern across under memory_order_seq_cst.
I can't find any supporting evidence for either belief in the standard though. There's definitely a chance that assignment and memcpy of atomic primitives is undefined behaviour.
How should primitive operations on atomic primitives behave?
Thanks!
Operations on objects that are _Atomic qualified (and atomic_int is just a different writing for that) are guaranteed to have sequential consistency. You find that mentionned at the end of the semantics section for each of the operands. (And maybe the mention for assignment is missing.)
Your code is not correct at two places: initialization must use the ATOMIC_VAR_INIT macro (7.17.2.1), and memcpy is undefined (the sizes might not agree), although it probably will work on most of the architectures.
Also the line
tmp = mc + 4; // Arithmetic
doesn't do what your comment claims. This is not arithmetic on an atomic object, but a load followed by an ordinary addition. More interesting would be
mc += 4; // Arithmetic
which is an atomic operation with sequential consistency.
I am writing a signal processing program using matlab. I know there are two types of float-pointing variables, single and double. Considering the memory usage, I want my code to work with only single type variable when the system's memory is not large, while it can also be adapted to work with double type variables when necessary, without significant modification (simple and light modification before running is OK, i.e., I don't need runtime-check technique). I know this can be done by macro in C and by template in C++. I don't find practical techniques which can do this in matlab. Do you have any experience with this?
I have a simple idea that I define a global string containing "single" or "double", then I pass this string to any memory allocation method called in my code to indicate what type I need. I think this can work, I just want to know which technique you guys use and is widely accepted.
I cannot see how a template would help here. The type of c++ templates are still determined in compile time (std::vector vec ...). Also note that Matlab defines all variables as double by default unless something else is stated. You basically want runtime checks for your code. I can think of one solution as using a function with a persistent variable. The variable is set once per run. When you generate variables you would then have to generate all variables you want to have as float through this function. This will slow down assignment though, since you have to call a function to assign variables.
This example is somehow an implementation of the singleton pattern (but not exactly). The persistent variable type is set at the first use and cannot change later in the program (assuming that you do not do anything stupid as clearing the variable explicitly). I would recommend to go for hardcoding single in case performance is an issue, instead of having runtime checks or assignment functions or classes or what you can come up with.
function c = assignFloat(a,b)
persistent type;
if (isempty(type) & nargin==2)
type = b;
elseif (isempty(type))
type = 'single';
% elseif(nargin==2), error('Do not set twice!') % Optional code, imo unnecessary.
end
if (strcmp(type,'single'))
c = single(a);
return;
end
c = double(a);
end
I'm trying to determine the "Swift-y" way of creating my own contiguous memory containers (in my particular case, I'm building n-dimensional arrays). I want my containers to be as close to Swift's builtin Array as possible - in terms of functionality and usability.
I need to access the pointer to memory of my containers for stuff like Accelerate and BLAS operations.
I want to know whether an ArraySlice's pointer would point to the first element of the slice, or the first element of its base.
When I tried to test UnsafePointer<Int>(array) == UnsafePointer<Int>(array[1...2]) it looks like Swift doesn't allow pointer construction from ArraySlices (or I just did it incorrectly).
I'm looking for advice on which way would be the most "Swift-y"?
I understand that when slicing an array the follow is true:
let array = [1, 2, 3]
array[1] == array[1...2][1]
and
array[1...2][0] != 2 # index out of bounds error
In other words, indexing is always performed relative to the base Array.
Which suggests: that we should return a pointer to the base's first element. Because slices are relative to their base.
However, iteration through a slice (obviously) only considers elements of that slice:
for i in array[1..2] # i takes on 2 followed by 3
Which suggests: that we should return a pointer to the slice's first element. Because slices have their own starting point.
If my user wanted to operate on a slice in a BLAS operation it would be intuitive to expect:
mmul(matrix1[1...2, 0...1].pointer, matrix2[4...5, 0...1].pointer)
to point to the first elements of slice, but I don't know if this is the way a Swift ArraySlice would do things.
My Question: Should a container slice object's pointer point to the first element of the slice, or, the first element of the base container.
This operation is unsafe:
UnsafePointer<Int>(array)
What you mean is:
array.withUnsafeBufferPointer { ... }
This applies to your types as well, and is the pattern you should employ to interoperate with BLAS and Accelerate. You should not try to use a pointer method IMO.
There is no promise that array will continue to exist by the time you actually access the pointer, even if that happens in the same line of code. ARC is free to destroy that memory shockingly quickly.
UnsafeBufferPointer is actually a very nice type in that it is already promised to be contiguous and it behaves as a Collection.
My suggestion here would be to manage your own memory internally, probably with a ManagedBuffer, but maybe just with a UnsafeMutablePointer that you alloc and destroy yourself. It's very important that you manage the layout of the data so that it's compatible with Accelerate. You don't want Array<Array<UInt8>>. That's going to add too much structure. You want a blob of bytes that you index into in the good-ol' C ways (row*width+column, etc). You probably don't want your slices to return pointers at all directly. Your mmul function is likely going to need special logic to understand how to pull the pieces it needs out of slices with minimal copying so that it works with vDSP_mmul. "Generic" and "Accelerate" seldom go together.
For example, considering this:
mmul(matrix1[1...2, 0...1].pointer, matrix2[4...5, 0...1].pointer)
(Obviously I assume your real matrices are dramatically larger; this kind of matrix doesn't make much sense to send to vDSP.)
You're going to have to write your own mmul here obviously since this memory isn't laid out correctly. So you might as well pass the slices. Then you'd do something like (totally untested, uncompiled, and I'm sure the syntax is wildly wrong):
mmul(m1: MatrixSlice, m2: MatrixSlice) -> Matrix {
var s1 = UnsafeMutablePointer<Float>.alloc(m1.rows * m1.columns)
// use vDSP_vgathr to copy each sliced row out of m1 into s1
var s2 = UnsafeMutablePointer<Float>.alloc(m2.rows * m2.columns)
// use vDSP_vgathr to copy each sliced row out of m2 into s2
var result = UnsafeMutablePointer<Float>.alloc(m1.rows * m2.columns)
vDSP_mmul(s1, 1, s2, 1, result, 1, m1.rows, m2.columns, m1.columns)
s1.destroy()
s2.destroy()
// This will need to call result.move() or moveInitializeFrom or something
return Matrix(result)
}
I'm just throwing out stuff here, but this is probably the kind of structure you'd want.
To your underlying question about whether the pointer to the container or to the data is usually passed by Swift, the answer is unfortunately "magic" for Array and no one else. Passing an Array to something that wants a pointer will magically (by the compiler, not the stdlib) pass a pointer to the storage of the Array. No other type gets this magic. Not even ContiguousArray gets this magic. If you pass a ContiguousArray to something that wants a pointer, you'll pass the pointer to the container (and if it's mutable, corrupt the container; true story, hated that oneā¦)
Thanks in part to #RobNapier the answer to my question is: ArraySlice's pointer should point to the slice's first element.
The way I verified this was simply:
var array = [5,4,3,325,67,7,3]
array.withUnsafeBufferPointer{ $0 } != array[3...6].withUnsafeBufferPointer{ $0 } # true
^--- points to 5's address ^--- points to 325's address
This is a general algorithm question, but my primary environment is Matlab.
I have a function
out=f(arg1,arg2,,.....)
which takes a long time to execute and is expensive to compute (i.e. cluster time). A given argument argn can be a string, integer, vector, and even a function handle
For this reason, I want to avoid calling f(args) for the same argument values. Inside my program, this can occur in ways that are not necessarily controllable by the programmer.
So, I want to call f() once for each possible value of args, and save the results to disk. Then, whenever it is called the next time, check if there is currently a result for those argument values. If so, I would load it from disk.
My current idea is to create a cell variable, with one row for each function call. In the first column is out. In column 2:N are the values of argn, and check the equivalence of each individually.
Since the variable types of the arguments vary, how would I go about doing this?
Is there a better algorithm?
More generally, how do people deal with saving simulation results to disk and storing metadata? (other than cramming everything into a filename!)
You can implement a function that looks something like this:
function result = myfun(input)
persistent cache
if isempty(cache)
cachedInputs = [];
cachedOutputs = [];
cache = {cachedInputs, cachedOutputs};
end
[isCached, idx] = ismember(input, cache{1});
if isCached
result = cache{2}(idx);
else
result = doHardThingOnCluster(input);
cache{1}(end+1) = input;
cache{2}(end+1) = result;
end
This simple example assumes that your inputs and outputs are both scalar numbers that can be stored in an array. If you have to deal with strings, or anything more complicated, you could use a cell array for caching rather than an array. Or in fact, maybe a containers.Map might be even better. Alternatively, if you have to cache really massive results, you might be better off saving it to a file and caching the file name, then loading the file in if you find it's been cached.
Hope that helps!