Write from compute shader to persistently mapped SSBO fails - scala

I'm trying to write to a SSBO with a compute shader and read the data back on the cpu.
The compute shader is just a 1x1x1 toy example that writes 24 floats:
#version 450 core
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout (std430, binding = 0) buffer particles {
float Particle[];
};
void main() {
for (int i = 0; i < 24; ++i) {
Particle[i] = i + 1;
}
}
This is how I run the shader and read the data:
val bufferFlags = GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT
val bufferSize = 24 * 4
val bufferId = glCreateBuffers()
glNamedBufferStorage(bufferId, bufferSize, bufferFlags)
val mappedBuffer = glMapNamedBufferRange(bufferId, 0, bufferSize, bufferFlags)
mappedBuffer.rewind()
val mappedFloatBuffer = mappedBuffer.asFloatBuffer()
mappedFloatBuffer.rewind()
val ssboIndex = glGetProgramResourceIndex(progId, GL_SHADER_STORAGE_BLOCK, "particles")
val props = Array(GL_BUFFER_BINDING)
val params = Array(-1)
glGetProgramResourceiv(progId, GL_SHADER_STORAGE_BLOCK, ssboIndex, props, null, params)
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, params(0), bufferId)
glUseProgram(progId)
val sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0)
glDispatchCompute(1, 1, 1)
glClientWaitSync(sync, 0, 1000000000) match {
case GL_TIMEOUT_EXPIRED =>
println("Timeout expired")
case GL_WAIT_FAILED =>
println("Wait failed. " + glGetError())
case _ =>
println("Result:")
while(mappedFloatBuffer.hasRemaining) {
println(mappedFloatBuffer.get())
}
}
I expect it to print the numbers 1 to 24 but instead it prints 24 zeros. Using the cpu I can read and write (if the GL_MAP_WRITE_BIT is set) to the buffer just fine. The same happens if I don't use DSA (glBindBuffer / glBufferStorage / glMapBufferRange instead). However, if the buffer isn't mapped while the shader runs and I only map it just before printing the contents, everything works correctly. Isn't this exactly what persistently mapped buffers are for? So I can keep it mapped while the gpu is using it?
I checked for any errors, with glGetError as well as with the newer debug output but I don't get any.
Here (pastebin) is a fully working example. You need LWJGL to run it.

There are a number of problems in your code.
First, you put the fence sync before the command you want to sync with. Syncing with a fence syncs with all commands executed before the fence, not after. If you want to sync with the compute shader execution, then you have to insert the fence after the dispatch call, not before.
Second, synchronization is not enough. Writes to an SSBO are incoherent, so you must follow the rules of incoherent memory accesses in order to make them visible to you. In this case, you need to insert an appropriate memory barrier between the compute operation and when you try to read from the buffer with glMemoryBarrier. Since you're reading the data via mapping, the proper barrier to use is GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT.
Your code appears to work when you use non-persistent mapping, but this is mere appearance. It's still undefined behavior due to the improper incoherent memory access (ie: the lack of a memory barrier). It just so happens that the UB does what you want... in that case.

Related

Transferring arrays/classes/records between locales

In a typical N-Body simulation, at the end of each epoch, each locale would need to share its own portion of the world (i.e. all bodies) to the rest of the locales. I am working on this with a local-view approach (i.e. using on Loc statements). I encountered some strange behaviours that I couldn't make sense out of, so I decided to make a test program, in which things got more complicated. Here's the code to replicate the experiment.
proc log(args...?n) {
writeln("[locale = ", here.id, "] [", datetime.now(), "] => ", args);
}
const max: int = 50000;
record stuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class ctuff {
var x1: int;
var x2: int;
proc init() {
this.x1 = here.id;
this.x2 = here.id;
}
}
class wrapper {
// The point is that total size (in bytes) of data in `r`, `c` and `a` are the same here, because the record and the class hold two ints per index.
var r: [{1..max / 2}] stuff;
var c: [{1..max / 2}] owned ctuff?;
var a: [{1..max}] int;
proc init() {
this.a = here.id;
}
}
proc test() {
var wrappers: [LocaleSpace] owned wrapper?;
coforall loc in LocaleSpace {
on Locales[loc] {
wrappers[loc] = new owned wrapper();
}
}
// rest of the experiment further down.
}
Two interesting behaviours happen here.
1. Moving data
Now, each instance of wrapper in array wrappers should live in its locale. Specifically, the references (wrappers) will live in locale 0, but the internal data (r, c, a) should live in the respective locale. So we try to move some from locale 1 to locale 3, as such:
on Locales[3] {
var timer: Timer;
timer.start();
var local_stuff = wrappers[1]!.r;
timer.stop();
log("get r from 1", timer.elapsed());
log(local_stuff);
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_c = wrappers[1]!.c;
timer.stop();
log("get c from 1", timer.elapsed());
}
on Locales[3] {
var timer: Timer;
timer.start();
var local_a = wrappers[1]!.a;
timer.stop();
log("get a from 1", timer.elapsed());
}
Surprisingly, my timings show that
Regardless of the size (const max), the time of sending the array and record strays constant, which doesn't make sense to me. I even checked with chplvis, and the size of GET actually increases, but the time stays the same.
The time to send the class field increases with time, which makes sense, but it is quite slow and I don't know which case to trust here.
2. Querying the locales directly.
To demystify the problem, I also query the .locale.id of some variables directly. First, we query the data, which we expect to live in locale 2, from locale 2:
on Locales[2] {
var wrappers_ref = wrappers[2]!; // This is always 1 GET from 0, okay.
log("array",
wrappers_ref.a.locale.id,
wrappers_ref.a[1].locale.id
);
log("record",
wrappers_ref.r.locale.id,
wrappers_ref.r[1].locale.id,
wrappers_ref.r[1].x1.locale.id,
);
log("class",
wrappers_ref.c.locale.id,
wrappers_ref.c[1]!.locale.id,
wrappers_ref.c[1]!.x1.locale.id
);
}
And the result is:
[locale = 2] [2020-12-26T19:36:26.834472] => (array, 2, 2)
[locale = 2] [2020-12-26T19:36:26.894779] => (record, 2, 2, 2)
[locale = 2] [2020-12-26T19:36:27.023112] => (class, 2, 2, 2)
Which is expected. Yet, if we query the locale of the same data on locale 1, then we get:
[locale = 1] [2020-12-26T19:34:28.509624] => (array, 2, 2)
[locale = 1] [2020-12-26T19:34:28.574125] => (record, 2, 2, 1)
[locale = 1] [2020-12-26T19:34:28.700481] => (class, 2, 2, 2)
Implying that wrappers_ref.r[1].x1.locale.id lives in locale 1, even though it should clearly be on locale 2. My only guess is that by the time .locale.id is executed, the data (i.e. the .x of the record) is already moved to the querying locale (1).
So all in all, the second part of the experiment lead to a secondary question, whilst not answering the first part.
NOTE: all experiment are run with -nl 4 in chapel/chapel-gasnet docker image.
Good observations, let me see if I can shed some light.
As an initial note, any timings taken with the gasnet Docker image should be taken with a grain of salt since that image simulates the execution across multiple nodes using your local system rather than running each locale on its own compute node as intended in Chapel. As a result, it is useful for developing distributed memory programs, but the performance characteristics are likely to be very different than running on an actual cluster or supercomputer. That said, it can still be useful for getting coarse timings (e.g., your "this is taking a much longer time" observation) or for counting communications using chplvis or the CommDiagnostics module.
With respect to your observations about timings, I also observe that the array-of-class case is much slower, and I believe I can explain some of the behaviors:
First, it's important to understand that any cross-node communications can be characterized using a formula like alpha + beta*length. Think of alpha as representing the basic cost of performing the communication, independent of length. This represents the cost of calling down through the software stack to get to the network, putting the data on the wire, receiving it on the other side, and getting it back up through the software stack to the application there. The precise value of alpha will depend on factors like the type of communication, choice of software stack, and physical hardware. Meanwhile, think of beta as representing the per-byte cost of the communication where, as you intuit, longer messages necessarily cost more because there's more data to put on the wire, or potentially to buffer or copy, depending on how the communication is implemented.
In my experience, the value of alpha typically dominates beta for most system configurations. That's not to say that it's free to do longer data transfers, but that the variance in execution time tends to be much smaller for longer vs. shorter transfers than it is for performing a single transfer versus many. As a result, when choosing between performing one transfer of n elements vs. n transfers of 1 element, you'll almost always want the former.
To investigate your timings, I bracketed your timed code portions with calls to the CommDiagnostics module as follows:
resetCommDiagnostics();
startCommDiagnostics();
...code to time here...
stopCommDiagnostics();
printCommDiagnosticsTable();
and found, as you did with chplvis, that the number of communications required to localize the array of records or array of ints was constant as I varied max, for example:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
This is consistent with what I'd expect from the implementation: That for an array of value types, we perform a fixed number of communications to access array meta-data, and then communicate the array elements themselves in a single data transfer to amortize the overheads (avoid paying multiple alpha costs).
In contrast, I found that the number of communications for localizing the array of classes was proportional to the size of the array. For example, for the default value of 50,000 for max, I saw:
locale
get
put
execute_on
0
0
0
0
1
0
0
0
2
0
0
0
3
25040
25000
1
I believe the reason for this distinction relates to the fact that c is an array of owned classes, in which only a single class variable can "own" a given ctuff object at a time. As a result, when copying the elements of array c from one locale to another, you're not just copying raw data, as with the record and integer cases, but also performing an ownership transfer per element. This essentially requires setting the remote value to nil after copying its value to the local class variable. In our current implementation, this seems to be done using a remote get to copy the remote class value to the local one, followed by a remote put to set the remote value to nil, hence, we have a get and put per array element, resulting in O(n) communications rather than O(1) as in the previous cases. With additional effort, we could potentially have the compiler optimize this case, though I believe it will always be more expensive than the others due to the need to perform the ownership transfer.
I tested the hypothesis that owned classes were resulting in the additional overhead by changing your ctuff objects from being owned to unmanaged, which removes any ownership semantics from the implementation. When I do this, I see a constant number of communications, as in the value cases:
locale
get
execute_on
0
0
0
1
0
0
2
0
0
3
21
1
I believe this represents the fact that once the language has no need to manage the ownership of the class variables, it can simply transfer their pointer values in a single transfer again.
Beyond these performance notes, it's important to understand a key semantic difference between classes and records when choosing which to use. A class object is allocated on the heap, and a class variable is essentially a reference or pointer to that object. Thus, when a class variable is copied from one locale to another, only the pointer is copied, and the original object remains where it was (for better or worse). In contrast, a record variable represents the object itself, and can be thought of as being allocated "in place" (e.g., on the stack for a local variable). When a record variable is copied from one locale to the other, it's the object itself (i.e., the record's fields' values) which are copied, resulting in a new copy of the object itself. See this SO question for further details.
Moving on to your second observation, I believe that your interpretation is correct, and that this may be a bug in the implementation (I need to stew on it a bit more to be confident). Specifically, I think you're correct that what's happening is that wrappers_ref.r[1].x1 is being evaluated, with the result being stored in a local variable, and that the .locale.id query is being applied to the local variable storing the result rather than the original field. I tested this theory by taking a ref to the field and then printing locale.id of that ref, as follows:
ref x1loc = wrappers_ref.r[1].x1;
...wrappers_ref.c[1]!.x1.locale.id...
and that seemed to give the right result. I also looked at the generated code which seemed to indicate that our theories were correct. I don't believe that the implementation should behave this way, but need to think about it a bit more before being confident. If you'd like to open a bug against this on Chapel's GitHub issues page, for further discussion there, we'd appreciate that.

CPU memory allocation failed in Dynet

I am not sure why I am running out of memory. Take the parser of Goldberg, all I do is to change this line:
scores, exprs = self.__evaluate(conll_sentence, True)
and add a for loop around it to repeat it K times:
for k in xrange(K):
scores, exprs = self.__evaluate(conll_sentence, True)
# do something
Then in the getExpr, i do the following:
samples_out = np.random.normal(0,0.001, (1, self.hidden_units))
samples_FOH = np.random.normal(0,0.001,(self.hidden_units, self.ldims * 2))
samples_FOM = np.random.normal(0,0.001,(self.hidden_units, self.ldims * 2))
samples_Bias = np.random.normal(0,0.001, (self.hidden_units))
XoutLayer = self.outLayer.expr()+inputTensor(samples_out)
XhidLayerFOH = self.hidLayerFOH.expr()+inputTensor(samples_FOH)
XhidLayerFOM = self.hidLayerFOM.expr()+inputTensor(samples_FOM)
XhidBias = self.hidBias.expr()+inputTensor(samples_Bias)
if sentence[i].headfov is None:
sentence[i].headfov = XhidLayerFOH * concatenate([sentence[i].lstms[0], sentence[i].lstms[1]])
if sentence[j].modfov is None:
sentence[j].modfov = XhidLayerFOM * concatenate([sentence[j].lstms[0], sentence[j].lstms[1]])
output = XoutLayer * self.activation(sentence[i].headfov + sentence[j].modfov + XhidBias)
return output
Essentially what is happening in the above block is to first generate normally distributed noise, then add it to the trained values. But it seems somewhere along the way, all the generated values stay in the memory and it just runs out of memory. Any one knows why?
Dynet expressions stay in memory until the next call to renew_cg().
So the fix would be to call it after each iteration of your loop, provided that you have retrieved all information you needed from the computation graph.
Side note: when you do a simple addition, such as:
XoutLayer = self.outLayer.expr()+inputTensor(samples_out)
no addition is actually performed. You just create a new expression and specify how to evaluate it from other expressions. The actual computation is performed when there is a call to .forward() (or .value(), etc) on XoutLayer or on an expression whose computation depends on XoutLayer.
So, dynet needs to allocate memory for all expressions in the current computation graph.

Expensive flatMap() operation on streams originating from Stream.emits()

I just encountered an issue with degrading fs2 performance using a stream of strings to be written to a file via text.utf8encode. I tried to change my source to use chunked strings to increase performance, but the observation was performance degradation instead.
As far as I can see, it boils down to the following: Invoking flatMap on a stream that originates from Stream.emits() can be very expensive. Time usage seems to be exponential based on the size of the sequence passed to Stream.emits(). The code snippet below shows an example:
/*
Test done with scala 2.11.11 and fs2 version 0.10.0-M7.
*/
val rangeSize = 20000
val integers = (1 to rangeSize).toVector
// Note that the last flatMaps are just added to show extreme load for streamA.
val streamA = Stream.emits(integers).flatMap(Stream.emit(_))
val streamB = Stream.range(1, rangeSize + 1).flatMap(Stream.emit(_))
streamA.toVector // Uses approx. 25 seconds (!)
streamB.toVector // Uses approx. 15 milliseconds
Is this a bug, or should usage of Stream.emits() for large sequences be avoided?
TLDR: Allocations.
Longer answer:
Interesting question. I ran a JFR profile on both methods separately, and looked at the results. First thing which immediately attracted my eye was the amount of allocations.
Stream.emit:
Stream.range:
We can see that Stream.emit allocates a significant amount of Append instances, which are the concrete implementation of Catenable[A], which is the type used in Stream.emit to fold:
private[fs2] final case class Append[A](left: Catenable[A], right: Catenable[A]) extends Catenable[A]
This actually comes from the implementation of how Catenable[A] implemented foldLeft:
foldLeft(empty: Catenable[B])((acc, a) => acc :+ f(a))
Where :+ allocates a new Append object for each element. This means we're at least generating 20000 such Append objects.
There is also a hint in the documentation of Stream.range about how it produces a single chunk instead of dividing the stream further, which may be bad if this was a big range we're generating:
/**
* Lazily produce the range `[start, stopExclusive)`. If you want to produce
* the sequence in one chunk, instead of lazily, use
* `emits(start until stopExclusive)`.
*
* #example {{{
* scala> Stream.range(10, 20, 2).toList
* res0: List[Int] = List(10, 12, 14, 16, 18)
* }}}
*/
def range(start: Int, stopExclusive: Int, by: Int = 1): Stream[Pure,Int] =
unfold(start){i =>
if ((by > 0 && i < stopExclusive && start < stopExclusive) ||
(by < 0 && i > stopExclusive && start > stopExclusive))
Some((i, i + by))
else None
}
You can see that there is no additional wrapping here, only the integers that get emitted as part of the range. On the other hand, Stream.emits creates an Append object for every element in the sequence, where we have a left containing the tail of the stream, and right containing the current value we're at.
Is this a bug? I would say no, but I would definitely open this up as a performance issue to the fs2 library maintainers.

Efficiently way to read binary files in scala

I'm trying to read a binary file (16 MB) in which I have only integers coded on 16 bits. So for that, I used chunks of 1 MB which gives me an array of bytes. For my own needs, I convert this byte array to a short array with the following function convert but reading this file with a buffer and convert it into a short array take me 5 seconds, is it a faster way than my solution ?
def convert(in: Array[Byte]): Array[Short] = in.grouped(2).map {
case Array(one) => (one << 8 | (0 toByte)).toShort
case Array(hi, lo) => (hi << 8 | lo).toShort
} .toArray
val startTime = System.nanoTime()
val file = new RandomAccessFile("foo","r")
val defaultBlockSize = 1 * 1024 * 1024
val byteBuffer = new Array[Byte](defaultBlockSize)
val chunkNums = (file.length / defaultBlockSize).toInt
for (i <- 1 to chunkNums) {
val seek = (i - 1) * defaultBlockSize
file.seek(seek)
file.read(byteBuffer)
val s = convert(byteBuffer)
println(byteBuffer size)
}
val stopTime = System.nanoTime()
println("Perf of = " + ((stopTime - startTime) / 1000000000.0) + " for a duration of " + duration + " s")
16 MB easily fits in memory unless you're running this on a feature phone or something. No need to chunk it and make the logic harder.
Just gulp the whole file at once with java.nio.files.Files.readAllBytes:
val buffer = java.nio.files.Files.readAllBytes(myfile.toPath)
assuming you are not stuck with Java 1.6. (If you are stuck with Java 1.6, pre-allocate your buffer size using myfile.size, and use read on a FileInputStream to get it all in one go. It's not much harder, just don't forget to close it!)
Then if you don't want to convert it yourself, you can
val bb = java.nio.ByteBuffer.wrap(buffer)
bb.order(java.nio.ByteOrder.nativeOrder)
val shorts = new Array[Short](buffer.length/2)
bb.asShortBuffer.get(shorts)
And you're done.
Note that this is all Java stuff; there's nothing Scala-specific here save the syntax.
If you're wondering why this is so much faster than your code, it's because grouped(2) boxes the bytes and places them in an array. That's three allocations for every short you want! You can do it yourself by indexing the array directly, and that will be fast, but why would you want to when ByteBuffer and friends do exactly what you need already?
If you really really care about that last (odd) byte, then you can use (buffer.length + 1)/2 for the size of shorts, and tack on a if ((buffer.length) & 1 == 1) shorts(shorts.length-1) = ((bb.get&0xFF) << 8).toShort to grab the last byte.
A couple of issues pop out:
If byteBuffer is always going to be 1024*1024 size then the case Array(one) in convert will never actually be used and therefore pattern matching is unnecessary.
Also, you can avoid the for loop with a tail recursive function. After the val byteBuffer = ... line you can replace the chunkNums and for loop with:
#scala.annotation.tailrec
def readAndConvert(b: List[Array[Short]], file : RandomAccessFile) : List[Array[Short]] = {
if(file.read(byteBuffer) < 0)
b
else {
file.skipBytes(1024*1024)
readAndConvert(b.+:(convert(byteBuffer)), file)
}
}
val sValues = readAndConvert(List.empty[Array[Short]], file)
Note: because list preppending is much faster than appending the above loop gets you the converted value in reverse order from the reading order in the file.

Different result returned using Scala Collection par in a series of runs

I have tasks that I want to execute concurrently and each task takes substantial amount of memory so I have to execute them in batches of 2 to conserve memory.
def runme(n: Int = 120) = (1 to n).grouped(2).toList.flatMap{tuple =>
tuple.par.map{x => {
println(s"Running $x")
val s = (1 to 100000).toList // intentionally to make the JVM allocate a sizeable chunk of memory
s.sum.toLong
}}
}
val result = runme()
println(result.size + " => " + result.sum)
The result I expected from the output was 120 => 84609924480 but the output was rather random. The returned collection size differed from execution to execution. Most of the time there was missing count even though all the futures were executed looking at the console. I thought flatMap waits the parallel executions in map to complete before returning the complete. What should I do to always get the right result using par? Thanks
Just for the record: changing the underlying collection in this case shouldn't change the output of your program. The problem is related to this known bug. It's fixed from 2.11.6, so if you use that (or higher) Scala version, you should not see the strange behavior.
And about the overflow, I still think that your expected value is wrong. You can check that the sum is overflowing because the list is of integers (which are 32 bit) while the total sum exceeds the integer limits. You can check it with the following snippet:
val n = 100000
val s = (1 to n).toList // your original code
val yourValue = s.sum.toLong // your original code
val correctValue = 1l * n * (n + 1) / 2 // use math formula
var bruteForceValue = 0l // in case you don't trust math :) It's Long because of 0l
for (i ← 1 to n) bruteForceValue += i // iterate through range
println(s"yourValue = $yourValue")
println(s"correctvalue = $correctValue")
println(s"bruteForceValue = $bruteForceValue")
which produces the output
yourValue = 705082704
correctvalue = 5000050000
bruteForceValue = 5000050000
Cheers!
Thanks #kaktusito.
It worked after I changed the grouped list to Vector or Seq i.e. (1 to n).grouped(2).toList.flatMap{... to (1 to n).grouped(2).toVector.flatMap{...