What purpose does a queue serve in System Verilog? - system-verilog

They are not used for RTL but rather verification, correct? They would not be synthesizable.
Do they have better memory management features in turn optimizing program time? If I recall
correctly, System Verilog has an automatic garbage collector, so there is no need to deallocate memory.
The official IEEE documentation does a great job of explaining how they work. I am just wondering in what scenarios I would use one vs an array. One guess would be that they have associated methods that allow for easier data manipulation?
Thank you in advance for your knowledge and expertise.

A queue can be synthesisable if it has a bounded maximum size. Only a few synthesis tools support it, probably none of the FPGA synthesis tools.
The key advantage with a queue is in efficiency adding/removing one element from the array, especially when accessed at the head or tail of the queue. A dynamic array may require reallocation and copying the entire array when modifying its size. The penalty for a queue is the extra time it takes to access elements in the middle of the queue, and extra space compared with the same number of element of a dynamic array.
I hope that 2 answers this question.

Related

What happens when an AppendStructuredBuffer overflows in a compute shader?

I have a Unity project in which I'm writing to an AppendStructuredBuffer<Triangle> via Append(triangle) in a compute shader.
In this instance, I know the theoretical limit to the number of triangles that could exist, so the obvious correct approach is to size the buffer accordingly. As a hack, though, I'm experimenting with allocating drastically smaller buffers so that they can be more efficiently processed by other parts of the system (in particular, reading back to CPU). One could imagine other situations in which a specific limit may not be known, or could be wrongly assumed.
Clearly, this is potentially hazardous. I'm sure there are more robust approaches that could be used for my current system (or more generally) without sacrificing performance, but I'm not (particularly) asking for advice on that.
What I want to know is what the expected behaviour is when a program calls Append() beyond the capacity of such a buffer. I imagine that it is undefined, and potentially liable to corrupt other areas of VRAM, to an extent dependent on GPU drivers / DirectX version etc. It may be that it is more formally specified, but I haven't been able to find that out.
Of course, even if the behaviour is specified, it seems somewhat reckless to deliberately risk. Still, I'd like to know:
Whether it is possible to detect that such a buffer is full in the context of a kernel function (given the highly threaded nature this is likely impractical).
What the performance implications of that are if it is possible.
What the consequences of overflowing are (in this instance I'm specifically anticipating it, but bugs happen).
How all of the above might be expected to differ for different hardware vendors, APIs, etc.
Perhaps it is 'safe' to the extent that excess data will simply be lost to the void without cost. In any case the system can - for example - periodically check fullness of buffers and do any extra housework that may be necessary... leaving the question of how severe any mistakes in the tuning of such a system might be.
Under many circumstances, at least in DirectX, out of bounds access is defined as returning 0. I'm still not totally sure about writes, but think there is reason to believe they should be generally safe in current implementations.
I would still be very wary of relying on this, especially when using other APIs.
According to this specification,
5.3.10.2 Using Unordered Count and Append Buffers
...
The counter behind imm_atomic_alloc and imm_atomic_consume has no
overflow or underflow clamping, and there is no feedback given to the
shader as to whether overflow/underflow happened (wrapping of the
counter). The only thing the counter really accomplishes is a way of
generating unique addresses that is conveniently bundled with the UAV.
Further, https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#inst_IMM_ATOMIC_ALLOC
There is no clamping of the count, so it wraps on overflow.
I don't think I'm wrong in interpretting 'wrapping' as being to the length of the buffer in these instances.
So, the answer as I understand it is that on Append() the internal counter will wrap, and subsequent invocations will end up overwriting earlier data. As it happens, I am currently rendering my buffer without reference to such a counter (because I do another pass on the 'triangles' to turn them into vertices for rendering, which I currently do on a non-AppendBuffer). I should experiment with passing a buffer with a count to that draw call, which should allow me to verify whether most of my model suddenly disappears when I overflow.
In any case, it seems that the operation should be safe in terms of not corrupting other parts of the system, but that referring to the counter may be the wrong way to detect problems.

sharing data between master and slave threads in kdb

What is the ideal method to sharing read only data between the master and slave threads? From my understanding there are two options:
Set shared data as global variable in main so that the slave threads can read them.
Pass shared variables to slave threads as parameters.
From my experiments, there is hardly any difference in performance even with big data set. In fact, 1) has slightly worse performance over 2). I know that for 2), kdb will serialize and serialize parameters. Does it do the same for 1)? That would explain the degradation of performance considering the global variable is bigger in size than thread specific parameter. Are there alternative methods to do this?
Secondly, as slave threads cannot modify global variables. I reckon the only way to share results with main thread is returning them back. Please comment if this is not the case.
EDIT: Performance is measured in terms of runtime before and after call to peach.
Passing values into a function via peach like this
{}[v;] peach vector
sounds nice, and works well, unless v is very large. Each thread gets a copy (even if v is a global).
So the answer depends on your use case. Do you have enough RAM? Can you afford the memory copy given the number of threads? If the answer is yes, then you can afford to do this without too much ill affect (remember allocation will affect time).
I prefer to use globals for this reason.

Writing a functional and yet functional image processing library in Scala

We are developing a small image processing library for Scala (student project). The library is completely functional (i.e. no mutability). The raster of image is stored as Stream[Stream[Int]] to exploit the benefits of lazy evaluation with least efforts. However upon performing a few operations on an image the heap gets full and an OutOfMemoryError is thrown. (for example, up to 4 operations can be performed on a jpeg image sized 500 x 400, 35 kb before JVM heap runs out of space.)
The approaches we have thought of are:
Twiddling with JVM options and increase the heap size. (We don't know how to do this under IDEA - the IDE we are working with.)
Choosing a different data structure than Stream[Stream[Int]], the one which is more suited to the task of image processing. (Again we do not have much idea about the functional data structures beyond the simple List and Stream.)
The last option we have is giving up on immutability and making it a mutable library (like the popular image processing libraries), which we don't really want to do. Please suggest us some way to keep this library functional and still functional, if you know what I mean.
Thank you,
Siddharth Raina.
ADDENDUM:
For an image sized 1024 x 768, the JVM runs out of heap space even for a single mapping operation. Some example code from our test:
val image = Image from "E:/metallica.jpg"
val redded = image.map(_ & 0xff0000)
redded.display(title = "Redded")
And the output:
"C:\Program Files (x86)\Java\jdk1.6.0_02\bin\java" -Didea.launcher.port=7533 "-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA Community Edition 10.0.2\bin" -Dfile.encoding=windows-1252 -classpath "C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\charsets.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\deploy.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\javaws.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\jce.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\jsse.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\management-agent.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\plugin.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\resources.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\rt.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\dnsns.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\localedata.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\sunjce_provider.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\sunmscapi.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\sunpkcs11.jar;C:\new Ph\Phoebe\out\production\Phoebe;E:\Inventory\Marvin.jar;C:\scala-2.8.1.final\lib\scala-library.jar;C:\scala-2.8.1.final\lib\scala-swing.jar;C:\scala-2.8.1.final\lib\scala-dbc.jar;C:\new Ph;C:\scala-2.8.1.final\lib\scala-compiler.jar;E:\Inventory\commons-math-2.2.jar;E:\Inventory\commons-math-2.2-sources.jar;E:\Inventory\commons-math-2.2-javadoc.jar;E:\Inventory\jmathplot.jar;E:\Inventory\jmathio.jar;E:\Inventory\jmatharray.jar;E:\Inventory\Javax Media.zip;E:\Inventory\jai-core-1.1.3-alpha.jar;C:\Program Files (x86)\JetBrains\IntelliJ IDEA Community Edition 10.0.2\lib\idea_rt.jar" com.intellij.rt.execution.application.AppMain phoebe.test.ImageTest
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at scala.collection.Iterator$class.toStream(Iterator.scala:1011)
at scala.collection.IndexedSeqLike$Elements.toStream(IndexedSeqLike.scala:52)
at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1011)
at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1011)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:565)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:557)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:168)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:168)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:565)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:557)
at scala.collection.immutable.Stream$$anonfun$flatten1$1$1.apply(Stream.scala:453)
at scala.collection.immutable.Stream$$anonfun$flatten1$1$1.apply(Stream.scala:453)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:565)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:557)
at scala.collection.immutable.Stream.length(Stream.scala:113)
at scala.collection.SeqLike$class.size(SeqLike.scala:221)
at scala.collection.immutable.Stream.size(Stream.scala:48)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:388)
at scala.collection.immutable.Stream.toArray(Stream.scala:48)
at phoebe.picasso.Image.force(Image.scala:85)
at phoebe.picasso.SimpleImageViewer.<init>(SimpleImageViewer.scala:10)
at phoebe.picasso.Image.display(Image.scala:91)
at phoebe.test.ImageTest$.main(ImageTest.scala:14)
at phoebe.test.ImageTest.main(ImageTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)
Process finished with exit code 1
If I understood correctly, you store each individual pixel in one Stream element, and this can be inefficient. What you can do is create your custom LazyRaster class which contains lazy references to blocks of the image of some size (for instance, 20x20). The first time some block is written, its corresponding array is initialized, and from there on changing a pixel means writing to that array.
This is more work, but may result in better performance. Furthermore, if you wish to support stacking of image operations (e.g. do a map - take - map), and then evaluating the image in "one-go", the implementation could get tricky - stream implementation is the best evidence for this.
Another thing one can do is ensure that the old Streams are being properly garbage collected. I suspect image object in your example is a wrapper for your streams. If you wish to stack multiple image operations (like mapping) together and be able to gc the references you no longer need, you have to make sure that you don't hold any references to a stream - note that this is not ensured if:
you have a reference to your image on the stack (image in the example)
your Image wrapper contains such a reference.
Without knowing more about the exact use cases, its hard to say more.
Personally, I would avoid Streams altogether, and simply use some immutable array-based data structure which is both space-efficient and avoids boxing. The only place where I potentially see Streams being used is in iterative image transformations, like convolution or applying a stack of filters. You wouldn't have a Stream of pixels, but a Stream of images, instead. This could be a nice way to express a sequence of transformations - in this case, the comments about gc in the link given above apply.
If you process large streams, you need to avoid holding onto a reference to the head of the stream. This will prevent garbage collection.
It's possible that calling certain methods on Stream will internally hold onto the head. See the discussion here: Functional processing of Scala streams without OutOfMemory errors
Stream is very unlikely to be the optimum structure here. Given the nature of a JPEG it makes little sense to "stream" it into memory line-by-line.
Stream also has linear access time for reading elements. Again, probably not what you want unless you're streaming data.
I'd recommend using an IndexedSeq[IndexedSeq[Int]] in this scenario. Or (if performance is important) an Array[Array[Int]], which will allow you to avoid some boxing/unboxing costs.
Martin has written a good overview of the 2.8 collections API which should help you understand the inherent trade-offs in the various collection types available.
Even if using Arrays, there's still every reason to use them as immutable structures and maintain a functional programming style. Just because a structure is mutable doesn't mean you have to mutate it!
I recommend also looking at continuous rather than just discrete models for imagery. Continuous is generally more modular/composable than discrete--whether time or space.
As a first step you should take a memory dump and analyze it. It is very possible that you will see the problem immediately.
There is special command line option to force JVM to make dump on OOME: -XX:+HeapDumpOnOutOfMemoryError. And good tools, like jhat and VisualVM, which can help you in analysis.
Stream is more about lazy evaluation than immutability. And you're
forcing an insane amount of space and time overhead for each pixel by
doing so. Furthermore, Streams only make sense when you can defer the
determination (calculation or retrieval) of individual pixel values.
And, of course, random access is impossible. I'd have to deem the
Stream an entirely inappropriate data structure for image processing.
I'd strongly recommend that you manage your own raster memory (bonus
points for not fixing a single raster image organization into your
code) and allocate storage for whole channels or planes or bands
thereof (depending on the raster organization in play).
UPDATE: By the foregoing, I mean don't use nested Array or IndexedSeq, but allocate a block and compute which element using the row and column values.
Then take an "immutable after initialization" approach. Once a given
pixel or sample has been established in the raster, you never allow it
to be changed. This might require a one-bit raster plane to track the
established pixels. Alternatively, if you know how you'll be filling
the raster (the sequence in which pixels will be assigned) you can get
away with a much simpler and cheaper representation of how much of the
raster is established and how much remains to be filled.
Then as you perform processing on the raster images, do so in a pipeline
where no image is altered in place, but rather a new image is always
generated as various transforms are applied.
You might consider that for some image transformations (convolution,
e.g.) you must take this approach or you will not get the correct
results.
I strongly recommend Okasaki's Purely Functional Data Structures if you don't have any experience with functional data structures (as you seem to indicate).
To increase your heap size using intellij, you need to add the following to the VM Parameters section of the Run/Debug Configuration:
-Xms256m -Xmx256m
This will increase the maximum heap size to 256MB and also ensure this amount is requested by the VM at startup, which generally represents a performance increase.
Additionally, you're using a relatively old JDK. If possible, I recommend you update to the latest available version, as newer builds enable escape analysis, which can in some cases have a dramatic effect on performance.
Now, in terms of algorithms, I would suggest that you follow the advice above and divide the image into blocks of say, 9x9 (any size will do though). I'd then go and have a look at Huet's Zipper and think about how that might be applied to an image represented as a tree structure, and how that might enable you to model the image as a persistent data structure.
Increasing the heap size in idea can be done in the vmoptions file, which can be found in the bin directory in your idea installation directory (add -Xmx512m to set the heap size to 512 megabyte, for example).
Apart from that, it is hard to say what causes the out of memory without knowing what operations you exactly perform, but perhaps this question provides some useful tips.
One solution would be to put the image in an array, and make filters like "map" return a wrapper for that array. Basically, you have a trait named Image. That trait requires abstract pixel retrieving operations. When, for example, the "map" function is called, you return an implementation, which delegates the calls to the old Image, and executes the function on it. The only problem with that would be that the transformation could end up being executed multiple times, but since it is a functional library, that is not very important.

iPhone Objective-C, malloc or NSMutableData?

I'm need to use a volitile block of memory to constantly write and rewrite the data inside using multiple threads. The data will be rendered thread-safe using #synchronized if I utilize either malloc'd data or NSMutableData.
My question is what is more recommended for speed? Seeing I'm running recursivly calculated equations on the matrix of data I need to be able to allocate, retrieve, and set the data as quickly as possible.
I'm going to be doing my own research on the subject but I was wondering if anyone knew off-hand if the overhead of the Objective-C NSMutableData would introduce speed setbacks?
re: psychotik's suggestion: volatile is a keyword in C that basically tells the compiler to avoid optimizing usage of the symbol it's attached to. This is important for multithreaded code, or code that directly interfaces with hardware. However, it's not very useful for working with blocks of memory (from malloc() or NSData.) As psychotik said, it's for use with primitives such as an int or a pointer (i.e. the pointer itself, not the data it points to.) It's not going to make your data access any faster, and may in fact slow it down by defeating the compiler's optimization tricks.
For cross-thread synchronization, your fastest bet is, I think, an OSSpinLock if you don't need recursive access, or a pthread_mutex set up as recursive if you do. Keep in mind OSSpinLock is, as the name suggests, a spin lock, so certain usage patterns make it less efficient than a pthread_mutex, but it's also extremely close to the metal (it's based off the hardware's atomic get/set operations.)
If your data really is being accessed frequently enough that you're concerned with locking performance, you'll probably want to avoid NSData and just work with a block of memory from malloc()--but, without knowing more about what you're trying to accomplish or how frequently you're accessing the data, a solution does not readily present itself. Can you tell us more about your intent?

why does memcached not support "multi set"

Can anyone explain why memcached folks decided to support multi get but not multi set.
By multi I mean operation involving more than one key (see protocol at http://code.google.com/p/memcached/wiki/NewCommands).
So you can get multiple keys in one shot (basic advantage is the standard saving you get by doing less round trips) but why can not you get bulk sets?
My theory is that it was meant to do less number of sets and that too individually (e.g. on a cache read and miss). But I still do not see how multi-set really conflicts with the general philosophy of memcached.
I looked at the client features at http://code.google.com/p/memcached/wiki/NewCommonFeatures and it seems that some clients potentially do support "Multi-Set" (why only in binary protocol?). I am using Java spy memcached, btw.
It's not supported in the text protocol because it'd be very, very complicated to express, no clients would support it, and it would provide very little that you can't already do from the text protocol.
It's supported in the binary protocol because it's a trivial use case of binary operations.
spymemcached supports it implicitly -- just do a bunch of sets and magic happens:
http://dustin.github.com/2009/09/23/spymemcached-optimizations.html
I don't know a lot about memcache internals, but I assume writes have to be blocking, atomic operations. I assume that allowing multiple set operations to be batched, you could block all reads for a long time (or risk a get occurring while only half of a write had been applied). Forcing writes to be done individually allows them to be interleaved fairly with gets.
I would imagine that the restriction against using multi sets is to avoid collisions when writing cached values to the memcache.
As an object cache, I can't foresee an example of when you would need transactional type writes. This use case seems less suited for a caching layer, but better suited for the underlying database.
If sets come in interleaved from different clients, it is most likely the case that for one key, the last one wins, or is at least close enough, until the cache is invalidated and a newer value is written.
As Gian mentions, there don't seem to be any good reasons to block reads from the cache while several or many writes to the cache happen.