Scalable O(1)-memory threaded counters? - atomic

I have a bunch of threads with a bunch of counters. Threads decrement the counters, and interesting things happen if a counter hits zero. This is trivial to implement with atomic ops.
However, it gets harder if we require two properties to hold regardless of the number of threads or counters:
Scalability: Decrementing a counter is O(polylog).
Compactness: The memory per counter is O(1).
I know how to do either one of these in isolation: the trivial implementation is compact and hierarchical counting networks [4] are scalable). Is it possible to do both?
Note: Since O(n) threads can't make O(n) different changes O(1) memory in time less than O(n), solving this requires sharing data structure between the different counters.
[4]: J. Aspnes, M. Herlily, and N. Shavit. Counting networks. Journal of the ACM, 41(5):1020-1048, Sept 1994.
Update: Jed Brown pointed out the obvious fact that O(1) time is impossible. Changed to polylog.

Have you tried Dr. Cliff Click's ConcurrentAutoTable (Counter) from the high-scale-lib:
http://sourceforge.net/projects/high-scale-lib/files/high-scale-lib/high-scale-lib-v1.1.1/
http://www.youtube.com/watch?v=WYXgtXWejRM
http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
http://www.infoq.com/news/2008/05/click_non_blocking/
http://www.azulsystems.com/events/javaone_2008/2008_CodingNonBlock.pdf

There was a paper for scalable counters. You basically have a tree, where each thread has a node, and a thread wishing to inc/dec posts that fact, and then it begins to climb the tree, up to the counter, which is at the top, accumulating inc/dec values on the way, then applying the totals to the counter at the top. (That's the gist of it - lots of extra detail).
It distributes the inc/dec away from a single cache line, which of course is something that prevents scalability.
Check out the white papers in the wiki at http://www.liblfds.org - you'll find it there.

Related

is kdb fast solely due to processing in memory

I've heard quite a couple times people talking about KDB deal with millions of rows in nearly no time. why is it that fast? is that solely because the data is all organized in memory?
another thing is that is there alternatives for this? any big database vendors provide in memory databases ?
A quick Google search came up with the answer:
Many operations are more efficient with a column-oriented approach. In particular, operations that need to access a sequence of values from a particular column are much faster. If all the values in a column have the same size (which is true, by design, in kdb), things get even better. This type of access pattern is typical of the applications for which q and kdb are used.
To make this concrete, let's examine a column of 64-bit, floating point numbers:
q).Q.w[] `used
108464j
q)t: ([] f: 1000000 ? 1.0)
q).Q.w[] `used
8497328j
q)
As you can see, the memory needed to hold one million 8-byte values is only a little over 8MB. That's because the data are being stored sequentially in an array. To clarify, let's create another table:
q)u: update g: 1000000 ? 5.0 from t
q).Q.w[] `used
16885952j
q)
Both t and u are sharing the column f. If q organized its data in rows, the memory usage would have gone up another 8MB. Another way to confirm this is to take a look at k.h.
Now let's see what happens when we write the table to disk:
q)`:t/ set t
`:t/
q)\ls -l t
"total 15632"
"-rw-r--r-- 1 kdbfaq staff 8000016 May 29 19:57 f"
q)
16 bytes of overhead. Clearly, all of the numbers are being stored sequentially on disk. Efficiency is about avoiding unnecessary work, and here we see that q does exactly what needs to be done when reading and writing a column - no more, no less.
OK, so this approach is space efficient. How does this data layout translate into speed?
If we ask q to sum all 1 million numbers, having the entire list packed tightly together in memory is a tremendous advantage over a row-oriented organization, because we'll encounter fewer misses at every stage of the memory hierarchy. Avoiding cache misses and page faults is essential to getting performance out of your machine.
Moreover, doing math on a long list of numbers that are all together in memory is a problem that modern CPU instruction sets have special features to handle, including instructions to prefetch array elements that will be needed in the near future. Although those features were originally created to improve PC multimedia performance, they turned out to be great for statistics as well. In addition, the same synergy of locality and CPU features enables column-oriented systems to perform linear searches (e.g., in where clauses on unindexed columns) faster than indexed searches (with their attendant branch prediction failures) up to astonishing row counts.
Sources(S): http://www.kdbfaq.com/kdb-faq/tag/why-kdb-fast
as for speed, the memory thing does play a big part but there are several other things, fast read from disk for hdb, splaying etc. From personal experienoce I can say, you can get pretty good speeds from c++ provided you want to write that much code. With kdb you get all that and some more.
another thing about speed is also speed of coding. Steep learning curve but once you get it, complex problems can be coded in minutes.
alternatives you can look at onetick or google in memory databases
kdb is fast but really expensive. Plus, it's a pain to learn Q. There are a few alternatives such as DolphinDB, Quasardb, etc.

Sieve of Eratosthenes (reducing space complexity)

I wanted to generate prime numbers between two given numbers ‘a’ and ‘b’ (b > a). What I did was store Boolean values in an array of size b-1 (that is for numbers 2 to b) and then I applied the sieve method.
Is there a better way, that reduces space complexity, if I don't need all prime numbers from 2 to b?
You need to store all primes which are smaller of equal than the square root of b, then for each number between a and b check whether they are divisible by any of these numbers and they don't equal these numbers. So in our case the magic number is sqrt(b)
You can use segmented sieve of Eratosthenes. The basic idea is pretty simple.
In a typical sieve, we start with a large array of Booleans, all set to the same value. These represent odd numbers, starting from 3. We look at the first and see that it's true, so we add it to the list of prime numbers. Then we mark off every multiple of that number as not prime.
Now, the problem with this is that it's not very cache friendly. As we mark off the multiples of each number, we go through the entire array. Then when we reach the end, we start over from the beginning (which is no longer in the cache) and walk through the entire array again. Each time through the array, we read the entire array from main memory again.
For a segmented sieve, we do things a bit differently. We start by by finding only the primes up to the square root of the limit we care about. Then we use those to mark off primes in the main array. The difference here is the order in which we mark off primes. Instead of marking off all the multiples of three, then all the multiples of 5, and so on, we start by marking off the multiples of three for data that will fit in the cache. Then, instead of continuing on to more data in the array, we go back and mark off the multiples of five for the data that fits in the cache. Then the multiples of 7, and so on.
Then, when we've marked off all the multiples in that cache-sized chunk of data, we move on to the next cache-sized chunk of data. We start over with marking off multiples of 3 in this chunk, then multiples of 5, and so on until we've marked off all the multiples in this chunk. We continue that pattern until we've marked off all the non-prime numbers in all the chunks, and we're done.
So, given N primes below the square root of the limit we care about, a naive sieve will read the entire array of Booleans N times. By contrast, a segmented sieve will only read each chunk of the data once. Once a chunk of data is read from main memory, all the processing on that chunk is done before any more data is read from main memory.
The exact speed-up this gives will depend on the ratio of the speed of cache to the speed of main memory, the size of the array you're using vs. the size of the cache, and so on. Nonetheless, it is generally pretty substantial--for example, on my particular machine, looking for the primes up to 100 million, the segmented sieve has a speed advantage of about 10:1.
One thing you must remember, if you're using C++. A well-known issue with std::vector<bool> is Under C++98/03, vector<bool> was required to be a specialization that stored each Boolean as a single bit with some proxy trickery to get bool-like behavior. That requirement has since been lifted, but many libraries still include it.
With a non-segmented sieve, it's generally a useful trade-off. Although it requires a little extra CPU time to compute masks and such to modify only a single bit at a time, it saves enough bandwidth to main memory to more than compensate.
With a segmented sieve, bandwidth to main memory isn't nearly as large a factor, so using a vector<char> generally seems to give better results (at least with the compilers and processors I have handy).
Getting optimal performance from a segmented sieve does require knowledge of the size of your processor's cache, but getting it precisely correct isn't usually critical--if you assume the size is smaller than it really is, you won't necessarily get optimal use of your cache, but you usually won't lose a lot either.

Estimating possible # of actors in Scala

How can I estimate the number of actors that a Scala program can handle?
For context, I'm contemplating what is essentially a neural net that will be creating and forgetting cells at a high rate. I'm contemplating making each cell an actor, but there will be millions of them. I'm trying to decide whether this design is worth pursuing, but can't estimate the limits of number of actors. My intent is that it should totally run on one system, so distributed limits don't apply.
For that matter, I haven't definitely settled on Scala, if there's some better choice, but the cells do have state, as in, e.g., their connections to other cells, the weights of the connections, etc. Though this COULD be done as "Each cell is final. Changes mean replacing the current cell with a new one bearing the same id#."
P.S.: I don't know Scala. I'm considering picking it up to do this project. I'm also considering lots of other alternatives, including Java, Object Pascal and Ada. But actors seem a better map to what I'm after than thread-pools (and Java can't handle enough threads to make a thread/cell design feasible.
P.S.: At all times, most of the actors will be quiescent, but there wil need to be a way of cycling through the entire collection of them. If there isn't one built into the language, then this can be managed via first/next links within each cell. (Both links are needed, to allow cells in the middle to be extracted for release.)
With a neural net simulation, the real question is how much of the computational effort will be spent communicating, and how much will be spent computing something within a cell? If most of the effort is in communication then actors are perhaps a good choice for correctness, but not a good choice at all for efficiency (even with Akka, which performs reasonably well; AsyncFP might do the trick, though). Millions of neurons sounds slow--efficiency is probably a significant concern. If the neurons have some pretty heavy-duty computations to do themselves, then the communications overhead is no big deal.
If communications is the bottleneck, and you have lots of tiny messages, then you should design a custom data structure to hold the network, and also custom thread-handling that will take advantage of all the processors you have and minimize the amount of locking that you must do. For example, if you have space, each neuron could hold an array of input values from those neurons linked to it, and it would when calculating its output just read that array directly with no locking and the input neurons would just update the values also with no locking as they went. You can then just dump all your neurons into one big pool and have a master distribute them in chunks of, I don't know, maybe ten thousand at a time, each to its own thread. Scala will work fine for this sort of thing, but expect to do a lot of low-level work yourself, or wait for a really long time for the simulation to finish.

How to efficiently do scattered summing with SSE/x86

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.

Why LRU doesn't suffer Belady's Anomaly?

I have a question about page replacement algorithms. FIFO suffers from Belady's Anomaly but LRU doesn't. Does anyone know why LRU doesn't suffer? I've been searching for the reason on the internet but no luck.
Because LRU is a stacking algorithm, and using k frames will always be a subset of k + n frames for LRU. Thus, any page-faults that may occur for k + n frames will also occur for k frames, which in turn means that LRU doesn't suffer Belady's anomaly.
Because FIFO assumes that the mere fact that a page has been occupying memory for a long time that it is the safest to replace, where in reality that simply isn't the case. Rather, where FIFO fails is that statistically, if a page has been called frequently, it's more likely to be called again than another page which has been called recently. In other words, frequency is a far better determiner of page loading than age.
Similar to Caspar's answer, however I found the explanation from my textbook (slightly edited) to be a bit more clear.
[LRU belongs] to a class of page-replacement algorithms, called stack algorithms, [which] can never exhibit Belady’s anomaly.
A stack algorithm is an algorithm for which it can be shown that the set of pages in memory for N frames is always a subset of the set of pages that would be in memory with N + 1 frames. [Therefore an additional frame will never cause an additional page fault.]
For LRU replacement, the set of pages in memory would be the N most recently referenced pages. If the number of frames is increased, these N pages will still be the most recently referenced and so will still be in memory.
Silberschatz, A., Galvin, P. B., & Gagne, G. (2014). Operating System Concepts (9th ed.). Singapore: Wiley.