Scala : How to speed up parallel synchronous processing on a small list? - scala

I see that Parallel collections are mostly designed to speed up processing of large collections but it is not mentioned if this can be helpful for small lists or not.
Check this example:
List(1,2,3).map(loadHeavyFile(_))
List(1,2,3).par.map(loadHeavyFile(_)).toList
What I want to know here is that :
Is one faster than another?
Will Scala use multiple threads if the list only has 3 elements?
In a general way, is it possible to speed up the response time of this code?
I know I can use Future, then Future.sequence and wait for the result to come up, but this seems unnatural when loadHeavyFile is a synchronous call. I don't want to have to specify a timeout for example.
Note: I want to preserve list ordering.
Any guidance here is appreciated.

In parallel collections Measuring Performance - How big should a collection be to go parallel? note some parameters that are useful for estimating and deciding whether a parallel version of a collection may outperform the non-parallel one.
For such a small list, the overhead of creating the parallel version may be larger than the actual sequential mapping; however in this example the focus lies on I/O performance and possibly on memory management, should the involved files be large and be uploaded to memory.
To provide a more factual answer, consider benchmarking the sequential and parallel versions for different I/O loads in order to estimate roughly which heavy I/O load threshold makes the parallel version worth it. The I/O factor makes the estimation OS and hardware dependent and it may prove it hard to generalise.

Related

is kdb fast solely due to processing in memory

I've heard quite a couple times people talking about KDB deal with millions of rows in nearly no time. why is it that fast? is that solely because the data is all organized in memory?
another thing is that is there alternatives for this? any big database vendors provide in memory databases ?
A quick Google search came up with the answer:
Many operations are more efficient with a column-oriented approach. In particular, operations that need to access a sequence of values from a particular column are much faster. If all the values in a column have the same size (which is true, by design, in kdb), things get even better. This type of access pattern is typical of the applications for which q and kdb are used.
To make this concrete, let's examine a column of 64-bit, floating point numbers:
q).Q.w[] `used
108464j
q)t: ([] f: 1000000 ? 1.0)
q).Q.w[] `used
8497328j
q)
As you can see, the memory needed to hold one million 8-byte values is only a little over 8MB. That's because the data are being stored sequentially in an array. To clarify, let's create another table:
q)u: update g: 1000000 ? 5.0 from t
q).Q.w[] `used
16885952j
q)
Both t and u are sharing the column f. If q organized its data in rows, the memory usage would have gone up another 8MB. Another way to confirm this is to take a look at k.h.
Now let's see what happens when we write the table to disk:
q)`:t/ set t
`:t/
q)\ls -l t
"total 15632"
"-rw-r--r-- 1 kdbfaq staff 8000016 May 29 19:57 f"
q)
16 bytes of overhead. Clearly, all of the numbers are being stored sequentially on disk. Efficiency is about avoiding unnecessary work, and here we see that q does exactly what needs to be done when reading and writing a column - no more, no less.
OK, so this approach is space efficient. How does this data layout translate into speed?
If we ask q to sum all 1 million numbers, having the entire list packed tightly together in memory is a tremendous advantage over a row-oriented organization, because we'll encounter fewer misses at every stage of the memory hierarchy. Avoiding cache misses and page faults is essential to getting performance out of your machine.
Moreover, doing math on a long list of numbers that are all together in memory is a problem that modern CPU instruction sets have special features to handle, including instructions to prefetch array elements that will be needed in the near future. Although those features were originally created to improve PC multimedia performance, they turned out to be great for statistics as well. In addition, the same synergy of locality and CPU features enables column-oriented systems to perform linear searches (e.g., in where clauses on unindexed columns) faster than indexed searches (with their attendant branch prediction failures) up to astonishing row counts.
Sources(S): http://www.kdbfaq.com/kdb-faq/tag/why-kdb-fast
as for speed, the memory thing does play a big part but there are several other things, fast read from disk for hdb, splaying etc. From personal experienoce I can say, you can get pretty good speeds from c++ provided you want to write that much code. With kdb you get all that and some more.
another thing about speed is also speed of coding. Steep learning curve but once you get it, complex problems can be coded in minutes.
alternatives you can look at onetick or google in memory databases
kdb is fast but really expensive. Plus, it's a pain to learn Q. There are a few alternatives such as DolphinDB, Quasardb, etc.

Why is my processor underused when I perform calculation intensive simulations?

I often have to run calculation intensive simulations using Matlab. These simulations often take a long time and I expect my computer to use all its ressources in order for these simulations to be completed in as little time as possible.
However, when I open the Activity Monitor on my computer, processor usage is never above 55% and there is often about 1GB of unused RAM.
My question is: why is the processor not used to its full potential, and is there a safe and easy way to change this? Indeed, it would be great if I could get my simulations to be completed in half the time they currently take!
Its probably because you have a processer with multiple cores and that the code you are executing isn't written to run in multiple threads/processes. Unless you specifically write your code to take advantage of multiple cores it will only be able to use a single core at a single time.
A relatively easy way to enable parallel computing is to use the Parallel Computing Toolbox.
Additionally you might consider reading this: http://www.mathworks.com/company/newsletters/articles/parallel-matlab-multiple-processors-and-multiple-cores.html

Estimating possible # of actors in Scala

How can I estimate the number of actors that a Scala program can handle?
For context, I'm contemplating what is essentially a neural net that will be creating and forgetting cells at a high rate. I'm contemplating making each cell an actor, but there will be millions of them. I'm trying to decide whether this design is worth pursuing, but can't estimate the limits of number of actors. My intent is that it should totally run on one system, so distributed limits don't apply.
For that matter, I haven't definitely settled on Scala, if there's some better choice, but the cells do have state, as in, e.g., their connections to other cells, the weights of the connections, etc. Though this COULD be done as "Each cell is final. Changes mean replacing the current cell with a new one bearing the same id#."
P.S.: I don't know Scala. I'm considering picking it up to do this project. I'm also considering lots of other alternatives, including Java, Object Pascal and Ada. But actors seem a better map to what I'm after than thread-pools (and Java can't handle enough threads to make a thread/cell design feasible.
P.S.: At all times, most of the actors will be quiescent, but there wil need to be a way of cycling through the entire collection of them. If there isn't one built into the language, then this can be managed via first/next links within each cell. (Both links are needed, to allow cells in the middle to be extracted for release.)
With a neural net simulation, the real question is how much of the computational effort will be spent communicating, and how much will be spent computing something within a cell? If most of the effort is in communication then actors are perhaps a good choice for correctness, but not a good choice at all for efficiency (even with Akka, which performs reasonably well; AsyncFP might do the trick, though). Millions of neurons sounds slow--efficiency is probably a significant concern. If the neurons have some pretty heavy-duty computations to do themselves, then the communications overhead is no big deal.
If communications is the bottleneck, and you have lots of tiny messages, then you should design a custom data structure to hold the network, and also custom thread-handling that will take advantage of all the processors you have and minimize the amount of locking that you must do. For example, if you have space, each neuron could hold an array of input values from those neurons linked to it, and it would when calculating its output just read that array directly with no locking and the input neurons would just update the values also with no locking as they went. You can then just dump all your neurons into one big pool and have a master distribute them in chunks of, I don't know, maybe ten thousand at a time, each to its own thread. Scala will work fine for this sort of thing, but expect to do a lot of low-level work yourself, or wait for a really long time for the simulation to finish.

How to efficiently do scattered summing with SSE/x86

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.

How to utilise parallel processing in Matlab

I am working on a time series based calculation. Each iteration of the calculation is independent. Could anyone share some tips / online primers on using utilising parallel processing in Matlab? How can this be specified inside the actual code?
Since you have access to the Parallel toolbox, I suggest that you first check whether you can do it the easy way.
Basically, instead of writing
for i=1:lots
out(:,i)=do(something);
end
You write
parfor i=1:lots
out(:,i)=do(something);
end
Then, you use matlabpool to create a number of workers (you can have a maximum of 8 on your local machine with the toolbox, and tons on a remote cluster if you also have a Distributed Computing Server license), and you run the code, and see nice speed gains when your iterations are run by 8 cores instead of one.
Even though the parfor route is the easiest, it may not work right out of the box, since you might do your indexing wrong, or you may be referencing an array in a problematic way etc. Look at the mlint warnings in the editor, read the documentation, and rely on good old trial and error, and you should figure it out reasonably fast. If you have nested loops, it's often best parallelize only the innermost one and ensure it does tons of iterations - this is not only good design, it also reduces the amount of code that could give you trouble.
Note that especially if you run the code on a local machine, you may run into memory issues (which might manifest in really slow execution in parallel mode because you're paging): Every worker gets a copy of the workspace, so if your calculation involves creating a 500MB array, 8 workers will need a total 4GB of RAM - and then you haven't even started counting the RAM of the parent process! In addition, it can be good to only use N-1 cores on your machine, so that there is still one core left for other processes that may run on the computer (such as a mandatory antivirus...).
Mathworks offers its own parallel computing toolbox. If you do not want to purchase that, there a few options
You could write your own mex file and use pthreads or OpenMP.
However make sure you do not call any Mex api in the parallel part of the code, because they arent thread safe
If you want coarser grained parallelism via MPI you can try pmatlab
Same with parmatlab
Edit: Adding link Parallel MATLAB with openmp mex files
I have only tried the first.
Don't forget that many Matlab functions are already multithreaded. By careful programming you may be able to take advantage of them -- check the documentation for your version as the Mathworks seem to be increasing the range and number of multithreaded functions with each new release. For example, it seems that 2010a has multithreaded ffts which may be useful for time series processing.
If the intrinsic multithreading is not what you need, then as #srean suggests, the Parallel Computing Toolbox is available. For my money (or rather, my employers' money) it's the way to go, allowing you to program in parallel in Matlab, rather than having to bolt things on. I have to admit, too, that I'm quite impressed by the toolbox and the facilities it offers.