Are parallel collections intended to do operations with side effects? If so, how can you avoid race conditions?
For example:
var sum=0
(1 to 10000).foreach(n=>sum+=n); println(sum)
50005000
no problem with this.
But if try to parallelize, race conditions happen:
var sum=0
(1 to 10000).par.foreach(n=>sum+=n);println(sum)
49980037
Quick answer: don't do that. Parallel code should be parallel, not concurrent.
Better answer:
val sum = (1 to 10000).par.reduce(_+_) // depends on commutativity and associativity
See also aggregate.
Parallel case doesn't work because you don't use volatile variables hence not ensuring visibility of your writes and because you have multiple threads that do the following:
read sum into a register
add to the register with the sum value
write the updated value back to memory
If 2 threads do step 1 first one after the other and then proceed to do the rest of the steps above in any order, they will end up overwriting one of the updates.
Use #volatile annotation to ensure visibility of sum when doing something like this. See here.
Even with #volatile, due to the non-atomicity of the increment you will be losing some increments. You should use AtomicIntegers and their incrementAndGet.
Although using atomic counters will ensure correctness, having shared variables here hinders performance greatly - your shared variable is now a performance bottleneck because every thread will try to atomically write to the same cache line. If you wrote to this variable infrequently, it wouldn't be a problem, but since you do it in every iteration, there will be no speedup here - in fact, due to cache-line ownership transfer between processors, it will probably be slower.
So, as Daniel suggested - use reduce for this.
Related
I've read a lot of questions and answers here about unpersist() on dataframes. I so far haven't found an answer to this question:
In Spark, once I am done with a dataframe, is it a good idea to call .unpersist() to manually force that dataframe to be unpersisted from memory, as opposed to waiting for GC (which is an expensive task)? In my case I am loading many dataframes so that I can perform joins and other transformations.
So, for example, if I wish to load and join 3 dataframes A, B and C:
I load dataframe A and B, join these two to create X, and then .unpersist() B because I don't need it any more (but I will need A), and could use the memory to load C (which is big). So then I load C, and join C to X, .unpersist() on C so I have more memory for the operations I will now perform on X and A.
I understand that GC will unpersist for me eventually, but I also understand than GC is an expensive task that should be avoided if possible. To re-phrase my question: Is this an appropriate method of manually managing memory, to optimise my spark jobs?
My understanding (please correct if wrong):
I understand that .unpersist() is a very cheap operation.
I understand that GC calls .unpersist() on my dataframes eventually anyway.
I understand that spark monitors cache and drops based on Last Recently Used policy. But in some cases I do not want the 'Last Used' DF to be dropped, so instead I can call.unpersist() on the datafames I know I will not need in future, so that I don't drop the DFs I will need and have to reload them later.
To re-phrase my question again if unclear: is this an appropriate use of .unpersist(), or should I just let Spark and GC do their job?
Thanks in advance :)
There seem to be some misconception. While using unpersist is a valid approach to get better control over the storage, it doesn't avoid garbage collection. In fact all the on heap objects associated with cached data will be left garbage collector.
So while operation itself is relatively cheap, chain of events it triggers might not be cheap. Luckily explicit persist is not worse than waiting for automatic cleaner or GC triggered cleaner, so if you want to clean specific objects, go ahead and do it.
To limit GC on unpersist it might be worth to take a look at the OFF_HEAP StorageLevel.
What is the ideal method to sharing read only data between the master and slave threads? From my understanding there are two options:
Set shared data as global variable in main so that the slave threads can read them.
Pass shared variables to slave threads as parameters.
From my experiments, there is hardly any difference in performance even with big data set. In fact, 1) has slightly worse performance over 2). I know that for 2), kdb will serialize and serialize parameters. Does it do the same for 1)? That would explain the degradation of performance considering the global variable is bigger in size than thread specific parameter. Are there alternative methods to do this?
Secondly, as slave threads cannot modify global variables. I reckon the only way to share results with main thread is returning them back. Please comment if this is not the case.
EDIT: Performance is measured in terms of runtime before and after call to peach.
Passing values into a function via peach like this
{}[v;] peach vector
sounds nice, and works well, unless v is very large. Each thread gets a copy (even if v is a global).
So the answer depends on your use case. Do you have enough RAM? Can you afford the memory copy given the number of threads? If the answer is yes, then you can afford to do this without too much ill affect (remember allocation will affect time).
I prefer to use globals for this reason.
So i have completed the coursera course on scala and have taken it upon myself to do a small POC showing off the multiprocessor capabilities of scala.
i am looking at creating a very small example where a application can launch multiple tasks(each task will do some network related queries etc) and i can show the usage of multiple cores as well.
Also there will be a thread that will listen on a specific port of a machine and spawn tasks based on what information it receives there.
Any suggestions on how to proceed with this kind of a problem?
I don't want to use AKKA now.
Parallel collections are perhaps the least-effort way to make use of multiple processors in Scala. It naturally leads into how best to organise one's code and data to take advantage of the parallel operations, and more importantly what doesn't get faster.
As a more concrete problem, suppose you have read a CSV file (or XML document, or whatever) and want to parse the data. If the records have already been split into a collection such as a List[String], you can then do .par to create a parallel List, and then a subsequent .map will use all cores where possible. The resulting List[whatever] will retain the same ordering even though the operations were not executed sequentially. Consider summing the values on each line:
val in: List[String] = ...
val out = in.par.map { line =>
val cols = line split ','
cols.map(_.toInt).sum
}
So an in of List("1,2,3", "4,5,6") would result in an out of List(6, 15), as it would without the .par. but it'll run across multiple cores. Whether it's faster is another matter, since there is overhead in using parallel collections that likely makes a trivial example such as this slower. You will want to experiment to see where parallel collections are a benefit for your use cases.
There is a more extensive discussion and documentation of parallel collections at http://docs.scala-lang.org/overviews/parallel-collections/overview.html
What about the sleeping barber problem? You could implement it in a distributed manner over the network, with the barber(s)' spawning service listening on one port and the customers spawning and requesting the barber(s) services over the network.
I think that would be vast and interesting enough while not being impossible.
Then you can build on it to expand it as much as you want, such as adding specialized barbers for different things (hair cut or shaving) and down from there. Sky (or, better, thread's no. cap) is the limit!
In Scala, when you write a function that has no side effect and is referentially transparent, does that mean, that the runtime environment automatically distributes it's processing to multiple threads?
No, usually it does not mean so, unless you explicitly specify somehow that you want parallel processing (e.g. ScalaCL, parallel collections). It will be hard to impossible to it do automatically, for example:
def foo() = {
val x = 1 + 2
val y = 2 + 3
x + y
}
Although, calculation of x and y can be parallelized, in practice it will be even slower (due to penalty incurred in parallelization) than serial code. So with automatic parallelization of everything you will end up with highly ineffective code for basic units.
You can say: why don't you parallelize code automatically and selectively (e.g. do not parallelize when it is not worth it), but such system have to rely on billions of factors, most important will be a specific architecture, current OS load, running profile, and many more (I guess in the end, we will have to solve halting problem). And this magic tracking system will involve it's own penalty.
Finally, although there is effect typing research, stock scala do not have any ways to make a distinction between side-effecting and non-side effecting functions.
After all, it is not that hard to parallelize scala code manually, as #fracca demonstrated.
a function is not automatically run in parallel, but it can be made trivially to do so.
for instance, if you have a large collection, you can easily parallelise that operation by calling .par
(1 to 100000).par.map(_ * 2)
Futures, actors and other strategies are still valuable. Best tool for the job.
why is an assign statement more efficient than not using assign?
co-workers say that:
assign
a=3
v=7
w=8.
is more efficient than:
a=3.
v=7.
w=8.
why?
You could always test it yourself and see... but, yes, it is slightly more efficient. Or it was the last time I tested it. The reason is that the compiler combines the statements and the resulting r-code is a bit smaller.
But efficiency is almost always a poor reason to do it. Saving a micro-second here and there pales next to avoiding disk IO or picking a more efficient algorithm. Good reasons:
Back in the dark ages there was a limit of 63k of r-code per program. Combining statements with ASSIGN was a way to reduce the size of r-code and stay under that limit (ok, that might not be a "good" reason). One additional way this helps is that you could also often avoid a DO ... END pair and further reduce r-code size.
When creating or updating a record the fields that are part of an index will be written back to the database as they are assigned (not at the end of the transaction) -- grouping all assignments into a single statement helps to avoid inconsistent dirty reads. Grouping the indexed fields into a single ASSIGN avoids writing the index entries multiple times. (This is probably the best reason to use ASSIGN.)
Readability -- you can argue that grouping consecutive assignments more clearly shows your intent and is thus more readable. (I like this reason but not everyone agrees.)
basically doing:
a=3.
v=7.
w=8.
is the same as:
assign a=3.
assign v=7.
assign w=8.
which is 3 separate statements so a little more overhead. Therefore less efficient.
Progress does assign as one statement whether there is 1 or more variables being assigned. If you do not say Assign then it is assumed so you will do 3 statements instead of 1. There is a 20% - 40% reduction in R Code and a 15% - 20% performance improvement when using one assign statement. Why this is can only be speculated on as I can not find any source with information on why this is. For database fields and especially key/index fields it makes perfect sense. For variables I can only assume it has to do with how progress manages its buffers and copies data to and from buffers.
ASSIGN will combine multiple statements into one. If a, v and w are fields in your db, that means it will do something like INSERT INTO (a,v,w)...
rather than
INSERT INTO (a)...
INSERT INTO (v)
etc.