In Scala, when you write a function that has no side effect and is referentially transparent, does that mean, that the runtime environment automatically distributes it's processing to multiple threads?
No, usually it does not mean so, unless you explicitly specify somehow that you want parallel processing (e.g. ScalaCL, parallel collections). It will be hard to impossible to it do automatically, for example:
def foo() = {
val x = 1 + 2
val y = 2 + 3
x + y
}
Although, calculation of x and y can be parallelized, in practice it will be even slower (due to penalty incurred in parallelization) than serial code. So with automatic parallelization of everything you will end up with highly ineffective code for basic units.
You can say: why don't you parallelize code automatically and selectively (e.g. do not parallelize when it is not worth it), but such system have to rely on billions of factors, most important will be a specific architecture, current OS load, running profile, and many more (I guess in the end, we will have to solve halting problem). And this magic tracking system will involve it's own penalty.
Finally, although there is effect typing research, stock scala do not have any ways to make a distinction between side-effecting and non-side effecting functions.
After all, it is not that hard to parallelize scala code manually, as #fracca demonstrated.
a function is not automatically run in parallel, but it can be made trivially to do so.
for instance, if you have a large collection, you can easily parallelise that operation by calling .par
(1 to 100000).par.map(_ * 2)
Futures, actors and other strategies are still valuable. Best tool for the job.
Related
I've read a lot of questions and answers here about unpersist() on dataframes. I so far haven't found an answer to this question:
In Spark, once I am done with a dataframe, is it a good idea to call .unpersist() to manually force that dataframe to be unpersisted from memory, as opposed to waiting for GC (which is an expensive task)? In my case I am loading many dataframes so that I can perform joins and other transformations.
So, for example, if I wish to load and join 3 dataframes A, B and C:
I load dataframe A and B, join these two to create X, and then .unpersist() B because I don't need it any more (but I will need A), and could use the memory to load C (which is big). So then I load C, and join C to X, .unpersist() on C so I have more memory for the operations I will now perform on X and A.
I understand that GC will unpersist for me eventually, but I also understand than GC is an expensive task that should be avoided if possible. To re-phrase my question: Is this an appropriate method of manually managing memory, to optimise my spark jobs?
My understanding (please correct if wrong):
I understand that .unpersist() is a very cheap operation.
I understand that GC calls .unpersist() on my dataframes eventually anyway.
I understand that spark monitors cache and drops based on Last Recently Used policy. But in some cases I do not want the 'Last Used' DF to be dropped, so instead I can call.unpersist() on the datafames I know I will not need in future, so that I don't drop the DFs I will need and have to reload them later.
To re-phrase my question again if unclear: is this an appropriate use of .unpersist(), or should I just let Spark and GC do their job?
Thanks in advance :)
There seem to be some misconception. While using unpersist is a valid approach to get better control over the storage, it doesn't avoid garbage collection. In fact all the on heap objects associated with cached data will be left garbage collector.
So while operation itself is relatively cheap, chain of events it triggers might not be cheap. Luckily explicit persist is not worse than waiting for automatic cleaner or GC triggered cleaner, so if you want to clean specific objects, go ahead and do it.
To limit GC on unpersist it might be worth to take a look at the OFF_HEAP StorageLevel.
I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?
I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.
So i have completed the coursera course on scala and have taken it upon myself to do a small POC showing off the multiprocessor capabilities of scala.
i am looking at creating a very small example where a application can launch multiple tasks(each task will do some network related queries etc) and i can show the usage of multiple cores as well.
Also there will be a thread that will listen on a specific port of a machine and spawn tasks based on what information it receives there.
Any suggestions on how to proceed with this kind of a problem?
I don't want to use AKKA now.
Parallel collections are perhaps the least-effort way to make use of multiple processors in Scala. It naturally leads into how best to organise one's code and data to take advantage of the parallel operations, and more importantly what doesn't get faster.
As a more concrete problem, suppose you have read a CSV file (or XML document, or whatever) and want to parse the data. If the records have already been split into a collection such as a List[String], you can then do .par to create a parallel List, and then a subsequent .map will use all cores where possible. The resulting List[whatever] will retain the same ordering even though the operations were not executed sequentially. Consider summing the values on each line:
val in: List[String] = ...
val out = in.par.map { line =>
val cols = line split ','
cols.map(_.toInt).sum
}
So an in of List("1,2,3", "4,5,6") would result in an out of List(6, 15), as it would without the .par. but it'll run across multiple cores. Whether it's faster is another matter, since there is overhead in using parallel collections that likely makes a trivial example such as this slower. You will want to experiment to see where parallel collections are a benefit for your use cases.
There is a more extensive discussion and documentation of parallel collections at http://docs.scala-lang.org/overviews/parallel-collections/overview.html
What about the sleeping barber problem? You could implement it in a distributed manner over the network, with the barber(s)' spawning service listening on one port and the customers spawning and requesting the barber(s) services over the network.
I think that would be vast and interesting enough while not being impossible.
Then you can build on it to expand it as much as you want, such as adding specialized barbers for different things (hair cut or shaving) and down from there. Sky (or, better, thread's no. cap) is the limit!
In some high-level languages like Matlab, you can use "logical indexing" to select a whole set of entries in an array for operating on.
I understand what logical indexing is and how to use it.
Instead, I am asking:
How does it work ("behind the scenes")?
Does it not boil down to just a for-loop?
If so, why is it so much faster than for-looping?
Interpreted languages can be thought of as a variation on assembler running on an emulated core. They have stacks and commands that work in ways like the assembler without actually being the assembler. They are a virtual machine.
A for loop can be thought of as telling the system, set a value, run a sequence of tasks, and when you are done then come back and check on that value. If it is not at a threshold, then change it in a prescribed way, and go repeat those tasks and come back. In assembler you are running screaming fast, but in the "VM" not so much. Consider the demonstration between 13:50 and 15:30 of this link: (link)
This means that what appears to be a for loop, isn't actually a for loop. It is operating system interrupts, and virtualized memory. It is virus-scans in the background and megasloth bloatware.
If you had a virtual system, could you make a short-cut for addressing memory that didn't use the virtualized for loop, that was reasonably efficient? MatLab tries to major on data processing, so it has to have very efficient ways of storing, sorting, and selecting data within its virtual machine.
MathWorks is not going to make the details of this accessible to the public. If it has a great idea then they don't want it implemented in Python, and R tomorrow. If it has a mediocre idea then they don't want to be beaten in execution by Python and R tomorrow. Either way, making the nuts and bolts of that particular approach accessible to the public without an NDA - it is likely a losing proposition for them.
Bottom lines:
its not a real "for", even for a for loop, because its running virtually
they are opening up some of the internals of their data handling to improve usability
they aren't likely to disclose actual code because of negative business consequences
It is worthy to note that vectorized code can outperform for loops while doing the same thing. This means they likely are applying more of that internals to execution of the "sequence of tasks" to get performance improvement.
Are parallel collections intended to do operations with side effects? If so, how can you avoid race conditions?
For example:
var sum=0
(1 to 10000).foreach(n=>sum+=n); println(sum)
50005000
no problem with this.
But if try to parallelize, race conditions happen:
var sum=0
(1 to 10000).par.foreach(n=>sum+=n);println(sum)
49980037
Quick answer: don't do that. Parallel code should be parallel, not concurrent.
Better answer:
val sum = (1 to 10000).par.reduce(_+_) // depends on commutativity and associativity
See also aggregate.
Parallel case doesn't work because you don't use volatile variables hence not ensuring visibility of your writes and because you have multiple threads that do the following:
read sum into a register
add to the register with the sum value
write the updated value back to memory
If 2 threads do step 1 first one after the other and then proceed to do the rest of the steps above in any order, they will end up overwriting one of the updates.
Use #volatile annotation to ensure visibility of sum when doing something like this. See here.
Even with #volatile, due to the non-atomicity of the increment you will be losing some increments. You should use AtomicIntegers and their incrementAndGet.
Although using atomic counters will ensure correctness, having shared variables here hinders performance greatly - your shared variable is now a performance bottleneck because every thread will try to atomically write to the same cache line. If you wrote to this variable infrequently, it wouldn't be a problem, but since you do it in every iteration, there will be no speedup here - in fact, due to cache-line ownership transfer between processors, it will probably be slower.
So, as Daniel suggested - use reduce for this.