How to do parallel pipeline? - scala

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?

I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.


What is the correct way to organize the ptransforms in a beam pipeline?

I'm developing one pipeline that reads data from Kafka.
The source kafka topic is quite big in terms of traffic, there are 10k messages inserted per second and each of the message is around 200kB
I need to filter the data in order to apply the transformations that I need but something I'm sure is if there is an order in which I need to apply the filter and window functions.
would be more efficient than
or it would be the same performance both options?
I know that samza is just a model that only tells the what and not the how and the runner optimizes the pipeline but I just want to be sure I got it correct
If there is substantial filtering, windowing after the filter will technically reduce the amount of work performed, though that saved work is cheap enough that I doubt it'd make a measurable difference. (Presumably the runner could notice that the filter does not observe the assigned window and lift it in that case, but as mentioned it's unclear if there are really savings to be gained here...)

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

Convert a JavaPairRDD into list without collect() [duplicate]

We know that if we need to convert RDD to a list, then we should use collect(). but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail).
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. We can use it in demo code, but that's all there to use for these APIs. So why to have an API which we can not even use (Or am I missing something).
Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling
List<String> myList= RDD.collect.toList (which effects performance)
I looked up to google but could not find anything which can be effective. Please help if someone has got a better approach.
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
No, and there can't be. And if there were such a way, collect would be implemented using it in the first place.
Well, technically you could implement List interface on top of RDD (or most of it?), but that would be a bad idea and quite pointless.
So why to have an API which we can not even use (Or am I missing something).
collect is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. If that's not your case, use foreach or other actions instead.
As you want to collect the Data in a Java Collection, the data has to collect on single JVM as the java collections won't be distributed. There is no way to get all data in collection by not getting data. The interpretation of problem space is wrong.
collect and similar are not meant to be used in normal spark code. They are useful for things like debugging, testing, and in some cases when working with small datasets.
You need to keep your data inside of the rdd, and use rdd transformations and actions without ever taking the data out. Methods like collect which pull you data out of spark and onto your driver defeat the purpose and undo any advantage that spark might be providing since now you're processing all of your data on a single machine anyway.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input ( DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Suggestions on how to go about a simple example displaying scala's multiprocessor capabilities

So i have completed the coursera course on scala and have taken it upon myself to do a small POC showing off the multiprocessor capabilities of scala.
i am looking at creating a very small example where a application can launch multiple tasks(each task will do some network related queries etc) and i can show the usage of multiple cores as well.
Also there will be a thread that will listen on a specific port of a machine and spawn tasks based on what information it receives there.
Any suggestions on how to proceed with this kind of a problem?
I don't want to use AKKA now.
Parallel collections are perhaps the least-effort way to make use of multiple processors in Scala. It naturally leads into how best to organise one's code and data to take advantage of the parallel operations, and more importantly what doesn't get faster.
As a more concrete problem, suppose you have read a CSV file (or XML document, or whatever) and want to parse the data. If the records have already been split into a collection such as a List[String], you can then do .par to create a parallel List, and then a subsequent .map will use all cores where possible. The resulting List[whatever] will retain the same ordering even though the operations were not executed sequentially. Consider summing the values on each line:
val in: List[String] = ...
val out = { line =>
val cols = line split ','
So an in of List("1,2,3", "4,5,6") would result in an out of List(6, 15), as it would without the .par. but it'll run across multiple cores. Whether it's faster is another matter, since there is overhead in using parallel collections that likely makes a trivial example such as this slower. You will want to experiment to see where parallel collections are a benefit for your use cases.
There is a more extensive discussion and documentation of parallel collections at
What about the sleeping barber problem? You could implement it in a distributed manner over the network, with the barber(s)' spawning service listening on one port and the customers spawning and requesting the barber(s) services over the network.
I think that would be vast and interesting enough while not being impossible.
Then you can build on it to expand it as much as you want, such as adding specialized barbers for different things (hair cut or shaving) and down from there. Sky (or, better, thread's no. cap) is the limit!