I was working a proof of concept of having Spark Mllib training and prediction serving exposed for multiple tenants with some form of a REST interface. I did get a POC up and running but it seems a bit wasteful as it has to create numerous spark contexts and JVMs to execute in so I was wondering if there is a way around that or a cleaner solution having in mind spark's context per jvm restrictions.
There are 2 parts to it:
Trigger training of a specified jar per tenant with specific restrictions for each tenant like executor size etc. (this is pretty much out of the box with spark job server, sadly it doesnt yet seem to support Oauth), but there is a way to do it. For this part I don't think it's possible to share context between tenants because they should be able to train in parallel and as far as i know an MLlib context will do 2 training requests sequentially.
This is trickier and I can't seem to find a good way to do that, but once the model has been trained we need to load it in some kind of a REST service and expose it. This also means allocating a spark context per tenant, hence a full JVM per tenant serving predictions, which is quite wasteful.
Any feedback on how this can possibly be improved or re-architected so it's a bit less resource hungry, maybe there are certain Spark features I'm not aware of that would facilitate that. Thanks!
Related
The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).
I am an amateur Spark user and Scala. Although I did numerous searches, I could not find my answer.
Is it possible to assign different tasks to different executors at the same time on a single driver program?
for example, Suppose we have 10 nodes. I want to write a code to classify a dataset using Naive Bayes algorithm with five workers and at the same time, I want to assign the other five workers a task to classify the dataset with the decision tree algorithm. Afterward, I will combine the the answers.
HamidReza,
What you want to achieve is running two actions in parallel from your driver. It's definately possible but it only makes sense if your actions are not using the whole cluster (for a better resource management in fact).
You can use concurrency for this. There are many ways of implementing a concurrent program, starting with Futures (I can't really recommend this approach, but seems to be the most popular choice in Scala), to more advanced types like Tasks (you can take a look to popular functional libraries like Monix, Cats or Zio).
I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?
I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.
In the past for job that required a heavy processing load I would use Scala and parallel collections.
I'm currently experimenting with Spark and find it interesting but a steep learning curve. I find the development slower as have to use a reduced Scala API.
What do I need to determine before deciding wether or not to use Spark ?
The current Spark job im trying to implement is processing approx 5GB if data. This data is not huge but I'm running a Cartesian product of this data and this is generating data in excess of 50GB. But maybe using Scala parallel collecitons will be just as fast, I know the dev time to implement the job will be faster from my point of view.
So what considerations should I take into account before deciding to use Spark ?
The main advantages Spark has over traditional high-performance computing frameworks (e.g. MPI) are fault-tolerance, easy integration into the Hadoop stack, and a remarkably active mailing list http://mail-archives.apache.org/mod_mbox/spark-user/ . Getting distributed fault-tolerant in-memory computations to work efficiently isn't easy and it's definitely not something I'd want to implement myself. There's a review of other approaches to the problem in the original paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf .
However, when my work is I/O bound, I still tend to rely primarily on pig scripts as pig is more mature and I think the scripts are easier to write. Spark has been great when pig scripts won't cut it (e.g. iterative algorithms, graphs, lots of joins).
Now, if you've only got 50g of data, you probably don't care about distributed fault-tolerant computations (if all your stuff is on a single node then there's no framework in the world that can save you from a node failure :) ) so parallel collections will work just fine.
I'm coming from a java background and have a CPU bound problem that I'm trying to parallelize to improve performance. I have broken up my code to perform in a modular way so that it can be distributed and run in a parallel way (hopefully).
#Transactional(readOnly = false, propagation = Propagation.REQUIRES_NEW)
public void runMyJob(List<String> some params){
doComplexEnoughStuffAndWriteToMysqlDB();
}
Now, I have been thinking of the following options for parallelizing this problem and I'd like people's thoughts/experience in this area.
Options I am currently thinking of:
1) Use Java EE (eg JBoss) clustering and MessageDrivenBeans. The MDBs are on the slave nodes in the cluster. Each MDB can pick up an event which kicks off a job as above. AFAIK Java EE MDBs are multithreaded by the app server so this should hopefully also be able to take advantage of multicores. Thus it should be vertically and horizontally scalable.
2) I could look at using something like Hadoop and Map Reduce. Concerns I would have here is that my job processing logic is actually quite high level so I'm not sure how translatable that is to Map Reduce. Also, I'm a total newbie to MR.
3) I could look at something like Scala which I believe makes concurrency programming much simpler. However, while this is vertically scalable, it's not a cluster/horizontally scalable solution.
Anyway, hope all that makes sense and thank you very much for any help provided.
the solution you are looking for is Akka. Clustering is a feature under development, and will normally be included in Akka 2.1
Excellent Scala and Java Api, extremely complete
Purely message-oriented pattern, with no shared state
Fault resistant and scalable
Extremely easy to distribute jobs
Please get rid of J2EE if you are still on time. You are very welcome to join the Akka mailing list to ask your questions.
You should have a look at spark.
It is a cluster computing framework written in Scala aiming at being a viable alternative to Hadoop.
It has a number of nice feats:
In-Memory Computations: You can control the degree of caching
Hadoop Input/Output interoperability: Spark can read/write data from all the Hadoop input sources such as HDFS, EC2, etc.
The concept of "Resilient Distributed Datasets" (RDD) which allows you to directly execute most of MR style workloads in parallel on a cluster as you would do locally
Primary API = Scala, optional python and Java APIs
It makes use of Akka :)
If I understand your question correctly, Spark would combine your options 2) and 3).