I am currently using spark to write my dimensional data model and we are currently uploading the jar to an AWS EMR cluster to test. However, this is tedious and time consuming for testing and building tables.
I would like to know what others are doing to speed up their development. The possibilities I came across in my research is running spark jobs directly from the IDE with Intellij Idea and I would like to know other development processes that are being used where it's faster to develop.
The ways I have had tried till now are:
Installing spark and hdfs on two or three commodity PCs and test the code before submitting it on the cluster.
Running the code on the single node to avoid dummy mistakes.
Submitting the jar file on the cluster.
The similar part in the first and third method is making the jar file which may takes a lot of time. The second one is not suitable to find and fix the bugs and problems and raise on distributed running environments.
Related
I have set up a Kubernetes cluster using Kubernetes Engine on GCP to work on some data preprocessing and modelling using Dask. I installed Dask using Helm following these instructions.
Right now, I see that there are two folders, work and examples
I was able to execute the contents of the notebooks in the example folder confirming that everything is working as expected.
My questions now are as follows
What are the suggested workflow to follow when working on a cluster? Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?
How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment? Would you just manually move them to a bucket every time you upgrade (which seems tedious)? or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?
I'm new to working with data in a distributed environment in the cloud so any suggestions are welcome.
What are the suggested workflow to follow when working on a cluster?
There are many workflows that work well for different groups. There is no single blessed workflow.
Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?
Sure, that would be fine.
How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment?
You might save your data to some more permanent store, like cloud storage, or a git repository hosted elsewhere.
Would you just manually move them to a bucket every time you upgrade (which seems tedious)?
Yes, that would work (and yes, it is)
or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?
Yes, that would also work.
In Summary
The Helm chart includes a Jupyter notebook server for convenience and easy testing, but it is no substitute for a full fledged long-term persistent productivity suite. For that you might consider a project like JupyterHub (which handles the problems you list above) or one of the many enterprise-targeted variants on the market today. It would be easy to use Dask alongside any of those.
I actually want to know the underlying mechanism of how this happens that when I execute sbt run the spark application starts !
What is the difference between this and running spark on standalone mode and then deploying application on it using spark-submit.
If someone can explain how the jar is submitted and who makes the task and assigns it in both the cases, that would be great.
Please help me out with this or point to some read where i can make my doubts cleared !
First, read this.
Once you are familiar with the terminologies, different roles, and their responsibilities, read below paragraph to summarize.
There are different ways to run a spark application(a spark app is nothing but a bunch of class files with an entry point).
You can run the spark application as single java process(usually for development purposes). This is what happens when you run sbt run.
In this mode, all the services like driver, workers etc are run inside a single JVM.
But above way of running is only for development and testing purposes as it won't scale. That means you won't be able to process a huge amount of data. This is where other ways of running a spark app come into the picture(Standalone, mesos, yarn etc).
Now read this.
In these modes, there will be dedicated JVMs for different roles. Driver will be running as a separate JVM, there could be 10s to 1000s of executor JVMs running on different machines(Crazy right!).
The interesting part is, the same application that runs inside a single JVM will be distributed to run on 1000s of JVMs. This distribution of the application, life-cycle of these JVMs, making them fault-tolerance etc are taken care by Spark and the underlying cluster frameworks.
Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.
I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.
Is possible build a bigdata application on cloud with RED HAT'PaaS OpenShift? I'm looking how build on cloud an Scala Application with Hadoop (HDFS),Spark,an Apache Mahout but i can't find any thing about it.I've seen something with HortonWorks but nothing clear about how install it in an openshift environment an how add HDFS node in Cloud too.Is it possible with OpneShift?
It's possible in Amazon but my question is : IS possible in OpenShift ??
It really depends on what you're ultimately trying to achieve. I know you mention building a big data application on Openshift with Scala but what will the application ultimately be doing?
I've gotten Hadoop running in a gear before but if you want a better example check out this quickstart here to get an idea of how its done https://github.com/ryanj/flask-hbase-todos. I know its not scala but here's a good article that will show you how to put together a scala app https://www.openshift.com/blogs/building-distributed-and-event-driven-applications-in-java-or-scala-with-akka-on-openshift.
What will the application ultimately be doing?:
Forecasting for football match result for several football leagues,a web application (ruby) and
statistic computation and data mining ,calculations with Scala language
and apache frameworks(spark & mahout).
We get the info via CSV files, process and save it in nosql db (Cassandra).
And all of this on cloud(OpenShift),that's the idea.
I've seen the info https://github.com/ryanj/flask-hbase-todos.I'll try by this way but
with Scala.
I'd like to use zookeeper in one of my applications for distributed configuration management. The application is currently running in distributed environment and having to restart nodes for configuration files changes is a headache.
However, we want the zookeeper process to be started from within the application. The point is to reduced startup dependency and reduce operational cost. We've already have startup/shutdown scripts for the application and we need to reduce impact for operations team.
Has any one done something similar? Is this setup recommended or there are better solutions? Any tip or feedback is appreciated.
I have a blog post that describes how to embed Zookeeper in an application. The Zookeeper developers don't recommend it, though, and I would tend to agree now, though I had the same rationale for embedding it that you do - to reduce the number of moving parts.
You want to keep your ZK cluster stable but you will need to restart your app to do code updates, etc, impacting the ZK cluster stability.
Ultimately you will end up using your ZK cluster for multiple apps and those extra moving parts will be amortized over a number of projects.