I actually want to know the underlying mechanism of how this happens that when I execute sbt run the spark application starts !
What is the difference between this and running spark on standalone mode and then deploying application on it using spark-submit.
If someone can explain how the jar is submitted and who makes the task and assigns it in both the cases, that would be great.
Please help me out with this or point to some read where i can make my doubts cleared !
First, read this.
Once you are familiar with the terminologies, different roles, and their responsibilities, read below paragraph to summarize.
There are different ways to run a spark application(a spark app is nothing but a bunch of class files with an entry point).
You can run the spark application as single java process(usually for development purposes). This is what happens when you run sbt run.
In this mode, all the services like driver, workers etc are run inside a single JVM.
But above way of running is only for development and testing purposes as it won't scale. That means you won't be able to process a huge amount of data. This is where other ways of running a spark app come into the picture(Standalone, mesos, yarn etc).
Now read this.
In these modes, there will be dedicated JVMs for different roles. Driver will be running as a separate JVM, there could be 10s to 1000s of executor JVMs running on different machines(Crazy right!).
The interesting part is, the same application that runs inside a single JVM will be distributed to run on 1000s of JVMs. This distribution of the application, life-cycle of these JVMs, making them fault-tolerance etc are taken care by Spark and the underlying cluster frameworks.
Related
I am currently using spark to write my dimensional data model and we are currently uploading the jar to an AWS EMR cluster to test. However, this is tedious and time consuming for testing and building tables.
I would like to know what others are doing to speed up their development. The possibilities I came across in my research is running spark jobs directly from the IDE with Intellij Idea and I would like to know other development processes that are being used where it's faster to develop.
The ways I have had tried till now are:
Installing spark and hdfs on two or three commodity PCs and test the code before submitting it on the cluster.
Running the code on the single node to avoid dummy mistakes.
Submitting the jar file on the cluster.
The similar part in the first and third method is making the jar file which may takes a lot of time. The second one is not suitable to find and fix the bugs and problems and raise on distributed running environments.
Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.
I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.
I'd like to use zookeeper in one of my applications for distributed configuration management. The application is currently running in distributed environment and having to restart nodes for configuration files changes is a headache.
However, we want the zookeeper process to be started from within the application. The point is to reduced startup dependency and reduce operational cost. We've already have startup/shutdown scripts for the application and we need to reduce impact for operations team.
Has any one done something similar? Is this setup recommended or there are better solutions? Any tip or feedback is appreciated.
I have a blog post that describes how to embed Zookeeper in an application. The Zookeeper developers don't recommend it, though, and I would tend to agree now, though I had the same rationale for embedding it that you do - to reduce the number of moving parts.
You want to keep your ZK cluster stable but you will need to restart your app to do code updates, etc, impacting the ZK cluster stability.
Ultimately you will end up using your ZK cluster for multiple apps and those extra moving parts will be amortized over a number of projects.
I have "project" written in Python with multiples components: there are several distinct Pyramid and Twisted apps running.
We're looking at using Celery to offload some of the work from Pyramid and Twisted. Just to be clear, we're looking at one Celery instance / config, that handles the work for multiple Pyramid and Twisted apps.
All the info I found online covers multiple Celery for one or more apps; not one Celery for multiple apps. Celery will be doing 4-5 functions that are common to all these apps.
Are there any recommended strategies / common pitfalls for this sort of setup, or should we be generally fine with having a standalone celery_tasks package that all the different projects import ?
It is distributed system. By the definition it doesn't matter from where you call the tasks as long as they get executed by a worker and the caller is able to fetch the results.
You should be fine with both projects configured properly to sending tasks and receiving results. One shared module with common tasks is going to be just fine.
Shared workers should import only that module.
I have a Play! 2 application where I have defined some jobs. These jobs interact with external web services and with the database, hence they need a running application to work.
I would like to be able to launch these jobs as SBT tasks from the Play console. SO I have followed the guide to define my own tasks and I am able to define simple tasks. What I cannot do is importing from the application namespace. I guess this makes some sense - in the context of SBT we may not have an application running.
Is there some way to write an SBT task where an application is launched and one has access to the application namespace?
Don't know if you still need an answer, have you tried to use the Play console?
$ play console
It loads all your application classes so you can use them as you wish.
See http://www.playframework.com/documentation/2.1.0/PlayConsole for more information :)