Scla/Java library not installing on execution of Databricks Notebook - scala

At work I have a Scala Databricks Notebook that uses many libraries imports, both from Maven and from some JAR files. The issue I have is that when I plan jobs on this Notebook, it sometimes fails (completely randomly but mostly 1 time over 10 runs) because it executes the cells before all libraries are installed. Thus the job fails and I have to go launch it manually. Such comportment from this Databricks' product is far from being professional as I can't use it in production because it sometimes fails.
I tried to put a Thread.Sleep() of 1 minute or so before all my imports, but it does not change anything. For Python there's the dbutils.library.installPyPI("library-name") but there's no such thing for Scala in the Dbutils documentation.
So does anyone have had the same issue and if so, how did you solve it ?
Thank you !

Simply put for prod scheduled jobs use New Job Cluster and avoid All Purpose Cluster.
New Job Clusters are dedicated clusters created and started when you run a task and terminate immediately after the task completes. In production, Databricks recommends using new clusters so that each task runs in a fully isolated environment.
In the UI, when setting up your notebook job select a New Job Cluster and afterwards add all the dependent libraries to the job.
The pricing is different for New Job Cluster. I would say it ends up cheaper.
Note: Use Databricks pools to reduce cluster start and auto-scaling times (if it's an issue to begin with).

Related

Connect from a windows machine to Spark

I'm very (very!) new to Spark and Scala. I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.
When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.
(like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html)
My question is:
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Any help will be appreciated!
Thanks!
There are two modes of running your code either submitting your job to the server. or by running in local mode which requires no Spark Cluster to be setup. Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.
Running in Local Mode
val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")
Setting master as "local" spark along with your application.
If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.
val conf = new SparkConf()
.setMaster("spark://cyborg:7077")
.setAppName("SubmitJobToCluster Example")
.setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))
Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.
val sc = new SparkContext(conf)
This is a old project spark-examples you have samples programs which you can run directly from your IDE.
So Answering you questions
Are all of these steps mandatory? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?
NO
Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?
Yes you can. The above example does it.
Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine? Are there any other approaches?
Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark. When doing it pro-grammatically you have complete control over it.

In UrbanCode Deploy, how do I cause an application process to fail if not all component versions were specified?

Currently, when I run an application process that installs various components, if I don't specify a version for any of them, the deploy component process doesn't run, and it says "No Version Selected". However, the step doesn't fail, and the process continues. Is there a way to configure the process to fail if not all components have a version? Or is there a way for me to interrogate the manifest for the process in a step at the top to figure it out myself and fail accordingly? I currently can find no way to do either of these things. The version of UCD I am using is 6.1.1.3.
If your component process is configured as "Process Type* Operational (With Version)" then if you don't select the version the job will fail.

How use eclipse debug hadoop wordcount?

I want to use eclipse debug the wordcount, because I want to see the job how to run in the JobTracker. But hadoop use Proxy, I don't know the concrete process that job how to run in the JobTracker. How should I debug?
You are better off debugging "locally" against a single-node cluster (e.g. one of the sandboxes supplied by Cloudera or Hortonworks): that way you can truly step through the code as there is only one mapper/reducer in play. That's been my approach at least: usually the problems I had to debug were to do with the contents of specific files; I just copied over the relevant file to my test system and debugged there.

How to build and run Scala Spark locally

I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.
So I have cloned the latest from Spark repo :
git clone https://github.com/apache/spark.git
Spark appears to be a Maven project so when I create it in Eclipse here is the structure :
Some of the top level folders also have pom files :
So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?
Building Spark locally, the short answer:
git clone git#github.com:apache/spark.git
cd spark
sbt/sbt compile
Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'.
To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.
Then, when creating the Spark Context, use sparkConfig.local[1] as master like:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("SparkDebugExample")
so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.
If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.

Deployments for multiple environment in jenkins

I want to use jenkins to deploy various WARs using our single script for multiple servers.
Could you please suggest how to pass servers name to a job, so that our script can take that as an argument and start deploying on the selected server? The solution will be used to deploy the same code to 10-20 servers using our customized ant script to build these projects.
EDIT: We are using AIX servers. Want to use a drop down menu from which user can select environment IP,Port. How should I approach this?:
Maintaining txt files of environments
Using choice parameter
On selection of this env, we will use this env variable in our shell script to deploy.
To have one job start another, just use the parametrized trigger plugin. In addition, I like to run the deployment jobs on the target machine. For this I defined a slave for every target server. To be able to run a job on a specific slave and be able to choose the slave as a parameter, I use the NodeLabel Parameter Plugin.
If you want more specific tips, be more specific on what application servers you use. It would also be interesting to know if you operate under windows, linux, or other environment. The more info you give the better and more fitting the answers.