Spark on standalone cluster throws java.lang.illegalStateException - mongodb

I hava a app and read data from MongoDB.
If I use local pattern, it runs well, however, it throws java.lang.illegalStateExcetion when I use standalone cluster pattern
With local pattern, the SparkContext is val sc = new SparkContext("local","Scala Word Count")
With Standalone cluster pattern, the SparkContext is val sc = new SparkContext() and submit shell is ./spark-submit --class "xxxMain" /usr/local/jarfile/xxx.jar --master spark://master:7077
It trys 4 times then throw error when it runs to the first action
My code
configOriginal.set("mongo.input.uri","mongodb://172.16.xxx.xxx:20000/xxx.Original")
configOriginal.set("mongo.output.uri","mongodb://172.16.xxx.xxx:20000/xxx.sfeature")
mongoRDDOriginal =sc.newAPIHadoopRDD(configOriginal,classOf[com.mongodb.hadoop.MongoInputFormat],classOf[Object], classOf[BSONObject])
I learned from this example
mongo-spark
I searched and someone said it was because of mongo-hadoop-core-1.3.2, but either I up the version to mongo-hadoop-core-1.4.0 or down to 'mongo-hadoop-core-1.3.1', it didn't work.
Please help me!

Finally, I got the solution.
Because each of my workers have many cores and mongo-hadoop-core-1.3.2 doesn't support multiple threads, however it fixed in mongo-hadoop-core-1.4.0. But why my app still get error is because of "intellij idea" cache. You should add mongo-java-driver dependency, too.

Related

Use pyspark on yarn cluster without creating the context

I'll try to do my best to explain my self. I'm using JupyterHub to connect to the cluster of my university and write some code. Basically i'm using pyspark but, since i'va always used "yarn kernel" (i'm not sure of what i'm sying) i've never defined the spark context or the spark session. Now, for some reason, it doesn't work anymore and when i try to use spark this error appears:
Code ->
df = spark.read.csv('file:///%s/.....
Error ->
name 'spark' is not defined
It already happend to me but i just solved by installing another version of pyspark. Now i don't know what to do.

configuring scheduling pool in spark using zeppelin, scala and EMR

In pyspark I'm able to change to a fair scheduler within zeppelin (on AWS EMR) by doing the following:
conf = sc.getConf()
conf.set('spark.scheduler.allocation.file',
'/etc/spark/conf.dist/fairscheduler.xml.template')
sc.setLocalProperty("spark.scheduler.pool", 'production')
However if I try something similar in a scala cell it then things continue to run in the FIFO pool
val conf = sc.getConf()
conf.set("spark.scheduler.allocation.file",
"/etc/spark/conf.dist/fairscheduler.xml.template")
sc.setLocalProperty("spark.scheduler.pool", "FAIR")
I've tried so many combinations, but nothing has worked. Any advice is appreciated.
I ran into a similar issue with Spark 2.4. In my case, the problem was resolved by removing the default "spark.scheduler.pool" option in my Spark config. It might be that your Scala Spark interpreter is set up with spark.scheduler.pool but your python isn't.
I traced the issue to a bug in Spark - https://issues.apache.org/jira/browse/SPARK-26988. The problem is that if you set the config property "spark.scheduler.pool" in the base configuration, you can't then override it using setLocalProperty. Removing it from the base configuration made it work correctly. See the bug description for more detail.

A master url must be set to your configuration (Spark scala on AWS)

This is what I wrote via intellij. I plan on eventually writing larger spark scala files.
Anyways, I uploaded it on an AWS cluster that I had made. The "master" line, line 11 was "master("local")". I ran into this error
The second picture is the error that was returned by AWS when it did not run successfully. i changed line 11 to "yarn" instead of local (see the first picture for its current state)
It still is returning the same error. I put in the following flags when I uploaded it manually
--steps Type=CUSTOM_JAR,Name="SimpleApp"
It worked two weeks ago. My friend did almost the exact same thing as me. I am not sure why it isn't working.
I am looking for both a brief explanation and an answer. Looks like I need a little more knowledge on how spark works.
I am working with amazon EMR.
I think on the line 9 you are creating SparkContext with "old way" approach in spark 1.6.x and older version - you need to set master in default configuration file (usually location conf/spark-defaults.conf) or pass it to spark-submit (it is required in new SparkConf())...
On line 10 you are creating "spark" context with SparkSesion which is approach in spark 2.0.0. So in my opinion your problem is line num. 9 and I think you should remove it and work with SparkSesion or set reqiered configuration for SparkContext In case when you need sc.
You can access to sparkContext with sparkSession.sparkContext();
If you still want to use SparkConf you need to define master programatically:
val sparkConf = new SparkConf()
.setAppName("spark-application-name")
.setMaster("local[4]")
.set("spark.executor.memory","512m");
or with declarative approach in conf/spark-defaults.conf
spark.master local[4]
spark.executor.memory 512m
or simply at runtime:
./bin/spark-submit --name "spark-application-name" --master local[4] --executor-memory 512m your-spark-job.jar
Try using the below code:
val spark = SparkSession.builder().master("spark://ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:xxxx").appName("example").getOrCreate()
you need to provide the proper link to your aws cluster.

Spark : check your cluster UI to ensure that workers are registered

I have a simple program in Spark:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("spark://10.250.7.117:7077").setAppName("Simple Application").set("spark.cores.max","2")
val sc = new SparkContext(conf)
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
//first get the first 10 records
println("Getting the first 10 records: ")
ratingsFile.take(10)
//get the number of records in the movie ratings file
println("The number of records in the movie list are : ")
ratingsFile.count()
}
}
When I try to run this program from the spark-shell i.e. I log into the name node (Cloudera installation) and run the commands sequentially on the spark-shell:
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
println("Getting the first 10 records: ")
ratingsFile.take(10)
println("The number of records in the movie list are : ")
ratingsFile.count()
I get correct results, but if I try to run the program from eclipse, no resources are assigned to program and in the console log all I see is:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Also, in the Spark UI, I see this:
Job keeps Running - Spark
Also, it should be noted that this version of spark was installed with Cloudera (hence no worker nodes show up).
What should I do to make this work?
EDIT:
I checked the HistoryServer and these jobs don't show up there (even in incomplete applications)
I have done configuration and performance tuning for many spark clusters and this is a very common/normal message to see when you are first prepping/configuring a cluster to handle your workloads.
This is unequivocally due to insufficient resources to have the job launched. The job is requesting one of:
more memory per worker than allocated to it (1GB)
more CPU's than available on the cluster
Finally figured out what the answer is.
When deploying a spark program on a YARN cluster, the master URL is just yarn.
So in the program, the spark context should just looks like:
val conf = new SparkConf().setAppName("SimpleApp")
Then this eclipse project should be built using Maven and the generated jar should be deployed on the cluster by copying it to the cluster and then running the following command
spark-submit --master yarn --class "SimpleApp" Recommender_2-0.0.1-SNAPSHOT.jar
This means that running from eclipse directly would not work.
You can check your cluster's work node cores: your application can't exceed that. For example, you have two work node. And per work node you have 4 cores. Then you have 2 applications to run. So you can give every application 4 cores to run the job.
You can set like this in the code:
SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan")
.set("spark.cores.max", "4");
It works for me.
There are also some causes of this same error message other than those posted here.
For a spark-on-mesos cluster, make sure you have java8 or newer java version on mesos slaves.
For spark standalone, make sure you have java8 (or newer) on the workers.
You don't have any workers to execute the job. There are no available cores for the job to execute and that's the reason the job's state is still in 'Waiting'.
If you have no workers registered with Cloudera how will the jobs execute?

Deploying Play Framework (needing a SparkContext) on cluster

We are deploying to a Cluster a Play! Framework (version 2.3) app that runs jobs needing a SparkContext (e.g : search in Hbase, or WordCount). Our cluster is a Yarn cluster.
To start our app we execute the command line :
activator run
(after compiling and packaging)
We are experimenting some issues with the SparkContext configuration, we use this configuration :
new SparkConf(false)
.setMaster("yarn-cluster")
.setAppName("MyApp")
val sc = new SparkContext(sparkConf)
When we call a route that calls a job using the SparkContext we and up with this error :
[RuntimeException: java.lang.ExceptionInInitializerError]
or this one (when we reload):
[RuntimeException: java.lang.NoClassDefFoundError: Could not initialize class mypackage.MyClass$]
To test our code we changed the the parameter setMaster to
.setMaster("local[4]")
And it worked well. But of course our code is not distributed and we do not use the Spark capabilities of distributing our code (e.g : RDD).
Is running our App with the spark submit command a solution? If it is how can this be done ?
We would rather still use the activator command.