Programatically setting (remote) master address for launching Spark - scala

Note that the following local setting does work:
sc = new SparkContext("local[8]", testName)
But setting the remote master programmatically does not work:
sc = new SparkContext(master, testName)
or (same end result)
val sconf = new SparkConf()
.setAppName(testName)
.setMaster(master)
sc = new SparkContext(sconf)
In both of the latter cases the result is:
[16:25:33,427][INFO ][AppClient$ClientActor] Connecting to master akka.tcp://sparkMaster#mellyrn:7077/user/Master...
[16:25:33,439][WARN ][ReliableDeliverySupervisor] Association with remote system [akka.tcp://sparkMaster#mellyrn:7077]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
The following command line approach for setting the spark master consistently works (verified on multiple projects):
$SPARK_HOME/bin/spark-submit --master spark://mellyrn.local:7077
--class $1 $curdir/sparkclass.jar )
Clearly there is some additional configuration happening related to the command line spark-submit. Anyone want to posit what that might be?

In UNIX shell script below:
SP_MAST_URL=$CASSANDRA_HOME/dse client-tool spark master-address
echo $SP_MAST_URL
This will print the master from your Spark cluster environment. You may try this command utility provided by Spark and pass it on to SPARK SUBMIT command.
Note: CASSANDRA_HOME is the path where Apache cassandra installation is done. It could be any UNIX FILE path depending upon each environment.

Related

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code
And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)
val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))
//Validators is a Sequence of validations on columns in a Row.
// Validator method signature
// def checkForErrors(row: Row): (fieldName, value, message) ={
// logic to validate the field in a row }
val validateRow: Row => Row = (row: Row)=>{
val errorList = validators.map(validator => validator.checkForErrors(row)
Row.merge(row, Row(errorList))
}
val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema))
Versions : Spark 2.4.7 and Scala 2.11.8
Any ideas on why this might happen or if someone had the same issue.
I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.
For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"):
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

How to use mesos master url in a self-contained Scala Spark program

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through mesos.
I create spark context like this:
val conf = new SparkConf().setMaster("mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark").setAppName("foo")
val sc = new SparkContext(conf)
I found out from searching around that you have to specify MESOS_NATIVE_JAVA_LIBRARY env var to point to the libmesos library, so when running my Scala program I do this:
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib sbt run
But, this results in a SparkException:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Could not parse Master URL: 'mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark'
At the same time, using spark-submit seems to work fine after exporting the MESOS_NATIVE_JAVA_LIBRARY env var.
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib spark-submit --class <MAIN CLASS> ./target/scala-2.10/<APP_JAR>.jar
Why?
How can I make the standalone program run like spark-submit?
Add spark-mesos jar to your classpath.

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException

i keep getting this error
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs:/filename.txt
I have set up a standalone spark cluster and i am trying to run this code on my master node.
conf = new SparkConf()
.setAppName("Recommendation Engine1")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "4g")
val sc = new SparkContext(conf)
val rawUserArtistData = sc.textFile("hdfs:/user_artist_data.txt").sample(false,0.05)
on my terminal i run,
spark-submit --class com.latentview.spark.Reco --master spark://MASTERNODE U IP:PORT --deploy-mode client
/home/cloudera/workspace/new/Sparksample/target/Sparksample-0.0.1-SNAPSHOT-jar-with-dependencies.jar
These are the various things that i tried,
I replaced the hdfs:/filename.txt with fs.defaultFS path which was present on my core-site.xml file
Replaced the hdfs:/filename.txt with hdfs:// (if at all it makes any difference)
Replaced the hdfs:/ with file:// and later with file:/// to access my local drive for the files
None of this seems to work is there anything else that could be going wrong.
if i do hadoop fs -ls
this is where my files are.
Generally the path is:
hdfs://name-nodeIP:8020/path/to/file
In your case it must be,
hdfs://localhost:8020/user_artist_data.txt
or
hdfs://machinname:8020/user_artist_data.txt
the org.apache.hadoop.mapred.InvalidInputException error mean that spark can not create RDD because the folder "hdfs:/user_artist_data.txt" has no file on it.
try to connect hdfs://localhost:8020/user_artist_data.txt and see if there are any files.

Simple Spark program eats all resources

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077
By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.
in file:spark-default.xml ------->setting :spark.cores.max=4
It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.

StreamingContext couldn't bind to a port used by Java

I have started Spark master and workers and can easily run a MapReduce like wordcount on HDFS.
Now I want to run a streaming on textstream and when I want to make a new StreamingContext
I have this error:
scala> val ssc = new StreamingContext("spark://master:7077","test", Seconds(2))
13/07/17 11:13:45 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.2.105:48594
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
....
I checked the port and it was used by Java. I killed the process and I got out of Spark-shell.
Is there any way I can change the StreamingContext's port to a random free port?
Java is the underlying process for spark (scala runs on the jvm). It is possible that you have multiple copies of spark /spark streaming running. Can you look into that?
Specifically: i get the same result if I have a spark-shell already running.
You can check for other spark processes:
ps -ef | grep spark | -v grep