Running Apache Spark Example Application in IntelliJ Idea - scala

I am trying to run the SparkPi.scala example program in Netbeans. Unfortunately I am quite new to Spark and have not been able to execute it successfully.
My preference is to work in Netbeans only and execute from there. I know spark also allows executing from the spark console - I however prefer not to take that approach.
This is my build.sbt file contents:
name := "SBTScalaSparkPi"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
This is my plugins.sbt file contents:
logLevel := Level.Warn
This is the program I am trying to execute:
import scala.math.random
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
JDK version: 1.8.
The error I get when trying to execute the code is given below:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/25 07:50:25 INFO SparkContext: Running Spark version 1.6.1
16/03/25 07:50:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/25 07:50:26 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:401)
at SparkPi.main(SparkPi.scala)
16/03/25 07:50:26 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>
at SparkPi$.main(SparkPi.scala:28)
at SparkPi.main(SparkPi.scala)
Process finished with exit code 1
Thanks in advance for any help.

A master URL must be set in your configuration
You must set a spark.master in your SparkConf. There are only two mandatory parameters you must set - the master and the AppName that you've already set. For more details, see Initializing Spark section in the docs.
Which master should you use? See Master URLs section for all options. The simplest option for testing is local, which runs an entire Spark system (driver, master, worker) on your local machine, with no extra configuration.
To set the master through the Scala API:
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)

The start of your program just lacks the URL that points to the Spark master endpoint. You can specify this as a command line parameter in InteliJ. The master URL is the URL and port where the Spark master of your cluster is running. An example command line parameter looks like this:
-Dspark.master=spark://myhost:7077
See the answer to this question for details:
How to set Master address for Spark examples from command line
Perhaps for your first runs you want to just start a local Spark standalone environment. How to get that running is well documented here: http://spark.apache.org/docs/latest/spark-standalone.html
If you got this running you can setup your spark master config like this:
-Dspark.master=spark://localhost:7077

The Master URL need to be set. Using the setMaster("local") function / method solved the issue.
val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
val spark = new SparkContext(conf)

As a matter of fact both #Matthias and #Tzach are right. You should choose your solution based on what is easier for you (maybe prefer the first option for now). As soon as you start running your spark job on a real cluster it is far better to not hardcode the "master" parameter so that you can run your spark job in multiple cluster mode (YARN, Mesos, Standalone with spark-submit) and still keep it running locally with Netbeans (-Dspark.master=local[*])

Related

How to save spark streaming data in cassandra

build.sbt
Below are the contents included in build.sbt file
val sparkVersion = "1.6.3"
scalaVersion := "2.10.5"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion)
libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0"
Command to initialize shell:
The below command is the shell initialization procedure I followed
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
Note:
Here I specified jar specifically because SBT couldn’t fetch the required libraries of spark streaming kafka used at creation of kafkaStream in later sections
Import required libraries:
This section includes libraries to be imported that are used in various cases of the REPL session
import org.apache.spark.SparkConf; import org.apache.spark.streaming.StreamingContext; import org.apache.spark.streaming.Seconds; import org.apache.spark.streaming.kafka.KafkaUtils; import com.datastax.spark.connector._ ; import org.apache.spark.sql.cassandra._ ;
Setting up Spark Streaming Configuration:
Here am configuring configurations required for spark streaming context
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
conf.set("spark.driver.allowMultipleContexts", "true"); // Required to set this to true because during // shell initialization or starting we a spark context is created with configurations of highlighted
conf.setMaster("local"); // then we are assigning those cofigurations locally
Creation of SparkStreamingContext using above configurations:
Using configurations defined above we create a spark streaming context in the below way
val ssc = new StreamingContext(conf, Seconds(1)); // Seconds here describe the interval to fetch
Creating a Kafka stream using above Spark Streaming Context aka SSC:
Here ssc is spark streaming context that was created above,
“localhost:2181” is ZKquoram
"spark-streaming-consumer-group" is consumer group
Map("test3" -> 5) is Map(“topic” -> number of partitions )
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("test3" -> 5)).map(_._2)
Note
Values fetched when the kafkaStream object is printed, using kafkaStream.print() are shown in below image
85052,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85053,19,167.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85054,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85055,19,167.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85056,19,960.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85057,19,167.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85058,19,960.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
17/09/02 18:25:25 INFO JobScheduler: Finished job streaming job 1504376716000 ms.0 from job set of time 1504376716000 ms
17/09/02 18:25:25 INFO JobScheduler: Total delay: 9.661 s for time 1504376716000 ms (execution: 0.021 s)
17/09/02 18:25:25 INFO JobScheduler: Starting job streaming job 1504376717000 ms.0 from job set of time 1504376717000 ms
Transforming the kafkaStream and saving in Cassandra:
kafkaStream.foreachRDD( rdd => {
if (! rdd.isEmpty()) {
rdd.map( line => {
val arr = line.split(",");
(arr(0), arr(1), arr(2), arr(3), arr(4), arr(5), arr(6), arr(7), arr(8), arr(9), arr(10), arr(11))
}). saveToCassandra("test", "sensorfeedVals", SomeColumns(
"tableid", "ccid", "paramval", "batVal", "time", "gwid", "gwhName", "snid", "snhName", "snStatus", "sd", "MId")
)
} else {
println("No records to save")
}
}
)
Start ssc:
Using ssc.start you can start the streaming
Issues am facing here are:
1. Printing of the content of stream is happening only after I enter exit or Ctrl+C
2. Whenever I use ssc.start does it start streaming immediately In REPL? Without giving time to enter ssc.awaitTermination
3. Main issue when I tried to save normally in below procedure ***
val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
am able to save in Cassandra but whenever am trying to save in Cassandra using the logic shown in Transforming the kafkaStream and saving in Cassandra: I couldn't extract each value from string and save it in respective columns of Cassandra tables!
java.lang.NoClassDefFoundError: Could not initialize class com.datastax.spark.connector.cql.CassandraConnector
Means the classpath has not been correctly setup for your application. Make sure you are using the --packages option when launching your application as is noted in the SCC Docs
For your other issues
You don't need awaitTermination in the REPL because the repl will not instantly quit after starting the streaming context. That call is there for an application which may have no further instructions to prevent the main thread from exiting.
Start will start the streaming immediately.
A line or two lines of code which related to contexts is causing the issue here!
I found the solution when i walked through the topics of context!
Here I was running multiple contexts but they are independent to each other.
I have initialized shell with below command:
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
So when shell starts A spark context with properties of Datastax connector are initialized.
Later I created some configurations and using those configurations created a spark streaming context. Using this context I have created kafkaStream. This kafkaStream is having only properties of SSC but not SC, so here raised the issue of storing in to cassandra.
I have tried to resolve it in the below and succeeded!
val sc = new SparkContext(new SparkConf().setAppName("Spark-Kafka-Streaming").setMaster("local[*]").set("spark.cassandra.connection.host", "127.0.0.1"))
val ssc = new StreamingContext(sc, Seconds(10))
Thanks everyone who came forward to support!
Let me know if any more best possible ways to achieve it!
A very simple approach is to convert a stream as a dataframe for foreachRDD API, convert the RDD to DataFrame and save to cassandra using SparkSQL-Cassandra Datasource API. Below is a simple code snippet where I am saving the Twitter tweets to a Cassandra Table
stream.foreachRDD(rdd => {
if (rdd.count() > 0) {
val data = rdd.filter(status => status.getLang.equals("en")).map(status => TweetsClass(status.getId,
status.getCreatedAt.toGMTString(),
status.getUser.getLocation,
status.getText)).toDF()
//Save the data to Cassandra
data.write.
format("org.apache.spark.sql.cassandra").
options(Map(
"table" -> "sentiment_tweets",
"keyspace" -> "My Keyspace",
"cluster" -> "My Cluster")).mode(SaveMode.Append).save()
}
})

How to use mesos master url in a self-contained Scala Spark program

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through mesos.
I create spark context like this:
val conf = new SparkConf().setMaster("mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark").setAppName("foo")
val sc = new SparkContext(conf)
I found out from searching around that you have to specify MESOS_NATIVE_JAVA_LIBRARY env var to point to the libmesos library, so when running my Scala program I do this:
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib sbt run
But, this results in a SparkException:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Could not parse Master URL: 'mesos://zk://<mesos-url1>,<mesos-url2>/spark/mesos-rtspark'
At the same time, using spark-submit seems to work fine after exporting the MESOS_NATIVE_JAVA_LIBRARY env var.
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib spark-submit --class <MAIN CLASS> ./target/scala-2.10/<APP_JAR>.jar
Why?
How can I make the standalone program run like spark-submit?
Add spark-mesos jar to your classpath.

Running Sparkling-Water with external H2O backend

I was following the steps for running Sparkling water with external backend from here. I am using spark 1.4.1, sparkling-water-1.4.16, I've build the extended h2o jar and exported the H2O_ORIGINAL_JAR and H2O_EXTENDED_JAR system variables. I start the h2o backend with
java -jar $H2O_EXTENDED_JAR -md5skip -name test
But when I start sparkling water via
./bin/sparkling-shell
and in it try to get the H2OConf with
import org.apache.spark.h2o._
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test”)
val hc = H2OContext.getOrCreate(sc, conf)
it fails on the second line with
<console>:24: error: trait H2OConf is abstract; cannot be instantiated
val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
^
I've tried adding the newly build extended h2o jar with --jars parameter either to sparkling water or standalone spark with no progress. Does any one have any hints?
This is unsupported for versions of Spark earlier than 2.0.
Download the latest version of sparkling jar and add it to while starting the spark-shell:
./bin/sparkling-shell --master yarn-client --jars "<path to the jar located>"
Then run the code by setting the extended h2o driver:
import org.apache.spark.h2o._
val conf = new H2OConf(spark).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("//home//xyz//sparkling-water-2.2.5/bin//h2odriver-sw2.2.5-hdp2.6-extended.jar").setNumOfExternalH2ONodes(2).setMapperXmx("6G")
val hc = H2OContext.getOrCreate(spark, conf)

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

Simple Spark program eats all resources

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077
By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.
in file:spark-default.xml ------->setting :spark.cores.max=4
It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.