More than one spark context error - scala

I have this spark code below:
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object Hbase {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark-Hbase").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
...
val ssc = new StreamingContext(sparkConf, Seconds(3))
val kafkaBrokers = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaBrokers, topics)
}
}
Now the error I am getting is:
Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.
Is there anything wrong with my code above? I do not see where I am creating the context again...

These are the two SparkContext you're creating. This is not allowed.
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sparkConf, Seconds(3))
You should create the streaming context from the original context.
val ssc = new StreamingContext(sc, Seconds(3))

you are initializing two spark context in the same JVM i.e. (sparkContext and streamingContext). That's why you are getting this exception. you can set spark.driver.allowMultipleContexts = true in config. Although, multiple Spark contexts is discouraged. You can get unexpected results.

Related

How to save a JSON file in ElasticSearch using Spark?

I am trying to save a JSON file in ElasticSearch but its not working.
This is my code:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark.sql._
import org.apache.spark.SparkConf
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.registerTempTable("sensor_ptable")
sen_p.saveToEs("sensor/metrics")
}
}
I am getting also this error:
Exception in thread "main" java.lang.NoSuchMethodError: org.elasticsearch.spark.sql.package$.sparkDataFrameFunctions(Lorg/apache/spark/sql/Dataset;)Lorg/elasticsearch/spark/sql/package$SparkDataFrameFunctions;
at learnscala.HelloEs$.main(HelloEs.scala:20)
at learnscala.HelloEs.main(HelloEs.scala)
There are multiple ways to save an RDD / Dataframe to Elastic Search.
Spark Dataframe can be written to Elastic Search using:
df.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
RDD can be written to ES using:
import org.elasticsearch.spark.rdd.EsSpark
EsSpark.saveToEs(rdd, "<ES_RESOURCE_PATH>")
In your case, modify the code as below:
`
object HelloEs {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WriteToES").setMaster("local")
conf.set("es.index.auto.create", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sen_p = sqlContext.read.json("/home/Bureau/mydoc/Orange.json")
sen_p.write.format("org.elasticsearch.spark.sql").mode("append").option("es.resource","<ES_RESOURCE_PATH>").option("es.nodes", "http://<ES_HOST>:9200").save()
}
}
`

Scala Package Issue With ZKStringSerializer

I am trying to use the class ZKStringSerializer, which I get with
import kafka.utils.ZKStringSerializer
According to the entirety of the internet, and even my own code before I restarted by computer, this should allow my code to work. However, I now get an incredibly confusing compile error,
object ZKStringSerializer in package utils cannot be accessed in package kafka.utils
This is confusing because this file is not supposed to be in any package, and I don't specify a package anywhere. This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
import org.I0Itec.zkclient.ZkClient
import org.I0Itec.zkclient.ZkConnection
import java.util.Properties
import org.apache.kafka.clients.admin
import kafka.admin.{AdminUtils, RackAwareMode}
import kafka.utils.ZKStringSerializer
import kafka.utils.ZkUtils
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val zookeeperConnect = "localhost:2181"
val sessionTimeoutMs = 10000
val connectionTimeoutMs = 10000
val zkClient = new ZkClient(zookeeperConnect, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer)
val topicName = "testTopic"
val numPartitions = 8
val replicationFactor = 1
val topicConfig = new Properties
val isSecureKafkaCluster = false
val zkUtils = new ZkUtils(zkClient, new ZkConnection(zookeeperConnect), isSecureKafkaCluster)
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig)
// Create producer for topic testTopic and actually push values to the topic
val props = new Properties()
props.put("bootstrap.servers", "localhost:9592")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "testTopic"
for (i <- 1 to 50) {
val record = new ProducerRecord(TOPIC, "key", s"hello $i")
producer.send(record)
}
val record = new ProducerRecord(TOPIC, "key", "the end" + new java.util.Date)
producer.send(record)
producer.flush()
producer.close()
}
}
I know this is too late, but for others who will be looking for the same issue-
In the latest version of kafka, kafka.utils got deprecated. So please use kafka admin client apis

SQLContext.gerorCreate is not a value

I am getting error SQLContext.gerorCreate is not a value of object org.apache.spark.SQLContext. This is my code
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.sql.functions
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types
import org.apache.spark.SparkContext
import java.io.Serializable
case class Sensor(id:String,date:String,temp:String,press:String)
object consum {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val sc=new SparkContext(sparkConf)
val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("hello" -> 5))
def parseSensor(str:String): Sensor={
val p=str.split(",")
Sensor(p(0),p(1),p(2),p(3))
}
val data=lines.map(_._2).map(parseSensor)
val sqlcontext=new SQLContext(sc)
import sqlcontext.implicits._
data.foreachRDD { rdd=>
val sensedata=sqlcontext.getOrCreate(rdd.sparkContext)
}
I have tried with SQLContext.getOrCreate as well but same error.
There is no such getOrCreate function defined for neither SparkContext nor SQLContext.
getOrCreate function is defined for SparkSession instances from which SparkSession instances are created. And we get sparkContext instance or sqlContext instance from the SparkSession instance created using getOrCreate method call.
I hope the explanation is clear.
Updated
The explanation I did above is suitable for higher versions of spark. In the blog as the OP is referencing, the author is using spark 1.6 and the api doc of 1.6.3 clearly states
Get the singleton SQLContext if it exists or create a new one using the given SparkContext

Avoid multiple connections to mongoDB from spark streaming

We developed a spark streaming application that sources data from kafka and writes to mongoDB. We are noticing performance implications while creating connections inside foreachRDD on the input DStream. The spark streaming application does a few validations before inserting into mongoDB. We are exploring options to avoid connecting to mongoDB for each message that is processed, rather we desire to process all messages within one batch interval at once. Following is the simplified version of the spark streaming application. One of the things we did is append all the messages to a dataframe and try inserting the contents of that dataframe outside of the foreachRDD. But when we run this application, the code that writes dataframe to mongoDB does not get executed.
Please note that I commented out a part of the code inside foreachRDD which we used to insert each message into mongoDB. Existing approach is very slow as we are inserting one message at a time. Any suggestions on performance improvement is much appreciated.
Thank you
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
//Earlier we were inserting each message into mongoDB, which we would like to avoid and process all at once
/* df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()*/
outDf = outDf.union(df)
}
}
)
//Added this part of the code in expectation to access the unioned dataframe and insert all messages at once
//println(outDf.count())
if(outDf.count() > 0)
{
outDf.write
.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
}
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
It sounds like you would want to reduce the number of connections to mongodb, for this purpose, you must use foreachPartition in code when you serve connection do mongodb see spec, the code will look like this:
rdd.repartition(1).foreachPartition {
//get instance of connection
//write/read with batch to mongo
//close connection
}

Spark Streaming StreamingContext error

Hi i am started spark streaming learning but i can't run an simple application
My code is here
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
And i am getting error like as the following
scala> val newscc = new StreamingContext(conf, Seconds(1))
15/10/21 13:41:18 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
Thanks
If you are using spark-shell, and it seems like you do, you should not instantiate StreamingContext using SparkConf object, you should pass shell-provided sc directly.
This means:
val conf = new SparkConf().setMaster("spark://beyhan:7077").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
becomes,
val ssc = new StreamingContext(sc, Seconds(1))
It looks like you work in the Spark Shell.
There is already a SparkContext defined for you there, so you don't need to create a new one. The SparkContext in the shell is available as sc
If you need a StreamingContext you can create one using the existing SparkContext:
val ssc = new StreamingContext(sc, Seconds(1))
You only need the SparkConf and SparkContext if you create an application.