Difference between running spark application as standalone vs spark submit / spark launcher? - scala

I am exploring different options to package spark application and i am confused what is the best mode and what are the differences between the following modes?
Submit spark application's jar to spark-submit
Construct a fat jar out of spark gradle project and run the jar as stand alone java application.
I have tried both the ways , but my requirement is to package the spark application inside docker container , running fat jar looks easy for me but as am a newbie i don't have any idea about the restrictions that i may face if i go with fat jar approach(leaving aside fat jar may grow in size)
Can you please let us know your inputs
Is it possible to setup spark cluster including driver and executors programatically ?
val conf = new SparkConf()
conf.setMaster("local")
conf.set("deploy-mode", "client")
conf.set("spark.executor.instances", "2")
conf.set("spark.driver.bindAddress", "127.0.0.1")
conf.setAppName("local-spark-kafka-consumer")
val sparkSession = SparkSession
.builder()
.master("local[*]")
.config(conf)
.appName("Spark SQL data sources example")
.getOrCreate()
val sc = sparkSession.sparkContext
val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,localhost:9093",
"key.deserializer" -> classOf[LongDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "consumerGroup10",
"auto.offset.reset" -> "earliest",
"max.poll.records" -> "1",
"enable.auto.commit" -> (false: java.lang.Boolean))
val topics = Array("topic1")
val stream = KafkaUtils.createDirectStream[String, String](...)
ssc.start()
ssc.awaitTermination()
} catch {
case e: Exception => println(e)
}

Using fat jars for deploying spark jobs is an old and even ancient practice. You can do this, trust me :) Just be careful about what you're writing inside it.

Related

How to set specific offset number while consuming message from Kafka topic through Spark streaming Scala

I am using below spark streaming Scala code for consuming real time kafka message from producer topic.
But the issue is sometime my job is failed due to server connectivity or some other reason and in my code auto commit property is set true due to that some message is lost and not able to store in my database.
So just want to know is there any way if we want to pull old kafka message from specific offset number.
I tried to set "auto.offset.reset" is earliest or latest but it fetch only new message those is not yet commit.
Let's take the example here like my current offset number is 1060 and auto offset reset property is earliest so when I restart my job it starts reading the message from 1061 but in some case if I want to read old kafka message from offset number 1020 then is there any property that we can use to start the consuming message from specific offset number
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val topic = "test123"
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[KafkaAvroDeserializer],
"schema.registry.url" -> "http://abc.test.com:8089"
"group.id" -> "spark-streaming-notes",
"auto.offset.reset" -> "earliest"
"enable.auto.commit" -> true
)
val stream = KafkaUtils.createDirectStream[String, Object](
ssc,
PreferConsistent,
Subscribe[String, Object](topic, KafkaParams)
stream.print()
ssc.start()
ssc.awaitTermination()
From Spark Streaming, you can't. You'd need to use kafka-consumer-groups CLI to commit offsets specific to your group id. Or manually construct a KafkaConsumer instance and invoke commitSync before starting the Spark context.
import org.apache.kafka.clients.consumer.KafkaConsumer
val c = KafkaConsumer(...)
val toCommit: java.util.Map[TopicPartition,OffsetAndMetadata] = ...
c.commitSync(toCommit) // But don't do this every run of your app
ssc.start()
Alternatively, Structured Streaming does offer startingOffsets config.
auto.offset.reset only applies to non existing group.id's

Why does the kafka consumer code freeze when I start spark stream?

I am new to Kafka and trying to implement Kafka consumer logic in spark2 and when I run all my code in the shell and start the streaming it shows nothing.
I have viewed many posts in StackOverflow but nothing helped me. I have even downloaded all the dependency jars from maven and tried to run but it still shows nothing.
Spark Version: 2.2.0
Scala version 2.11.8
jars I downloaded are kafka-clients-2.2.0.jar and spark-streaming-kafka-0-10_2.11-2.2.0.jar
but it still I face the same issue.
Please find the below code snippet
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{StreamingContext, Seconds}
import org.apache.spark.streaming.kafka010.{KafkaUtils, ConsumerStrategies, LocationStrategies}
val brokers = "host1:port, host2:port"
val groupid = "default"
val topics = "kafka_sample"
val topicset = topics.split(",").toSet
val ssc = new StreamingContext(sc, Seconds(2))
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val msg = KafkaUtils.createDirectStream[String, String](
ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topicset, kafkaParams)
)
msg.foreachRDD{
rdd => rdd.collect().foreach(println)
}
ssc.start()
I am expecting SparkStreaming to start but it doesn't do anything. What mistake have I done here? Or is this a known issue?
The driver will be sitting idle unless you call ssc.awaitTermination() at the end. If you're using spark-shell then it's not a good tool for streaming jobs.
Please, use interactive tools like Zeppelin or Spark notebook for interacting with streaming or try building your app as jar file and then deploy.
Also, if you're trying out spark streaming, Structured Streaming would be better as it is quite easy to play with.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
After ssc.start() use ssc.awaitTermination() in your code.
For testing, write your code in a Main Object and run it in any IDE like Intellij
You can use command shell and publish messages from the Kafka producer.
I have written all these steps in a simple example in a blog post with working code in GitHub. Please refer to: http://softwaredevelopercentral.blogspot.com/2018/10/spark-streaming-and-kafka-integration.html

spark Cassandra tuning

How to set following Cassandra write parameters in spark scala code for
version - DataStax Spark Cassandra Connector 1.6.3.
Spark version - 1.6.2
spark.cassandra.output.batch.size.rows
spark.cassandra.output.concurrent.writes
spark.cassandra.output.batch.size.bytes
spark.cassandra.output.batch.grouping.key
Thanks,
Chandra
In DataStax Spark Cassandra Connector 1.6.X, you can pass these parameters as part of your SparkConf.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "100")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100")
.set("spark.cassandra.output.batch.grouping.key", "partition")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
You can refer to this readme for more information.
The most flexible way is to add those variables in a file, such as spark.conf:
spark.cassandra.output.concurrent.writes 10
etc...
and then create your spark context in your app with something like:
val conf = new SparkConf()
val sc = new SparkContext(conf)
and finally, when you submit your app, you can specify your properties file with:
spark-submit --properties-file spark.conf ...
Spark will automatically read your configuration from spark.conf when creating the spark context
That way, you can modify the properties on your spark.conf without needing to recompile your code each time.

AbstractMethodError creating Kafka stream

I'm trying to open a Kafka (tried versions 0.11.0.2 and 1.0.1) stream using createDirectStream method and getting this AbstractMethodError error:
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
at org.apache.spark.streaming.kafka010.KafkaUtils$.initializeLogIfNecessary(KafkaUtils.scala:39)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at org.apache.spark.streaming.kafka010.KafkaUtils$.log(KafkaUtils.scala:39)
at org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66)
at org.apache.spark.streaming.kafka010.KafkaUtils$.logWarning(KafkaUtils.scala:39)
at org.apache.spark.streaming.kafka010.KafkaUtils$.fixKafkaParams(KafkaUtils.scala:201)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.<init>(DirectKafkaInputDStream.scala:63)
at org.apache.spark.streaming.kafka010.KafkaUtils$.createDirectStream(KafkaUtils.scala:147)
at org.apache.spark.streaming.kafka010.KafkaUtils$.createDirectStream(KafkaUtils.scala:124)
This is how I'm calling it:
val preferredHosts = LocationStrategies.PreferConsistent
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[IntegerDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupId,
"auto.offset.reset" -> "earliest"
)
val aCreatedStream = createDirectStream[String, String](ssc, preferredHosts,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
I have Kafka running on 9092 and I'm able to create producers and consumers and pass messages between them so not sure why it's not working from Scala code. Any ideas appreciated.
Turns out I was using Spark 2.3 and I should've been using Spark 2.2. Apparently that method was made abstract in the later version so I was getting that error.
I had the same exception, in my case I created the application jar with dependency to spark-streaming-kafka-0-10_2.11 of version 2.1.0, while trying to deploy to Spark 2.3.0 cluster.
I recieved same error. I set my dependencies same version as my spark interpreter is
%spark2.dep
z.reset()
z.addRepo("MavenCentral").url("https://mvnrepository.com/")
z.load("org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0")
z.load("org.apache.kafka:kafka-clients:2.3.0")

Why does Spark Streaming application work fine using sbt run but does not on Tomcat (as web application)?

I have a Spark application in Scala which grabs records from Kafka every 10 seconds and saves them as files. This is SBT project and I run my app with sbt run command. Everything works fine until I deploy my app on Tomcat. I managed to generate WAR file with this plugin but it looks like my app does not do anything when deployed on Tomcat.
This is my code:
object SparkConsumer {
def main (args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("mytopic")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value)).print
val arr = new ArrayBuffer[String]();
val lines = stream.map(record => (record.key, record.value));
stream.foreachRDD { rdd =>
if (rdd.count() > 0 ) {
val date = System.currentTimeMillis()
rdd.saveAsTextFile ("/tmp/sparkout/mytopic/" + date.toString)
rdd.foreach { record => println("t=" + record.topic + " m=" + record.toString()) }
}
println("Stream had " + rdd.count() + " messages")
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
println(o)
}
}
stream.saveAsTextFiles("/tmp/output")
ssc.start()
ssc.awaitTermination()
}
}
The strange thing is that the app works completely fine when ran via sbt run command. It reads the records from Kafka properly and saves them as files in the desired directory. I have no idea what is happening. I tried to enable logging with log4j but it doesn't even log anything when on Tomcat. I've been looking for an answer but haven't found the solution.
To sum up
My Scala Spark app (which is SBT project) should read records from Kafka and save them as files every 10 seconds. It works when ran via sbt run command but it doesn't when deployed on Tomcat.
Additional info:
Scala 2.12
Tomcat 7
SBT 0.13.15
ask for more
Q: What is the problem?
tl;dr The standalone application SparkConsumer behaves properly on Tomcat and so does Tomcat itself.
I'm very surprised to have read the question because your code is not something that I'd expect working ever on Tomcat. Sorry.
Tomcat is a servlet container and as such requires servlets in a web application.
Even though you managed to create a WAR and deploy it to Tomcat, you did not "trigger" anything from this web application to start a Spark Streaming application (the code inside main method).
The Spark Streaming application does work fine when executed using sbt run because that's the goal of sbt run, i.e. execute standalone application in a sbt-managed project.
Given you have only one standalone application in your sbt project, sbt run has managed to find SparkConsumer and execute its main entry method. No surprise here.
It however won't work on Tomcat. You'd have to expose the application as a POST or GET endpoint and use a HTTP client (a browser or command-line tool like curl, wget or httpie) to execute it.
Spark does not support Scala 2.12 so...how did you manage to use the Scala version with Spark?! Impossible!