Spark Streaming, kafka: java.lang.StackOverflowError - scala

I am getting below error in spark-streaming application, i am using kafka for input stream. When i was doing with socket, it was working fine. But when i changed to kafka it's giving error. Anyone has idea why it's throwing error, do i need to change my batch time and check pointing time?
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.StackOverflowError
My program:
def main(args: Array[String]): Unit = {
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for all streaming functionality
val ssc = new StreamingContext(sc, Seconds(5))
val brokers = args(0)
val topics= args(1)
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
val inputStream = messages.map(_._2)
// val inputStream = ssc.socketTextStream(args(0), args(1).toInt)
ssc.checkpoint(checkpointDirectory)
inputStream.print(1)
val parsedStream = inputStream
.map(line => {
val splitLines = line.split(",")
(splitLines(1), splitLines.slice(2, splitLines.length).map((_.trim.toLong)))
})
import breeze.linalg.{DenseVector => BDV}
import scala.util.Try
val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
(current: Seq[Array[Long]], prev: Option[Array[Long]]) => {
prev.map(_ +: current).orElse(Some(current))
.flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
})
state.checkpoint(Duration(10000))
state.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
}
}

Try to delete the checkpoint directory.
I'm not sure but it seems that your streaming context fails to restore from the checkpoints.
anyway, it worked for me.

Related

foreachRDD in Spark Stream will not write file

This is my Spark Streaming job:
object StreamingTest {
def main(args: Array[String]) = {
val sparkConf = new SparkConf().setAppName("StreamingTest")
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint("checkpoint")
val lines = KafkaUtils.createStream(
ssc,
"localhost:2181/kafka",
"streaming-group",
Map("streaming-topic" -> 1),
StorageLevel.MEMORY_AND_DISK
).map(_.2)
lines.flatMap(_.split(" ")).map(token => s"${token} => ${token.hashCode}")
.foreachRDD(rdd => {
rdd.saveAsTextFile(s"/results/raw-${System.currentTimeInMillis()}.txt")
})
lines.flatMap(_.split(" ")).map(token => s"${token} => ${token.hashCode}")
.saveAsTextFiles("/results/raw", "test")
ssc.start
ssc.awaitTermination
}
}
The last save operation works. It writes files to /results/raw. The one inside the foreach loop does not. Can someone explain why?

Save Scala Spark Streaming Data to MongoDB

Here's my simplified Apache Spark Streaming code which gets input via Kafka Streams, combine, print and save them to a file. But now i want the incoming stream of data to be saved in MongoDB.
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingDataToMongoDB")
.set("spark.streaming.concurrentJobs", "2")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicName1 = List("KafkaSimple").toSet
val topicName2 = List("SimpleKafka").toSet
val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName1)
val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName2)
val lines1 = stream1.map(_._2)
val lines2 = stream2.map(_._2)
val allThelines = lines1.union(lines2)
allThelines.print()
allThelines.repartition(1).saveAsTextFiles("File", "AllTheLinesCombined")
I have tried Stratio Spark-MongoDB Library and some other resources but still no success. Someone please help me proceed or redirect me to some useful working resource/tutorial. Cheers :)
If you want to write out to a format which isn't directly supported on DStreams you can use foreachRDD to write out each batch one-by-one using the RDD based API for Mongo.
lines1.foreachRDD ( rdd => {
rdd.foreach( data =>
if (data != null) {
// Save data here
} else {
println("Got no data in this window")
}
)
})
Do same for lines2.

Reactive Kafka not working

I am trying out an simple reactive-kafka program which reads and writes to Kafka. It starts up but does nothing even when I am publishing messages to the input topic.
implicit val system = ActorSystem("main")
implicit val materializer = ActorMaterializer()
val kafkaUrl: String = "localhost:9092"
val producerSettings = ProducerSettings(system, new ByteArraySerializer, new StringSerializer)
.withBootstrapServers(kafkaUrl)
val consumerSettings = ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer,
Set("inputTopic"))
.withBootstrapServers(kafkaUrl)
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val flow: RunnableGraph[Control] = Consumer.committableSource(consumerSettings.withClientId("client1"))
.map { msg =>
println("msg = " + msg)
Producer.Message(new ProducerRecord[Array[Byte], String]("test.topic2", msg.value), msg.committableOffset)
}
.to(Producer.commitableSink(producerSettings))
flow.run()
It just stays there forever. Any tips on debugging why this is not working?

Writing data to cassandra using spark

I have a spark job written in Scala, in which I am just trying to write one line separated by commas, coming from Kafka producer to Cassandra database. But I couldn't call saveToCassandra.
I saw few examples of wordcount where they are writing map structure to Cassandra table with two columns and it seems working fine. But I have many columns and I found that the data structure needs to parallelized.
Here's is the sample of my code:
object TestPushToCassandra extends SparkStreamingJob {
def validate(ssc: StreamingContext, config: Config): SparkJobValidation = SparkJobValid
def runJob(ssc: StreamingContext, config: Config): Any = {
val bp_conf=BpHooksUtils.getSparkConf()
val brokers=bp_conf.get("bp_kafka_brokers","unknown_default")
val input_topics = config.getString("topics.in").split(",").toSet
val output_topic = config.getString("topic.out")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, input_topics)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(","))
val li = words.par
li.saveToCassandra("testspark","table1", SomeColumns("col1","col2","col3"))
li.print()
words.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach{
case x:String=>{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val outMsg=x+" from spark"
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String](output_topic,null,outMsg)
producer.send(message)
}
}
)
)
ssc.start()
ssc.awaitTermination()
}
}
I think it's the syntax of Scala that I am not getting correct.
Thanks in advance.
You need to change your words DStream into something that the Connector can handle.
Like a Tuple
val words = lines
.map(_.split(","))
.map( wordArr => (wordArr(0), wordArr(1), wordArr(2))
or a Case Class
case class YourRow(col1: String, col2: String, col3: String)
val words = lines
.map(_.split(","))
.map( wordArr => YourRow(wordArr(0), wordArr(1), wordArr(2)))
or a CassandraRow
This is because if you place an Array there all by itself it could be an Array in C* you are trying to insert rather than 3 columns.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

Receiving empty data from Kafka - Spark Streaming

Why am I getting empty data messages when I read a topic from kafka?
Is it a problem with the Decoder?
*There is no error or exception.
Code:
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Queue Status")
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint("/tmp/")
val kafkaConfig = Map("zookeeper.connect" -> "ip.internal:2181",
"group.id" -> "queue-status")
val kafkaTopics = Map("queue_status" -> 1)
val kafkaStream = KafkaUtils.createStream[String, QueueStatusMessage, StringDecoder, QueueStatusMessageKafkaDeserializer](
ssc,
kafkaConfig,
kafkaTopics,
StorageLevel.MEMORY_AND_DISK)
kafkaStream.window(Minutes(1),Seconds(10)).print()
ssc.start()
ssc.awaitTermination()
}
The Kafka decoder:
class QueueStatusMessageKafkaDeserializer(props: VerifiableProperties = null) extends Decoder[QueueStatusMessage] {
override def fromBytes(bytes: Array[Byte]): QueueStatusMessage = QueueStatusMessage.parseFrom(bytes)
}
The (empty) result:
-------------------------------------------
Time: 1440010266000 ms
-------------------------------------------
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
Solution:
Just strictly specified the types in the Kafka topic Map:
val kafkaTopics = Map[String, Int]("queue_status" -> 1)
Still don't know the reason for the problem, but the code is working fine now.