Issue storing AVRO Kafka streams to File System - scala

I want to store my AVRO kafka streams to file system using my spark streaming API with the following scala code in delimited format, but facing some challenges in achieving this
record.write.mode(SaveMode.Append).csv("/Users/Documents/kafka-poc/consumer-out/)
Since, record(generic record) is not a DF or RDD, I am not sure how to proceed with this?
Code
val messages = SparkUtilsScala.createCustomDirectKafkaStreamAvro(ssc, kafkaParams, zookeeper_host, kafkaOffsetZookeeperNode, topicsSet)
val requestLines = messages.map(_._2)
requestLines.foreachRDD((rdd, time: Time) => {
rdd.foreachPartition { partitionOfRecords => {
val recordInjection = SparkUtilsJava.getRecordInjection(topicsSet.last)
for (avroLine <- partitionOfRecords) {
val record = recordInjection.invert(avroLine).get
println("Consumer output...."+record)
println("Consumer output schema...."+record.getSchema)
}}}}
following is the output & schema
{"username": "Str 1-0", "tweet": "Str 2-0", "timestamp": 0}
{"type":"record","name":"twitter_schema","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"int"}]}
Thanks in advance and appreciate your help

I found a solution for this.
val jsonStrings: RDD[String] = sc.parallelize(Seq(record.toString()));
val result = sqlContext.read.json(jsonStrings).toDF();
result.write.mode("Append").csv("/Users/Documents/kafka-poc/‌​consumer-out/");

Related

How to add Header to Avro Kafka Message

We are using Avro Datum Reader and Datum Writer to build Kafka messages in Scala.
Code :
def AvroKafkaMessage(schemaPath : String, dataPath: String): Array[Byte] =
{
val schema = Source.fromFile(schemaPath).mkString
val schemaObj = new Schema.Parser().parse(schema)
val reader= new GenericDatumReader[GenericRecord](schemaObj)
val dataFile = new File(dataPath)
val dataFileReader = new DataFileReader[GenericRecord](dataFile, reader)
val datum = dataFileReader.next()
val writer = new SpecificDatumWriter[GenericRecord](schemaObj)
val out = new ByteArrayOutputStream()
val encoder : BinaryEncoder= EncoderFactory.get().binaryEncoder(out, null)
writer.write(datum,encoder)
encoder.flush()
out.close()
out.toByteArray()
}
Since there would we multiple events per kafka topic, we need to add header to avro messages for unit testing.
How to add headers in avro file and produce kakfa messages ?
Spark dataframes need their own column for Kafka headers. They must exist in a specific format of Array[(String, Array[Byte])]. Avro doesn't particularly matter;your shown function returns a byte array, so add that to a row/column of the dataframe you wish to write to Kafka.
If you have an existing Avro file you want to produce to Kafka, use Spark's existing from_avro function

Spark Streaming 1.6 + Kafka: Too many batches in "queued" status

I'm using spark streaming to consume messages from a Kafka topic, which has 10 partitions. I'm using direct approach to consume from kafka and the code can be found below:
def createStreamingContext(conf: Conf): StreamingContext = {
val dateFormat = conf.dateFormat.apply
val hiveTable = conf.tableName.apply
val sparkConf = new SparkConf()
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.driver.allowMultipleContexts", "true")
val sc = SparkContextBuilder.build(Some(sparkConf))
val ssc = new StreamingContext(sc, Seconds(conf.batchInterval.apply))
val kafkaParams = Map[String, String](
"bootstrap.servers" -> conf.kafkaBrokers.apply,
"key.deserializer" -> classOf[StringDeserializer].getName,
"value.deserializer" -> classOf[StringDeserializer].getName,
"auto.offset.reset" -> "smallest",
"enable.auto.commit" -> "false"
)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
conf.topics.apply().split(",").toSet[String]
)
val windowedKafkaStream = directKafkaStream.window(Seconds(conf.windowDuration.apply))
ssc.checkpoint(conf.sparkCheckpointDir.apply)
val eirRDD: DStream[Row] = windowedKafkaStream.map { kv =>
val fields: Array[String] = kv._2.split(",")
createDomainObject(fields, dateFormat)
}
eirRDD.foreachRDD { rdd =>
val schema = SchemaBuilder.build()
val sqlContext: HiveContext = HiveSQLContext.getInstance(Some(rdd.context))
val eirDF: DataFrame = sqlContext.createDataFrame(rdd, schema)
eirDF
.select(schema.map(c => col(c.name)): _*)
.write
.mode(SaveMode.Append)
.partitionBy("year", "month", "day")
.insertInto(hiveTable)
}
ssc
}
As it can be seen from the code, I used window to achieve this (and please correct me if I'm wrong): Since there's an action to insert into a hive table, I want to avoid writing to HDFS too often, so what I want is to hold enough data in memory and only then write to the filesystem. I thought that using window would be the right way to achieve it.
Now, in the image below, you can see that there are many batches being queued and the batch being processed, takes forever to complete.
I'm also providing the details of the single batch being processed:
Why are there so many tasks for the insert action, when there aren't many events in the batch? Sometimes having 0 events also generates thousands of tasks that take forever to complete.
Is the way I process microbatches with Spark wrong?
Thanks for your help!
Some extra details:
Yarn containers have a max of 2gb.
In this Yarn queue, the maximum number of containers is 10.
When I look at details of the queue where this spark application is being executed, the number of containers is extremely large, around 15k pending containers.
Well, I finally figured it out. Apparently Spark Streaming does not get along with empty events, so inside the foreachRDD portion of the code, I added the following:
eirRDD.foreachRDD { rdd =>
if (rdd.take(1).length != 0) {
//do action
}
}
That way we skip empty micro-batches. the isEmpty() method does not work.
Hope this help somebody else! ;)

Write to Cassandra with writetime using dataframe in spark

I have a following code :-
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
val collection = kafkaStream.map(_._2).map(parser)
collection.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
val dfs = rdd.toDF()
dfs.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "tablename", "keyspace" -> "dbname"))
.mode(SaveMode.Append).save()
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
In above example I'm saving spark streaming to cassandra using dataframe. Now, I want each row of df should have its specific writetime, similar to this command -
insert into table (imei , date , gpsdt ) VALUES ( '1345','2010-10-12','2010-10-12 10:10:10') USING TIMESTAMP 1530313803922977;
So basically writetime of each row should be equal to the gpsdt column of that row. On searching I found this link but it shows example of RDD, i want similar use case of dataframe - https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md Any suggestions, Thanks
As I'm aware, there is no such functionality in DataFrame version (there is corresponding JIRA: https://datastax-oss.atlassian.net/browse/SPARKC-416). But you anyway have the RDD, that you convert into DataFrame - why not use saveToCassandra as described in link that you cited?
P.S. you may have performance problems as you check for emptiness (http://www.waitingforcode.com/apache-spark/isEmpty-trap-spark/read)

How to compare two RDDs in Spark, when data of second stream may not yet be available?

I am working on a Spark app that streams data from two different topics topic_a and topic_b of a Kafka server. I want to consume both streams and check if the data coming from both topics is equal.
val streamingContext = new StreamingContext(sparkContext, Seconds(batchDuration))
val eventStream = KafkaUtils.createDirectStream[String, String](streamingContext, PreferConsistent, Subscribe[String, String](topics, consumerConfig))
def start(record: (RDD[ConsumerRecord[String, String]], Time)): Unit = {
// ...
def cmp(rddA: RDD[ConsumerRecord[String, String]], rddB: RDD[ConsumerRecord[String, String]]): Unit = {
// Do compare...
// but rddA or rddB may be empty! :-(
}
val rddTopicA = rdd.filter(_.topic == 'topic_a')
val rddTopicB = rdd.filter(_.topic == 'topic_b')
cmp(rddTopicA, rddTopicB)
}
eventStream.foreachRDD((x, y) => start((x, y)))
streamingContext.start()
streamingContext.awaitTermination()
The problem is that, when comparing both RDDs in cmp, one of the RDDs may be empty, as the data stream may not yet be available in Kafka. Is it possible to somehow wait until both RDDs have the same amount of rows and then start the comparison? Or first convert the RDD that has data into a DataSet and then temporarily store it for later comparison?

Spark Streaming using Kafka: empty collection exception

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})