Value split is not a member of (String, String) - scala

I am trying to read data from Kafka and Storing into Cassandra tables through Spark RDD's.
Getting error while compiling the code:
/root/cassandra-count/src/main/scala/KafkaSparkCassandra.scala:69: value split is not a member of (String, String)
[error] val lines = messages.flatMap(line => line.split(',')).map(s => (s(0).toString, s(1).toDouble,s(2).toDouble,s(3).toDouble))
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
Below code : when i run the code manually through interactive spark-shell it works fine, but while compiling code for spark-submit error comes.
// Create direct kafka stream with brokers and topics
val topicsSet = Set[String] (kafka_topic)
val kafkaParams = Map[String, String]("metadata.broker.list" -> kafka_broker)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet)
// Create the processing logic
// Get the lines, split
val lines = messages.map(line => line.split(',')).map(s => (s(0).toString, s(1).toDouble,s(2).toDouble,s(3).toDouble))
lines.saveToCassandra("stream_poc", "US_city", SomeColumns("city_name", "jan_temp", "lat", "long"))

All messages in kafka are keyed. The original Kafka stream, in this case messages, is a stream of tuples (key,value).
And as the compile error points out, there's no split method on tuples.
What we want to do here is:
messages.map{ case (key, value) => value.split(','))} ...

KafkaUtils.createDirectStream returns a tuple of key and value (since messages in Kafka are optionally keyed). In your case it's of type (String, String). If you want to split the value, you have to first take it out:
val lines =
messages
.map(line => line._2.split(','))
.map(s => (s(0).toString, s(1).toDouble,s(2).toDouble,s(3).toDouble))
Or using partial function syntax:
val lines =
messages
.map { case (_, value) => value.split(',') }
.map(s => (s(0).toString, s(1).toDouble,s(2).toDouble,s(3).toDouble))

Related

Stateful streaming Spark processing

I'm learning Spark and trying to build a simple streaming service.
For e.g. I have a Kafka queue and a Spark job like words count. That example is using a stateless mode. I'd like to accumulate words counts so if test has been sent a few times in different messages I could get a total number of all its occurrences.
Using other examples like StatefulNetworkWordCount I've tried to modify my Kafka streaming service
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
ssc.checkpoint("/tmp/data")
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
// Update the cumulative count using mapWithState
// This will give a DStream made of state (which is the cumulative count of the words)
val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
output
}
val stateDstream = wordDstream.mapWithState(
StateSpec.function(mappingFunc) /*.initialState(initialRDD)*/)
stateDstream.print()
stateDstream.map(s => (s._1, s._2.toString)).foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
// Start the computation
ssc.start()
ssc.awaitTermination()
I get a lot of errors like
17/03/26 21:33:57 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#2b680207)
- field (class: com.DirectKafkaWordCount$$anonfun$main$2, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.DirectKafkaWordCount$$anonfun$main$2, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
though the stateless version works fine without errors
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _).map(s => (s._1, s._2.toString))
wordCounts.print()
wordCounts.foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
// Start the computation
ssc.start()
ssc.awaitTermination()
The question is how to make the streaming stateful word count.
At this line:
ssc.checkpoint("/tmp/data")
you've enabled checkpointing, which means everything in your:
wordCounts.foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
has to be serializable, and sc itself is not, as you can see from the error message:
object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#2b680207)
Removing checkpointing code line will help with that.
Another way is to either continuously compute your DStream into RDD or write data directly to redis, something like:
wordCounts.foreachRDD{rdd =>
rdd.foreachPartition(partition => RedisContext.setZset("word_count", partition, ttl, redisConfig)
}
RedisContext is a serializable object that doesn't depend on SparkContext
See also: https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/redisFunctions.scala

Spark Streaming using Kafka: empty collection exception

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})

Exception while accessing KafkaOffset from RDD

I have a Spark consumer which streams from Kafka.
I am trying to manage offsets for exactly-once semantics.
However, while accessing the offset it throws the following exception:
"java.lang.ClassCastException: org.apache.spark.rdd.MapPartitionsRDD
cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges"
The part of the code that does this is as below :
var offsetRanges = Array[OffsetRange]()
dataStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.foreachRDD(rdd => { })
Here dataStream is a direct stream(DStream[String]) created using KafkaUtils API something like :
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+"_"+t)).map(_._2)
If somebody can help me understand what I am doing wrong here.
transform is the first method in the chain of methods performed on datastream as mentioned in the official documentation as well
Thanks.
Your problem is:
.map(._2)
Which creates a MapPartitionedDStream instead of the DirectKafkaInputDStream created by KafkaUtils.createKafkaStream.
You need to map after transform:
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+""+t))
kafkaStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.map(_._2)
.foreachRDD(rdd => // stuff)

KafkaRDD scala minimal example

I'm trying to get a running example using KafkaRDD:
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val offsetRanges = Array(
OffsetRange("topic", 0, 0, 2)
)
val rdd = KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder](sc, kafkaParams, offsetRanges)
rdd.map(x => println(x)).collect()
res: Array[Unit] = Array((), ())
I have been careful in creating "topic" with a single partition and writing 2 messages, hello, world.
I can get what looks like a correct RDD, but how can I access its content? Am I missing something?
Thanks, E.
The problem is this line, I believe:
rdd.map(x => println(x)).collect()
The way an RDD works, rdd.map runs on the executor. When you println it's printing it to stdout for the executor. To print it to stdout in the driver application, try this instead:
rdd.collect().map(x => println(x))

Writing data to cassandra using spark

I have a spark job written in Scala, in which I am just trying to write one line separated by commas, coming from Kafka producer to Cassandra database. But I couldn't call saveToCassandra.
I saw few examples of wordcount where they are writing map structure to Cassandra table with two columns and it seems working fine. But I have many columns and I found that the data structure needs to parallelized.
Here's is the sample of my code:
object TestPushToCassandra extends SparkStreamingJob {
def validate(ssc: StreamingContext, config: Config): SparkJobValidation = SparkJobValid
def runJob(ssc: StreamingContext, config: Config): Any = {
val bp_conf=BpHooksUtils.getSparkConf()
val brokers=bp_conf.get("bp_kafka_brokers","unknown_default")
val input_topics = config.getString("topics.in").split(",").toSet
val output_topic = config.getString("topic.out")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, input_topics)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(","))
val li = words.par
li.saveToCassandra("testspark","table1", SomeColumns("col1","col2","col3"))
li.print()
words.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach{
case x:String=>{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val outMsg=x+" from spark"
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String](output_topic,null,outMsg)
producer.send(message)
}
}
)
)
ssc.start()
ssc.awaitTermination()
}
}
I think it's the syntax of Scala that I am not getting correct.
Thanks in advance.
You need to change your words DStream into something that the Connector can handle.
Like a Tuple
val words = lines
.map(_.split(","))
.map( wordArr => (wordArr(0), wordArr(1), wordArr(2))
or a Case Class
case class YourRow(col1: String, col2: String, col3: String)
val words = lines
.map(_.split(","))
.map( wordArr => YourRow(wordArr(0), wordArr(1), wordArr(2)))
or a CassandraRow
This is because if you place an Array there all by itself it could be an Array in C* you are trying to insert rather than 3 columns.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md