spark .txt Kafka message to Data Frame - scala

I am getting error CDRS.toDF() error
case class CDR(phone:String, first_type:String,in_out:String,local:String,duration:String,date:String,time:String,roaming:String,amount:String,in_network:String,is_promo:String,toll_free:String,bytes:String,last_type:String)
// Create direct Kafka stream with brokers and topics
//val topicsSet = Set[String] (kafka_topic)
val topicsSet = Set[String] (kafka_topic)
val kafkaParams = Map[String, String]("metadata.broker.list" ->
kafka_broker)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
//===============================================================================================
//Apply Schema Of Class CDR to Message Coming From Kafka
val CDRS = messages.map(_.split('|')).map(x=> CDR
(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13).repla ceAll("\n","")))

Related

Read json from Kafka and write json to other Kafka topic

I'm trying prepare application for Spark streaming (Spark 2.1, Kafka 0.10)
I need to read data from Kafka topic "input", find correct data and write result to topic "output"
I can read data from Kafka base on KafkaUtils.createDirectStream method.
I converted the RDD to json and prepare filters:
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val elementDstream = messages.map(v => v.value).foreachRDD { rdd =>
val PeopleDf=spark.read.schema(schema1).json(rdd)
import spark.implicits._
PeopleDf.show()
val PeopleDfFilter = PeopleDf.filter(($"value1".rlike("1"))||($"value2" === 2))
PeopleDfFilter.show()
}
I can load data from Kafka and write "as is" to Kafka use KafkaProducer:
messages.foreachRDD( rdd => {
rdd.foreachPartition( partition => {
val kafkaTopic = "output"
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
partition.foreach{ record: ConsumerRecord[String, String] => {
System.out.print("########################" + record.value())
val messageResult = new ProducerRecord[String, String](kafkaTopic, record.value())
producer.send(messageResult)
}}
producer.close()
})
})
However, I cannot integrate those two actions > find in json proper value and write findings to Kafka: write PeopleDfFilter in JSON format to "output" Kafka topic.
I have a lot of input messages in Kafka, this is the reason I want to use foreachPartition to create the Kafka producer.
The process is very simple so why not use structured streaming all the way?
import org.apache.spark.sql.functions.from_json
spark
// Read the data
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", inservers)
.option("subscribe", intopic)
.load()
// Transform / filter
.select(from_json($"value".cast("string"), schema).alias("value"))
.filter(...) // Add the condition
.select(to_json($"value").alias("value")
// Write back
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", outservers)
.option("subscribe", outtopic)
.start()
Try using Structured Streaming for that. Even if you used Spark 2.1, you may implement your own Kafka ForeachWriter as followed:
Kafka sink:
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
import org.apache.spark.sql.ForeachWriter
class KafkaSink(topic:String, servers:String) extends ForeachWriter[(String, String)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
kafkaProperties.put("value.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (String, String)): Unit = {
producer.send(new ProducerRecord(topic, value._1 + ":" + value._2))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
Usage:
val topic = "<topic2>"
val brokers = "<server:ip>"
val writer = new KafkaSink(topic, brokers)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()

dstream[Double] to dstream scala

I am developing spark-consumer application which is consuming messages from kafka broker and i want to find average of that messages which are coming to spark consumer and finally i want to store that average into cassandra.
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("MyDirectKafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(20))
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val count = lines.count()
count.print()
val total = lines.map(JSON.parseFull(_)
.asInstanceOf[Option[List[Map[String,List[Map[String,Double]]]]]]
.map(_(1)("MeterData").map(_("mVolts1"))).getOrElse(List())).flatMap(list => list).reduce((x,y) => x+y)
total.print()
val avg = total.reduce((total,count) => total / count )
avg.print()
ssc.start()
ssc.awaitTermination()
In the above code,i am getting total and count exactly what i am expecting it to be, but i am not able to compute average as count is dstream[long] and total is dstream[double].
I think that this line is having some problem.
"val avg = total.reduce((total,count) => total / count )"
Any help is appreciated.
Output:
Count:
This is the count output i get in stream as dstream[Long]
Total:
This is the total output i get in same stream as dstream[Double]

Stateful streaming Spark processing

I'm learning Spark and trying to build a simple streaming service.
For e.g. I have a Kafka queue and a Spark job like words count. That example is using a stateless mode. I'd like to accumulate words counts so if test has been sent a few times in different messages I could get a total number of all its occurrences.
Using other examples like StatefulNetworkWordCount I've tried to modify my Kafka streaming service
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
ssc.checkpoint("/tmp/data")
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
// Update the cumulative count using mapWithState
// This will give a DStream made of state (which is the cumulative count of the words)
val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
output
}
val stateDstream = wordDstream.mapWithState(
StateSpec.function(mappingFunc) /*.initialState(initialRDD)*/)
stateDstream.print()
stateDstream.map(s => (s._1, s._2.toString)).foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
// Start the computation
ssc.start()
ssc.awaitTermination()
I get a lot of errors like
17/03/26 21:33:57 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#2b680207)
- field (class: com.DirectKafkaWordCount$$anonfun$main$2, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.DirectKafkaWordCount$$anonfun$main$2, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
though the stateless version works fine without errors
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _).map(s => (s._1, s._2.toString))
wordCounts.print()
wordCounts.foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
// Start the computation
ssc.start()
ssc.awaitTermination()
The question is how to make the streaming stateful word count.
At this line:
ssc.checkpoint("/tmp/data")
you've enabled checkpointing, which means everything in your:
wordCounts.foreachRDD(rdd => sc.toRedisZSET(rdd, "word_count", 0))
has to be serializable, and sc itself is not, as you can see from the error message:
object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#2b680207)
Removing checkpointing code line will help with that.
Another way is to either continuously compute your DStream into RDD or write data directly to redis, something like:
wordCounts.foreachRDD{rdd =>
rdd.foreachPartition(partition => RedisContext.setZset("word_count", partition, ttl, redisConfig)
}
RedisContext is a serializable object that doesn't depend on SparkContext
See also: https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/redisFunctions.scala

Save Scala Spark Streaming Data to MongoDB

Here's my simplified Apache Spark Streaming code which gets input via Kafka Streams, combine, print and save them to a file. But now i want the incoming stream of data to be saved in MongoDB.
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingDataToMongoDB")
.set("spark.streaming.concurrentJobs", "2")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicName1 = List("KafkaSimple").toSet
val topicName2 = List("SimpleKafka").toSet
val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName1)
val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName2)
val lines1 = stream1.map(_._2)
val lines2 = stream2.map(_._2)
val allThelines = lines1.union(lines2)
allThelines.print()
allThelines.repartition(1).saveAsTextFiles("File", "AllTheLinesCombined")
I have tried Stratio Spark-MongoDB Library and some other resources but still no success. Someone please help me proceed or redirect me to some useful working resource/tutorial. Cheers :)
If you want to write out to a format which isn't directly supported on DStreams you can use foreachRDD to write out each batch one-by-one using the RDD based API for Mongo.
lines1.foreachRDD ( rdd => {
rdd.foreach( data =>
if (data != null) {
// Save data here
} else {
println("Got no data in this window")
}
)
})
Do same for lines2.

Writing data to cassandra using spark

I have a spark job written in Scala, in which I am just trying to write one line separated by commas, coming from Kafka producer to Cassandra database. But I couldn't call saveToCassandra.
I saw few examples of wordcount where they are writing map structure to Cassandra table with two columns and it seems working fine. But I have many columns and I found that the data structure needs to parallelized.
Here's is the sample of my code:
object TestPushToCassandra extends SparkStreamingJob {
def validate(ssc: StreamingContext, config: Config): SparkJobValidation = SparkJobValid
def runJob(ssc: StreamingContext, config: Config): Any = {
val bp_conf=BpHooksUtils.getSparkConf()
val brokers=bp_conf.get("bp_kafka_brokers","unknown_default")
val input_topics = config.getString("topics.in").split(",").toSet
val output_topic = config.getString("topic.out")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, input_topics)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(","))
val li = words.par
li.saveToCassandra("testspark","table1", SomeColumns("col1","col2","col3"))
li.print()
words.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach{
case x:String=>{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val outMsg=x+" from spark"
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String](output_topic,null,outMsg)
producer.send(message)
}
}
)
)
ssc.start()
ssc.awaitTermination()
}
}
I think it's the syntax of Scala that I am not getting correct.
Thanks in advance.
You need to change your words DStream into something that the Connector can handle.
Like a Tuple
val words = lines
.map(_.split(","))
.map( wordArr => (wordArr(0), wordArr(1), wordArr(2))
or a Case Class
case class YourRow(col1: String, col2: String, col3: String)
val words = lines
.map(_.split(","))
.map( wordArr => YourRow(wordArr(0), wordArr(1), wordArr(2)))
or a CassandraRow
This is because if you place an Array there all by itself it could be an Array in C* you are trying to insert rather than 3 columns.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md