My use case is to print offset number , partition , topic for each record that has been read from kafka from a spark streaming application.
currently my code to create discrete stream looks like this.
val stream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, params, topicPartition_map)
)
but to get metadata about each record i need to pass in message handler..
so i am expecting to use something like this..
val messageHandler = { mmd: MessageAndMetadata[String, String] => (mmd.topic, mmd.key, mmd.message)
val stream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, params, topicPartition_map),
messageHandler
)
or
val messageHandler = { mmd: MessageAndMetadata[String, String] => (mmd.topic, mmd.key, mmd.message)
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String, String)](
ssc, params, topicPartition_map,messageHandler)
but the createDirectStream is not available with method argument that accepts message handler..
my sbt dependency is.
libraryDependencies += "org.apache.spark" %%
"spark-streaming-kafka-0-10" % sparkVersion
exclude("org.slf4j","slf4j-log4j12")
exclude("com.fasterxml.jackson.module","*") scalaVersion := "2.12.12"
val sparkVersion = "2.4.3"
Related
I am getting error CDRS.toDF() error
case class CDR(phone:String, first_type:String,in_out:String,local:String,duration:String,date:String,time:String,roaming:String,amount:String,in_network:String,is_promo:String,toll_free:String,bytes:String,last_type:String)
// Create direct Kafka stream with brokers and topics
//val topicsSet = Set[String] (kafka_topic)
val topicsSet = Set[String] (kafka_topic)
val kafkaParams = Map[String, String]("metadata.broker.list" ->
kafka_broker)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
//===============================================================================================
//Apply Schema Of Class CDR to Message Coming From Kafka
val CDRS = messages.map(_.split('|')).map(x=> CDR
(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13).repla ceAll("\n","")))
I want to consume messages from Kafka topic using Scala 2.10.6 and Spark 1.6.2. For Kafka I am using this dependency:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.2</version>
</dependency>
This code compiles fine, however I want to define auto.offset.reset and here the problem arises:
val topicMap = topic.split(",").map((_, kafkaNumThreads.toInt)).toMap
val data = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap,
StorageLevel.MEMORY_AND_DISK_SER_2).map(_._2)
When I add kafkaParams, it does not compile anymore:
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> group,
"zookeeper.connection.timeout.ms" -> "10000",
"auto.offset.reset" -> "smallest")
val data = KafkaUtils.createStream(ssc, kafkaParams, topicMap,
StorageLevel.MEMORY_AND_DISK_SER_2).map(_._2)
Error message:
94: error: missing parameter type for expanded function ((x$3) => x$3._2)
[ERROR] StorageLevel.MEMORY_AND_DISK_SER_2).map(_._2)
I tried many different cominations of parameters of createStream, but everything fails. Can someone help please?
You need to add type parameters to KafkaUtils.createStream for it to resolve the underlying types of the stream. For example, if your key and value are of type String:
val data: DStream[String] =
KafkaUtils
.createStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicMap,
StorageLevel.MEMORY_AND_DISK_SER_2
).map(_._2)
Here's my simplified Apache Spark Streaming code which gets input via Kafka Streams, combine, print and save them to a file. But now i want the incoming stream of data to be saved in MongoDB.
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingDataToMongoDB")
.set("spark.streaming.concurrentJobs", "2")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicName1 = List("KafkaSimple").toSet
val topicName2 = List("SimpleKafka").toSet
val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName1)
val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName2)
val lines1 = stream1.map(_._2)
val lines2 = stream2.map(_._2)
val allThelines = lines1.union(lines2)
allThelines.print()
allThelines.repartition(1).saveAsTextFiles("File", "AllTheLinesCombined")
I have tried Stratio Spark-MongoDB Library and some other resources but still no success. Someone please help me proceed or redirect me to some useful working resource/tutorial. Cheers :)
If you want to write out to a format which isn't directly supported on DStreams you can use foreachRDD to write out each batch one-by-one using the RDD based API for Mongo.
lines1.foreachRDD ( rdd => {
rdd.foreach( data =>
if (data != null) {
// Save data here
} else {
println("Got no data in this window")
}
)
})
Do same for lines2.
I have a Spark consumer which streams from Kafka.
I am trying to manage offsets for exactly-once semantics.
However, while accessing the offset it throws the following exception:
"java.lang.ClassCastException: org.apache.spark.rdd.MapPartitionsRDD
cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges"
The part of the code that does this is as below :
var offsetRanges = Array[OffsetRange]()
dataStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.foreachRDD(rdd => { })
Here dataStream is a direct stream(DStream[String]) created using KafkaUtils API something like :
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+"_"+t)).map(_._2)
If somebody can help me understand what I am doing wrong here.
transform is the first method in the chain of methods performed on datastream as mentioned in the official documentation as well
Thanks.
Your problem is:
.map(._2)
Which creates a MapPartitionedDStream instead of the DirectKafkaInputDStream created by KafkaUtils.createKafkaStream.
You need to map after transform:
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set(source_schema+""+t))
kafkaStream
.transform {
rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
.map(_._2)
.foreachRDD(rdd => // stuff)
I have a spark job written in Scala, in which I am just trying to write one line separated by commas, coming from Kafka producer to Cassandra database. But I couldn't call saveToCassandra.
I saw few examples of wordcount where they are writing map structure to Cassandra table with two columns and it seems working fine. But I have many columns and I found that the data structure needs to parallelized.
Here's is the sample of my code:
object TestPushToCassandra extends SparkStreamingJob {
def validate(ssc: StreamingContext, config: Config): SparkJobValidation = SparkJobValid
def runJob(ssc: StreamingContext, config: Config): Any = {
val bp_conf=BpHooksUtils.getSparkConf()
val brokers=bp_conf.get("bp_kafka_brokers","unknown_default")
val input_topics = config.getString("topics.in").split(",").toSet
val output_topic = config.getString("topic.out")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, input_topics)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(","))
val li = words.par
li.saveToCassandra("testspark","table1", SomeColumns("col1","col2","col3"))
li.print()
words.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach{
case x:String=>{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val outMsg=x+" from spark"
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String](output_topic,null,outMsg)
producer.send(message)
}
}
)
)
ssc.start()
ssc.awaitTermination()
}
}
I think it's the syntax of Scala that I am not getting correct.
Thanks in advance.
You need to change your words DStream into something that the Connector can handle.
Like a Tuple
val words = lines
.map(_.split(","))
.map( wordArr => (wordArr(0), wordArr(1), wordArr(2))
or a Case Class
case class YourRow(col1: String, col2: String, col3: String)
val words = lines
.map(_.split(","))
.map( wordArr => YourRow(wordArr(0), wordArr(1), wordArr(2)))
or a CassandraRow
This is because if you place an Array there all by itself it could be an Array in C* you are trying to insert rather than 3 columns.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md