spark streaming join kafka topics - scala

We have two InputDStream from two Kafka topics, but we have to join the data of these two input together.
The problem is that each InputDStream is processed independently, because of the foreachRDD, nothing can be returned, to join after.
var Message1ListBuffer = new ListBuffer[Message1]
var Message2ListBuffer = new ListBuffer[Message2]
inputDStream1.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map({ msg =>
val r = msg.value()
val avro = AvroUtils.objectToAvro(r.getSchema, r)
val messageValue = AvroInputStream.json[FMessage1](avro.getBytes("UTF-8")).singleEntity.get
Message1ListBuffer = Message1FlatMapper.flatmap(messageValue)
Message1ListBuffer
})
inputDStream1.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
})
inputDStream2.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map({ msg =>
val r = msg.value()
val avro = AvroUtils.objectToAvro(r.getSchema, r)
val messageValue = AvroInputStream.json[FMessage2](avro.getBytes("UTF-8")).singleEntity.get
Message2ListBuffer = Message1FlatMapper.flatmap(messageValue)
Message2ListBuffer
})
inputDStream2.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
})
I thought I could return Message1ListBuffer and Message2ListBuffer, turn them into dataframes and join them. But that does not work, and I do not think it's the best choice
From there, what is the way to return the rdd of each foreachRDD in order to make a join?
inputDStream1.foreachRDD(rdd => {
})
inputDStream2.foreachRDD(rdd => {
})

Not sure about the Spark version you are using, with Spark 2.3+, it can be achieved directly.
With Spark >= 2.3
Subscribe to 2 topics you want to join
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic1")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load
val ds2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic2")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load
Format the subscribed messages in both streams
val stream1 = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val stream2 = ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Join both the streams
resultStream = stream1.join(stream2)
more join operations here
Warning:
Delay records will not get a join match. Need to tweak buffer a bit. more information found here

Related

Kafka Unit testing

I have a function kafkaIngestion which creates a df from kafkatopic in the following way:
def kafkaIngestion(spark:sparksession):Dataframe = {
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("group.id", grpid)
.load()
.selectExpr("cast(value as string) as data")
.select(from_json($"data", schema=inputSchema)
.as("data")
.select("data.*")
df
}
I am unable to mock the the code to return my expected df. What's the correct way to mock the df?

Messages not loading into SilverTable from Topic

Trying to load messages from Topic into a silverTable in the WriteStream. But the messages are not loading into silverTable. How to read the messages into silverTable?
var df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "10.19.9.4:1111")
.option("subscribe", "testTopic")
.load()
df = df.select($"value",$"topic")
// select the avro encoded value and the topic name from the topic
df.writeStream
.foreachBatch( (batchDF: DataFrame, batchId: Long)=>
{
batchDF.persist()
val topics = batchDF.select("topic").distinct().collect().map(
(row)=>row.getString(0))
topics.foreach((topix)=>{
val silverTable = mappings(topix)
// filter out message for the current topic
var writeDF = batchDF.where(s"topic = '${topix}'")
// decode the avro records to a spark struct
val schemaReg = schemaRegistryMappings(topix)
writeDF = writeDF.withColumn("avroconverted",
from_avro($"value", topix+"-value", schemaReg))
// append to the sliver table
writeDF.write.format("delta").mode("append").saveAsTable("silverTable")
})
}

Writing sorted dataframe into kafka topic (sorted order rows) in spark structured streaming using scala

while displaying sorting results to console results are showing as expected in sorting order, but when i push those results to kafka topic the sorting order is missing
def main(args: Array[String]) = {
//Spark config and kafka config
// load method
val Raw_df = readStream(sparkSession, inputtopic)
//converting read kafka mesages into json format
val df_messages = Raw_df.selectExpr("CAST(value AS STRING)")
.withColumn("data", from_json($"value", my_schema))
.select("data.*")
val win = window($"date_column","5 minutes")
val modified_df = df_messages.withWatermark("date_column", "3 minutes")
.groupBy(win,$"All_colums", $"date_column")
.count()
.orderBy(asc("date_column"),asc("column_5"))
val finalcol = modified_df.drop("count").drop("window")
//mapping all columsn and converting them to json mesages
val finalcolonames = my_schema.fields.map(z => z.name)
val dataset_Json = finalcol.withColumn("value", to_json(struct(finalcolonames.map(y => col(y)): _*)))
.select($"value")
//val query = writeToKafkaStremoutput(dataset_Json, outputtopic,checkpointlocation)
val query = writeToConsole(order)
(query)
}
//below method write data to kafka topic
def writeToKafkaStremoutput(dataFrame: DataFrame, Config: Config, topic: String,checkpointlocation:String) = {
dataFrame
.selectExpr( "CAST(value AS STRING)")
.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime("1 second"))
.option("topic", topic)
.option("kafka.bootstrap.servers", "kafka.bootstrap_servers")
.option("checkpointLocation",checkpointPath)
.outputMode(OutputMode.Complete())
.start()
}
//console op for testing
// below method write data toconsole
def writeToConsole(dataFrame: DataFrame) = {
import org.apache.spark.sql.streaming.Trigger
val query = dataFrame
.writeStream
.format("console")
.option("numRows",300)
//.trigger(Trigger.ProcessingTime("20 second"))
.outputMode(OutputMode.Complete())
.option("truncate", false)
.start()
query
}

Read json from Kafka and write json to other Kafka topic

I'm trying prepare application for Spark streaming (Spark 2.1, Kafka 0.10)
I need to read data from Kafka topic "input", find correct data and write result to topic "output"
I can read data from Kafka base on KafkaUtils.createDirectStream method.
I converted the RDD to json and prepare filters:
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val elementDstream = messages.map(v => v.value).foreachRDD { rdd =>
val PeopleDf=spark.read.schema(schema1).json(rdd)
import spark.implicits._
PeopleDf.show()
val PeopleDfFilter = PeopleDf.filter(($"value1".rlike("1"))||($"value2" === 2))
PeopleDfFilter.show()
}
I can load data from Kafka and write "as is" to Kafka use KafkaProducer:
messages.foreachRDD( rdd => {
rdd.foreachPartition( partition => {
val kafkaTopic = "output"
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
partition.foreach{ record: ConsumerRecord[String, String] => {
System.out.print("########################" + record.value())
val messageResult = new ProducerRecord[String, String](kafkaTopic, record.value())
producer.send(messageResult)
}}
producer.close()
})
})
However, I cannot integrate those two actions > find in json proper value and write findings to Kafka: write PeopleDfFilter in JSON format to "output" Kafka topic.
I have a lot of input messages in Kafka, this is the reason I want to use foreachPartition to create the Kafka producer.
The process is very simple so why not use structured streaming all the way?
import org.apache.spark.sql.functions.from_json
spark
// Read the data
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", inservers)
.option("subscribe", intopic)
.load()
// Transform / filter
.select(from_json($"value".cast("string"), schema).alias("value"))
.filter(...) // Add the condition
.select(to_json($"value").alias("value")
// Write back
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", outservers)
.option("subscribe", outtopic)
.start()
Try using Structured Streaming for that. Even if you used Spark 2.1, you may implement your own Kafka ForeachWriter as followed:
Kafka sink:
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
import org.apache.spark.sql.ForeachWriter
class KafkaSink(topic:String, servers:String) extends ForeachWriter[(String, String)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
kafkaProperties.put("value.serializer",
classOf[org.apache.kafka.common.serialization.StringSerializer].toString)
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (String, String)): Unit = {
producer.send(new ProducerRecord(topic, value._1 + ":" + value._2))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
Usage:
val topic = "<topic2>"
val brokers = "<server:ip>"
val writer = new KafkaSink(topic, brokers)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()

How to join 2 spark sql streams

ENV:
Scala spark version: 2.1.1
This is my streams (read from kafka):
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("JoinStreams")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
val schema = StructType(
List(
StructField("t", DataTypes.StringType),
StructField("dst", DataTypes.StringType),
StructField("dstPort", DataTypes.IntegerType),
StructField("src", DataTypes.StringType),
StructField("srcPort", DataTypes.IntegerType),
StructField("ts", DataTypes.LongType),
StructField("len", DataTypes.IntegerType),
StructField("cpu", DataTypes.DoubleType),
StructField("l", DataTypes.StringType),
StructField("headers", DataTypes.createArrayType(DataTypes.StringType))
)
)
val baseDataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", 'topic')
.load()
.selectExpr("cast (value as string) as json")
.select(from_json($"json", schema).as("data"))
.select($"data.*")
val requestsDataFrame = baseDataFrame
.filter("t = 'REQUEST'")
.repartition($"dst")
.withColumn("rowId", monotonically_increasing_id())
val responseDataFrame = baseDataFrame
.filter("t = 'RESPONSE'")
.repartition($"src")
.withColumn("rowId", monotonically_increasing_id())
responseDataFrame.createOrReplaceTempView("responses")
requestsDataFrame.createOrReplaceTempView("requests")
val dataFrame = spark.sql("select * from requests left join responses ON requests.rowId = responses.rowId")
I get this ERROR when starting the application:
org.apache.spark.sql.AnalysisException: Left outer/semi/anti joins with a streaming DataFrame/Dataset on the right is not supported;;
How can I join these two streams?
I also try to do direct join and get the same error.
Should I first save it to file and then read it again?
What is the best practice?
It seems you need Spark 2.3:
"In Spark 2.3, we have added support for stream-stream joins..."
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins