scala Spark structured streaming Receiving duplicate message - scala

I am using Kafka and Spark 2.4.5 Structured Streaming.I am doing the average operation.but i am facing issues due to getting duplicate records from the Kafka topic in a current batch.
For example ,Kafka topic message received on 1st batch batch on update mode
car,Brand=Honda,speed=110,1588569015000000000
car,Brand=ford,speed=90,1588569015000000000
car,Brand=Honda,speed=80,15885690150000000000
here the result is average on car brand per timestamp
i.e groupby on 1588569015000000000 and Brand=Honda , the result we got
110+90/2 = 100
now second message received late data with the duplicate message with same timestamp
car,Brand=Honda,speed=50,1588569015000000000
car,Brand=Honda,speed=50,1588569015000000000
i am expecting average should update to 110+90+50/3 = 83.33
but result update to 110+90+50+50/4=75,which is wrong
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1") // Both topics on same stream!
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as data")
group by timestamp and brand
write to kafka with checkpoint
How to use Spark Structured Streaming to do this or anything wrong on code?

Spark Structured Streaming allows deduplication on a streaming dataframe using dropDuplicates. You would need to specify fields to identify a duplicate record and across batches, spark will consider only the first record per combination and records with duplicate values will be discarded.
Below snippet will deduplicate your streaming dataframe on Brand, Speed and timestamp combination.
rawDataStream.dropDuplicates("Brand", "speed", "timestamp")
Refer to spark documentation here

Related

Convert Stream Dataframe into Spark Dataframe using scala and spark

I have the following stream dataframe.
+----------------------------------
|______value______________________|
| I am going to school 😀 |
| why are you crying 🙁 😞 |
| You are not very good my friend |
I have created the above dataframe using below code
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("failOnDataLoss", false)
.option("subscribe", "myTopic.raw")
.load()
I want to store the same stream dataframe into spark dataframe. is that possible to convert so in scala and spark? because at the end I want to convert the spark dataframe into a list of sentences. Issue with stream dataframe is i am unable to convert it directly into a list that I can iterate and do some data processing actions.
You should be able to do many of standard operations on the stream that you're getting from Kafka, but you need to take into account the differences in semantics between batch and streaming processing - refer to the Spark docs for that.
Also, when you're getting data from Kafka, the set of columns is fixed, and you get a binary payload that you need to cast the value column to string, or something like this (see docs):
val df = readStream.select($"value".cast("string").alias("sentences"))
after that you'll get a dataframe with actual payload, and start processing. Depending on the complexity of processing, you may need to revert to the foreachBatch functionality, but that may not be necessary - you need to provide more details on what kind of processing you need to do.

How can it be possible? duplicate records in Kafka queue?

I'm using Apache Nifi and Spark and Kafka to send messages between them. First of all, I take data with Nifi and I send it to Spark to process it. Then, I send data from Spark to Nifi again to insert it in a DB.
My problem is Each time I run Spark, I get the same 3.142 records. I have the first part of Nifi stopped, the second is running, and each time I run Spark, I have the same 3.142 records., and now I'm not able to understand this data.
Where does it come from?
I've tried to see if I have data on Kafka-Queue-I (from Nifi to Spark) or Kafka-Queue-II (from Spark to NiFi), but in both cases, the answer is NO. Only, when I run Spark, appears in Kafka-Queue-II 3.142 records, but this doesn't happen on Kafka-Queue-I...
In Nifi, PublishKafka_1_0 1.7.0:
In Spark, KafkaConsumer:
val raw_data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("group.id", "spark_kafka_consumer")
.option("startingOffsets", "latest")
.option("enable.auto.commit", true)
option("failOnDataLoss", "false")
.option("subscribe", input_topic)
.load()
It show me a lot of false_alarm...
Some process...
var parsed_data = raw_data
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
. ...
Kafka Source in Spark
var query = parsed_data
.select(to_json(schema_out).alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("zookeeper.connect", zookeeper_servers)
.option("checkpointLocation", checkpoint_location)
.option("topic", output_topic)
.start()
query.awaitTermination()
And, KafkaConsumer 1.7.0 in NiFi
I suspect that you're using a new (auto-generated) consumer group every time you start Spark and you have an offset reset policy of earliest. The result of this would be that Spark starts from the beginning of the topic every time.
Kafka does not remove messages from the topic when its consumed (unlike other pub-sub systems). To not see old messages, you will need to set a consumer group and have Spark commit the offsets as it processes. These offsets are stored and then the next time a consumer starts with that group, it will pick up from the last stored offset for that group.
I would also note that Kafka, outside of some very specific usage patterns and technology choices, does not promise "exactly-once" messaging but instead "at-least-once"; my general advice there would be to try to be tolerant of duplicated records.

How to use writeStream to pass Spark stream to a kafka topic

I am using twitter stream function which gives a stream. I am required to use Spark writeStream function like:writeStream function link
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
The 'df' needs to be a streaming Dataset/DataFrame. If df is a normal DataFrame, it will give error showing 'writeStream' can be called only on streaming Dataset/DataFrame;
I have already done:
1. get stream from twitter
2. filter and map it to get a tag for each twitt (Positive, Negative, Natural)
The last step is to groupBy tag and count for each and pass it to Kafka.
Do you guys have any idea how to transform a Dstream into a streaming Dataset/DataFrame?
Edited: ForeachRDD function does change Dstream to normal DataFrame.
But 'writeStream' can be called only on streaming
Dataset/DataFrame. (writeStream link is provided above)
org.apache.spark.sql.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFrame;
how to transform a Dstream into a streaming Dataset/DataFrame?
DStream is an abstraction of a series of RDDs.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is a property isStreaming of Dataset).
It is possible to convert a DStream to a streaming Dataset to keep the behaviour of the DStream.
I think you don't really want it though.
All you need is to take tweets using DStream and save them to a Kafka topic (and you think you need Structured Streaming). I think you simply need Spark SQL (the underlying engine of Structured Streaming).
A pseudo-code would then be as follows (sorry it's been a longer while since I used the old-fashioned Spark Streaming):
val spark: SparkSession = ...
val tweets = DStream...
tweets.foreachRDD { rdd =>
import spark.implicits._
rdd.toDF.write.format("kafka")...
}

Spark Structured Streaming with foreach

I am using spark structured streaming to read data from kafka.
val readStreamDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.getString("kafka.source.brokerList"))
.option("startingOffsets", config.getString("kafka.source.startingOffsets"))
.option("subscribe", config.getString("kafka.source.topic"))
.load()
Based on an uid in the message read from kafka, I have to make an api call to an external source and fetch data and write back to another kafka topic.
For this I am using a custom foreach writer and processing every message.
import spark.implicits._
val eventData = readStreamDF
.select(from_json(col("value").cast("string"), event).alias("message"), col("timestamp"))
.withColumn("uid", col("message.eventPayload.uid"))
.drop("message")
val q = eventData
.writeStream
.format("console")
.foreach(new CustomForEachWriter())
.start()
The CustomForEachWriter makes an API call and fetch results against the given uid from a service. The result is an array of ids. These ids are then again written back to another kafka topic via a kafka producer.
There are 30 kafka partition and I have launched spark with following config
num-executors = 30
executors-cores = 3
executor-memory = 10GB
But still the spark job starts lagging and is not able to keep up with the incoming data rate.
Incoming data rate is around 10K messages per sec. The avg time to process a single msg in 100ms.
I want to understand how spark process this in case of structured streaming.
In case of structured streaming there is one dedicated executor which is responsible for reading data from all partitions of kafka.
Does that executor distributes tasks based on no. of partitions in kafka.
The data in a batch get processed sequentially. How can that be made to process parallel so as to maximize the throughput.
I think CustomForEachWriter writer will work on a single row/record of the dataset. If you are using 2.4 version of Spark, you can experiment foreachBatch. But it is under Evolving.

structured streaming read based on kafka partitions

I am using spark structured Streaming to Read incoming messages from a Kafka topic and write to multiple parquet tables based on the incoming message
So i created a single readStream as Kafka source is common and for each parquet table created separate write stream in a loop . This works fine but the readstream is creating a bottleneck as for each writeStream it create a readStream and there is no way to cache the dataframe which is already read.
val kafkaDf=spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.servers)
.option("subscribe", conf.topics)
// .option("earliestOffset","true")
.option("failOnDataLoss",false)
.load()
foreach table {
//filter the data from source based on table name
//write to parquet
parquetDf.writeStream.format("parquet")
.option("path", outputFolder + File.separator+ tableName)
.option("checkpointLocation", "checkpoint_"+tableName)
.outputMode("append")
.trigger(Trigger.Once())
.start()
}
Now every write stream is creating a new consumer group and reading entire data from Kafka and then doing the filter and writing to Parquet. This is creating a huge overhead. To avoid this overhead, I can partition the Kafka topic to have as many partitions as the number of tables and then the readstream should only read from a given partition. But I don't see a way to specify partition details as part of Kafka read stream.
if data volume is not that high, write your own sink, collect data of each micro-batch , then you should be able to cache that dataframe and write to different locations, need some tweaks though but it will work
you can use foreachBatch sink and cache the dataframe. Hopefully it works