In my spark streaming application, I'm trying to stream the data from Azure EventHub and writing onto couple of directories in the hdfs blob based on the data. Basically followed the link multiple writeStream with spark streaming
Below is the code:
def writeStreamer(input: DataFrame, checkPointFolder: String, output: String): StreamingQuery = {
input
.writeStream
.format("com.databricks.spark.avro")
.partitionBy("year", "month", "day")
.option("checkpointLocation", checkPointFolder)
.option("path", output)
.outputMode(OutputMode.Append)
.start()
}
writeStreamer(dtcFinalDF, "/qmctdl/DTC_CheckPoint", "/qmctdl/DTC_DATA")
val query1 = writeStreamer(canFinalDF, "/qmctdl/CAN_CheckPoint", "/qmctdl/CAN_DATA")
query1.awaitTermination()
What i currently observe is that, data is writing successfully to "/qmctdl/CAN_DATA directory but no data is getting written to "/qmctdl/DTC_DATA. Am i doing anything wrong here, any help would be appreciated much.
Take a look at this answer:
Executing separate streaming queries in spark structured streaming
I don't know about Azure EventHub but basically I think one stream is reading all the data and other stream don't get served any data.
Can you try this
spark.streams.awaitAnyTermination()
Instead of
query1.awaittTermination()
Related
I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()
I am using twitter stream function which gives a stream. I am required to use Spark writeStream function like:writeStream function link
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
The 'df' needs to be a streaming Dataset/DataFrame. If df is a normal DataFrame, it will give error showing 'writeStream' can be called only on streaming Dataset/DataFrame;
I have already done:
1. get stream from twitter
2. filter and map it to get a tag for each twitt (Positive, Negative, Natural)
The last step is to groupBy tag and count for each and pass it to Kafka.
Do you guys have any idea how to transform a Dstream into a streaming Dataset/DataFrame?
Edited: ForeachRDD function does change Dstream to normal DataFrame.
But 'writeStream' can be called only on streaming
Dataset/DataFrame. (writeStream link is provided above)
org.apache.spark.sql.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFrame;
how to transform a Dstream into a streaming Dataset/DataFrame?
DStream is an abstraction of a series of RDDs.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is a property isStreaming of Dataset).
It is possible to convert a DStream to a streaming Dataset to keep the behaviour of the DStream.
I think you don't really want it though.
All you need is to take tweets using DStream and save them to a Kafka topic (and you think you need Structured Streaming). I think you simply need Spark SQL (the underlying engine of Structured Streaming).
A pseudo-code would then be as follows (sorry it's been a longer while since I used the old-fashioned Spark Streaming):
val spark: SparkSession = ...
val tweets = DStream...
tweets.foreachRDD { rdd =>
import spark.implicits._
rdd.toDF.write.format("kafka")...
}
I am using spark structured Streaming to Read incoming messages from a Kafka topic and write to multiple parquet tables based on the incoming message
So i created a single readStream as Kafka source is common and for each parquet table created separate write stream in a loop . This works fine but the readstream is creating a bottleneck as for each writeStream it create a readStream and there is no way to cache the dataframe which is already read.
val kafkaDf=spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.servers)
.option("subscribe", conf.topics)
// .option("earliestOffset","true")
.option("failOnDataLoss",false)
.load()
foreach table {
//filter the data from source based on table name
//write to parquet
parquetDf.writeStream.format("parquet")
.option("path", outputFolder + File.separator+ tableName)
.option("checkpointLocation", "checkpoint_"+tableName)
.outputMode("append")
.trigger(Trigger.Once())
.start()
}
Now every write stream is creating a new consumer group and reading entire data from Kafka and then doing the filter and writing to Parquet. This is creating a huge overhead. To avoid this overhead, I can partition the Kafka topic to have as many partitions as the number of tables and then the readstream should only read from a given partition. But I don't see a way to specify partition details as part of Kafka read stream.
if data volume is not that high, write your own sink, collect data of each micro-batch , then you should be able to cache that dataframe and write to different locations, need some tweaks though but it will work
you can use foreachBatch sink and cache the dataframe. Hopefully it works
Some backstory: For a homework project for university we are tasked to implement an algorithm of choice in a scalable way. We chose to use Scala, Spark, MongoDB and Kafka as these were recommended during the course. To read data from our MongoDB, we opted to use MongoSpark as it allows for easy and scalable operations on data. We also use Kafka to simulate streaming from an outside source. We need to perform multiple operations on every entry that Kafka produces. The issue comes from saving the result of this data back to MongoDB.
We have the following code:
val streamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "aTopic")
.load
.selectExpr("CAST(value AS STRING)")
From here on, we're at a loss. We cannot use a .map as MongoSpark only operates on DataFrames, Datasets and RDDs and is not serializable, and using MongoSpark.save does not work on streaming DataFrames like the one specified. We also cannot use the default MongoDB Scala driver as this conflicts with MongoSpark upon adding the dependency. Note that the rest of the algorithm heavily relies on joins and groupbys.
How can we get the data from here to our MongoDB?
Edit:
For an easy to reproduce example, one could try the following:
val streamDF = sparkSession
.readStream
.format("rate")
.load
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
The save() method for MongoDB Connector for Spark accepts RDD (as of current version 2.2). When utilising DStream with MongoSpark, you need to fetch the 'batches' of RDDs in the stream to write.
wordCounts.foreachRDD({ rdd =>
import spark.implicits._
val wordCounts = rdd.map({ case (word: String, count: Int)
=> WordCount(word, count) }).toDF()
wordCounts.write.mode("append").mongo()
})
See also:
Design Patterns for using foreachRDD
MongoDB: Spark Streaming
I am developing a spark streaming job(using structured streaming not using DStreams). I get a message from kafka and that will contain many fields with comma separated value out of which the first column will be a filename. Now based on that filename i will have to read the file from HDFS and create a dataframe and operate further on the same. This seems to be simple, but spark is not allowing me to run any actions before the start is called. Spark Documentation also quotes the same.
In addition, there are some Dataset methods that will not work on
streaming Datasets. They are actions that will immediately run queries
and return results, which does not make sense on a streaming Dataset.
Below is what i have tried.
object StructuredStreamingExample {
case class filenameonly(value:String)
def main(args:Array[String])
{
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").master("local[*]").getOrCreate()
spark.sqlContext.setConf("spark.sql.shuffle.partitions", "5")
import spark.implicits._
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "strtest")
.load()
val values=lines.selectExpr("CAST(value AS STRING)").as[String]
val filename = values.map(x => x.split(",")(0)).toDF().select($"value")
//Here how do i convert the filename which is a Dataframe to string and apply that to spark.readtextfile(filename)
datareadfromhdfs
.writeStream
.trigger(ProcessingTime("10 seconds"))
.outputMode("append")
.format("console")
.start()
.awaitTermination()
Now in the above code after i get the filename which is a Dataframe how do i convert that to a String so that i can do spark.readtextfile(filename) to read the file in HDFS.
I'm not sure it's the best use for spark streaming but in a case like this, I would call filename.foreachRDD and read hdfs files from in there and do whatever you need after.
(Keep in mind that when running inside a foreachRDD, you cannot use global spark session but need to getOrCreate it from the builder like that: val sparkSession = SparkSession.builder.config(myCurrentForeachRDD.sparkContext.getConf).getOrCreate())
You seems to rely on a stream to tell you where to look and load files.. Have you tried simply using a file stream on that folder and let spark monitor and read new files automatically for you?
It is sure not the best use case to use spark structured streaming. If you understand the spark structured streaming correctly all the data transformations/aggregations should happen on the query that generates result table. However you can still implement some workarounds where you can write the code to read data from HDFS in (falt)mapWithGroupState. But, again it is not advisable to do so.