Some backstory: For a homework project for university we are tasked to implement an algorithm of choice in a scalable way. We chose to use Scala, Spark, MongoDB and Kafka as these were recommended during the course. To read data from our MongoDB, we opted to use MongoSpark as it allows for easy and scalable operations on data. We also use Kafka to simulate streaming from an outside source. We need to perform multiple operations on every entry that Kafka produces. The issue comes from saving the result of this data back to MongoDB.
We have the following code:
val streamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "aTopic")
.load
.selectExpr("CAST(value AS STRING)")
From here on, we're at a loss. We cannot use a .map as MongoSpark only operates on DataFrames, Datasets and RDDs and is not serializable, and using MongoSpark.save does not work on streaming DataFrames like the one specified. We also cannot use the default MongoDB Scala driver as this conflicts with MongoSpark upon adding the dependency. Note that the rest of the algorithm heavily relies on joins and groupbys.
How can we get the data from here to our MongoDB?
Edit:
For an easy to reproduce example, one could try the following:
val streamDF = sparkSession
.readStream
.format("rate")
.load
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
The save() method for MongoDB Connector for Spark accepts RDD (as of current version 2.2). When utilising DStream with MongoSpark, you need to fetch the 'batches' of RDDs in the stream to write.
wordCounts.foreachRDD({ rdd =>
import spark.implicits._
val wordCounts = rdd.map({ case (word: String, count: Int)
=> WordCount(word, count) }).toDF()
wordCounts.write.mode("append").mongo()
})
See also:
Design Patterns for using foreachRDD
MongoDB: Spark Streaming
Related
I have the following stream dataframe.
+----------------------------------
|______value______________________|
| I am going to school 😀 |
| why are you crying 🙁 😞 |
| You are not very good my friend |
I have created the above dataframe using below code
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("failOnDataLoss", false)
.option("subscribe", "myTopic.raw")
.load()
I want to store the same stream dataframe into spark dataframe. is that possible to convert so in scala and spark? because at the end I want to convert the spark dataframe into a list of sentences. Issue with stream dataframe is i am unable to convert it directly into a list that I can iterate and do some data processing actions.
You should be able to do many of standard operations on the stream that you're getting from Kafka, but you need to take into account the differences in semantics between batch and streaming processing - refer to the Spark docs for that.
Also, when you're getting data from Kafka, the set of columns is fixed, and you get a binary payload that you need to cast the value column to string, or something like this (see docs):
val df = readStream.select($"value".cast("string").alias("sentences"))
after that you'll get a dataframe with actual payload, and start processing. Depending on the complexity of processing, you may need to revert to the foreachBatch functionality, but that may not be necessary - you need to provide more details on what kind of processing you need to do.
I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()
I am using twitter stream function which gives a stream. I am required to use Spark writeStream function like:writeStream function link
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
The 'df' needs to be a streaming Dataset/DataFrame. If df is a normal DataFrame, it will give error showing 'writeStream' can be called only on streaming Dataset/DataFrame;
I have already done:
1. get stream from twitter
2. filter and map it to get a tag for each twitt (Positive, Negative, Natural)
The last step is to groupBy tag and count for each and pass it to Kafka.
Do you guys have any idea how to transform a Dstream into a streaming Dataset/DataFrame?
Edited: ForeachRDD function does change Dstream to normal DataFrame.
But 'writeStream' can be called only on streaming
Dataset/DataFrame. (writeStream link is provided above)
org.apache.spark.sql.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFrame;
how to transform a Dstream into a streaming Dataset/DataFrame?
DStream is an abstraction of a series of RDDs.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is a property isStreaming of Dataset).
It is possible to convert a DStream to a streaming Dataset to keep the behaviour of the DStream.
I think you don't really want it though.
All you need is to take tweets using DStream and save them to a Kafka topic (and you think you need Structured Streaming). I think you simply need Spark SQL (the underlying engine of Structured Streaming).
A pseudo-code would then be as follows (sorry it's been a longer while since I used the old-fashioned Spark Streaming):
val spark: SparkSession = ...
val tweets = DStream...
tweets.foreachRDD { rdd =>
import spark.implicits._
rdd.toDF.write.format("kafka")...
}
I am developing a spark streaming job(using structured streaming not using DStreams). I get a message from kafka and that will contain many fields with comma separated value out of which the first column will be a filename. Now based on that filename i will have to read the file from HDFS and create a dataframe and operate further on the same. This seems to be simple, but spark is not allowing me to run any actions before the start is called. Spark Documentation also quotes the same.
In addition, there are some Dataset methods that will not work on
streaming Datasets. They are actions that will immediately run queries
and return results, which does not make sense on a streaming Dataset.
Below is what i have tried.
object StructuredStreamingExample {
case class filenameonly(value:String)
def main(args:Array[String])
{
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").master("local[*]").getOrCreate()
spark.sqlContext.setConf("spark.sql.shuffle.partitions", "5")
import spark.implicits._
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "strtest")
.load()
val values=lines.selectExpr("CAST(value AS STRING)").as[String]
val filename = values.map(x => x.split(",")(0)).toDF().select($"value")
//Here how do i convert the filename which is a Dataframe to string and apply that to spark.readtextfile(filename)
datareadfromhdfs
.writeStream
.trigger(ProcessingTime("10 seconds"))
.outputMode("append")
.format("console")
.start()
.awaitTermination()
Now in the above code after i get the filename which is a Dataframe how do i convert that to a String so that i can do spark.readtextfile(filename) to read the file in HDFS.
I'm not sure it's the best use for spark streaming but in a case like this, I would call filename.foreachRDD and read hdfs files from in there and do whatever you need after.
(Keep in mind that when running inside a foreachRDD, you cannot use global spark session but need to getOrCreate it from the builder like that: val sparkSession = SparkSession.builder.config(myCurrentForeachRDD.sparkContext.getConf).getOrCreate())
You seems to rely on a stream to tell you where to look and load files.. Have you tried simply using a file stream on that folder and let spark monitor and read new files automatically for you?
It is sure not the best use case to use spark structured streaming. If you understand the spark structured streaming correctly all the data transformations/aggregations should happen on the query that generates result table. However you can still implement some workarounds where you can write the code to read data from HDFS in (falt)mapWithGroupState. But, again it is not advisable to do so.
I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844