Unbounded table is spark structured streaming - scala

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?

See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Related

Consuming exactly one file per micro batch in spark stream

I am new to spark streaming technology and doing a POC to migrate one of our ETL application to spark from commercial ETL tool. We get many small files thorough out the day as INPUT and will do the ETL operation on that.
As part of POC, we are trying out spark streaming. Currently, we are struck in logic to generate Surrogate KEY values for each record and we identified ROW_NUMBER as the way to generate SK, but it doesn't work for spark stream, as it has to be windowed operation.
I am trying to use the record number column (sequence number for each record in side the files) to generate SK.
Consume exactly one file at each micro batch.
calculate Sk = Previous batch max record number + record number
In write stream phase -calculate and store the max record number value using "foreachbatch" option.
I have used "maxFilesPerTrigger" option to fetch one file, but it doesn't guarantee exactly one file per micro batch.
val smsfDF_stream = spark.readStream.option("sep", ",").schema(schema).option("maxFilesPerTrigger", 1).option("cleanSource","archive").option("sourceArchiveDir", "/edw_ab_data/spark_poc/archive").csv("/edw_ab_data/spark_poc/input")
Also, I see max value for each batchdf, I see '9999' as value
val fin_query = resinterdf2.writeStream.format("csv").foreachBatch( (batchDF: DataFrame, batchId: Long ) => {
batchDF.createOrReplaceTempView("batchdf_tab")
val file_name = "max_file_"+batchId+".dat"
val max_rec_id = batchDF.sparkSession.sql("""select max(rec_number) from batchdf_tab""")
max_rec_id.write.format("csv").mode("overwrite").save("/edw_ab_data/spark_poc/max_val.dat")}).option("checkpointLocation", "/edw_ab_data/spark_poc").outputMode("append").option("path", "/edw_ab_data/spark_poc/output").start()

Data streamed from Kafka to Postgres and missing seconds later

I am trying to save data from local Kafka instance to local Postgres with Spark Streaming. I have configured all connections and parameters, and data actually gets to the database. However, it is there only for a couple of seconds. After that, the table simply becomes empty. If I stop the app as soon there is some data in Postgres, data persists, so I suppose I have missed some parameter for streaming in Spark or something in Kafka configuration files. The code is in Java, not Scala, so there is dataset instead of DataFrame.
I tried setting spark.driver.allowMultipleContexts to true, but this has nothing with context. When I run count on database with complete data set streaming in the background, there is always about 1700 records, which means there might be some parameter for batch size.
censusRecordJavaDStream.map(e -> {
Row row = RowFactory.create(e.getAllValues());
return row;
}).foreachRDD(rdd -> {
Dataset<Row> censusDataSet = spark.createDataFrame(rdd, CensusRecord.getStructType());
censusDataSet
.write()
.mode(SaveMode.Overwrite)
.jdbc("jdbc:postgresql:postgres", "census.census", connectionProperties);
});
My goal is to stream data from Kafka and save it to Postgre. Each record has unique ID, which is used as a key in Kafka, so there should be no conflicts regarding primary key or double entries. For current testing purposes, I am using small subset of about 100 records; complete dataset is over 300MB.

How to count number of rows per window in streaming queries?

Scenario: Working on Spark Streaming in Structured SQL. I have to implement a "info" dataset about how many rows I've processed in the last "window".
A little bit of code.
val invalidData: Dataset[String] =
parsedData.filter(record => !record.isValid).map(record => record.rawInput)
val validData: Dataset[FlatOutput] = parsedData
.filter(record => record.isValid)
I have two Dataset. But since I'm working on Streaming I cannot perform a .count (Error raised: Queries with streaming sources must be executed with writeStream.start())
val infoDataset = validData
.select(count("*") as "valid")
but a new error occurs: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark and I don't want to set outputMode as complete since I don't want total count from beginning, but just last "windowed" batch.
Unfortunately I don't have any column that I could register as watermark for these datasets.
Is there a way to know how many rows are processed in each iteration?
It seems like StreamingQueryStatus could be of some help.

How to do df.rdd or df.collect().foreach on streaming dataset?

This is the exception I am getting whenever I am trying to convert it.
val df_col = df.select("ts.user.friends_count").collect.map(_.toSeq)
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
All I am trying to do is replicate the following sql.dataframe operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
collect is a big no-no even in Spark Core's RDD world due to the size of the data you may transfer back to the driver's single JVM. It just sets the boundary of the benefits of Spark as after collect you are in a single JVM.
With that said, think about unbounded data, i.e. a data stream, that will never terminate. That's Spark Structured Streaming.
A streaming Dataset is one that is never complete and the data inside varies every time you ask for the content, i.e. the result of executing the structured query over a stream of data.
You simply cannot say "Hey, give me the data that is the content of a streaming Dataset". That does not even make sense.
That's why you cannot collect on a streaming dataset. It is not possible up to Spark 2.2.1 (the latest version at the time of this writing).
If you want to receive the data that is inside a streaming dataset for a period of time (aka batch interval in Spark Streaming or trigger in Spark Structured Streaming) you write the result to a streaming sink, e.g. console.
You can also write your custom streaming sink that does collect.map(_.toSeq) inside addBatch which is the main and only method of a streaming sink. As a matter of fact, console sink does exactly it.
All I am trying to do is replicate the following sql.dataframe
operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
The very first solution that comes to my mind is to use foreach sink:
The foreach operation allows arbitrary operations to be computed on the output data.
That, of course, does not mean that this is the best solution. Just one that comes to my mind immediately.

Spark streaming: Cache DStream results across batches

Using Spark streaming (1.6) I have a filestream for reading lookup data with 2s of batch size, however files are copyied to the directory only every hour.
Once there's a new file, its content is read by the stream, this is what I want to cache into memory and keep there
until new files are read.
There's another stream to which I want to join this dataset therefore I'd like to cache.
This is a follow-up question of Batch lookup data for Spark streaming.
The answer does work fine with updateStateByKey however I don't know how to deal with cases when a KV pair is
deleted from the lookup files, as the Sequence of values in updateStateByKey keeps growing.
Also any hint how to do this with mapWithState would be great.
This is what I tried so far, but the data doesn't seem to be persisted:
val dictionaryStream = ssc.textFileStream("/my/dir")
dictionaryStream.foreachRDD{x =>
if (!x.partitions.isEmpty) {
x.unpersist(true)
x.persist()
}
}
DStreams can be persisted directly using persist method which persist every RDD in the stream:
dictionaryStream.persist
According to the official documentation this applied automatically for
window-based operations like reduceByWindow and reduceByKeyAndWindow and state-based operations like updateStateByKey
so there should be no need for explicit caching in your case. Also there is no need for manual unpersisting. To quote the docs once again:
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared
and a retention period is tuned automatically based on the transformations which are used in the pipeline.
Regarding mapWithState you'll have to provide a StateSpec. A minimal example requires a functions which takes key, Option of current value and previous state. Lets say you have DStream[(String, Long)] and you want to record maximum value so far:
val state = StateSpec.function(
(key: String, current: Option[Double], state: State[Double]) => {
val max = Math.max(
current.getOrElse(Double.MinValue),
state.getOption.getOrElse(Double.MinValue)
)
state.update(max)
(key, max)
}
)
val inputStream: DStream[(String, Double)] = ???
inputStream.mapWithState(state).print()
It is also possible to provide initial state, timeout interval and capture current batch time. The last two can be used to implement removal strategy for the keys which haven't been update for some period of time.