Consuming exactly one file per micro batch in spark stream - scala

I am new to spark streaming technology and doing a POC to migrate one of our ETL application to spark from commercial ETL tool. We get many small files thorough out the day as INPUT and will do the ETL operation on that.
As part of POC, we are trying out spark streaming. Currently, we are struck in logic to generate Surrogate KEY values for each record and we identified ROW_NUMBER as the way to generate SK, but it doesn't work for spark stream, as it has to be windowed operation.
I am trying to use the record number column (sequence number for each record in side the files) to generate SK.
Consume exactly one file at each micro batch.
calculate Sk = Previous batch max record number + record number
In write stream phase -calculate and store the max record number value using "foreachbatch" option.
I have used "maxFilesPerTrigger" option to fetch one file, but it doesn't guarantee exactly one file per micro batch.
val smsfDF_stream = spark.readStream.option("sep", ",").schema(schema).option("maxFilesPerTrigger", 1).option("cleanSource","archive").option("sourceArchiveDir", "/edw_ab_data/spark_poc/archive").csv("/edw_ab_data/spark_poc/input")
Also, I see max value for each batchdf, I see '9999' as value
val fin_query = resinterdf2.writeStream.format("csv").foreachBatch( (batchDF: DataFrame, batchId: Long ) => {
batchDF.createOrReplaceTempView("batchdf_tab")
val file_name = "max_file_"+batchId+".dat"
val max_rec_id = batchDF.sparkSession.sql("""select max(rec_number) from batchdf_tab""")
max_rec_id.write.format("csv").mode("overwrite").save("/edw_ab_data/spark_poc/max_val.dat")}).option("checkpointLocation", "/edw_ab_data/spark_poc").outputMode("append").option("path", "/edw_ab_data/spark_poc/output").start()

Related

Parallelism in Cassandra read using Scala

I am trying to invoke parallel reading from Cassandra table using spark. But I am not able to invoke parallelism as only one reads is happening any given time. What approach should be followed to achieve the same?
I'd recommend you go with below approach source Russell Spitzer's Blog
Manually dividing our partitions using a Union of partial scans :
Pushing the task to the end-user is also a possibility (and the current workaround.) Most end users already understand why they have long partitions and know in general the domain their column values fall in. This makes it possible for them to manually divide up a request so that it chops up large partitions.
For example, assuming the user knows clustering column c spans from 1 to 1000000. They could write code like
val minRange = 0
val maxRange = 1000000
val numSplits = 10
val subSize = (maxRange - minRange) / numSplits
sc.union(
(minRange to maxRange by subSize)
.map(start =>
sc.cassandraTable("ks", "tab")
.where("c > $start and c < ${start + subSize}"))
)
Each RDD would contain a unique set of tasks drawing only portions of full partitions. The union operation joins all those disparate tasks into a single RDD. The maximum number of rows any single Spark Partition would draw from a single Cassandra partition would be limited to maxRange/ numSplits. This approach, while requiring user intervention, would preserve locality and would still minimize the jumps between disk sectors.
Also read-tuning-parameters

Data streamed from Kafka to Postgres and missing seconds later

I am trying to save data from local Kafka instance to local Postgres with Spark Streaming. I have configured all connections and parameters, and data actually gets to the database. However, it is there only for a couple of seconds. After that, the table simply becomes empty. If I stop the app as soon there is some data in Postgres, data persists, so I suppose I have missed some parameter for streaming in Spark or something in Kafka configuration files. The code is in Java, not Scala, so there is dataset instead of DataFrame.
I tried setting spark.driver.allowMultipleContexts to true, but this has nothing with context. When I run count on database with complete data set streaming in the background, there is always about 1700 records, which means there might be some parameter for batch size.
censusRecordJavaDStream.map(e -> {
Row row = RowFactory.create(e.getAllValues());
return row;
}).foreachRDD(rdd -> {
Dataset<Row> censusDataSet = spark.createDataFrame(rdd, CensusRecord.getStructType());
censusDataSet
.write()
.mode(SaveMode.Overwrite)
.jdbc("jdbc:postgresql:postgres", "census.census", connectionProperties);
});
My goal is to stream data from Kafka and save it to Postgre. Each record has unique ID, which is used as a key in Kafka, so there should be no conflicts regarding primary key or double entries. For current testing purposes, I am using small subset of about 100 records; complete dataset is over 300MB.

How to count number of rows per window in streaming queries?

Scenario: Working on Spark Streaming in Structured SQL. I have to implement a "info" dataset about how many rows I've processed in the last "window".
A little bit of code.
val invalidData: Dataset[String] =
parsedData.filter(record => !record.isValid).map(record => record.rawInput)
val validData: Dataset[FlatOutput] = parsedData
.filter(record => record.isValid)
I have two Dataset. But since I'm working on Streaming I cannot perform a .count (Error raised: Queries with streaming sources must be executed with writeStream.start())
val infoDataset = validData
.select(count("*") as "valid")
but a new error occurs: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark and I don't want to set outputMode as complete since I don't want total count from beginning, but just last "windowed" batch.
Unfortunately I don't have any column that I could register as watermark for these datasets.
Is there a way to know how many rows are processed in each iteration?
It seems like StreamingQueryStatus could be of some help.

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Spark streaming: Cache DStream results across batches

Using Spark streaming (1.6) I have a filestream for reading lookup data with 2s of batch size, however files are copyied to the directory only every hour.
Once there's a new file, its content is read by the stream, this is what I want to cache into memory and keep there
until new files are read.
There's another stream to which I want to join this dataset therefore I'd like to cache.
This is a follow-up question of Batch lookup data for Spark streaming.
The answer does work fine with updateStateByKey however I don't know how to deal with cases when a KV pair is
deleted from the lookup files, as the Sequence of values in updateStateByKey keeps growing.
Also any hint how to do this with mapWithState would be great.
This is what I tried so far, but the data doesn't seem to be persisted:
val dictionaryStream = ssc.textFileStream("/my/dir")
dictionaryStream.foreachRDD{x =>
if (!x.partitions.isEmpty) {
x.unpersist(true)
x.persist()
}
}
DStreams can be persisted directly using persist method which persist every RDD in the stream:
dictionaryStream.persist
According to the official documentation this applied automatically for
window-based operations like reduceByWindow and reduceByKeyAndWindow and state-based operations like updateStateByKey
so there should be no need for explicit caching in your case. Also there is no need for manual unpersisting. To quote the docs once again:
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared
and a retention period is tuned automatically based on the transformations which are used in the pipeline.
Regarding mapWithState you'll have to provide a StateSpec. A minimal example requires a functions which takes key, Option of current value and previous state. Lets say you have DStream[(String, Long)] and you want to record maximum value so far:
val state = StateSpec.function(
(key: String, current: Option[Double], state: State[Double]) => {
val max = Math.max(
current.getOrElse(Double.MinValue),
state.getOption.getOrElse(Double.MinValue)
)
state.update(max)
(key, max)
}
)
val inputStream: DStream[(String, Double)] = ???
inputStream.mapWithState(state).print()
It is also possible to provide initial state, timeout interval and capture current batch time. The last two can be used to implement removal strategy for the keys which haven't been update for some period of time.