spark read stream from parquet source - scala

job1 -> parquet files [inputPath] -> job2 -> delta table [outputPath]
job1 writes stream to parquet files which are up to 3 days old.
I want to start job2 which will start from latest streamed files and will process latest only so will not try to process data that is older than last hour if i start now.
Right now I do:
val outDF = spark
.readStream
.option("latestFirst", "true")
.option("maxFilesPerTrigger", "10000")
.schema(Encoders.product[InputData].schema)
.parquet(inputPath)
.transform(Processor().process)
StreamDeltaWriter
.write(outDF, outputPath, partitioningCols, checkpointLocation)
.awaitTermination()
The issue i have is that it starts with latest files then it processes backwards so it does not processes latest arriving file as it tries to process all historical files first.
How to solve it?
EDIT:
according to documentation if latestFirst true if set together with maxFilesPerTrigger then maxFileAge is ignored.
Would maxFileAge work if I do latestFirst only without maxFilesPerTrigger ?

Related

Spark streaming sourceArchiveDir does not move file to archive dir

How to move source CSV files into archive directory using "sourceArchiveDir" and "cleanSource=archive"? I am running below code, but it does not move source file, however stream processing is working fine, i.e. it prints source file content to console.
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val inputPath =
"/<here is an absolute path to my project dir>/data/input/spark_full_delta/2021-06-21"
spark
.readStream
.format("csv")
.schema(jsonSchema)
.option("pathGlobFilter","customers_*2021-06-21.csv")
.option(
"sourceArchiveDir",
"/<here is an absolute path to my project dir>/data/archive")
.option("cleanSource", "archive")
.option("latestFirst","false")
.option("spark.sql.streaming.fileSource.cleaner.numThreads", "2")
.option("header", "true")
.load(inputPath)
.withColumn("date", lit("2021-06-21"))
.writeStream
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime("5 seconds"))
.format("console")
.start()
StructSchema for reference:
scala> jsonSchema
res54: org.apache.spark.sql.types.StructType = StructType(
StructField(customerId,IntegerType,true),
StructField(name,StringType,true),
StructField(country,StringType,true),
StructField(date,DateType,false))
Documentation reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets. Scroll down to the table of source with their options.
File source archiving is based on several more internal Spark options, which you can try to change (but you do not have to) for debugging purpose to speed up the source files archiving process:
spark
.readStream
.format("csv")
.schema(jsonSchema)
// Number of log files after which all the previous files
// are compacted into the next log file.
.option("spark.sql.streaming.fileSource.log.compactInterval","1")
// How long in milliseconds a file is guaranteed to
// be visible for all readers.
.option("spark.sql.streaming.fileSource.log.cleanupDelay","1")
.option(
"sourceArchiveDir",
"/<here is an absolute path to my project dir>/data/archive")
.option("cleanSource", "archive")
...
then try to add more files to the source path. Spark should move already seen files from previous micro-batch to "sourceArchiveDir".
Please note, both options (compactInterval, cleanupDelay) are Spark internal options, which may change in future without any notice. Default values as of Spark 3.2.0-SNAPSHOT:
spark.sql.streaming.fileSource.log.compactInterval: 10
spark.sql.streaming.fileSource.log.cleanupDelay: 10 minutes

spark structured streaming file source read from a certain partition onwards

I have a folder on HDFS like below containing ORC files:
/path/to/my_folder
It contains partitions:
/path/to/my_folder/dt=20190101
/path/to/my_folder/dt=20190103
/path/to/my_folder/dt=20190103
...
Now I need to process the data here using streaming.
A spark.readStream.format("orc").load("/path/to/my_folder") nicely works.
However, I do not want to process the whole table, but rather only start from a certain partition onwards similar to a certain kafka offset.
How can this be implemented? I.e. how can I specify the initial state where to read from.
Spark Structured Streaming File Source Starting Offset claims that there is no such feature.
Their suggestion to use: latestFirst is not desirable for my use-case, as I do not aim to build an always-on streaming application, but rather use Trigger.Once like a batch job with the nice streaming semantics of duplicate reduction and handling of late-arriving data
If this is not available, what would be a suitable workaround?
edit
Run warn-up stream with option("latestFirst", true) and
option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge
processing time. This way, warm-up stream will save latest file
timestamp to checkpoint.
Run real stream with option("maxFileAge", "0"), real sink using the
same checkpoint location. In this case stream will process only newly
available files.
https://stackoverflow.com/a/51399134/2587904
building from this idea, let's look at an example:
# in bash
rm -rf data
mkdir -p data/dt=20190101
echo "1,1,1" >> data/dt=20190101/1.csv
echo "1,1,2" >> data/dt=20190101/2.csv
mkdir data/dt=20190102
echo "1,2,1" >> data/dt=20190102/1.csv
echo "1,2,2" >> data/dt=20190102/2.csv
mkdir data/dt=20190103
echo "1,3,1" >> data/dt=20190103/1.csv
echo "1,3,2" >> data/dt=20190103/2.csv
mkdir data/dt=20190104
echo "1,4,1" >> data/dt=20190104/1.csv
echo "1,4,2" >> data/dt=20190104/2.csv
spark-shell --conf spark.sql.streaming.schemaInference=true
// from now on in scala
val df = spark.readStream.csv("data")
df.printSchema
val query = df.writeStream.format("console").start
query.stop
// cleanup the data and start from scratch.
// this time instead of outputting to the console, write to file
val query = df.writeStream.format("csv")
.option("path", "output")
.option("checkpointLocation", "checkpoint")
val started = query.start
// in bash
# generate new data
mkdir data/dt=20190105
echo "1,5,1" >> data/dt=20190105/1.csv
echo "1,5,2" >> data/dt=20190105/2.csv
echo "1,4,3" >> data/dt=20190104/3.csv
// in scala
started.stop
// cleanup the output, start later on with custom checkpoint
//bash: rm -rf output/*
val started = query.start
// bash
echo "1,4,3" >> data/dt=20190104/4.csv
started.stop
// *****************
//bash: rm -rf output/*
Everything works as intended. The operation picks up where the checkpoint marks the last processed file.
How can a checkpoint definition be generated by hands such as all files in dt=20190101 and dt=20190102 have been processed and no late-arriving data is tolerated there anymore and the processing shall continue with all the files from dt=20190103 onwards?
I see that spark generates:
commits
metadata
offsets
sources
_spark-metadata
files and folders.
So far I only know that _spark-metadata can be ignored to set an initial state / checkpoint.
But have not yet figured out (from the other files) which minimal values need to be present so processing picks up from dt=20190103 onwards.
edit 2
By now I know that:
commits/0 needs to be present
metadata needs to be present
offsets needs to be present
but can be very generic:
v1
{"batchWatermarkMs":0,"batchTimestampMs":0,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
When I tried to remove one of the already processed and committed files from sources/0/0, the query still runs but: not only the new data is processed larger than the existing committed data, but any data, in particular, the files I just removed from the log.
How can I change this behavior to only process data more current than the initial state?
edit 3
The docs (https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html) but also javadocs ;) list getOffset
The maximum offset (getOffset) is calculated by fetching all the files
in path excluding files that start with _ (underscore).
That sounds interesting, but so far I have not figured out how to use it to solve my problem.
Is there a simpler way to achieve the desired functionality besides creating a custom (copy) of the FileSource?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L237
maxFileAge also sounds interesting.
I have started to work on a custom file stream source. However, fail to properly instanciate it. https://gist.github.com/geoHeil/6c0c51e43469ace71550b426cfcce1c1
When calling:
val df = spark.readStream.format("org.apache.spark.sql.execution.streaming.StatefulFileStreamSource")
.option("partitionState", "/path/to/data/dt=20190101")
.load("data")
The operation fails with:
InstantiationException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource
at java.lang.Class.newInstance(Class.java:427)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:196)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
... 53 elided
Caused by: java.lang.NoSuchMethodException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource.<init>()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 59 more
Even though it is basically a copy of the original source. What is different? Why is the constructor not found from https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L196? But it works just fine for https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L42
Even:
touch -t 201801181205.09 data/dt=20190101/1.csv
touch -t 201801181205.09 data/dt=20190101/2.csv
val df = spark.readStream
.option("maxFileAge", "2d")
.csv("data")
returns the whole dataset and fails to filter to the most k current days.

How to overwrite a partition in apache spark 2.3 while still writing to parquet with insertInto method

I saw this example code to overwrite a partition through spark 2.3 really nicely
dfPartition.coalesce(coalesceNum).write.mode("overwrite").format("parquet").insertInto(tblName)
My issue is that even after adding .format("parquet") it is not being written as parquet rather .c000 .
The compaction and overwriting of the partition if working but not the writing as parquet.
Fullc code here
val sparkSession = SparkSession.builder //.master("local[2]")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("parquet.compression", "snappy")
.enableHiveSupport() //can just comment out hive support
.getOrCreate
sparkSession.sparkContext.setLogLevel("ERROR")
println("Created hive Context")
val currentUtcDateTime = new DateTime(DateTimeZone.UTC)
//to compact yesterdays partition
val partitionDtKey = currentUtcDateTime.minusHours(24).toString("yyyyMMdd").toLong
val dfPartition = sparkSession.sql(s"select * from $tblName where $columnPartition=$hardCodedPartition")
if (!dfPartition.take(1).isEmpty) {
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
dfPartition.coalesce(coalesceNum).write.format("parquet").mode("overwrite").insertInto(tblName)
sparkSession.sql(s"msck repair table $tblName")
Helpers.executeQuery("refresh " + tblName, "impala", resultRequired = false)
}
else {
"echo invalid partition"
}
here is the question where I got the suggestion to use this code Overwrite specific partitions in spark dataframe write method.
What I like about this method is not having to list the partition columns which is really good nice. I can easily use it in many cases
Using scala 2.11 , cdh 5.12 , spark 2.3
Any suggestions
The extension .c000 relates to the executor who did the file, not to the actual file format. The file could be parquet and end with .c000, or .snappy, or .zip... To know the actual file format, run this command:
hadoop dfs -cat /tmp/filename.c000 | head
where /tmp/filename.c000 is the hdfs path to your file. You will see some strange simbols, and you should see parquet there somewhere if its actually a parquet file.

How to continuously monitor a directory by using Spark Structured Streaming

I want spark to continuously monitor a directory and read the CSV files by using spark.readStream as soon as the file appears in that directory.
Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.
Here is the complete Solution for this use Case:
If you are running in stand alone mode. You can increase the driver memory as:
bin/spark-shell --driver-memory 4G
No need to set the executor memory as in Stand Alone mode executor runs within the Driver.
As Completing the solution of #T.Gaweda, find the solution below:
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
csvDf.writeStream.format("console").option("truncate","false").start()
now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.
Note: If you want spark to inferschema you have to first set the following configuration:
spark.sqlContext.setConf("spark.sql.streaming.schemaInferenc‌​e","true")
where spark is your spark session.
As written in official documentation you should use "file" source:
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Code example taken from documentation:
// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
If you don't specify trigger, Spark will read new files as soon as possible

What is correct directory path format on Windows for StreamingContext.textFileStream?

I am trying to execute a spark streaming application to process the stream of files data to perform word count.
The directory I am reading is from Windows. As shown I using the local directory like "Users/Name/Desktop/Stream".It is not HDFS.
I created a folder as "Stream" in desktop.
I started the Spark Streaming application and after that I added some text files into the folder 'Stream'. But my spark application is not able to read the files. It is always giving the empty results.
Here is my code.
//args(0) = local[2]
object WordCount {
def main(args: Array[String]) {
val ssc = new StreamingContext(args(0), "word_count",Seconds(5))
val lines = ssc.textFileStream("Users/name/Desktop/Stream")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Output: Getting empty data every 5 seconds
17/05/18 07:35:00 INFO Executor: Running task 0.0 in stage 71.0 (TID 35)
-------------------------------------------
Time: 1495107300000 ms
-------------------------------------------
I tried giving the path as C:/Users/name/Desktop/Stream as well - still the same issue and application could not read the files.
Can anyone please guide if I am giving the incorrect directory path ?
Your code's fine so the only issue is to use proper path to the directory. Please use file:// prefix to denote local file system that would give file://C:/Users/name/Desktop/Stream.
Please start one step at a time to confirm that our understanding is at the same level.
When you execute the Spark Streaming application, create the directory to be in the same directory where you start the application, say Stream. Once you confirm that the application works fine with the local directory we'll fix it globally to read from any directory on Windows (if that's still needed).
Please also make sure that you "move" your files as the operation to create a file in the monitored directory has to be atomic (partial writes will mark the file as processed - see StreamingContext).
Files must be written to the monitored directory by "moving" them from another location within the same file system.
As you can see in the code the directory path will eventually be "wrapped" using Hadoop's File so the issue is to convince it to accept your path:
if (_path == null) _path = new Path(directory)