How to process files using Spark Structured Streaming chunk by chunk? - scala

I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately.
How can I do it using Spark Structured Streaming ?
I have seen that Jacek Laskowski (https://stackoverflow.com/users/1305344/jacek-laskowski) said in a similar question (Spark to process rdd chunk by chunk from json files and post to Kafka topic) that it was possible using the Spark Structured Streaming, but I can't find any examples about it.
Thanks a lot,

If using File Source:
maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)
spark
.readStream
.format("json")
.path("/path/to/files")
.option("maxFilesPerTrigger", 50)
.load
If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option.

Related

spark repartition issue for filesize

Need to merge small parquet files.
I have multiple small parquet files in hdfs.
I like to combine those parquet files each to nearly 128 mb each
2. So I read all the files using spark.read()
And did repartition() on that and write to the hdfs location
My issue is
I have approx 7.9 GB of data, when I did repartition and saved to hdfs it is getting nearly to 22 GB.
I had tied with repartition , range , colasce but not getting the solution
I think that it may be connected with your repartition operation. You are using .repartition(10) so Spark is going to use RoundRobin to repartition your data so probably ordering is going to change. Order of data is important during compresion, you can read more in this question
You may try to add sort or repartition your data by expresion instead of only number of partitions to optimize file size

Spark Structure Streaming checkpoint vs spark context CheckPointDir

Hello stack overflow community.
I'm using a spark streaming app in production environment and it was noticed that spark-checkpoints are contributing greatly to the under replication factor in HDFS and thus affects the HDFS stability. I'm trying to investigate a proper solution to clean spark check points regularly and not being through manual hdfs delete. I referred to couple of posts :
Spark Structured Streaming Checkpoint Cleanup and Spark structured streaming checkpoint size huge
So what I came up with is that I would set up the spark checkpoint directory and the spark structured streaming checkpoint directory referring to the same path and set the cleaning configuration to true. This solution will create a spark check point per spark context. I'm doubting that this might contradict with the purpose of check pointing but I'm still trying to understand the internals of spark and would appreciate any guidance here. Below is snippest of my code
spark.sparkContext.setCheckpointDir(checkPointLocation)
val options = Map("checkpointLocation" -> s"${spark.sparkContext.getCheckpointDir.get }")
val q = df.writeStream
.options(options)
.trigger(trigger)
.queryName(queryName)

How to write a partitioned parquet file in apache beam java

I am new to Apache Beam and not sure how to accomplish this task.
I want to write a partitioned parquet file using Apache Beam in Java.
Data is read from Kafka and I want the file to have a new partition every hour. The timestamp columns is present in the data.
Try to use FixedWindows for that. There is an example of windowed WordCount that writes every window into separate text file - so, I believe, it can be adapted for your case.

Configure Avro file size written to HDFS by Spark

I am writing a Spark dataframe in Avro format to HDFS. And I would like to split large Avro files so they would fit into the Hadoop block size and at the same time would not be too small. Are there any dataframe or Hadoop options for that? How can I split the files to be written into smaller ones?
Here is the way I write the data to HDFS:
dataDF.write
.format("avro")
.option("avroSchema",parseAvroSchemaFromFile("/avro-data-schema.json"))
.toString)
.save(dataDir)
I have researched a lot and found out that it is not possible to set up a limit in file sizes only in the number of Avro records. So the only solution would be to create an application for mapping the number of records to file sizes.

Connecting Spark streaming to streamsets input

I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .
I exploring if there are other ways to connect them for a sample POC.
The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.
SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss.
It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.
In short, your best option would be SDC -> Kafka -> Spark Streaming.