Process last n records in pyspark kafka streaming - pyspark

I need to do some operations on last 100 000 records (customer bills) of different stores each for every five minutes. What is the best approach or steps I need to follow in pyspark structured streaming? Input source is Kafka.
And also I have to gradually delete the old records beyond 100k records of each store, because I need only recent 100k records per every store at any time.
For example, I need to find out the details of the product 'p1' from last 100k records of store 'S1', and last 100k records from store 'S2' and so on.

Structured streaming don't have such option. One workaround is estimate messages per second, then calulate how long needed for 100K records. Finally, you can use startingTimestamp. Please refer https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-streaming-queries

Related

Processing upserts on a large number of partitions is not fast enough

The Problem
We have a Delta Lake setup on top of ADLS Gen2 with the following tables:
bronze.DeviceData: partitioned by arrival date (Partition_Date)
silver.DeviceData: partitioned by event date and hour (Partition_Date and Partition_Hour)
We ingest large amounts of data (>600M records per day) from an event hub into bronze.DeviceData (append-only). We then process the new files in a streaming fashion and upsert them into silver.DeviceData with the delta MERGE command (see below).
The data arriving in the bronze table can contain data from any partition in silver (e.g. a device may send historic data that it cached locally). However, >90% of the data arriving at any day is from partitions Partition_Date IN (CURRENT_DATE(), CURRENT_DATE() - INTERVAL 1 DAYS, CURRENT_DATE() + INTERVAL 1 DAYS). Therefore, to upsert the data, we have the following two spark jobs:
"Fast": processes the data from the three date partitions above. The latency is important here, so we prioritize this data
"Slow": processes the rest (anything but these three date partitions). The latency doesn't matter so much, but it should be within a "reasonable" amount of time (not more than a week I'd say)
Now we come to the problem: although the amount of data is magnitudes less in the "slow" job, it runs for days just to process a single day of slow bronze data, with a big cluster. The reason is simple: it has to read and update many silver partitions (> 1000 date partitions at times), and since the updates are small but the date partitions can be gigabytes, these merge commands are inefficient.
Furthermore, as time goes on, this slow job will become slower and slower, since the silver partitions it touches will grow.
Questions
Is our partitioning scheme and the fast/slow Spark job setup generally a good way to approach this problem?
What could be done to improve this setup? We would like to reduce the costs and the latency of the slow job and find a way so that it grows with the amount of data arriving at any day in bronze rather than with the size of the silver table
Additional Infos
we need the MERGE command, as certain upstream services can re-process historic data, which should then update the silver table as well
the schema of the silver table:
CREATE TABLE silver.DeviceData (
DeviceID LONG NOT NULL, -- the ID of the device that sent the data
DataType STRING NOT NULL, -- the type of data it sent
Timestamp TIMESTAMP NOT NULL, -- the timestamp of the data point
Value DOUBLE NOT NULL, -- the value that the device sent
UpdatedTimestamp TIMESTAMP NOT NULL, -- the timestamp when the value arrived in bronze
Partition_Date DATE NOT NULL, -- = TO_DATE(Timestamp)
Partition_Hour INT NOT NULL -- = HOUR(Timestamp)
)
USING DELTA
PARTITIONED BY (Partition_Date, Partition_Hour)
LOCATION '...'
our MERGE command:
val silverTable = DeltaTable.forPath(spark, silverDeltaLakeDirectory)
val batch = ... // the streaming update batch
// the dates and hours that we want to upsert, for partition pruning
// collected from the streaming update batch
val dates = "..."
val hours = "..."
val mergeCondition = s"""
silver.Partition_Date IN ($dates)
AND silver.Partition_Hour IN ($hours)
AND silver.Partition_Date = batch.Partition_Date
AND silver.Partition_Hour = batch.Partition_Hour
AND silver.DeviceID = batch.DeviceID
AND silver.Timestamp = batch.Timestamp
AND silver.DataType = batch.DataType
"""
silverTable.alias("silver")
.merge(batch.alias("batch"), mergeCondition)
// only merge if the event is newer
.whenMatched("batch.UpdatedTimestamp > silver.UpdatedTimestamp").updateAll
.whenNotMatched.insertAll
.execute
On Databricks, there are several ways to optimize performance of the merge into operation:
Perform Optimize with ZOrder on the columns that are part of the join condition. This may depend on the specific DBR version, as older versions (prior to 7.6 IIRC) were using real ZOrder algorithm that is working well for smaller number of columns, while DBR 7.6+ uses by default Hilbert space-filling curves instead
Use smaller file sizes - by default, OPTIMIZE creates files of 1Gb, that need to be rewritten. You can use spark.databricks.delta.optimize.maxFileSize to set file size to 32Mb-64Mb range so it will rewrite less data
Use conditions on partitions of the table (you're already doing that)
Don't use auto-compaction because it can't do ZOrder, but instead run explicit optimize with ZOrder. See documentation on details
Tune indexing of the columns, so it will index only columns that are required for your condition and queries. It's partially related to the merging, but can slightly improve write speed because no statistics will be collected for columns that aren't used for queries.
This presentation from Spark Summit talks about optimization of the merge into - what metrics to watch, etc.
I'm not 100% sure that you need condition silver.Partition_Date IN ($dates) AND silver.Partition_Hour IN ($hours) because you may read more data than required if you don't have specific partitions in the incoming data, but it will require to look into the execution plan. This knowledge base article explains how to make sure that merge into uses the partition pruning.
Update, December 2021st: In newer DBR versions (DBR 9+) there is a new functionality called Low Shuffle Merge that prevents shuffling of not modified data, so the merge happens much faster. It could be enabled by setting spark.databricks.delta.merge.enableLowShuffle to true.

Postgres Partitioning Query Performance when Partitioned for Delete

We are on Postgresql 12 and looking to partition a group of tables that are all related by Data Source Name. A source can have tens of millions of records and the whole dataset makes up about 900GB of space across the 2000 data sources. We don't have a good way to update these records so we are looking at a full dump and reload any time we need to update data for a source. This is why we are looking at using partitioning so we can load the new data into a new partition, detach (and later drop) the partition that currently houses the data, and then attach the new partition with the latest data. Queries will be performed via a single ID field. My concern is that since we are partitioning by source name and querying by an ID that isn't used in the partition definition that we won't be able to utilize any partition pruning and our queries will suffer for it.
How concerned should we be with query performance for this use case? There will be an index defined on the ID that is being queried, but based on the Postgres documentation it can add a lot of planning time and use a lot of memory to service queries that look at many partitions.
Performance will suffer, but it will depend on the number of partitions how much. The more partitions you have, the slower both planning and execution time will get, so keep the number low.
You can save on query planning time by defining a prepared statement and reusing it.

Considering total max records from the user and processing it based on the batch size in apache beam

I am trying to read the records from the source based on the count of total max records to be processed which should be given by the user.
Eg: Total Records in the source table is 1 million
Total Max records to process are 100K
I need to process those 100k records only from source.
I have gone through JDBC IO library classes to check if I have any option to implement it like there is an option to set the batch size, but I have found none.
PS: I want to implement it IO level, Not by adding limit to query
I was able to do it using with setMaxRows by turning off the auto-commit for JDBC IO
you can use the withQuery to specify the query with the number of records to read e.g. .withQuery("select id,name from Person limit 1000"). You can also parameterize the number of records using JdbcIO.StatementPreparator. The example in the doc may help.
EDIT
Another option is to use withFetchSize

Loading 2 million records in memory for batch is okay?

I have to run a spring batch job. I have to read around 2 million documents from mongo. Documents have 15 fields fixed. They contain strings, dates and _id.
My question is, what is the best way to process this? Just do in 1 step or spread thru many steps? What is the best practice? Isn't loading 2 million records into memory bad? I know when loading records thru Apache spark, it streams data which is good. But I am not using Apache spark.
The best way is to use a chunk-oriented step. See chunk-oriented processing section of the docs.
Loading 2 millions records in-memory is not a good idea (even if you can manage to do it by adding more memory to your JVM) because you will have a single transaction to handle those 2 million records. If your job crashes let's say after processing 1 million records, the processing of this first half would be lost. The idea is to process documents in chunks and commit a transaction for every chunk. This type of precessing is:
efficient: since it does not load the whole input data set in memory at once
robust: since a job crash would not require you to reprocess the already processed documents
Hope this helps.

s3 parquet write - too many partitions, slow writing

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.
output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")
Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.
EDIT : - id is an integer column with random numbers.
you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark