The Problem
We have a Delta Lake setup on top of ADLS Gen2 with the following tables:
bronze.DeviceData: partitioned by arrival date (Partition_Date)
silver.DeviceData: partitioned by event date and hour (Partition_Date and Partition_Hour)
We ingest large amounts of data (>600M records per day) from an event hub into bronze.DeviceData (append-only). We then process the new files in a streaming fashion and upsert them into silver.DeviceData with the delta MERGE command (see below).
The data arriving in the bronze table can contain data from any partition in silver (e.g. a device may send historic data that it cached locally). However, >90% of the data arriving at any day is from partitions Partition_Date IN (CURRENT_DATE(), CURRENT_DATE() - INTERVAL 1 DAYS, CURRENT_DATE() + INTERVAL 1 DAYS). Therefore, to upsert the data, we have the following two spark jobs:
"Fast": processes the data from the three date partitions above. The latency is important here, so we prioritize this data
"Slow": processes the rest (anything but these three date partitions). The latency doesn't matter so much, but it should be within a "reasonable" amount of time (not more than a week I'd say)
Now we come to the problem: although the amount of data is magnitudes less in the "slow" job, it runs for days just to process a single day of slow bronze data, with a big cluster. The reason is simple: it has to read and update many silver partitions (> 1000 date partitions at times), and since the updates are small but the date partitions can be gigabytes, these merge commands are inefficient.
Furthermore, as time goes on, this slow job will become slower and slower, since the silver partitions it touches will grow.
Questions
Is our partitioning scheme and the fast/slow Spark job setup generally a good way to approach this problem?
What could be done to improve this setup? We would like to reduce the costs and the latency of the slow job and find a way so that it grows with the amount of data arriving at any day in bronze rather than with the size of the silver table
Additional Infos
we need the MERGE command, as certain upstream services can re-process historic data, which should then update the silver table as well
the schema of the silver table:
CREATE TABLE silver.DeviceData (
DeviceID LONG NOT NULL, -- the ID of the device that sent the data
DataType STRING NOT NULL, -- the type of data it sent
Timestamp TIMESTAMP NOT NULL, -- the timestamp of the data point
Value DOUBLE NOT NULL, -- the value that the device sent
UpdatedTimestamp TIMESTAMP NOT NULL, -- the timestamp when the value arrived in bronze
Partition_Date DATE NOT NULL, -- = TO_DATE(Timestamp)
Partition_Hour INT NOT NULL -- = HOUR(Timestamp)
)
USING DELTA
PARTITIONED BY (Partition_Date, Partition_Hour)
LOCATION '...'
our MERGE command:
val silverTable = DeltaTable.forPath(spark, silverDeltaLakeDirectory)
val batch = ... // the streaming update batch
// the dates and hours that we want to upsert, for partition pruning
// collected from the streaming update batch
val dates = "..."
val hours = "..."
val mergeCondition = s"""
silver.Partition_Date IN ($dates)
AND silver.Partition_Hour IN ($hours)
AND silver.Partition_Date = batch.Partition_Date
AND silver.Partition_Hour = batch.Partition_Hour
AND silver.DeviceID = batch.DeviceID
AND silver.Timestamp = batch.Timestamp
AND silver.DataType = batch.DataType
"""
silverTable.alias("silver")
.merge(batch.alias("batch"), mergeCondition)
// only merge if the event is newer
.whenMatched("batch.UpdatedTimestamp > silver.UpdatedTimestamp").updateAll
.whenNotMatched.insertAll
.execute
On Databricks, there are several ways to optimize performance of the merge into operation:
Perform Optimize with ZOrder on the columns that are part of the join condition. This may depend on the specific DBR version, as older versions (prior to 7.6 IIRC) were using real ZOrder algorithm that is working well for smaller number of columns, while DBR 7.6+ uses by default Hilbert space-filling curves instead
Use smaller file sizes - by default, OPTIMIZE creates files of 1Gb, that need to be rewritten. You can use spark.databricks.delta.optimize.maxFileSize to set file size to 32Mb-64Mb range so it will rewrite less data
Use conditions on partitions of the table (you're already doing that)
Don't use auto-compaction because it can't do ZOrder, but instead run explicit optimize with ZOrder. See documentation on details
Tune indexing of the columns, so it will index only columns that are required for your condition and queries. It's partially related to the merging, but can slightly improve write speed because no statistics will be collected for columns that aren't used for queries.
This presentation from Spark Summit talks about optimization of the merge into - what metrics to watch, etc.
I'm not 100% sure that you need condition silver.Partition_Date IN ($dates) AND silver.Partition_Hour IN ($hours) because you may read more data than required if you don't have specific partitions in the incoming data, but it will require to look into the execution plan. This knowledge base article explains how to make sure that merge into uses the partition pruning.
Update, December 2021st: In newer DBR versions (DBR 9+) there is a new functionality called Low Shuffle Merge that prevents shuffling of not modified data, so the merge happens much faster. It could be enabled by setting spark.databricks.delta.merge.enableLowShuffle to true.
Related
We are on Postgresql 12 and looking to partition a group of tables that are all related by Data Source Name. A source can have tens of millions of records and the whole dataset makes up about 900GB of space across the 2000 data sources. We don't have a good way to update these records so we are looking at a full dump and reload any time we need to update data for a source. This is why we are looking at using partitioning so we can load the new data into a new partition, detach (and later drop) the partition that currently houses the data, and then attach the new partition with the latest data. Queries will be performed via a single ID field. My concern is that since we are partitioning by source name and querying by an ID that isn't used in the partition definition that we won't be able to utilize any partition pruning and our queries will suffer for it.
How concerned should we be with query performance for this use case? There will be an index defined on the ID that is being queried, but based on the Postgres documentation it can add a lot of planning time and use a lot of memory to service queries that look at many partitions.
Performance will suffer, but it will depend on the number of partitions how much. The more partitions you have, the slower both planning and execution time will get, so keep the number low.
You can save on query planning time by defining a prepared statement and reusing it.
We have on RDS a main Postgres server and a read replica.
We constantly write and update new data for the last couple of days.
Reading from the read-replica works fine when looking at older data but when trying to read from the last couple of days, where we keep updating the data on the main server, is painfully slow.
Queries that take 2-3 minutes on old data can timeout after 20 minutes when querying data from the last day or two.
Looking at the monitors like CPU I don't see any extra load on the read replica.
Is there a solution for this?
You are accessing over 65 buffers for ever 1 visible row found in the index scan (and over 500 buffers for each row which is returned by the index scan, since 90% are filtered out by the mmsi criterion).
One issue is that your index is not as well selective as it could be. If you had the index on (day, mmsi) rather than just (day) it should be about 10 times faster.
But it also looks like you have a massive amount of bloat.
You are probably not vacuuming the table often enough. With your described UPDATE pattern, all the vacuum needs are accumulating in the newest data, but the activity counters are evaluated based on the full table size, so autovacuum is not done often enough to suit the needs of the new data. You could lower the scale factor for this table:
alter table simplified_blips set (autovacuum_vacuum_scale_factor = 0.01)
Or if you partition the data based on "day", then the partitions for newer days will naturally get vacuumed more often because the occurrence of updates will be judged against the size of each partition, it won't get diluted out by the size of all the older inactive partitions. Also, each vacuum run will take less work, as it won't have to scan all of the indexes of the entire table, just the indexes of the active partitions.
As suggested, the problem was bloat.
When you update a record in an ACID database the database creates a new version of the record with the new updated record.
After the update you end with a "dead record" (AKA dead tuple)
Once in a while the database will do autovacuum and clean the table from the dead tuples.
Usually the autovacuum should be fine but if your table is really large and updated often you should consider changing the autovacuum analysis and size to be more aggressive.
I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.
In the following link, the creator of a tool I use (Airflow) suggests to create partitions for daily snapshots of dimension tables. I am wondering about the overhead of doing something like this in Postgres.
I am using the Postgres 10 built in partitioning for several tables, but mostly at a monthly or yearly level for facts. I never tried implementing a daily partition for dimensions before and it seems scary. It would simplify things though in several areas for me in case I need to rerun old tasks.
https://medium.com/#maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
Simple. With dimension snapshots where a new partition is appended at
each ETL schedule. The dimension table becomes a collection of
dimension snapshots where each partition contains the full dimension
as-of a point in time. “But only a small percentage of the data
changes every day, that’s a lot of data duplication!”. That’s right,
though typically dimension tables are negligible in size in proportion
to facts. It’s also an elegant way to solve SCD-type problematic by
its simplicity and reproducibility. Now that storage and compute are
dirt cheap compared to engineering time, snapshoting dimensions make
sense in most cases.
While the traditional type-2 slowly changing dimension approach is
conceptually sound and may be more computationally efficient overall,
it’s cumbersome to manage. The processes around this approach, like
managing surrogate keys on dimensions and performing surrogate key
lookup when loading facts, are error-prone, full of mutations and
hardly reproducible.
I have worked with systems with different levels of partitioning.
Generally any partitioning is OK as long as you have check constrains on partitions which allow query planner to find adequate partitions for query. Or you will have to query specific partition directly for some special cases. Otherwise you will see sequential scans over all partitions even for simple queries.
Daily partitions are completely OK do not worry. And I worked event with data collector based on PG which needed to have partitions for every 5 minutes of data because it collected several TBs per day.
Number of partitions can become a bigger problem only when you have several thousands or dozens of thousands of partitions - with this amount of partitions everything goes to different level of problems.
You will have to set proper max_locks_per_transaction for example to be able to work with them. Because even simple select over parent table places SharedAccessLock over all partitions which is not exactly nice but PG inheritance works this way.
Plus higher planing time for query - in our data warehouse we sometimes see planning times for queries like several minutes and queries taking only seconds - which is a bit craped... But it is hard to do anything with it because current PG planner works this way.
But PROs still overweight CONs so I highly recommend to use any partitioning granularity you need.
I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.
I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".