Avoid shuffle while writing data into a partitioned table from another partitioned table - scala

I am trying to read from a partitioned delta table, perform some narrow transformations, and write it into a new delta table which is partitioned on the same fields.
Table A (partitioned on col1, col2) -> Table B (partitioned on col1, col2)
Since the partitioning strategy is the same and there are no wide transformations, my assumption is that shuffle is not needed here
Do I need to specify some special options while reading or writing to ensure that the shuffle operation is not triggered for this?
I tried to read the data normally and write it back using df_B.write.partitionBy("col1", "col2")... but the shuffle still seems to be the bottleneck

I got the issue. I was seeing a shuffle happening because of the Delta Table property spark.databricks.delta.optimizeWrite.enabled. This may not be needed now since the partition strategy of source and destination is the same now.

Related

What is the best way to backfill a partition table using data from a non partitioned table? (postgres 12)

I'm converting a non partitioned table to a partitioned table in postgres 12. Assuming I have set up the new partition table and have created appropriate triggers for automatic creation of partitions, what is the best way to back fill the currently empty partition table?
Is a naive
insert into partitioned_table(a,d,b,c) select a, d, b, c from non_partitioned_table;
delete from non_partitioned_table;
appropriate? We have ~250M rows in the table so am a bit concerned about doubling the storage requirements.
Or perhaps
WITH moved_rows AS (
DELETE FROM non_partitioned_table
)
INSERT INTO partitioned_table
SELECT [DISTINCT] * FROM moved_rows;
or
COPY non_partitioned_table TO '/tmp/non_partitioned_table.csv' DELIMITER ',';
COPY partitioned_table FROM '/tmp/non_partitioned_table.csv' DELIMITER ',';
Either way seems like it could take a while to transfer the data. I'm also concerned that it will kill INSERT performance while data is being migrated so I assume we'd need to block inserts until it's done. Is there any way to estimate how long it will take to copy the data over?

distkey and sortkey on temporary tables - Redshift

I am starting to do some research on query tuning, and have been experimenting with using distkey and sortkey. From what I've read if I set the distkey to the joining column, the query planner will use a merge join instead of a hash join, which should be faster in Redshift. I was wondering if this also applies to temporary tables? Our production tables are actually views, so they do not have any keys already set. I'm not sure why we don't use the actual warehouse tables.
Yes, keys can be set for temporary tables:
create temp table fred DISTKEY (1) as ...
this is easily done with column position - first column in this example. You can also set the distribution style on temp tables is you so desire. Doing this can force data to stay "on node" for intermediate results in very large and complex queries. Redshift does a good job make reasonable decisions on how to distribute intermediate results but isn't perfect and doesn't understand the nature of the data. I've done this with good results when large data images are in play.
As to you second point about using views instead of tables - In Redshift standard views are basically SQL macros that are flattened / optimized through by the Redshift query compiler. So use of views instead of tables is not bad in itself. Use of view, especially complex ones, can hide what is being done by the query and this can add unneeded and unexpected complexity to the query. The keys are set in the tables referenced by the views. (I'm assuming that the views are not referencing external/spectrum tables)
Lastly, you state you are looking to achieve Merge Join behavior to improve performance. While it is true that this is the fastest type of join, the time and work required to get merge joins to happen on temp tables will not be offset by this performance gain (experience). Redshift will only use a Merge join when it is sure that the data being joined will "zipper" together without issue. If it isn't completely sure this is the case it has to perform a Hash join which is a more general process. To get Redshift to do the Merge join you will need to sort and analyze your temp tables which will cost much more time than the savings you will get. It is far more important to have your joins be "DIST NONE" - no network distribution of the data - than moving from a hash join to a merge join.
Yes, it can be done. Just put the distkey before the start of the table query
create temp table a distkey(column_name) as
(select query .....)

Redshift time-series table loading questions

Redshift documentation identifies time-series tables as a best practice:
http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
However, it doesn't address any of the following issues:
how many tables within a union-all view is reasonable - hundreds? (unanswered)
any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables? (Answer: no)
most effective method of loading underlying tables? Perhaps using firehose to insert into a staging table then periodically inserting those rows into appropriate table within union-all view? (unanswered)
any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria? (Answer: No)
can redshift support dropping old tables, adding new tables and rebuilding union-all view within a transaction? (unanswered)
My situation:
100 million rows added daily, which will grow to 500 million in 3 years
12 month retention desired
Estimated 99% of all queries will hit the most recent 1-7 days
Data is written to existing table via kinesis firehose to s3 which then triggers a copy to redshift table.
My proposed solution:
Create a year of daily tables with a union all view, along with a dist_key of sensor_id (100,000+ uniq values) and a sort_key of (timestamp, sensor_id).
Have firehose load into staging table
Create separate process that once an hour queries staging table to discover dates of data within table, then performs an insert into 'appropriate table' select * from where timestamp = table's timestamp.
This hourly writer can probably wrap a table rename, multiple insert-selects, and table recreate in a transaction to be invisible to firehose.
Once a month drop old tables, create next month of tables, and rebuild view.
This union-all view maintenance can probably be wrapped in a transaction to avoid impacts to users.
Once a night run the vacuum analyzer.
EDITS: added notes identifying which issues have been answered, and added some detail to the proposed solution.
Your proposed process sounds quite good! While I can't answer all your questions, here is some information:
Any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables?
Views are read-only. It is not possible to write to a view, nor is it possible to insert data while expecting Redshift to send it to an appropriate table (eg a specific table for the given day).
Any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria?
Redshift will not exclude specific tables from the query, but it will avoid reading particular disk blocks through the use of Zone Maps. Each block of data written to disk is associated with a specific table and column. The block has a Zone Map, which indicates the minimum and maximum values of that field stored within the block.
If a query includes a WHERE clause, Redshift can skip blocks that do not contain relevant data. This is particularly powerful when used on the SORTKEY column, since similar ranges of data are grouped together.
Given that you are using a date as the SORTKEY, Redshift will read very few disk blocks if the query includes a WHERE clause based on that column. This is very similar to the idea of skipping tables, but it actually skips reading disk blocks.

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

DB2 Partitioning

I know how partitioning in DB2 works but I am unaware about where this partition values exactly get stored. After writing a create partition query, for example:
CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
(
STARTING '1/1/2006' ENDING '12/31/2006'
EVERY 3 MONTHS
)
after running the above query we know that partitions are created on order for every 3 month but when we run a select query the query engine refers this partitions. I am curious to know where this actually get stored, whether in the same table or DB2 has a different table where partition value for every table get stored.
Thanks,
table partitions in DB2 are stored in tablespaces.
For regular tables (if table partitioning is not used) table data is stored in a single tablespace (not considering LOBs).
For partitioned tables multiple tablespaces can used for its partitions.
This is achieved by the "" clause of the CREATE TABLE statement.
CREATE TABLE parttab
...
in TBSP1, TBSP2, TBSP3
In this example the first partition will be stored in TBSP1, the second in TBSP2, The third in TBSP3, the fourth in TBSP1 and so on.
Table partitions are named in DB2 - by default PART1 ..PARTn - and all these details can be looked up in the system catalog view SYSCAT.DATAPARTITIONS including the specified partition ranges.
See also http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?cp=SSEPGG_10.5.0%2F2-12-8-27&lang=en
The column used as partitioning key can be looked up in syscat.datapartitionexpression.
There is also a long syntax for creating partitioned tables where partition names can be explizitly specified as well as the tablespace where the partitions will get stored.
For applications partitioned tables look like a single normal table.
Partitions can be detached from a partitioned table. In this case a partition is "disconnected" from the partitioned table and converted to a table without moving the data (or vice versa).
best regards
Michael
After a bit of research I finally figure it out and want to share this information with others, I hope it may come useful to others.
How to see this key values ? => For LUW (Linux/Unix/Windows) you can see the keys in the Table Object Editor or the Object Viewer Script tab. For z/OS there is an Object Viewer tab called "Limit Keys". I've opened issue TDB-885 to create an Object Viewer tab for LUW tables.
A simple query to check this values:
SELECT * FROM SYSCAT.DATAPARTITIONS
WHERE TABSCHEMA = ? AND TABNAME = ?
ORDER BY SEQNO
reference: http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?lang=en
DB2 will create separate Physical Locations for each partition. So each partition will have its own Table-space. When you SELECT on this partitioned Table your SQL may directly go to a single partition or it may span across many depending on how your SQL is. Also, this may allow your SQL to run in parallel i.e. many TS can be accessed concurrently to speed up the SELECT.