How to optimise tables in Netezza to compliment a join with date conditions - date

I have two tables that I need to join in Netezza and one of them is very large
I have a dimension table that is a customer table which has two fields, customer id and an observation date i.e.
cust_id, obs_date
'a','2015-01-05'
'b','2016-02-03'
'c','2014-05-21'
'd','2016-01-31'
I have a fact table that is transactional and very high in volume. It has a lot of transactions per customer per date i.e.
cust_id, tran_date, transaction_amt
'a','2015-01-01',1
'a','2015-01-01',2
'a','2015-01-01',5
'a','2015-01-02',7
'a','2015-01-02',2
'b','2016-01-02',12
Both tables are distributed by the same key - cust_id
However When I join the tables, i need to join given the date condition. The query is very fast when i just join them together, but when I add the date condition it does not seem optimised. Does anyone have tips on how to set up the underlying tables or write the join?
I.e. sum transaction_amt for each customer for all their transactions for the 3 months up to their obs_date
FROM CUSTOMER_TABLE
INNER JOIN TRANSACTION_TABLE
ON CUSTOMER_TABLE.cust_id = TRANSACTION_TABLE.cust_id
AND TRANSACTION_TABLE.TRAN_DATE BETWEEN CUSTOMER_TABLE.OBS_DATE - 30 AND CUSTOMER_TABLE.OBS_DATE

If your transaction table is sufficiently large, it may benefit from using CBTs.
If you can, create a copy of the table that uses TRAN_DATE to organize (I'm guessing at your ddl here):
create table transaction_table (
cust_id varchar(20)
,tran_date date
,transaction_amt numeric(10,0)
) distribute on (cust_id)
organize on (tran_date);
Join to that and see if performance is improved. You could also use a materialized view for just those columns, but I think a CBT would be more useful here.
As Scott mentions in the comments below, you should either sort by the date on insert or groom the records after to make sure that they are sorted appropriately.

Related

select 10000 records take too long time in PostgreSQL

my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.

Difference in partitioned and non-partitioned table in terms of vertical join in q kdb

I have two non-partitioned tables:
q)s:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.05); co:`a`b`f`b`c)
q)t:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.07); co:`a`b`e`b`d)
In above table when I run below query it works perfectly fine.
q)select distinct co from s,t where date within 2019.07.01 2019.07.02
co
--
a
b
f
e
I have tables with same name which are partitioned by date, when I try to run same query on partitioned tables I get below error:
ERROR: 'par
(trying to update a physically partitioned table)
Why do we get above error in partitioned tables?
What is the optimized approach to get similar output as we got in non-partitioned tables?
One solution to for 2 which I feel as brute-force is:
select distinct co from((select distinct co from s where date within 2019.07.01 2019.07.02),select distinct co from t where date within 2019.07.01 2019.07.02)
I'm assuming you are only including the date name in the source tables to assist in queries. A date partitioned table will generate the virtual date column from the hdb structure, you shouldn't include it in the actual table being written to.
Why do we get above error in partitioned tables?
There is no way to avoid having to access the data of a partitioned table except through an initial a select statement.. In this case you are directly trying to perform a , operation to the s and t tables
What is the optimized approach to get similar output as we got in non-partitioned tables?
In general, there may be a trade-off between the table size and the nature and frequency of the operations, sometimes it may be worth bringing the table into memory for frequent joins, or creating a top-level flat table with the relevant subset of data.
If this is just a generalized test case for larger operations then something along the following would be ideal
distinct raze {select distinct co from x where date within 2019.07.01 2019.07.02} each `s`t
This performance is not very different from your own query however, it's just a bit more succinct.

Can redshift stored procs be used to make a date range UNION ALL query

Since redshift does not natively support date partitioning, other than in redshift spectrum, all our tables are date partitioned
my_table_name_YYYY_MM_DD
So every time we do queries it's usually looks like this
select columns, i, want from
(select * from tbl1_date UNION ALL
select * from tbl2_date UNION ALL
select * from tbl3_date UNION ALL
select * from tbl4_date);
Where there's one UNION ALL per day.
Can stored procedures generate a date rangeso our business analysts stop losing their hair when I send them a python or bash script to generate the date range?
Yes, you could create a stored procedure that generates dynamic SQL using only the needed tables. See my answer here for a template to start from: Issue with passing column name as a parameter to "PREPARE" in Redshift
However, you should be aware that Redshift is able to achieve most of what you want automatically using a "Time Series Table" view. This documented here:
Using Time Series Tables
Use Time-Series Tables
You define a view that is composed of a UNION ALL over a sequence of identical tables with a sort key defined on a commonly filtered date or timestamp column. When you query that view Redshift is able to eliminate the scans on any UNION'ed tables that would not contain relevant data.
For example:
CREATE OR REPLACE VIEW store_sales_vw
AS SELECT * FROM store_sales_1998
UNION ALL SELECT * FROM store_sales_1999
UNION ALL SELECT * FROM store_sales_2001
UNION ALL SELECT * FROM store_sales_2002
UNION ALL SELECT * FROM store_sales_2003
;
SELECT cd.cd_education_status
,COUNT(*) sales_count
,AVG(ss_quantity) avg_quantity
FROM store_sales_vw vw
JOIN customer_demographics cd
ON vw.ss_cdemo_sk = cd.cd_demo_sk
WHERE ss_sold_ts BETWEEN '1999-09-01' AND '2000-08-31'
GROUP BY cd.cd_education_status
In this example Redshift will only use the store_sales_1999 and store_sales_2000 tables, skipping the other tables in the view. Note that the table skipping is not based the name of the table. Redshift knows the MIN and MAX values of the sort key timestamp in each table.
If you purse this approach please be sure to keep the total size of the UNION fairly low. I recommend (at most) daily tables for the last week [7], weekly tables for the last month [5], quarterly tables for the last year [4], and then yearly tables for older data.
You can use ALTER TABLE … APPEND to merge the daily tables in weekly tables and so on.

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

Amazon Redshift how to get the last date a table inserted data

I am trying to get the last date an insert was performed in a table (on Amazon Redshift), is there any way to do this using the metadata? The tables do not store any timestamp column, and even if they had it, we need to find out for 3k tables so it would be impractical so a metadata approach is our strategy. Any tips?
All insert execution steps for queries are logged in STL_INSERT. This query should give you the information you're looking for:
SELECT sti.schema, sti.table, sq.endtime, sq.querytxt
FROM
(SELECT MAX(query) as query, tbl, MAX(i.endtime) as last_insert
FROM stl_insert i
GROUP BY tbl
ORDER BY tbl) inserts
JOIN stl_query sq ON sq.query = inserts.query
JOIN svv_table_info sti ON sti.table_id = inserts.tbl
ORDER BY inserts.last_insert DESC;
Note: The STL tables only retain approximately two to five days of log history.