Difference in partitioned and non-partitioned table in terms of vertical join in q kdb - kdb

I have two non-partitioned tables:
q)s:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.05); co:`a`b`f`b`c)
q)t:([] date:(2019.07.01;2019.07.01;2019.07.02;2019.07.01;2019.07.07); co:`a`b`e`b`d)
In above table when I run below query it works perfectly fine.
q)select distinct co from s,t where date within 2019.07.01 2019.07.02
co
--
a
b
f
e
I have tables with same name which are partitioned by date, when I try to run same query on partitioned tables I get below error:
ERROR: 'par
(trying to update a physically partitioned table)
Why do we get above error in partitioned tables?
What is the optimized approach to get similar output as we got in non-partitioned tables?
One solution to for 2 which I feel as brute-force is:
select distinct co from((select distinct co from s where date within 2019.07.01 2019.07.02),select distinct co from t where date within 2019.07.01 2019.07.02)

I'm assuming you are only including the date name in the source tables to assist in queries. A date partitioned table will generate the virtual date column from the hdb structure, you shouldn't include it in the actual table being written to.
Why do we get above error in partitioned tables?
There is no way to avoid having to access the data of a partitioned table except through an initial a select statement.. In this case you are directly trying to perform a , operation to the s and t tables
What is the optimized approach to get similar output as we got in non-partitioned tables?
In general, there may be a trade-off between the table size and the nature and frequency of the operations, sometimes it may be worth bringing the table into memory for frequent joins, or creating a top-level flat table with the relevant subset of data.
If this is just a generalized test case for larger operations then something along the following would be ideal
distinct raze {select distinct co from x where date within 2019.07.01 2019.07.02} each `s`t
This performance is not very different from your own query however, it's just a bit more succinct.

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.
UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.
That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

select 10000 records take too long time in PostgreSQL

my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

LEFT JOIN returns incorrect result in PostgreSQL

I have two tables: A (525,968 records) and B (517,831 records). I want to generate a table with all the rows from A and the matched records from B. Both tables has column "id" and column "year". The combination of id and year in table A is unique, but not in table B. So I wrote the following query:
SELECT
A.id,
A.year,
A.v1,
B.x1,
B.e1
FROM
A
LEFT JOIN B ON (A.id = B.id AND A.year = B.year);
I thought the result should contain the same total number of records in A, but it only returns about 517,950 records. I'm wondering what the possible cause may be.
Thanks!
First of all, I understand that this is an example, but postgres may hava an issues with capital letters in the table names.
Secondly, it may be a good idea to check how exactly you calculated 525,968 records. The thing is - if you use sime kind of client of database administration / queries - it may show you different / technical information about tables (there may be internal row counters in postgres that may actually differ from the number of records).
And finally to check yourself do something like
SELECT
count("A".id)
FROM
"A"

How to optimise tables in Netezza to compliment a join with date conditions

I have two tables that I need to join in Netezza and one of them is very large
I have a dimension table that is a customer table which has two fields, customer id and an observation date i.e.
cust_id, obs_date
'a','2015-01-05'
'b','2016-02-03'
'c','2014-05-21'
'd','2016-01-31'
I have a fact table that is transactional and very high in volume. It has a lot of transactions per customer per date i.e.
cust_id, tran_date, transaction_amt
'a','2015-01-01',1
'a','2015-01-01',2
'a','2015-01-01',5
'a','2015-01-02',7
'a','2015-01-02',2
'b','2016-01-02',12
Both tables are distributed by the same key - cust_id
However When I join the tables, i need to join given the date condition. The query is very fast when i just join them together, but when I add the date condition it does not seem optimised. Does anyone have tips on how to set up the underlying tables or write the join?
I.e. sum transaction_amt for each customer for all their transactions for the 3 months up to their obs_date
FROM CUSTOMER_TABLE
INNER JOIN TRANSACTION_TABLE
ON CUSTOMER_TABLE.cust_id = TRANSACTION_TABLE.cust_id
AND TRANSACTION_TABLE.TRAN_DATE BETWEEN CUSTOMER_TABLE.OBS_DATE - 30 AND CUSTOMER_TABLE.OBS_DATE
If your transaction table is sufficiently large, it may benefit from using CBTs.
If you can, create a copy of the table that uses TRAN_DATE to organize (I'm guessing at your ddl here):
create table transaction_table (
cust_id varchar(20)
,tran_date date
,transaction_amt numeric(10,0)
) distribute on (cust_id)
organize on (tran_date);
Join to that and see if performance is improved. You could also use a materialized view for just those columns, but I think a CBT would be more useful here.
As Scott mentions in the comments below, you should either sort by the date on insert or groom the records after to make sure that they are sorted appropriately.