use of redshift keys to make query efficient - amazon-redshift

I have a redshift table with hundreds of millions of rows. My typical query looks like this...
select * from table where senddate > '2015-01-01 00:00:00' and senddate < '2015-08-01 00:00:00' and username = 'xyz'
I am not sure how sort and distribution keys work. I will like to know what should be the best option to make the query efficient.
I have around 3,000 unique usernames and senddate is a date within last 5 years.
I have one more question:
I am not using any compression for this table. Does that make the query slow?

Never use select * in a columnar DB, only pull the columns which are needed.
If this is the only query you want to run, distribution keys dont matter. You can do a diststyle ALL but it will take n times the storage where n is the number of nodes. That said, if you are going to join tables, distribute them on the joining keys
You can have a sortkey on senddate, username to avoid reading all the records (similar to a table scan in row-stores)
Read through to have a basic understanding of these points
http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.
UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.
That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

select 10000 records take too long time in PostgreSQL

my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.

Table specifically built for a dashboard has several filters.... best way to index?

I have created a materialized view for the purposes of feeding into a dashboard.
My goal is to make this table selectable in the fastest way possible and I'm not sure how to approach it. I was hoping that if I describe the table and how it will be used, someone could offer some direction.
The context is a website with funnel steps.Each row is an instance of a user triggering a funnel step such as add to cart, checkout, payment details and then finally transaction.
Since the table is for the purposes of analytics, it will be refreshed automatically with cron once a day only, in the morning, so I'm not worried about real time update speed, only select speed with various where clauses.
Suppose I have the fields described below:
(N = ~13M and expected to be ~20 by January, growing several million per month)
Table is unique with the combination of session id, user id and funnel step.
- Session Id (Id, so some duplication but generally very very granular - Varchar)
- User Id (Id, so some duplication but generally very very granular - Varchar)
- Date (Date)
- Funnel Step (10 distinct value - Varchar)
- Device Category (3 distinct values - Varchar)
- Country (~ 100 distinct values - varchar)
- City (~1000+ distinct values - varchar)
- Source (several thousand distinct values, nevertheless, stakeholder would like a filter - varchar)
Would I index each field individually? Or, should I index all fields in a oner? Per the documentation, I think I can index up to 32 fields at once. But would that be advisable here given my primary goal of select query speed over everything else?
The table will feed into dashboard that reads the table and dynamically translates filter inputs into where clauses. Each time the user adjusts a filter, the table will be read and grouped and aggregated based on the filter / where clause inputs.
Example query:
select
event_action,
count(distinct user_id) as users
from website_data.ecom_funnel
where date >= $input_start_date
and date <= $input_end_date
and device_category in ($mobile, $desktop, $tablet)
and country in ($list of all countries minus any not selected)
and source in ($list of all sources minus any not selected)
group by 1 order by users desc
This will result in a funnel shaped table of data.
I cannot aggregate before hand because the primary metric of concern is users, not sessions. These must be de-duped from the underlying table. Classic example... Suppose a person visits a website once a day for a week. Then the sum of unique visitors for that week is 1, however if I summed visitors by day I would get 7. Similar with my table, some users take multiple sessions to complete the funnel. So, this is why I cannot pre aggregate the table, since I need to apply filters to the underlying data and then count(distinct user id).
Here's explain on a subset of fields if it is useful:
QUERY PLAN
Sort (cost=862194.66..862194.68 rows=9 width=24)
Sort Key: (count(DISTINCT client_id)) DESC
-> GroupAggregate (cost=847955.01..862194.51 rows=9 width=24)
Group Key: event_action
-> Sort (cost=847955.01..852701.48 rows=1898589 width=37)
Sort Key: event_action
-> Seq Scan on ecom_funnel (cost=0.00..589150.14 rows=1898589 width=37)
Filter: ((device_category = ANY ('{mobile,desktop}'::text[])) AND (source = 'google'::text))
My overarching, specific question is, given my use case, should I index each field individually or should I create one single index? Does it matter?
On top of that, any tips for optimising this materialized view to run a select query faster would be appreciated.
Looking at your filter conditions, you should check the cardinality of device_category field by posting
select device_category, count(*) from website_data.ecom_funnel group by device_category
and looking at the values to determine if an index should firstly include this column. Possible index here (without knowing the cardinality) would be multicolumn and include:
(device_category, date)
Saying that, there's no benefit from creating indexes on each separate column as your query wouldn't use them all, so it does matter. You would slow down other CRUD operations that aren't Read operation.
Creating an index on all columns won't probably speed it up too much for you as well, but that's based on the data lying under the hood (in the table) and how your filters compare to the overall query without them (cardinality of values in columns being filtered). This would most likely create a huge overhead of going through the index tree and then obtaining rowids to return the data you need.
Summing up, I would try to narrow the index down to the columns that matter most in your filtering which means they cut most of the data being retrieved. If your query is meant to return majority of rows from the table then there's a need to aggregate, unfortunately, as this wouldn't speed things up.
Hope it helps.
Edit: I've just read that you already posted count of distinct values among your table. I'm not sure what Funnel Step is bound to in your table, but assuming it's a column named event_action, it might be beneficial to instead create an index that would help in grouping as well by doing:
(date, event_action)
It seems like you have omitted the GROUP BY clause at all, which should be included and it should be grouping by event_action, since that's what your select part is doing.
If you narrow the date down to several days/months every time you perform a select query, it might be a huge benefit to create index with first date column.
Remember, that position of column in an index matters.
If you look for values from several months let's say, you should preaggregate and store precalculated values from each month in another table and then UNION ALL that data to the current query which would only select data from current (still being updated) time.

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

Unable to optimise Redshift query

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.
The main table has a few hundred million rows.
creating the subtable is done with a query like this:
create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'
I have keys defined as:
SORTKEY (customer_id, time)
DISTKEY customer_id
Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.
Am I missing something or do I just need to scale the cluster?
If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.
Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.
To see this in action look in the system tables. First, find an example query:
SELECT *
FROM stl_query
WHERE userid > 1
ORDER BY starttime DESC
LIMIT 10;
Then, look at the bytes per slice for each step of you query in svl_query_report:
SELECT *
FROM svl_query_report
WHERE query = <your query id>
ORDER BY query,segment,step,slice;
For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"