Unable to optimise Redshift query - amazon-redshift

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.
The main table has a few hundred million rows.
creating the subtable is done with a query like this:
create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'
I have keys defined as:
SORTKEY (customer_id, time)
DISTKEY customer_id
Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.
Am I missing something or do I just need to scale the cluster?

If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.
Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.
To see this in action look in the system tables. First, find an example query:
SELECT *
FROM stl_query
WHERE userid > 1
ORDER BY starttime DESC
LIMIT 10;
Then, look at the bytes per slice for each step of you query in svl_query_report:
SELECT *
FROM svl_query_report
WHERE query = <your query id>
ORDER BY query,segment,step,slice;
For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"

Related

PostgreSQL different index creation time for same datatype

I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

PostgreSQL performance tuning with table partitions

I am solving an performance issue on PostgreSQL 9.6 dbo based system. Intro:
12yo system, similar to banking system, with most queried primary table called transactions.
CREATE TABLE jrn.transactions (
ID BIGSERIAL,
type_id VARCHAR(200),
account_id INT NOT NULL,
date_issued DATE,
date_accounted DATE,
amount NUMERIC,
..
)
In the table transactions we store all transactions within a bank account. Field type_id determines the type of a transaction. Servers also as C# EntityFramework Discriminator column. Values are like:
card_payment, cash_withdrawl, cash_in, ...
14 types of transaction are known.
In generally, there are 4 types of queries (no. 3 and .4 are by far most frequent):
select single transaction like: SELECT * FROM jrn.transactions WHERE id = 3748734
select single transaction with JOIN to other transaction like: SELECT * FROM jrn.transactions AS m INNER JOIN jrn.transactions AS r ON m.refund_id = r.id WHERE m.id = 3748734
select 0-100, 100-200, .. transactions of given type like: SELECT * FROM jrn.transactions WHERE account_id = 43784 AND type_id = 'card_payment' LIMIT 100
several aggregate queries, like: SELECT SUM(amount), MIN(date_issued), MAX(date_issued) FROM jrn.transactions WHERE account_id = 3748734 AND date_issued >= '2017-01-01'
In last few month we had unexpected row count growth, now 120M.
We are thinking of table partitioning, following to PostgreSQL doc: https://www.postgresql.org/docs/10/static/ddl-partitioning.html
Options:
partition table by type_id into 14 partitions
add column year and partition table by year (or year_month) into 12 (or 144) partitions.
I am now restoring data into out test environment, I am going to test both options.
What do you consider the most appropriate partitioning rule for such situation? Any other options?
Thanks for any feedback / advice etc.
Partitioning won't be very helpful with these queries, since they won't perform a sequential scan, unless you forgot an index.
The only good reason I see for partitioning would be if you want to delete old rows efficiently; then partitioning by date would be best.
Based on your queries, you should have these indexes (apart from the primary key index):
CREATE INDEX ON jrn.transactions (account_id, date_issued);
CREATE INDEX ON jrn.transactions (refund_id);
The following index might be a good idea if you can sacrifice some insert performance to make the third query as fast as possible (you might want to test):
CREATE INDEX ON jrn.transactions (account_id, type_id);
What you have here is almost a perfect case for column-based storage as you may get it using a SAP HANA Database. However, as you explicitly have asked for a Postgres answer and I doubt that a HANA database will be within the budget limit, we will have to stick with Postgres.
Your two queries no. 3 and 4 go quite into different directions, so there won't be "the single answer" to your problem - you will always have to balance somehow between these two use cases. Yet, I would try to use two different techniques to approach each of them individually.
From my perspective, the biggest problem is the query no. 4, which creates quite a high load on your postgres server just because it is summing up values. Moreover, you are just summing up values over and over again, which most likely won't change often (or even at all), as you have said that UPDATEs nearly do not happen at all. I furthermore assume two more things:
transactions is INSERT-only, i.e. DELETE statements almost never happen (besides perhaps in cases of some exceptional administrative intervention).
The values of column date_issued when INSERTing typically are somewhere "close to today" - so you usually won't INSERT stuff way in the past.
Out of this, to prevent aggregating values over and over again unnecessarily, I would introduce yet another table: let's call it transactions_aggr, which is built up like this:
create table transactions_aggr (
account_id INT NOT NULL,
date_issued DATE,
sumamount NUMERIC,
primary key (account_id, date_issued)
)
which will give you a table of per-day preaggregated values.
To determine which values are already preaggregated, I would add another boolean-typed column to transactions, which indicates to me, which of the rows are contained in transactions_aggr and which are not (yet). The query no. 4 then would have to be changed in such a way that it reads only non-preaggregated rows from transactions, whilst the rest could come from transactions_aggr. To facilitate that you could define a view like this:
select account_id, date_issued, sum(amount) as sumamount from
(
select account_id, date_issued, sumamount as amount from transactions_aggr as aggr
union all
select account_id, date_issued, sum(amount) as amount from transactions as t where t.aggregated = false
)
group by account_id, date_issued
Needless to say that putting an index on transactions.aggregated (perhaps in conjunction with the account_id) could greatly help to improve the performance here.
Updating transactions_aggr can be done using multiple approaches:
You could use this as a one-time activity and only pre-aggregate the current set of ~120m rows once. This would at least reduce the load on your machine doing aggregations significantly. However, over time you will run into the same problem again. Then you may just re-execute the entire procedure, simply dropping transactions_aggr as a whole and re-create it from scratch (all the original data still is there in transactions).
You have a nice period somewhere during the week/month/in the night, where you have little or no queries are coming in. Then you can open a transaction, read all transactions WHERE aggregated = false and add them with UPDATEs to transactions_aggr. Keep in mind to then toggle aggregated to true (should be done in the same transaction). The tricky part of this, however, is that you must pay attention to what reading queries will "see" of this transaction: Depending on your requirements of accuracy during that timeframe of this "update job", you may have to consider switching the transaction isolation level to "READ_COMMITED" to prevent ghost reads.
On the matter of your query no. 3 you then could try to really go for the approach of partitioning based on type_id. However, I perceive your query as a little strange, as you are performing a LIMIT/OFFSET without ordering (e.g. there is no ORDER BY statement in place) having specified (NB: You are not saying that you would be using database cursors). This may lead to the effect that the implicit order, which is currently used, is changed, if you enable partitioning on the table. So be careful on side-effects which this may cause on your program.
And one more thing: Before really doing the partition split, I would first check on the data distribution concerning type_id by issuing
select type_id, count(*) from transactions group by type_id
Not that it turns out that, for example, 90% of your data is with card_payment - so that you will have a heavily uneven distribution amongst your partitions and the biggest performance hogging queries are those which would still go into this single "large partition".
Hope this helps a little - and good luck!

use of redshift keys to make query efficient

I have a redshift table with hundreds of millions of rows. My typical query looks like this...
select * from table where senddate > '2015-01-01 00:00:00' and senddate < '2015-08-01 00:00:00' and username = 'xyz'
I am not sure how sort and distribution keys work. I will like to know what should be the best option to make the query efficient.
I have around 3,000 unique usernames and senddate is a date within last 5 years.
I have one more question:
I am not using any compression for this table. Does that make the query slow?
Never use select * in a columnar DB, only pull the columns which are needed.
If this is the only query you want to run, distribution keys dont matter. You can do a diststyle ALL but it will take n times the storage where n is the number of nodes. That said, if you are going to join tables, distribute them on the joining keys
You can have a sortkey on senddate, username to avoid reading all the records (similar to a table scan in row-stores)
Read through to have a basic understanding of these points
http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html

Cassandra efficient table walk

I'm currently working on a benchmark (which is part of my bachelor thesis) that compares SQL and NoSQL Databases based on an abstract data model an abstract queries to achieve fair implementation on all systems.
I'm currently working on the implementation of a query that is specified as follows:
I have a table in Cassandra that is specified as follows:
CREATE TABLE allocated(
partition_key int,
financial_institution varchar,
primary_uuid uuid,
report_name varchar,
view_name varchar,
row_name varchar,
col_name varchar,
amount float,
PRIMARY KEY (partition_key, report_name, primary_uuid));
This table contains about 100,000,000 records (~300GB).
We now need to calculate the sum for the field "amount" for every possible combination of report_name, view_name, col_name and row_name.
In SQL this would be quite easy, just select sum (amount) and group it by the fields you want.
However, since Cassandra does not support these operations (which is perfectly fine) I need to achieve this on another way.
Currently I achieve this by doing a full-table walk, processing each record and storing the sum in a HashMap in Java for each combination.
The prepared statement I use is as follows:
SELECT
partition_key,
financial_institution,
report_name,
view_name,
col_name,
row_name,
amount
FROM allocated;
That works partially on machines with lots on RAM for both, cassandra and the Java app, but crashes on smaller machines.
Now I'm wondering whether it's possible to achieve this on a faster way?
I could imagine using the partition_key, which serves also as the cassandra partition key and do this for every partition (I have 5 of them).
Also I though of doing this multithreaded by assigning every partition and report to a seperate thread and running it parallel. But I guess this would cause a lot of overhead on the application side.
Now to the actual question: Would you recommend another execution strategy to achieve this?
Maybe I still think too much in a SQL-like way.
Thank you for you support.
Here are two ideas that may help you.
1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.
First query (Q1):
select whatever_columns from allocated limit 1000;
Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk = sk_val and tk > tk_val limit 1000;
The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk > sk_val limit 1000;
Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):
select whatever_columns from allocated where token(pk) > token(pk_val) limit 1000;
Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....
You are done when Q4 returns less than 1000.
2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:
update amounts_table set amounts = amounts + whatever_amount where my_primary_key = four_col_values_delimited;
Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).
Sorry if my description is confusing. Please let me know if this helps.