I am new to Redhsift and migrting oracle to Redshift.
One of the oracle tables have around 60 indexes. AWS recommends its a good practice to have around 6 compound sort keys.
How would these 60 oracle indexes translate to Redhsift sort keys ? I understand there is no automated conversion or can't have all 60 of them as compound sort keys. I am new to redshift and May I know , how usually this conversion is approached.
In Oracle we can keep adding indexes to the same table and the queries / reports can use them. But in Redshift Changing sortkeys is through recreating the table. How do we make all queries which uses different filter columns and join columns on the same table have best performance?
Thanks
Redshift is columnar database, and it doesn't have indexes in the same meaning as in Oracle at all.
You can think of Redshift's compound sort key (not interleaved) as IOT in Oracle (index organized table), with all the data sorted physically by this compound key.
If you create interleaved sort key on x columns, it will act as a separate index on each of x columns in some manner.
In any way, being columnar database, Redshift can outperform Oracle in many aggregation queries due to it's compression and data structure. The main factors that affect performance in Redshift are distribution style and key, sort key and columns encoding.
If you can't fit all your queries with one table structure, you can duplicate the table with different structure, but the same data. This approach is widely used in big data columnar databases (for example projections in Vertica) and helps to achieve performance with storage being the cost.
Please review this page with several useful tips about Redshift performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
First a few key points
Redshift <> Oracle
Redshift does not have indexes, Redshift Sort Keys <> Oracle Indexes.
Hopefully, you are not expecting Redshift to replace Oracle for your OLTP workload. most of those 60 indexes are likely to be for optimising OLTP type workload.
Max Redshift sortkeys per table = 1
You cannot sort your Redshift data in more than 1 way! the sort key orders your table data. It is not an index.
You can specify an interleaved or a compound sort key.
Query Tuning
Hopefully, you will be using Redshift for analytical type queries. You should define sort and distribution based upon your expected queries. you should follow the best practice here and the tutorial here
Tuning Redshift is partly an art, you will need to use trial and error!
If you want specific guidance on this, please can you edit your question to be specific on what you are doing?
Related
I'm using a TimeScaleDB and PostgreSQL to manage timeseries data.
When optimizing the table is it recommended to rely purely on TimeScaleDB hypertable or should I also add indexes independently the same way I would do when not using a hypertable?
What is critical in that scenario is the performance of retrieving the data.
TimescaleDB creates an index on the time dimension by default. If your queries often select data on values from other columns, it can be helpful to create indexes on such columns as you would do with normal tables. However, in the case of TimescaleDB all indexes should be compound and include the time dimension column. You might drop the automatically created index on the time dimension after you created new indexes.
As usual creating new indexes should take in consideration that indexes occupy additional space and require more processing resources to maintain them.
Timescale has a blog post, which has advises on adding indexes.
I've read various posts and am still unclear. With a star schema, I would think that if I drive a query off a dimension table, say d_article, I end up with a set of SKs (sk_article) that are used to query/probe the main fact table. So, it makes sense to set sort keys on the fields commonly used in the Where clause on that dim table.
Next...and here's what I can't find an example or answer...should I include sk_article in a sort key in the fact table? More specifically, should I create an interleaved sort key with all the various SKs since we don't always use the same ones to join to the fact table?
I have seen no reference to including sort keys for use in Joins, only.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Amazon Redshift Foreign Keys - Sort or Interleaved Keys
Redshift Sort Key
Sort keys are just for sorting purpose, not for joining purpose. There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.
Also, as Tony commented,
Sort Keys are primarily meant to optimize the effectiveness of the Zone Maps (sort of like a BRIN index) and enabling range restricted scans. They aren't all that useful on most dimension tables because dimension tables are typically small. The only time a Sort Key can help with join performance is if you set everything up for a Merge Join - that usually only makes sense for large fact-to-fact table joins. Interleaved Keys are more of a special case sort key and do not help with any joins.
Every type of those keys has specific purpose. This may be good read for you.
For joining, fact and dimension tables, you should be using distribution key.
Redshift Distribution Keys (DIST Keys)
It determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. Here is good read for you.
I hope this answers your question.
A good video session is here, that may be really helpful in understanding SORT VS DIST Key.
I'm a newbie on database design hoping to learn by experimenting and implementing, and I'm sure some version of this question has been asked within database design in general, but this is specific to Tableau.
I have a few dashboards that are drawing from a PostgreSQL database table that contains several million rows of data. Performance rerendering views is quite slow (ie., if I select a different parameter, Tableau's Executing SQL query popup will appear and it often takes several minutes to complete).
I debugged this by using the performance recording option within Tableau, exporting the SQL queries that Tableau is using to a text file, and then using EXPLAIN ANALYZE to figure out what exactly the bottlenecks were. Unfortunately, I can't post the SQL queries themselves, but I've created a contrived case below to be as helpful as possible. Here is what my table looks like currently. The actual values being rendered by Tableau are in green, and the columns that I have foreign key references on are in yellow:
I see within the Query Plan that there's a lot of costly bitmap heap scans that are implementing the filters am using on Tableau on the frontend on neighborhood_id, view, animal, date_updated, animal_name.
I've attempted to place a multiple index on these fields, but upon rerunning the queries, it does not look like the PG query planner is electing to use these indices.
So my proposed solution is to create foreign key references for each of these fields (neighborhood_id, view, animal, date_updated, animal_name)- again, yellow represents a FK reference:
My hope is that these FK references will force the query planner is use an index scan instead of a sequential scan or bitmap heap scan. However, my questions are
Before, all the data was more or less stored in this one table, with
two joins to shelter and age_of_animal tables. Now, this table
will be joined to 8 smaller subtables- will these joins drastically
reduce performance? The subtables are quite small (ie. the animal
table will have only 40 entries).
I know the question is difficult to answer without seeing the actual
query and query plan, but what are some high-level reasons would the
query planner elect to not use an index? I've read through some articles like "Why Postgres Won't Always Use An Index" but mostly they refer to cases where it's a small table and a simple query where the cost of the index lookup is greater than simply traversing the rows. I don't think applies to my case though- I have millions of rows and a complex filter on 5+ columns.
Is the PG Query Planner any more likely to use multiple column
indices on a collection of foreign key columns versus regular
columns? I know that PG does not automatically add indices on
foreign keys, so I imagine I'll still need to add indices after
creating the foreign key references.
Of course, the answers to my questions could be "Why don't you just try it and see?", but in this case refactoring such a large table is quite costly and I'd like some intuition on whether it's worth the cost prior to undertaking it.
Many posts like this stackoverflow link claim that there is no concept of a clustered index in PostgreSQL. However, the PostgreSQL documentation contains something similar. A few people claim it is similar to a clustered index in SQL Server.
Do you know what the exact difference between these two is, if there is any?
A clustered index or index organized table is a data structure where all the table data are organized in index order, typically by organizing the table in a B-tree structure.
Once a table is organized like this, the order is automatically maintained by all future data modifications.
PostgreSQL does not have such clustering indexes. What the CLUSTER command does is rewrite the table in the order of the index, but the table remains a fundamentally unordered heap of data, so future data modifications will not maintain that index order.
You have to CLUSTER a PostgreSQL table regularly if you want to maintain an approximate index order in the face of data modifications to the table.
Clustering in PostgreSQL can improve performance, because tuples found during an index scan will be close together in the heap table, which can turn random access to the heap to faster sequential access.
I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?
VACUUM will also take longer if you have several sort keys.