I've read various posts and am still unclear. With a star schema, I would think that if I drive a query off a dimension table, say d_article, I end up with a set of SKs (sk_article) that are used to query/probe the main fact table. So, it makes sense to set sort keys on the fields commonly used in the Where clause on that dim table.
Next...and here's what I can't find an example or answer...should I include sk_article in a sort key in the fact table? More specifically, should I create an interleaved sort key with all the various SKs since we don't always use the same ones to join to the fact table?
I have seen no reference to including sort keys for use in Joins, only.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Amazon Redshift Foreign Keys - Sort or Interleaved Keys
Redshift Sort Key
Sort keys are just for sorting purpose, not for joining purpose. There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.
Also, as Tony commented,
Sort Keys are primarily meant to optimize the effectiveness of the Zone Maps (sort of like a BRIN index) and enabling range restricted scans. They aren't all that useful on most dimension tables because dimension tables are typically small. The only time a Sort Key can help with join performance is if you set everything up for a Merge Join - that usually only makes sense for large fact-to-fact table joins. Interleaved Keys are more of a special case sort key and do not help with any joins.
Every type of those keys has specific purpose. This may be good read for you.
For joining, fact and dimension tables, you should be using distribution key.
Redshift Distribution Keys (DIST Keys)
It determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. Here is good read for you.
I hope this answers your question.
A good video session is here, that may be really helpful in understanding SORT VS DIST Key.
Related
I'm a newbie on database design hoping to learn by experimenting and implementing, and I'm sure some version of this question has been asked within database design in general, but this is specific to Tableau.
I have a few dashboards that are drawing from a PostgreSQL database table that contains several million rows of data. Performance rerendering views is quite slow (ie., if I select a different parameter, Tableau's Executing SQL query popup will appear and it often takes several minutes to complete).
I debugged this by using the performance recording option within Tableau, exporting the SQL queries that Tableau is using to a text file, and then using EXPLAIN ANALYZE to figure out what exactly the bottlenecks were. Unfortunately, I can't post the SQL queries themselves, but I've created a contrived case below to be as helpful as possible. Here is what my table looks like currently. The actual values being rendered by Tableau are in green, and the columns that I have foreign key references on are in yellow:
I see within the Query Plan that there's a lot of costly bitmap heap scans that are implementing the filters am using on Tableau on the frontend on neighborhood_id, view, animal, date_updated, animal_name.
I've attempted to place a multiple index on these fields, but upon rerunning the queries, it does not look like the PG query planner is electing to use these indices.
So my proposed solution is to create foreign key references for each of these fields (neighborhood_id, view, animal, date_updated, animal_name)- again, yellow represents a FK reference:
My hope is that these FK references will force the query planner is use an index scan instead of a sequential scan or bitmap heap scan. However, my questions are
Before, all the data was more or less stored in this one table, with
two joins to shelter and age_of_animal tables. Now, this table
will be joined to 8 smaller subtables- will these joins drastically
reduce performance? The subtables are quite small (ie. the animal
table will have only 40 entries).
I know the question is difficult to answer without seeing the actual
query and query plan, but what are some high-level reasons would the
query planner elect to not use an index? I've read through some articles like "Why Postgres Won't Always Use An Index" but mostly they refer to cases where it's a small table and a simple query where the cost of the index lookup is greater than simply traversing the rows. I don't think applies to my case though- I have millions of rows and a complex filter on 5+ columns.
Is the PG Query Planner any more likely to use multiple column
indices on a collection of foreign key columns versus regular
columns? I know that PG does not automatically add indices on
foreign keys, so I imagine I'll still need to add indices after
creating the foreign key references.
Of course, the answers to my questions could be "Why don't you just try it and see?", but in this case refactoring such a large table is quite costly and I'd like some intuition on whether it's worth the cost prior to undertaking it.
I am new to Redhsift and migrting oracle to Redshift.
One of the oracle tables have around 60 indexes. AWS recommends its a good practice to have around 6 compound sort keys.
How would these 60 oracle indexes translate to Redhsift sort keys ? I understand there is no automated conversion or can't have all 60 of them as compound sort keys. I am new to redshift and May I know , how usually this conversion is approached.
In Oracle we can keep adding indexes to the same table and the queries / reports can use them. But in Redshift Changing sortkeys is through recreating the table. How do we make all queries which uses different filter columns and join columns on the same table have best performance?
Thanks
Redshift is columnar database, and it doesn't have indexes in the same meaning as in Oracle at all.
You can think of Redshift's compound sort key (not interleaved) as IOT in Oracle (index organized table), with all the data sorted physically by this compound key.
If you create interleaved sort key on x columns, it will act as a separate index on each of x columns in some manner.
In any way, being columnar database, Redshift can outperform Oracle in many aggregation queries due to it's compression and data structure. The main factors that affect performance in Redshift are distribution style and key, sort key and columns encoding.
If you can't fit all your queries with one table structure, you can duplicate the table with different structure, but the same data. This approach is widely used in big data columnar databases (for example projections in Vertica) and helps to achieve performance with storage being the cost.
Please review this page with several useful tips about Redshift performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
First a few key points
Redshift <> Oracle
Redshift does not have indexes, Redshift Sort Keys <> Oracle Indexes.
Hopefully, you are not expecting Redshift to replace Oracle for your OLTP workload. most of those 60 indexes are likely to be for optimising OLTP type workload.
Max Redshift sortkeys per table = 1
You cannot sort your Redshift data in more than 1 way! the sort key orders your table data. It is not an index.
You can specify an interleaved or a compound sort key.
Query Tuning
Hopefully, you will be using Redshift for analytical type queries. You should define sort and distribution based upon your expected queries. you should follow the best practice here and the tutorial here
Tuning Redshift is partly an art, you will need to use trial and error!
If you want specific guidance on this, please can you edit your question to be specific on what you are doing?
I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?
VACUUM will also take longer if you have several sort keys.
I'm trying to retrieve all of the keys from a DynamoDB table in an optimized way. There are millions of keys.
In Cassandra I would probably create a single row with a column for every key which would eliminate to do a full table scan. DynamoDBs 64k limit per Item would seemingly preclude this option though.
Is there a quick way for me to get back all of the keys?
Thanks.
I believe the DynamoDB analogue would be to use composite keys: have a primary key of "allmykeys" and a range attribute of the originals being tracked: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/DataModel.html#DataModelPrimaryKey
I suspect this will scale poorly to billions of entries, but should work adequately for a few million.
Finally, again as with Cassandra, the most straightforward solution is to use map/reduce to get the keys: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html
My application needs configurable columns , and titles of these columns get configured in the begining, If relation database I would have created generic columns in table like CodeA, CodeB etc for this need because it helps queering on these columns (Code A = 11 ) it also helps in displaying the values (if that columns stores code and value) but now I am using Non Relational database Datastore (and I am new to it), should I follow the same old approach or I should use collection (Key Value pair) type of structure .
There will be lot of filters on these columns. Please suggest
What you've just described is one of the classic scenarios for a Key-Value database. The limitation here is that you will not have many of the set-based tools you're used to.
Most of the K-V databases are really good at loading one "record" or small set thereof. However, they don't tend to be any good at loading anything that may require a join. Given that you're using AppEngine, you probably appreciate this limitation. But it's worth stating.
As an important note, not all K-V database will allow you to "select by any column". Many K-V stores actually only allow for selection by a primary key. If you take a look at MongoDB, you'll find that you can query any column which sounds like a necessary feature.
I would suggest using key/value pairs where keys will act as your column names and value will be their data.