Redshift Composite Sortkey - how many columns should we use? - amazon-redshift

I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?

VACUUM will also take longer if you have several sort keys.

Related

Redshift Sort Keys For Joins

I've read various posts and am still unclear. With a star schema, I would think that if I drive a query off a dimension table, say d_article, I end up with a set of SKs (sk_article) that are used to query/probe the main fact table. So, it makes sense to set sort keys on the fields commonly used in the Where clause on that dim table.
Next...and here's what I can't find an example or answer...should I include sk_article in a sort key in the fact table? More specifically, should I create an interleaved sort key with all the various SKs since we don't always use the same ones to join to the fact table?
I have seen no reference to including sort keys for use in Joins, only.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Amazon Redshift Foreign Keys - Sort or Interleaved Keys
Redshift Sort Key
Sort keys are just for sorting purpose, not for joining purpose. There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.
Also, as Tony commented,
Sort Keys are primarily meant to optimize the effectiveness of the Zone Maps (sort of like a BRIN index) and enabling range restricted scans. They aren't all that useful on most dimension tables because dimension tables are typically small. The only time a Sort Key can help with join performance is if you set everything up for a Merge Join - that usually only makes sense for large fact-to-fact table joins. Interleaved Keys are more of a special case sort key and do not help with any joins.
Every type of those keys has specific purpose. This may be good read for you.
For joining, fact and dimension tables, you should be using distribution key.
Redshift Distribution Keys (DIST Keys)
It determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. Here is good read for you.
I hope this answers your question.
A good video session is here, that may be really helpful in understanding SORT VS DIST Key.

Creating foreign key references to optimize Tableau SQL queries

I'm a newbie on database design hoping to learn by experimenting and implementing, and I'm sure some version of this question has been asked within database design in general, but this is specific to Tableau.
I have a few dashboards that are drawing from a PostgreSQL database table that contains several million rows of data. Performance rerendering views is quite slow (ie., if I select a different parameter, Tableau's Executing SQL query popup will appear and it often takes several minutes to complete).
I debugged this by using the performance recording option within Tableau, exporting the SQL queries that Tableau is using to a text file, and then using EXPLAIN ANALYZE to figure out what exactly the bottlenecks were. Unfortunately, I can't post the SQL queries themselves, but I've created a contrived case below to be as helpful as possible. Here is what my table looks like currently. The actual values being rendered by Tableau are in green, and the columns that I have foreign key references on are in yellow:
I see within the Query Plan that there's a lot of costly bitmap heap scans that are implementing the filters am using on Tableau on the frontend on neighborhood_id, view, animal, date_updated, animal_name.
I've attempted to place a multiple index on these fields, but upon rerunning the queries, it does not look like the PG query planner is electing to use these indices.
So my proposed solution is to create foreign key references for each of these fields (neighborhood_id, view, animal, date_updated, animal_name)- again, yellow represents a FK reference:
My hope is that these FK references will force the query planner is use an index scan instead of a sequential scan or bitmap heap scan. However, my questions are
Before, all the data was more or less stored in this one table, with
two joins to shelter and age_of_animal tables. Now, this table
will be joined to 8 smaller subtables- will these joins drastically
reduce performance? The subtables are quite small (ie. the animal
table will have only 40 entries).
I know the question is difficult to answer without seeing the actual
query and query plan, but what are some high-level reasons would the
query planner elect to not use an index? I've read through some articles like "Why Postgres Won't Always Use An Index" but mostly they refer to cases where it's a small table and a simple query where the cost of the index lookup is greater than simply traversing the rows. I don't think applies to my case though- I have millions of rows and a complex filter on 5+ columns.
Is the PG Query Planner any more likely to use multiple column
indices on a collection of foreign key columns versus regular
columns? I know that PG does not automatically add indices on
foreign keys, so I imagine I'll still need to add indices after
creating the foreign key references.
Of course, the answers to my questions could be "Why don't you just try it and see?", but in this case refactoring such a large table is quite costly and I'd like some intuition on whether it's worth the cost prior to undertaking it.

Why does Amazon Redshift only allow one sort key per table?

In Redshift, only one column can be designated as a sort key. I was wondering why a column-oriented DBMS would have a restriction like this.
ex. Let's say I have a table like this:
rowid name age
1 Kevin 20
2 Jill 35
3 Billy Bob 19
Internally the DB would store each column separately, perhaps like this:
Kevin:1,Jill:2,Billy Bob:3
20:1,35:2,19:3
I would think it would be possible to sort these separately and with their own ordering etc.
Redshift is designed to work on massive number of records, and to calculate analytics on it quickly. Many of the design patterns of smaller DB that are tuned into transactional workloads, are not going to work in that scale. For example, sort keys in OLTP are implemented with index that is duplicating the data. On small scale of data (GBs), it is not a big issue, but with large amount of data (TBs and PBs), it is.
The main usage of sort keys in Redshift is to allow the DB to minimize the number of disk IO reads, which is very slow. This is another example of a difference between small scale DBs and large ones. If an operation is taking 100ms for 1M records, it will take 100 seconds for 1B records or an hour for 36B records. Redshift allows queries over many billions of records, by managing a mapping of the minimum and maximum value of each column for each 1MB compressed data block. If the data of that block is sorted, most of the blocks can be ignored based on your WHERE clause filters.
This is the reason why you would like to define your sort key columns (note that you can have multiple columns), to match the columns that you use in your WHERE clauses (for example, Date).
Both Compound and Interleaved can support multiple columns, but with Compound you define the order of the sorting and with interleaved they are interleaved with no order between them.

Sorting Cassandra using individual components of Composite Keys

I want to store a list of users in a Cassandra Column Family(Wide rows).
The columns in the CF will have Composite Keys of pattern id:updated_time:name:score
After inserting all the users, i need to query users in a different sorted order each time.
For example, if i specify updated_time, i could be able to fetch the recent 10 users.
And, if i specify score, then i could be able to fetch the top 10 users based on score.
Does Cassandra supports this?
Kindly help me in this regard...
i need to query users in a different sorted order each time...
Does Cassandra supports this
It does not. Unlike a RDBMS, you can not make arbitrary queries and expect reasonable performance. Instead you must design you data model so the queries you anticipate will be made will be efficient:
The best way to approach data modeling for Cassandra is to start with your queries and work backwards from there. Think about the actions your application needs to perform, how you want to access the data, and then design column families to support those access patterns.
So rather than having one column family (table) for your data, you might want several with cross references between them. That is, you might have to denormalise your data.

Cassandra DB Design

I come from RDBMS background and designing an app with Cassandra as backend and I am unsure of the validity and scalability of my design.
I am working on some sort of rating/feedback app of books/movies/etc. Since Cassandra has the concept of flexible column families (sparse structure), I thought of using the following schema:
user-id (row key): book-id/movie-id (dynamic column name) - rating (column value)
If I do it this way, I would end up having millions of columns (which would have been rows in RDBMS) though not essentially associated with a row-key, for instance:
user1: {book1:Rating-Ok; book1023:good; book982821:good}
user2: {book75:Ok;book1023:good;book44511:Awesome}
Since all column families are stored in a single file, I am not sure if this is a scalable design (or a design at all!). Furthermore there might be queries like "pick all 'good' reviews of 'book125'".
What approach should I use?
This design is perfectly scalable. Cassandra stores data in sparse form, so empty cells don't consume disk space.
The drawback is that cassandra isn't very good in indexing by value. There are secondary indexes, but they should be used only to index a column or two, not each of million of columns.
There are two options to address this issue:
Materialised views (described, for example, here: http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/). This allows to build some set of predefined queries, probably quite complex ones.
Ad-hoc querying is possible with some sort of map/reduce job, that effectively iterates over the whole dataset. This might sound scary, but still it's pretty fast: Cassandra stores all data in SSTables, and this iterating might be implemented to scan data files sequentially.
Start from a desired set of queries and structure your column families to support those views. Especially with so few fields involved, each CF can act cheaply as its own indexed view of your data. During a fetch, the key will partition the data ultimately to one specific Cassandra node that can rapidly stream a set of wide rows to your app server in a pre-determined order. This plays to one of Cassandra's strengths, as the fragmentation of that read on physical media (when not cached) is extremely low compared to bouncing around the various tracks and sectors on an indexed search of an RDBMS table.
One useful approach when available is to select your key to segment the data such that a full scan of all columns in that segment is a reasonable proposition, and a good rough fit for your query. Then, you filter what you don't need, even if that filtering is performed in your client (app server). All reviews for a movie is a good example. Even if you filter the positive reviews or provide only recent reviews or a summary, you might still reasonably fetch all rows for that key and then toss what you don't need.
Another option is if you can figure out how to partition data(by time, by category), playOrm offers a solution of doing S-SQL into a partition which is very fast. It is very much like an RDBMS EXCEPT that you partition the data to stay scalable and can have as many partitions as you want. partitions can contain millions of rows(I would not exceed 10 million rows though in a partition).
later,
Dean