distkey and sortkey on temporary tables - Redshift - amazon-redshift

I am starting to do some research on query tuning, and have been experimenting with using distkey and sortkey. From what I've read if I set the distkey to the joining column, the query planner will use a merge join instead of a hash join, which should be faster in Redshift. I was wondering if this also applies to temporary tables? Our production tables are actually views, so they do not have any keys already set. I'm not sure why we don't use the actual warehouse tables.

Yes, keys can be set for temporary tables:
create temp table fred DISTKEY (1) as ...
this is easily done with column position - first column in this example. You can also set the distribution style on temp tables is you so desire. Doing this can force data to stay "on node" for intermediate results in very large and complex queries. Redshift does a good job make reasonable decisions on how to distribute intermediate results but isn't perfect and doesn't understand the nature of the data. I've done this with good results when large data images are in play.
As to you second point about using views instead of tables - In Redshift standard views are basically SQL macros that are flattened / optimized through by the Redshift query compiler. So use of views instead of tables is not bad in itself. Use of view, especially complex ones, can hide what is being done by the query and this can add unneeded and unexpected complexity to the query. The keys are set in the tables referenced by the views. (I'm assuming that the views are not referencing external/spectrum tables)
Lastly, you state you are looking to achieve Merge Join behavior to improve performance. While it is true that this is the fastest type of join, the time and work required to get merge joins to happen on temp tables will not be offset by this performance gain (experience). Redshift will only use a Merge join when it is sure that the data being joined will "zipper" together without issue. If it isn't completely sure this is the case it has to perform a Hash join which is a more general process. To get Redshift to do the Merge join you will need to sort and analyze your temp tables which will cost much more time than the savings you will get. It is far more important to have your joins be "DIST NONE" - no network distribution of the data - than moving from a hash join to a merge join.

Yes, it can be done. Just put the distkey before the start of the table query
create temp table a distkey(column_name) as
(select query .....)

Related

Find unused columns in Redshift

I'm trying to find columns that are not queried in Redshift tables, in order to drop them.
I tried doing this by analyzing stl_query table, but that doesn't provide very accurate results (because of views).
Is there some kind of utility that can help with that?
Alex

Redshift not performing merge joins with interleaved sort keys

I'm looking at the performance of some queries that I'm doing in Redshift and noticed something that I can't quite find in the documentation.
I created two tables that have a join key between them (about 10K rows in the child table).
For the parent table, let's call it A, I have a primary key that I've declared to be the distkey and sort key for the table. Let's call this id.
For the child table B, I've made a foreign key field, parent_id that references A.id. parent_id has been declared as the distkey for table B. Table B also has a primary key, id that I've defined. I've created an interleaved sort key on table B for (parent_id,id).
When I try to do an explain joining the two tables, I will always get a Hash Join. If I recreate table B with a normal compound sort key, I will always get a Merge Join.
When I look at the stats of the tables, I don't see any skews that are out of line.
My question is, will Redshift always use Hash Joins with interleaved sort keys or is there something I'm doing wrong?
EDIT - The order of the interleaved sort keys in Table B is actually (parent_id, id). I wrote it above incorrectly. I've updated the above to be clear now.
From my understanding:
A merge join can be used when both tables are sorted on the join column, which is very efficient -- a bit like closing a zipper, where both sides "fit into" each other.
A hash join is less efficient because it needs to do lookups via hashes to find matching values.
As you pointed out, if the tables are sorted using a normal compound key, then both tables are sorted by the join column.
In an interleaved join, however, values are not guaranteed to be sorted within each column.
The documentation for Interleaved Keys says:
An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style. When a query uses restrictive predicates on secondary sort columns, interleaved sorting significantly improves query performance as compared to compound sorting.
However, it does not mean that all columns are sorted (as they are with a Compound sort). Rather, it gives a generally good mix of sorting, so that sorts on any column work generally well. Therefore, each column is not necessarily fully sorted, hence the need for a hash join.
The blog post Quickly Filter Data in Amazon Redshift Using Interleaved Sorting tries to explain how the data is stored when using interleaved sorting.

Redshift query a daily-generated table

I am looking for a way to create a Redshift query that will retrieve data from a table that is generated daily. Tables in our cluster are of the form:
event_table_2016_06_14
event_table_2016_06_13
.. and so on.
I have tried writing a query that appends the current date to the table name, but this does not seem to work correctly (invalid operation):
SELECT * FROM concat('event_table_', to_char(getdate(),'YYYY_MM_DD'))
Any suggestions on how this can be performed are greatly appreciated!
I have tried writing a query that appends the current date to the
table name, but this does not seem to work correctly (invalid
operation):
Redshift does not support that. But you most likely won't need it.
Try the following (expanding on the answer from #ketan):
Create your main table with appropriate (for joins) DIST key, and COMPOUND or simple SORT KEY on timestamp column, and proper compression on columns.
Daily, create a temp table (use CREATE TABLE ... LIKE - this will preserve DIST/SORT keys), load it with daily data, VACUUM SORT.
Copy sorted temp table into main table using ALTER TABLE APPEND - this will copy the data sorted, and will reduce VACUUM on the main table. You may still need VACUUM SORT after that.
After that query your main table normally, probably giving it a range on timestamp. Redshift is optimised for these scenarios, and 99% of times you don't need to optimise table scans yourself - even on tables with billion of rows scans take milliseconds to few seconds. You may need to optimise elsewhere, but that's the second step.
To get insight in the performance of scans, use STL_QUERY system table to find your query ID, and then use STL_SCAN (or SVL_QUERY_SUMMARY) table to see how fast the scan was.
Your example is actually the main use case for ALTER TABLE APPEND.
I am assuming that you are creating a new table everyday.
What you can do is:
Create a view on top of event_table_* tables. Query your data using this view.
Whenever you create or drop a table, update the view.
If you want, you can avoid #2: Instead of creating a new table everyday, create empty tables for next 1-2 years. So, no need to update the view every day. However, do remember that there is an upper limit of 9,900 tables in Redshift.
Edit: If you always need to query today's table (instead of all tables, as I assumed originally), I don't think you can do that without updating your view.
However, you can modify your design to have just one table, with date as sort-key. So, whenever your table is queried with some date, all disk blocks that don't have that date will be skipped. That'll be as efficient as having time-series tables.

SQLite ANALYZE breaks indexes

I have a table that contains about 500K rows. The table has an index on the 'status' column. So I run the following explain command:
EXPLAIN QUERY PLAN SELECT * FROM my_table WHERE status = 'ACTIVE'
Results in a predictable 'explanation'...
SEARCH TABLE my_table USING INDEX IDX_my_table_status (status=?) (~10 rows)
After many additional rows are added to the table, I call 'ANALYZE'. Afterwards, queries seemed much slower so I re-ran my explain and now see the following:
SCAN TABLE my_table (~6033 rows)
First thing I notice is that BOTH the row estimates are WAY off. The biggest concern is the fact that the index seems to be skipped once ANALYZE is ran. I tried REINDEX - to no avail. The only way I can get the indexes back is to drop them, then re-create them. Has anyone seen this? Is this a bug? Any ideas what I am doing wrong? I have tried this on multiple datbases and I see the same results. This is on my PC, and on MAC and on the iPhone/iPad - all the same results.
When SQLite fetches rows from a table using an index, it has to read the index pages first, and then read all the table's pages that contain one or more matching records.
If there are many matching records, almost all the table's pages are likely to contain one, so going through the index would require reading more pages.
However, SQLite's query planner does not have information about the record sizes in the index or the table, so it's possible that its estimates are off.
The information collected by ANALYZE is stored in the sqlite_stat1 and maybe sqlite_stat3 tables.
Please show what the information about your table is.
If that information that not reflect the true distribution of your data, you can try to run ANALYZE again, or just delete that information from the sqlite_stat* tables.
You can force going through an index if you use ORDER BY on the indexed field.
(INDEXED BY is, as its documentation says, not intended for use in tuning the performance of a query.)
If you do not need to select all fields of the table, you can speed up specific queries by creating an index on those queries' fields so that you have a covering index.
It's not uncommon for a query execution plan to avoid using an existing index on a low-cardinality column like "status", which probably only has a few distinct values. It's often faster for the lookups to be performed by scanning the db table. (Some DBAs recommend never indexing low-cardinality columns.)
However, based on the wildly varying row counts in the explain plan, I'm guessing that SQLite's 'analyze' performs similarly to MySQL's 'analyze' when using the InnoDB storage engine. MySQL's 'analyze' does a random set of dives into the table data to determine row count, index cardinality, etc. Because of the random dives, the statistics may vary after each 'analyze' is run, and result in differing query execution plans. Low-cardinality columns are even more susceptible to incorrect stats, as, for example, the random dives may indicate that the majority of the rows in your table have an "active" status, making it more efficient to table scan rather than use the index. (I'm no SQLite expert, so someone please chime in if my hunch about the 'analyze' behavior is incorrect.)
You can try testing the use of the index in the query using "indexed by" (see http://www.sqlite.org/lang_indexedby.html), although forcing the use of indexes is usually a last resort. Different RDBMSs have different solutions to the low-cardinality problem, such as partitioning, using bitmap indexes, etc. I would recommend researching SQLite-specific solutions to querying/indexing on low-cardinality columns).

Best use of indices on temporary tables in T-SQL

If you're creating a temporary table within a stored procedure and want to add an index or two on it, to improve the performance of any additional statements made against it, what is the best approach? Sybase says this:
"the table must contain data when the index is created. If you create the temporary table and create the index on an empty table, Adaptive Server does not create column statistics such as histograms and densities. If you insert data rows after creating the index, the optimizer has incomplete statistics."
but recently a colleague mentioned that if I create the temp table and indices in a different stored procedure to the one which actually uses the temporary table, then Adaptive Server optimiser will be able to make use of them.
On the whole, I'm not a big fan of wrapper procedures that add little value, so I've not actually got around to testing this, but I thought I'd put the question out there, to see if anyone had any other approaches or advice?
A few thoughts:
If your temporary table is so big that you have to index it, then is there a better way to solve the problem?
You can force it to use the index (if you are sure that the index is the correct way to access the table) by giving an optimiser hint, of the form:
SELECT *
FROM #table (index idIndex)
WHERE id = #id
If you are interested in performance tips in general, I've answered a couple of other questions about that at some length here:
Favourite performance tuning tricks
How do you optimize tables for specific queries?
What's the problem with adding the indexes after you put data into the temp table?
One thing you need to be mindful of is the visibility of the index to other instances of the procedure that might be running at the same time.
I like to add a guid to these kinds of temp tables (and to the indexes), to make sure there is never a conflict. The other benefit of this approach is that you could simply make the temp table a real table.
Also, make sure that you will need to query the data in these temp tables more than once during the running of the stored procedure, otherwise the cost of index creation will outweigh the benefit to the select.
In Sybase if you create a temp table and then use it in one proc the plan for the select is built using an estimate of 100 rows in the table. (The plan is built when the procedure starts before the tables are populated.) This can result in the temp table being table scanned since it is only "100 rows". Calling a another proc causes Sybase to build the plan for the select with the actual number of rows, this allows the optimizer to pick a better index to use. I have seen significant improvedments using this approach but test on your database as sometimes there is no difference.