Find unused columns in Redshift

Find unused columns in Redshift - amazon-redshift

I'm trying to find columns that are not queried in Redshift tables, in order to drop them.
I tried doing this by analyzing stl_query table, but that doesn't provide very accurate results (because of views).
Is there some kind of utility that can help with that?
Alex

Related

distkey and sortkey on temporary tables - Redshift

I am starting to do some research on query tuning, and have been experimenting with using distkey and sortkey. From what I've read if I set the distkey to the joining column, the query planner will use a merge join instead of a hash join, which should be faster in Redshift. I was wondering if this also applies to temporary tables? Our production tables are actually views, so they do not have any keys already set. I'm not sure why we don't use the actual warehouse tables.

Yes, keys can be set for temporary tables:
create temp table fred DISTKEY (1) as ...
this is easily done with column position - first column in this example. You can also set the distribution style on temp tables is you so desire. Doing this can force data to stay "on node" for intermediate results in very large and complex queries. Redshift does a good job make reasonable decisions on how to distribute intermediate results but isn't perfect and doesn't understand the nature of the data. I've done this with good results when large data images are in play.
As to you second point about using views instead of tables - In Redshift standard views are basically SQL macros that are flattened / optimized through by the Redshift query compiler. So use of views instead of tables is not bad in itself. Use of view, especially complex ones, can hide what is being done by the query and this can add unneeded and unexpected complexity to the query. The keys are set in the tables referenced by the views. (I'm assuming that the views are not referencing external/spectrum tables)
Lastly, you state you are looking to achieve Merge Join behavior to improve performance. While it is true that this is the fastest type of join, the time and work required to get merge joins to happen on temp tables will not be offset by this performance gain (experience). Redshift will only use a Merge join when it is sure that the data being joined will "zipper" together without issue. If it isn't completely sure this is the case it has to perform a Hash join which is a more general process. To get Redshift to do the Merge join you will need to sort and analyze your temp tables which will cost much more time than the savings you will get. It is far more important to have your joins be "DIST NONE" - no network distribution of the data - than moving from a hash join to a merge join.

Yes, it can be done. Just put the distkey before the start of the table query
create temp table a distkey(column_name) as
(select query .....)

Best performance method for getting records by large collection of IDs

I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks

Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.

If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.

SQLite ANALYZE breaks indexes

I have a table that contains about 500K rows. The table has an index on the 'status' column. So I run the following explain command:
EXPLAIN QUERY PLAN SELECT * FROM my_table WHERE status = 'ACTIVE'
Results in a predictable 'explanation'...
SEARCH TABLE my_table USING INDEX IDX_my_table_status (status=?) (~10 rows)
After many additional rows are added to the table, I call 'ANALYZE'. Afterwards, queries seemed much slower so I re-ran my explain and now see the following:
SCAN TABLE my_table (~6033 rows)
First thing I notice is that BOTH the row estimates are WAY off. The biggest concern is the fact that the index seems to be skipped once ANALYZE is ran. I tried REINDEX - to no avail. The only way I can get the indexes back is to drop them, then re-create them. Has anyone seen this? Is this a bug? Any ideas what I am doing wrong? I have tried this on multiple datbases and I see the same results. This is on my PC, and on MAC and on the iPhone/iPad - all the same results.

When SQLite fetches rows from a table using an index, it has to read the index pages first, and then read all the table's pages that contain one or more matching records.
If there are many matching records, almost all the table's pages are likely to contain one, so going through the index would require reading more pages.
However, SQLite's query planner does not have information about the record sizes in the index or the table, so it's possible that its estimates are off.
The information collected by ANALYZE is stored in the sqlite_stat1 and maybe sqlite_stat3 tables.
Please show what the information about your table is.
If that information that not reflect the true distribution of your data, you can try to run ANALYZE again, or just delete that information from the sqlite_stat* tables.
You can force going through an index if you use ORDER BY on the indexed field.
(INDEXED BY is, as its documentation says, not intended for use in tuning the performance of a query.)
If you do not need to select all fields of the table, you can speed up specific queries by creating an index on those queries' fields so that you have a covering index.

It's not uncommon for a query execution plan to avoid using an existing index on a low-cardinality column like "status", which probably only has a few distinct values. It's often faster for the lookups to be performed by scanning the db table. (Some DBAs recommend never indexing low-cardinality columns.)
However, based on the wildly varying row counts in the explain plan, I'm guessing that SQLite's 'analyze' performs similarly to MySQL's 'analyze' when using the InnoDB storage engine. MySQL's 'analyze' does a random set of dives into the table data to determine row count, index cardinality, etc. Because of the random dives, the statistics may vary after each 'analyze' is run, and result in differing query execution plans. Low-cardinality columns are even more susceptible to incorrect stats, as, for example, the random dives may indicate that the majority of the rows in your table have an "active" status, making it more efficient to table scan rather than use the index. (I'm no SQLite expert, so someone please chime in if my hunch about the 'analyze' behavior is incorrect.)
You can try testing the use of the index in the query using "indexed by" (see http://www.sqlite.org/lang_indexedby.html), although forcing the use of indexes is usually a last resort. Different RDBMSs have different solutions to the low-cardinality problem, such as partitioning, using bitmap indexes, etc. I would recommend researching SQLite-specific solutions to querying/indexing on low-cardinality columns).

T-SQL: SELECT INTO sparse table?

I am migrating a large quantity of mostly empty tables into SQL Server 2008.
The tables are vertical partitions of one big logical table.
Problem is this logical table has more than 1024 columns.
Given that most of the fields are null, I plan to use a sparse table.
For all of my tables so far I have been using SELECT...INTO, which has been working really well.
However, now I have "CREATE TABLE failed because column 'xyz' in table 'MyBigTable' exceeds the maximum of 1024 columns."
Is there any way I can do SELECT...INTO so that it creates the new table with sparse support?

What you probably want to do is create the table manually and populate it with an INSERT ... SELECT statement.
To create the table, I would recommend scripting the different component tables and merging their definitions, making them all SPARSE as necessary. Then just run your single CREATE TABLE statement.

You cannot (and probably don't want to anyway). See INTO Clause (TSQL) for the MSDN documentation.
The problem is that sparse tables are a physical storage characteristic and not a logical characteristic, so there is no way the DBMS engine would know to copy over that characteristic. Moreover, it is a table-wide property and the SELECT can have multiple underlying source tables. See the Remarks section of the page I linked where it discusses how you can only use default organization details.

Best use of indices on temporary tables in T-SQL

If you're creating a temporary table within a stored procedure and want to add an index or two on it, to improve the performance of any additional statements made against it, what is the best approach? Sybase says this:
"the table must contain data when the index is created. If you create the temporary table and create the index on an empty table, Adaptive Server does not create column statistics such as histograms and densities. If you insert data rows after creating the index, the optimizer has incomplete statistics."
but recently a colleague mentioned that if I create the temp table and indices in a different stored procedure to the one which actually uses the temporary table, then Adaptive Server optimiser will be able to make use of them.
On the whole, I'm not a big fan of wrapper procedures that add little value, so I've not actually got around to testing this, but I thought I'd put the question out there, to see if anyone had any other approaches or advice?

A few thoughts:
If your temporary table is so big that you have to index it, then is there a better way to solve the problem?
You can force it to use the index (if you are sure that the index is the correct way to access the table) by giving an optimiser hint, of the form:
SELECT *
FROM #table (index idIndex)
WHERE id = #id
If you are interested in performance tips in general, I've answered a couple of other questions about that at some length here:
Favourite performance tuning tricks
How do you optimize tables for specific queries?

What's the problem with adding the indexes after you put data into the temp table?
One thing you need to be mindful of is the visibility of the index to other instances of the procedure that might be running at the same time.
I like to add a guid to these kinds of temp tables (and to the indexes), to make sure there is never a conflict. The other benefit of this approach is that you could simply make the temp table a real table.
Also, make sure that you will need to query the data in these temp tables more than once during the running of the stored procedure, otherwise the cost of index creation will outweigh the benefit to the select.

In Sybase if you create a temp table and then use it in one proc the plan for the select is built using an estimate of 100 rows in the table. (The plan is built when the procedure starts before the tables are populated.) This can result in the temp table being table scanned since it is only "100 rows". Calling a another proc causes Sybase to build the plan for the select with the actual number of rows, this allows the optimizer to pick a better index to use. I have seen significant improvedments using this approach but test on your database as sometimes there is no difference.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse