SQLite ANALYZE breaks indexes - iphone

I have a table that contains about 500K rows. The table has an index on the 'status' column. So I run the following explain command:
EXPLAIN QUERY PLAN SELECT * FROM my_table WHERE status = 'ACTIVE'
Results in a predictable 'explanation'...
SEARCH TABLE my_table USING INDEX IDX_my_table_status (status=?) (~10 rows)
After many additional rows are added to the table, I call 'ANALYZE'. Afterwards, queries seemed much slower so I re-ran my explain and now see the following:
SCAN TABLE my_table (~6033 rows)
First thing I notice is that BOTH the row estimates are WAY off. The biggest concern is the fact that the index seems to be skipped once ANALYZE is ran. I tried REINDEX - to no avail. The only way I can get the indexes back is to drop them, then re-create them. Has anyone seen this? Is this a bug? Any ideas what I am doing wrong? I have tried this on multiple datbases and I see the same results. This is on my PC, and on MAC and on the iPhone/iPad - all the same results.

When SQLite fetches rows from a table using an index, it has to read the index pages first, and then read all the table's pages that contain one or more matching records.
If there are many matching records, almost all the table's pages are likely to contain one, so going through the index would require reading more pages.
However, SQLite's query planner does not have information about the record sizes in the index or the table, so it's possible that its estimates are off.
The information collected by ANALYZE is stored in the sqlite_stat1 and maybe sqlite_stat3 tables.
Please show what the information about your table is.
If that information that not reflect the true distribution of your data, you can try to run ANALYZE again, or just delete that information from the sqlite_stat* tables.
You can force going through an index if you use ORDER BY on the indexed field.
(INDEXED BY is, as its documentation says, not intended for use in tuning the performance of a query.)
If you do not need to select all fields of the table, you can speed up specific queries by creating an index on those queries' fields so that you have a covering index.

It's not uncommon for a query execution plan to avoid using an existing index on a low-cardinality column like "status", which probably only has a few distinct values. It's often faster for the lookups to be performed by scanning the db table. (Some DBAs recommend never indexing low-cardinality columns.)
However, based on the wildly varying row counts in the explain plan, I'm guessing that SQLite's 'analyze' performs similarly to MySQL's 'analyze' when using the InnoDB storage engine. MySQL's 'analyze' does a random set of dives into the table data to determine row count, index cardinality, etc. Because of the random dives, the statistics may vary after each 'analyze' is run, and result in differing query execution plans. Low-cardinality columns are even more susceptible to incorrect stats, as, for example, the random dives may indicate that the majority of the rows in your table have an "active" status, making it more efficient to table scan rather than use the index. (I'm no SQLite expert, so someone please chime in if my hunch about the 'analyze' behavior is incorrect.)
You can try testing the use of the index in the query using "indexed by" (see http://www.sqlite.org/lang_indexedby.html), although forcing the use of indexes is usually a last resort. Different RDBMSs have different solutions to the low-cardinality problem, such as partitioning, using bitmap indexes, etc. I would recommend researching SQLite-specific solutions to querying/indexing on low-cardinality columns).

Related

Best performance method for getting records by large collection of IDs

I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.

Understanding indexes and performance as they relate to indexed column and non-indexed column data in the same row

I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.

Is the speed of a PostgreSQL SELECT adversely affected by too many indexes on the table?

I have read that when having a lot of indexes on a database It can seriously hurt the performance but in the PostgreSQL doc I can't find anything about it.
I have a very big table with something like 100 columns and a billion rows and often I have to do a lot of searches in a lot of different fields.
Does the performance of the PostgreSQL table will drop if I add a lot of indexes (maybe 10 unique column indexes and 5 or 7 3 column indexes)?
EDIT: With performance drop I mean the performance in fetching rows (select), the database will be updated once a month so the update and insert time are not an issue.
The indexes are maintained when the content of the table has been modified (i.e. INSERT, UPDATE, DELETE)
The query planner of PostgreSQL can decide when to use an index and when it's not needed and a sequential scan is more optimal.
So having too many indexes will hurt the modifying performance, not the fetching.
The indexes will have to be updated at each insert and update involving those columns.
I have some charts about that on my site: http://use-the-index-luke.com/sql/dml
An index is pure redundancy. It contains only data that is also stored
in the table. During write operations, the database must keep those
redundancies consistent. Specifically, it means that insert, delete
and update not only affect the table but also the indexes that hold a
copy of the affected data.
The chapter titles suggest the impact that indexes can have:
Insert — cannot take direct benefit from indexes
Delete — uses indexes for the where clause
Update — does not affect all indexes of the table

Delete Takes a Long Time

I've got a table which has about 5.5 million records. I need to delete some records from it based on date. My query looks like this:
DELETE FROM Table WHERE [Date] between '2011-10-31 04:30:23' and '2011-11-01 04:30:42'
It's about 9000 rows, but this operation last very long time. How can I speed it up? Date is type of datetime2, table has int primary key clustered. Update and delete triggers are disabled.
It's very possible that [Date] is being cast to a string on every row resulting in a sequential scan of the entire table.
You should try casting your parameters to a date instead:
DELETE FROM Table WHERE [Date] between convert(datetime, '2011-10-31 04:30:23') and convert(datetime, '2011-11-01 04:30:42')
Also, make sure there's an index on [Date]
Firstly make sure you have an index on date.
If there is an index check the execution plan and make sure it is using it. Notice that it doesn't always follow that using an index is the most efficient method of processing a delete because if you are deleting a large proportion of records (rule of thumb is in excess of 10%) the additional overhead of the index look-up can be greater than a full scan.
With a large table it's also well worth making sure that the statistics are up to date (run sp_updatestats) because if the database has an incorrect understanding of the number of rows in the table it will make inappropriate choices in its execution plan. For example if the statistics are incorrect the database may decide to ignore your index even if it exists because it thinks there are far fewer records in the table than there are. Odd distributions of dates might have similar effects.
I'd probably try dropping the index on date then recreating it again. Indexes are binary trees and to work efficiently they need to be balanced. If your data has accumulated over time the index may well lopsided and queries might take a long time to find the appropriate data. Both this and statistics issue should be handled automatically by your database maintenance job, but it's often overlooked.
Finally you don't say if there are many other indexes on the table. If there are then you might be running into issues with the database having to reorganize indexes as it progresses the delete as well as update the indexes. It's a bit drastic, but one option is to drop all other indexes on the table before running the delete, then create them again afterwards.

Best use of indices on temporary tables in T-SQL

If you're creating a temporary table within a stored procedure and want to add an index or two on it, to improve the performance of any additional statements made against it, what is the best approach? Sybase says this:
"the table must contain data when the index is created. If you create the temporary table and create the index on an empty table, Adaptive Server does not create column statistics such as histograms and densities. If you insert data rows after creating the index, the optimizer has incomplete statistics."
but recently a colleague mentioned that if I create the temp table and indices in a different stored procedure to the one which actually uses the temporary table, then Adaptive Server optimiser will be able to make use of them.
On the whole, I'm not a big fan of wrapper procedures that add little value, so I've not actually got around to testing this, but I thought I'd put the question out there, to see if anyone had any other approaches or advice?
A few thoughts:
If your temporary table is so big that you have to index it, then is there a better way to solve the problem?
You can force it to use the index (if you are sure that the index is the correct way to access the table) by giving an optimiser hint, of the form:
SELECT *
FROM #table (index idIndex)
WHERE id = #id
If you are interested in performance tips in general, I've answered a couple of other questions about that at some length here:
Favourite performance tuning tricks
How do you optimize tables for specific queries?
What's the problem with adding the indexes after you put data into the temp table?
One thing you need to be mindful of is the visibility of the index to other instances of the procedure that might be running at the same time.
I like to add a guid to these kinds of temp tables (and to the indexes), to make sure there is never a conflict. The other benefit of this approach is that you could simply make the temp table a real table.
Also, make sure that you will need to query the data in these temp tables more than once during the running of the stored procedure, otherwise the cost of index creation will outweigh the benefit to the select.
In Sybase if you create a temp table and then use it in one proc the plan for the select is built using an estimate of 100 rows in the table. (The plan is built when the procedure starts before the tables are populated.) This can result in the temp table being table scanned since it is only "100 rows". Calling a another proc causes Sybase to build the plan for the select with the actual number of rows, this allows the optimizer to pick a better index to use. I have seen significant improvedments using this approach but test on your database as sometimes there is no difference.