Optimize getting counts of rows grouped by first letter in SQLite? - iphone

My current query looks something like this:
SELECT SUBSTR(name,1,1), COUNT(*) FROM files GROUP BY SUBSTR(name,1,1)
But it's taking a pretty long time just to do counts on a table that's already indexed by the name column. I saw from this question that some engines might not use indexes correctly for the SUBSTR function, and in fact, sqlite will not use indexes for SUBSTR(string,1,1).
Is there any other approach that would utilize the index and net me some faster queries?

One strategy that is consistent with your access pattern is to add a new indexed column "first_letter" to your table. Use a trigger on to set the value on insert and update. Then your query is a simple group by first_letter.

Another strategy is to create a shadow table which contains an aggregation of the mother table. This isn't easy because it is your job as developer to keep the shadow table consistent with the mother table. Every delete, update or insert in table files needs to be accompanied by a change in the shadow table.
Databases like Oracle have support for materialized views to achieve this automatically but sqlite doesn't.

Related

Querying from parent table's data (postgresql)

I have a parent table (parent_table) and few children tables inherit from it (child_one_table and child_two_table).
I want to query data using columns belong to the parent table only (the data itself is inserted to the children tables), when I run an explain on my query I see that there are sequence scans running on all the tables (parent_table, child_one_table and child_two_table).
Is there a more efficient way to do this? When I try to query using ONLY on parent_table I get back 0 result since the data was inserted to children table.
Thanks in advance!
Thanks for clarifying your question!
When I query the data for SELECT * FROM parent_table WHERE name = 'joe'; I see that it actually go over all 3 tables and query for that.
This seems to be standard behaviour from PostgreSQL. According to the 5.10.1. Caveats section of https://www.postgresql.org/docs/current/ddl-inherit.html
Note that not all SQL commands are able to work on inheritance hierarchies. Commands that are used for data querying, data modification, or schema modification (e.g., SELECT, UPDATE, DELETE, most variants of ALTER TABLE, but not INSERT or ALTER TABLE ... RENAME) typically default to including child tables and support the ONLY notation to exclude them.
So really what you want to do is explore the usage of ONLY to specify the scope of your query.
I hope this can help!

Best performance method for getting records by large collection of IDs

I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.

Redshift query a daily-generated table

I am looking for a way to create a Redshift query that will retrieve data from a table that is generated daily. Tables in our cluster are of the form:
event_table_2016_06_14
event_table_2016_06_13
.. and so on.
I have tried writing a query that appends the current date to the table name, but this does not seem to work correctly (invalid operation):
SELECT * FROM concat('event_table_', to_char(getdate(),'YYYY_MM_DD'))
Any suggestions on how this can be performed are greatly appreciated!
I have tried writing a query that appends the current date to the
table name, but this does not seem to work correctly (invalid
operation):
Redshift does not support that. But you most likely won't need it.
Try the following (expanding on the answer from #ketan):
Create your main table with appropriate (for joins) DIST key, and COMPOUND or simple SORT KEY on timestamp column, and proper compression on columns.
Daily, create a temp table (use CREATE TABLE ... LIKE - this will preserve DIST/SORT keys), load it with daily data, VACUUM SORT.
Copy sorted temp table into main table using ALTER TABLE APPEND - this will copy the data sorted, and will reduce VACUUM on the main table. You may still need VACUUM SORT after that.
After that query your main table normally, probably giving it a range on timestamp. Redshift is optimised for these scenarios, and 99% of times you don't need to optimise table scans yourself - even on tables with billion of rows scans take milliseconds to few seconds. You may need to optimise elsewhere, but that's the second step.
To get insight in the performance of scans, use STL_QUERY system table to find your query ID, and then use STL_SCAN (or SVL_QUERY_SUMMARY) table to see how fast the scan was.
Your example is actually the main use case for ALTER TABLE APPEND.
I am assuming that you are creating a new table everyday.
What you can do is:
Create a view on top of event_table_* tables. Query your data using this view.
Whenever you create or drop a table, update the view.
If you want, you can avoid #2: Instead of creating a new table everyday, create empty tables for next 1-2 years. So, no need to update the view every day. However, do remember that there is an upper limit of 9,900 tables in Redshift.
Edit: If you always need to query today's table (instead of all tables, as I assumed originally), I don't think you can do that without updating your view.
However, you can modify your design to have just one table, with date as sort-key. So, whenever your table is queried with some date, all disk blocks that don't have that date will be skipped. That'll be as efficient as having time-series tables.

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

Best use of indices on temporary tables in T-SQL

If you're creating a temporary table within a stored procedure and want to add an index or two on it, to improve the performance of any additional statements made against it, what is the best approach? Sybase says this:
"the table must contain data when the index is created. If you create the temporary table and create the index on an empty table, Adaptive Server does not create column statistics such as histograms and densities. If you insert data rows after creating the index, the optimizer has incomplete statistics."
but recently a colleague mentioned that if I create the temp table and indices in a different stored procedure to the one which actually uses the temporary table, then Adaptive Server optimiser will be able to make use of them.
On the whole, I'm not a big fan of wrapper procedures that add little value, so I've not actually got around to testing this, but I thought I'd put the question out there, to see if anyone had any other approaches or advice?
A few thoughts:
If your temporary table is so big that you have to index it, then is there a better way to solve the problem?
You can force it to use the index (if you are sure that the index is the correct way to access the table) by giving an optimiser hint, of the form:
SELECT *
FROM #table (index idIndex)
WHERE id = #id
If you are interested in performance tips in general, I've answered a couple of other questions about that at some length here:
Favourite performance tuning tricks
How do you optimize tables for specific queries?
What's the problem with adding the indexes after you put data into the temp table?
One thing you need to be mindful of is the visibility of the index to other instances of the procedure that might be running at the same time.
I like to add a guid to these kinds of temp tables (and to the indexes), to make sure there is never a conflict. The other benefit of this approach is that you could simply make the temp table a real table.
Also, make sure that you will need to query the data in these temp tables more than once during the running of the stored procedure, otherwise the cost of index creation will outweigh the benefit to the select.
In Sybase if you create a temp table and then use it in one proc the plan for the select is built using an estimate of 100 rows in the table. (The plan is built when the procedure starts before the tables are populated.) This can result in the temp table being table scanned since it is only "100 rows". Calling a another proc causes Sybase to build the plan for the select with the actual number of rows, this allows the optimizer to pick a better index to use. I have seen significant improvedments using this approach but test on your database as sometimes there is no difference.