I'm working in an API that needs to return a list of financial transactions. These records are held in 6 different tables, but all have 3 common fields:
transaction_id int NOT NULL,
account_id bigint NOT NULL,
created timestamptz NOT NULL
note: might have actually
been a good use of table in inheritance in postgresql but it wasn't done like that.
The business requirement is to return all transactions for a given account_id in 1 list sorted by created in descending order (similar to an online banking page where your last transaction is at the top). Originally, they want to paginate in groups of 50 records, but I've got them to do it on date ranges (believing that I can do it more efficiently in the database than using offset and limits).
My intent is to create an index on each of these tables like this:
CREATE INDEX idx_table_1_account_created ON table_1(account_id, created desc);
ALTER TABLE table_1 CLUSTER ON idx_table_1_account_created;
Then finally to create a view to union all of the records from the 6 tables into one list and then obviously the records from the 6 tables will need to be *resorted" to come up with a unified list (in the correct order). This call will look like:
SELECT * FROM vw_all_transactions
WHERE account_id = 12345678901234
AND created >= '2014-01-01' AND created < '2014-02-01'
ORDER BY created desc;
My question is related to creating the indexing and clustering scheme. Since the records are going to have to be resorted by the view anyway is there any reason to do specify the individual indexes as created desc? And does sorting this way have any penalties when periodicially calling CLUSTER;
I've done some googling and reading but can't really seem to find any information that answers how this clustering is going to work.
Using PostgreSQL 9.2 on Heroku.
Related
I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.
Following the blog of Rob Conery I have set of unique IDs across the tables of my Postgres DB.
Now, using these unique IDs, is there a way to query a row on the DB without knowing what table it is in? Or can those tables be indexed such that if the row is not available on the current table, I just increase the index and I can query to the next table?
In short - if you did not prepared for that - then no. You can prepare for that by generating your own uuid. Please look here. For instance PG has uuid that preserve order. Also uuid v5 has something like namespaces. So you can build hierarchy. However that is done by hashing namespace, and I don't know tool to do opposite inside PG.
If you know all possible tables in advance you could prepare a query that simply UNIONs a search with a tagged type over all tables. In case of two tables named comments and news you could do something like:
PREPARE type_of_id(uuid) AS
SELECT id, 'comments' AS type
FROM comments
WHERE id = $1
UNION
SELECT id, 'news' AS type
FROM news
WHERE id = $1;
EXECUTE type_of_id('8ecf6bb1-02d1-4c04-8875-f1da62b7f720');
Automatically generating this could probably be done by querying pg_catalog.pg_tables and generating the relevant query on the fly.
In my app I have a concept of "seasons" which change discretely over time. All the entities are related to some season. All entities have season based indices as well as some indices on other fields. When season change occurs, postgresql decides to use filtered scan plan based on season index rather than more specific field indices. At the beginning of the season the planning cost of such decision is very little, so it's ok, but the problem is - season change brings MANY users to come at the very beginning of the season, so postgresql scan based query plan becomes bad very fast - it simply scans all the entities in the new season, and filters target items. After first auto analyze postgres decides to use a good plan, BUT auto analyze runs VERY SLOWLY due to contention and I suppose it's like a snowball - the more requests are done, the more contention is due to a bad plan and thus auto analyze works slowly and slowly. The biggest time for auto analyze to work was about an hour last week, and it becomes a real problem. I know postgresql architects decided to disable the possibility to choose the index used in query, but what is the best way to overcome my problem then?
Just to clarify, here is a DDL, one of the "slow" queries and explain results before and after auto analyze.
DDL
CREATE TABLE race_results (
id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('race_results_id_seq'::regclass),
user_id INTEGER NOT NULL,
opponent_id INTEGER,
season_id INTEGER NOT NULL,
type RACE_TYPE NOT NULL DEFAULT 'battle'::race_type,
elo_delta INTEGER NOT NULL,
opponent_elo_delta INTEGER NOT NULL DEFAULT 0,
);
CREATE INDEX race_results_type_user_id_index ON race_results USING BTREE (season_id, type, user_id);
CREATE INDEX race_results_type_opponent_id_index ON race_results USING BTREE (season_id, type, opponent_id);
CREATE INDEX race_results_opponent_id_index ON race_results USING BTREE (opponent_id);
CREATE INDEX race_results_user_id_index ON race_results USING BTREE (user_id);
Query
SELECT 1000 + COALESCE(SUM(CASE WHEN user_id = 6446 THEN elo_delta ELSE opponent_elo_delta END), 0)
FROM race_results
WHERE type = 'battle' :: race_type AND (user_id = 6446 OR opponent_id = 6446) AND
season_id = current_season_id()
Results of explain before auto analyze (as you see more than a thousand items is already removed by filter and soon it becomes hundreds of thousands for each request)
Results of explain analyze after auto analyze (now postgres decides to use the right index and no filtering needed anymore, but the problem is - auto analyze takes too long partly due to contention of ineffective index selection in previous picture)
ps: Now I'm solving the problem just turning off the application server after 10 seconds after season changes so that postgres gets new data and starts autoanalyze, and then turn it on, when autoanalyze finishes, but such solution involves downtime, which is not desirable and overall it looks weird
Finally I found the solution. It's not perfect and I will not mark it as the best one, however it works and could help someone.
Instead of indices on season, type and user/opponent id, I now have indices
CREATE INDEX race_results_type_user_id_index ON race_results USING BTREE (user_id,season_id, type);
CREATE INDEX race_results_type_opponent_id_index ON race_results USING BTREE (opponent_id,season_id, type);
One problem which appeared - I needed and index on season anyway in other queries, but when I add index
CREATE INDEX race_results_season_index ON race_results USING BTREE (season_id);
the planner tries to use it again instead of those right indices and the whole situation is repeated. What I've done is simply added one more field: 'season_id_clone', which contains the same data as 'season_id', and I made an index against it. Now when I need to filter something based on season (not including queries from the first post), I'm using season_id_clone in query. I know it's weird, but I haven't found anything better.
Does anyone know why the order of the rows changed after I made an update to table? Is there any way to make the order go back or change to another order eg:order by alphabetical?
This is the update I performed:
update t set amount = amount + 1 where account = accountNumber
After this update when I go and see the table, the order has changed
A table doesn't have a natural row order, some database systems will actually refuse your query if you don't add an ORDER BY clause at the end of your SELECT
Why did the order change?
Because the database engine fetches your rows in the physical order they come from the storage. Some engines, like SQL Server, can have a CLUSTERED INDEX which forces a physical order, but it is still never really guaranteed that you get your results in that precise order.
The clustered index exist mostly as an optimization. PostgreSQL has a similar CLUSTER function to change the physical order, but it's an heavy process which locks the table : http://www.postgresql.org/docs/9.1/static/sql-cluster.html
How to force an alphabetical order of the rows?
Add an ORDER BY clause in your query.
SELECT * FROM table ORDER BY column
I have following scenario while using postgresql -
No of tables - 100 ,
No of rows per table - ~ 10 Million .
All the tables have same schema E.g. each table contains daily call records of a company. So 100 tables contain call records of 100 days.
I want to make following type of queries on these tables -
For each column of each table get count of records having null value in that column.
So considering above scenario, what can be the major optimizations in table structures ? How should i prepare my query and does there exist any efficient way of querying for such cases
If you're using Postgres table inheritance, a simple select count(*) from calls where foo is null will work fine. It will use an index on foo provided null foo rows aren't too common.
Internally, that will do what you'd do manually without table inheritance, i.e. union all the result for each individual child table.
If you need to run this repeatedly, maintain the count in memcached or in another table.