Make postgresql request faster - postgresql

SELECT
referrer_id, count(id) as referrals_count
FROM users
WHERE referrer_id != 0
GROUP BY referrer_id
order by referrals_count desc
limit 10;
Now i have this request, i checked execution time and it was ~26ms on table with 150k+ rows. How can i make this request faster?
Explain analyze
I tryed to create index on referrer_id field and index like: id DESC, but it isn't worked for me.

There is no way to make this query particularly efficient, but given that your WHERE clause eliminates 2/3 of the table, a filtered index can probably help.
create index on users (referrer_id, id) where referrer_id<>0;
For me, this index makes it about 5 times faster than the seq scan. By including "id" in the index as well, you can get an index-only scan (as long as the table is well vacuumed) and by fetching the rows already sorted by referrer_id, you can avoid the work of the hashagg. And that combination makes it quite a bit faster.
If the "id" field is never null, then you can change count(id) to count(*) and will get the same answer. This change means you no longer need to include "id" in the index in order to get the index-only scan.
And if referrer_id is always >= 0 (which is likely the case for an id column) you can change the where clause to referrer_id > 0, which will let get rid of the need for the rather esoteric WHERE clause on the index.
Both of those combined would leave you with the need for just an index on referrer_id, which I would guess you already have anyway.
But if sending the answer is 10 times slower than running the query in the first place, I would say you are working on the wrong part of the problem.

Related

Postgres: Optimization of query with simple where clause

I have a table with the following columns:
ID (VARCHAR)
CUSTOMER_ID (VARCHAR)
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
SELECT *
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?
You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.

Performance of like 'query%' on multimillion rows, postgresql

We have a table with 10 million rows. We need to find first few rows with like 'user%' .
This query is fast if it matches at least 2 rows (It returns results in 0.5 sec). If it doesn't find any 2 rows matching with that criteria, it is taking at least 10 sec. 10 secs is huge for us (since we are using this auto suggestions, users will not wait for so long to see the suggestions.)
Query: select distinct(name) from user_sessions where name like 'user%' limit 2;
In the above query, the name column is of type citext and it is indexed.
Whenever you're working on performance, start by explaining your query. That'll show the the query optimizer's plan, and you can get a sense of how long it's spending doing various pieces. In particular, check for any full table scans, which mean the database is examining every row in the table.
Since the query is fast when it finds something and slow when it doesn't, it sounds like you are indeed hitting a full table scan. I believe you that it's indexed, but since you're doing a like, the standard string index can't be used efficiently. You'll want to check out varchar_pattern_ops (or text_pattern_ops, depending on the column type of name). You create that this way:
CREATE INDEX ON pattern_index_on_users_name ON users (name varchar_pattern_ops)
After creating an index, check EXPLAIN query to make sure it's being used. text_pattern_ops doesn't work with the citext extension, so in this case you'll have to index and search for lower(name) to get good case-insensitive performance:
CREATE INDEX ON pattern_index_on_users_name ON users (lower(name) text_pattern_ops)
SELECT * FROM users WHERE lower(name) like 'user%' LIMIT 2

Select query with offset limit is too much slow

I have read from internet resources that a query will be slow when the offset increases. But in my case I think its too much slow. I am using postgres 9.3
Here is the query (id is primary key):
select * from test_table offset 3900000 limit 100;
It returns me data in around 10 seconds. And I think its too much slow. I have around 4 million records in table. Overall size of the database is 23GB.
Machine configuration:
RAM: 12 GB
CPU: 2.30 GHz
Core: 10
Few values from postgresql.conf file which I have changed are as below. Others are default.
shared_buffers = 2048MB
temp_buffers = 512MB
work_mem = 1024MB
maintenance_work_mem = 256MB
dynamic_shared_memory_type = posix
default_statistics_target = 10000
autovacuum = on
enable_seqscan = off ## its not making any effect as I can see from Analyze doing seq-scan
Apart from these I have also tried by changing the values of random_page_cost = 2.0 and cpu_index_tuple_cost = 0.0005 and result is same.
Explain (analyze, buffers) result over the query is as below:
"Limit (cost=10000443876.02..10000443887.40 rows=100 width=1034) (actual time=12793.975..12794.292 rows=100 loops=1)"
" Buffers: shared hit=26820 read=378984"
" -> Seq Scan on test_table (cost=10000000000.00..10000467477.70 rows=4107370 width=1034) (actual time=0.008..9036.776 rows=3900100 loops=1)"
" Buffers: shared hit=26820 read=378984"
"Planning time: 0.136 ms"
"Execution time: 12794.461 ms"
How people around the world negotiates with this problem in postgres? Any alternate solution will be helpful for me as well.
UPDATE:: Adding order by id (tried with other indexed column as well) and here is the explain:
"Limit (cost=506165.06..506178.04 rows=100 width=1034) (actual time=15691.132..15691.494 rows=100 loops=1)"
" Buffers: shared hit=110813 read=415344"
" -> Index Scan using test_table_pkey on test_table (cost=0.43..533078.74 rows=4107370 width=1034) (actual time=38.264..11535.005 rows=3900100 loops=1)"
" Buffers: shared hit=110813 read=415344"
"Planning time: 0.219 ms"
"Execution time: 15691.660 ms"
It's slow because it needs to locate the top offset rows and scan the next 100. No amounts of optimization will change that when you're dealing with huge offsets.
This is because your query literally instruct the DB engine to visit lots of rows by using offset 3900000 -- that's 3.9M rows. Options to speed this up somewhat aren't many.
Super-fast RAM, SSDs, etc. will help. But you'll only gain by a constant factor in doing so, meaning it's merely kicking the can down the road until you reach a larger enough offset.
Ensuring the table fits in memory, with plenty more to spare will likewise help by a larger constant factor -- except the first time. But this may not be possible with a large enough table or index.
Ensuring you're doing index-only scans will work to an extent. (See velis' answer; it has a lot of merit.) The problem here is that, for all practical purposes, you can think of an index as a table storing a disk location and the indexed fields. (It's more optimized than that, but it's a reasonable first approximation.) With enough rows, you'll still be running into problems with a larger enough offset.
Trying to store and maintain the precise position of the rows is bound to be an expensive approach too.(This is suggested by e.g. benjist.) While technically feasible, it suffers from limitations similar to those that stem from using MPTT with a tree structure: you'll gain significantly on reads but will end up with excessive write times when a node is inserted, updated or removed in such a way that large chunks of the data needs to be updated alongside.
As is hopefully more clear, there isn't any real magic bullet when you're dealing with offsets this large. It's often better to look at alternative approaches.
If you're paginating based on the ID (or a date field, or any other indexable set of fields), a potential trick (used by blogspot, for instance) would be to make your query start at an arbitrary point in the index.
Put another way, instead of:
example.com?page_number=[huge]
Do something like:
example.com?page_following=[huge]
That way, you keep a trace of where you are in your index, and the query becomes very fast because it can head straight to the correct starting point without plowing through a gazillion rows:
select * from foo where ID > [huge] order by ID limit 100
Naturally, you lose the ability to jump to e.g. page 3000. But give this some honest thought: when was the last time you jumped to a huge page number on a site instead of going straight for its monthly archives or using its search box?
If you're paginating but want to keep the page offset by any means, yet another approach is to forbid the use of larger page number. It's not silly: it's what Google is doing with search results. When running a search query, Google gives you an estimate number of results (you can get a reasonable number using explain), and then will allow you to brows the top few thousand results -- nothing more. Among other things, they do so for performance reasons -- precisely the one you're running into.
I have upvoted Denis's answer, but will add a suggestion myself, perhaps it can be of some performance benefit for your specific use-case:
Assuming your actual table is not test_table, but some huge compound query, possibly with multiple joins. You could first determine the required starting id:
select id from test_table order by id offset 3900000 limit 1
This should be much faster than original query as it only requires to scan the index vs the entire table. Getting this id then opens up a fast index-search option for full fetch:
select * from test_table where id >= (what I got from previous query) order by id limit 100
You didn't say if your data is mainly read-only or updated often. If you can manage to create your table at one time, and only update it every now and then (say every few minutes) your problem will be easy to solve:
Add a new column "offset_id"
For your complete data set ordered by ID, create an offset_id simply by incrementing numbers: 1,2,3,4...
Instead of "offset ... limit 100" use "where offset_id >= 3900000 limit 100"
you can optimise in two steps
First get maximum id out of 3900000 records
select max(id) (select id from test_table order by id limit 3900000);
Then use this maximum id to get the next 100 records.
select * from test_table id > {max id from previous step) order by id limit 100 ;
It will be faster as both queries will do index scan by id.
This way you get the rows in semi-random order. You are not ordering the results in a query, so as a result, you get the data as it is stored in the files. The problem is that when you update the rows, the order of them can change.
To fix that you should add order by to the query. This way the query will return the rows in the same order. What's more then it will be able to use an index to speed the query up.
So two things: add an index, add order by to the query. Both to the same column. If you want to use the id column, then don't add index, just change the query to something like:
select * from test_table order by id offset 3900000 limit 100;
First, you have to define limit and offset with order by clause or you will get inconsistent result.
To speed up the query, you can have a computed index, but only for these condition :
Newly inserted data is strictly in id order
No delete nor update on column id
Here's how You can do it :
Create a row position function
create or replace function id_pos (id) returns bigint
as 'select count(id) from test_table where id <= $1;'
language sql immutable;
Create a computed index on id_pos function
create index table_by_pos on test_table using btree(id_pos(id));
Here's how You call it (offset 3900000 limit 100):
select * from test_table where id_pos(id) >= 3900000 and sales_pos(day) < 3900100;
This way, the query will not compute the 3900000 offset data, but only will compute the 100 data, making it much faster.
Please note the 2 conditions where this approach can take place, or the position will change.
I don't know all of the details of your data, but 4 million rows can be a little hefty. If there's a reasonable way to shard the table and essentially break it up into smaller tables it could be beneficial.
To explain this, let me use an example. let's say that I have a database where I have a table called survey_answer, and it's getting very large and very slow. Now let's say that these survey answers all come from a distinct group of clients (and I also have a client table keeping track of these clients). Then something I could do is I could make it so that I have a table called survey_answer that doesn't have any data in it, but is a parent table, and it has a bunch of child tables that actually contain the data the follow the naming format survey_answer_<clientid>, meaning that I'd have child tables survey_answer_1, survey_answer_2, etc., one for each client. Then when I needed to select data for that client, I'd use that table. If I needed to select data across all clients, I can select from the parent survey_answer table, but it will be slow. But for getting data for an individual client, which is what I mostly do, then it would be fast.
This is one example of how to break up data, and there are many others. Another example would be if my survey_answer table didn't break up easily by client, but instead I know that I'm typically only accessing data over a year period of time at once, then I could potentially make child tables based off of year, such as survey_answer_2014, survey_answer_2013, etc. Then if I know that I won't access more than a year at a time, I only really need to access maybe two of my child tables to get all the data I need.
In your case, all I've been given is perhaps the id. We can break it up by that as well (though perhaps not as ideal). Let's say that we break it up so that there's only about 1000000 rows per table. So our child tables would be test_table_0000001_1000000, test_table_1000001_2000000, test_table_2000001_3000000, test_table_3000001_4000000, etc. So instead of passing in an offset of 3900000, you'd do a little math first and determine that the table that you want is table test_table_3000001_4000000 with an offset of 900000 instead. So something like:
SELECT * FROM test_table_3000001_4000000 ORDER BY id OFFSET 900000 LIMIT 100;
Now if sharding the table is out of the question, you might be able to use partial indexes to do something similar, but again, I'd recommend sharding first. Learn more about partial indexes here.
I hope that helps. (Also, I agree with Szymon Guz that you want an ORDER BY).
Edit: Note that if you need to delete rows or selectively exclude rows before getting your result of 100, then sharding by id will become very hard to deal with (as pointed out by Denis; and sharding by id is not great to begin with). But if your 'just' paginating the data, and you only insert or edit (not a common thing, but it does happen; logs come to mind), then sharding by id can be done reasonably (though I'd still choose something else to shard on).
How about if paginate based on IDs instead of offset/limit?
The following query will give IDs which split all the records into chunks of size per_page. It doesn't depend on were records deleted or not
SELECT id AS from_id FROM (
SELECT id, (ROW_NUMBER() OVER(ORDER BY id DESC)) AS num FROM test_table
) AS rn
WHERE num % (per_page + 1) = 0;
With these from_IDs you can add links to the page. Iterate over :from_ids with index and add the following link to the page:
:from_id_index
When user visits the page retrieve records with ID which is greater than requested :from_id:
SELECT * FROM test_table WHERE ID >= :from_id ORDER BY id DESC LIMIT :per_page
For the first page link with from_id=0 will work
1
To avoid slow pagination with big tables always use auto-increment primary key then use the query below:
SELECT * FROM test_table WHERE id > (SELECT min(id) FROM test_table WHERE id > ((1 * 10) - 10)) ORDER BY id DESC LIMIT 10
1: is the page number
10: is the records per page
Tested and work well with 50 millions records.
There are two simple approaches to solve such a problem
Splitting the query into two subqueries that the first one do all the heavy job on index-only scan as described here
Create calculated index that holds the offset as described here, this can be enhanced using window functions.

Is there a logically equivalent and efficient version of this query without using a CTE?

I have a query on a postgresql 9.2 system that takes about 20s in it's normal form but only takes ~120ms when using a CTE.
I simplified both queries for brevity.
Here is the normal form (takes about 20s):
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
ORDER BY modified_at DESC
LIMIT 25;
Here is the explain for this query: http://explain.depesz.com/s/2v8
The CTE form (about 120ms):
WITH raw AS (
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
)
SELECT *
FROM raw
ORDER BY modified_at DESC
LIMIT 25;
Here is the explain for the CTE: http://explain.depesz.com/s/uxy
Simply by moving the ORDER BY to the outer part of the query reduces the cost by 99%.
I have two questions: 1) is there a way to construct the first query without using a CTE in such a way that it is logically equivalent more performant and 2) what does this difference in performance say about how the planner is determining how to fetch the data?
Regarding the questions above, are there additional statistics or other planner hints that would help improve the performance of the first query?
Edit: Taking away the limit also causes the query to use a heap scan as opposed to an index scan backwards. Without the LIMIT the query completes in 40ms.
After seeing the effect of the LIMIT I tried with LIMIT 1, LIMIT 2, etc. The query performs in under 100ms when using LIMIT 1 and 10s+ with LIMIT > 1.
After thinking about this some more, question 2 boils down to why does the planner use an index scan backwards in one case and a bitmap heap scan + sort in another logically equivalent case? And how can I "help" the planner use the efficient plan in both cases?
Update:
I accepted Craig's answer because it was the most comprehensive and helpful. The way I ended up solving the problem was by using a query that was practically equivalent though not logically equivalent. At the root of the issue was an index scan backwards of the index on modified_at. In order to inform the planner that this was not a good idea I add a predicate of the form WHERE modified_at >= NOW() - INTERVAL '1 year'. This included enough data for the application but prevented the planner from going down the backwards index scan path.
This was a much lower impact solution that prevented the need to rewrite the queries using either a sub query or a CTE. YMMV.
Here's why this is happening, with the following explanation current until at least 9.3 (if you're reading this and on a newer version, check to make sure it hasn't changed):
PostgreSQL doesn't optimize across CTE boundaries. Each CTE clause is run in isolation and its results are consumed by other parts of the query. So a query like:
WITH blah AS (
SELECT * FROM some_table
)
SELECT *
FROM blah
WHERE id = 4;
will cause the full inner query to get executed. PostgreSQL won't "push down" the id = 4 qualification into the inner query. CTEs are "optimization fences" in that regard, which can be both good or bad; it lets you override the planner when you want to, but prevents you from using CTEs as simple syntactic cleanup for a deeply nested FROM subquery chain if you do need push-down.
If you rephrase the above as:
SELECT *
FROM (SELECT * FROM some_table) AS blah
WHERE id = 4;
using a sub-query in FROM instead of a CTE, Pg will push the qual down into the subquery and it'll all run nice and quickly.
As you have discovered, this can also work to your benefit when the query planner makes a poor decision. It appears that in your case a backward index scan of the table is immensely more expensive a bitmap or index scan of two smaller indexes followed by a filter and sort, but the planner doesn't think it will be so it plans the query to scan the index.
When you use the CTE, it can't push the ORDER BY into the inner query, so you're overriding its plan and forcing it to use what it thinks is an inferior execution plan - but one that turns out to be much better.
There's a nasty workaround that can be used for these situations called the OFFSET 0 hack, but you should only use it if you can't figure out a way to make the planner do the right thing - and if you have to use it, please boil this down to a self-contained test case and report it to the PostgreSQL mailing list as a possible query planner bug.
Instead, I recommend first looking at why the planner is making the wrong decision.
The first candidate is stats / estimates problems, and sure enough when we look at your problematic query plan there's a factor of 3500 mis-estimation of the expected result rows. That's big, but not impossibly big, though it's more interesting that you actually only get one row where the planner is expecting a non-trivial row set. That doesn't help us much, though; if the row count is lower than expected that means that choosing to use the index was a better choice than expected.
The main issue looks like it's not using the smaller, more selective indexes sierra_kilo and papa_lima because it sees the ORDER BY and thinks that it'll save more time doing a backward index scan and avoiding the sort than it really does. That makes sense given that there's only one matching row to sort! If it got the expected 3500 rows then it might've made more sense to avoid the sort, though that's still a fairly small rowset to just sort in memory.
Do you set any parameters like enable_seqscan, etc? If you do, unset them; they're for testing only and totally inappropriate for production use. If you aren't using the enable_ params I think it's worth raising this on the PostgreSQL mailing list pgsql-perform. The anonymized plans make this a bit difficult, though, especially since there's no gurantee that identifiers from one plan refer to the same objects in the other plan, and they don't match what you wrote in the query on the question. You'll want to produce a properly hand-done version where everything matches up before asking on the mailing list.
There's a fairly good chance that you'll need to provide the real values for anyone to help. If you don't want to do that on a public mailing list, there's another option available. (I should note that I work for one of them, per my profile).
Just a shot in the dark, but what happens if you run this
SELECT *
FROM (
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
) AS x
ORDER BY modified_at DESC
LIMIT 25;

Optimization of count query for PostgreSQL

I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.