I am using PostgreSQL 12. My Table contains 1 Billion Records. It was partitioned by every Month based on date range. Every day contains more than 3 Million Records. When I select some set of ids, it takes more time.
I want to filter 20 days but it takes very long time. So I select records for one day only. It's also took more than 10 seconds...
Below is my query,
Select id,col1,col2,col3,col4,col5,col6,Logtime from myLog R
where R.id in(1818154 59,…………**500 IDS**………..,180556591)
and R.LogTime='2019-07-29'::date
and R.Logtime>='2019-07-29 00:00:00' and R.LogTime<='2019-07-30 00:00:00'
Order by R.LogTime desc;
Below is my queries plan,
"Sort (cost=2556.35.2556.35 rows=1 width=298) (actual time 10088.084.10088.091 rows=557 loops-1)"
" Sort Key: R.LogTime DESC”
" Sort Method: quicksort Memory: 172 kB
-> Index Scan using p_my_log201907_LogTime_id_idx on p_mylog201907 r (cost=
0.56..2556.34 rows=1 width-298) (actual time=69.056..10085.712 rows=557 loops=1)"
Index Cond: (Logtime):: date = "2019-07-29’::date)
AND (id = ANY (‘{1818154 59,…………500 IDS………..,180556591}’::bigint[])
Filter: ( Logtime>= ‘2019-07-29 00:00:00’:: timestamp without time
AND (Logtime < ‘2019-07-30 00:00:00’:: timestamp without time zone)}"
"Planning Time: 0.664 ms
"Execution Time: 10088.189 ms
Below is my index,
CREATE INDEX Idx_LogTime ON MyLog( (LogTime::date) DESC,id desc);
At the time of query execution I have set the WORK_MEM to '1GB'. Please suggest me. How can I optimize and speedup my query?
Related
I have a table with more than 3 million rows, one column named creationdate is a timestamp without time zone.
I created a couple of indexes on it, like:
"idx_routingtasks_creationdate" btree (creationdate)
"idx_routingtasks_creationdate2" btree ((creationdate::date))
When filter by creationdate::date (casting by date) the index idx_routingtasks_creationdate is not used:
explain analyze select * from routingtasks where creationdate::date < (now() - interval '1 day')::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1315631.93 rows=2811638 width=715) (actual time=186642.012..413763.643 rows=2800659 loops=1)
Filter: ((creationdate)::date < ((now() - '1 day'::interval))::date)
Rows Removed by Filter: 212248
Planning time: 0.195 ms
Execution time: 413875.829 ms
(5 rows)
The same when not casting by date:
explain analyze select * from routingtasks where creationdate < now() - interval '1 day';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on routingtasks (cost=0.00..1300588.39 rows=2918447 width=715) (actual time=72089.312..327288.333 rows=2876756 loops=1)
Filter: (creationdate < (now() - '1 day'::interval))
Rows Removed by Filter: 141052
Planning time: 0.104 ms
Execution time: 327401.745 ms
(5 rows)
How can I create an index on the creationdate column to allow this filter use it?
The answer lies in this part of the execution plan:
Seq Scan ... (actual ... rows=2876756 ...)
...
Rows Removed by Filter: 141052
Since almost all rows are returned anyway, using a sequential scan and discarding the few rows that are filtered out is the most efficient way to process the query.
If you want to verify that, temporarily
SET enable_seqscan = off;
to make PostgreSQL avoid a sequential scan if possible. Then you can test if query execution gets faster or not.
I'm making two queries to a contacts table (1854453 total records) and a notes table (956467 total records). Although their query plans are very similar, the notes table query is taking considerably longer to process while the contacts query is really fast. Below are the queries with the query plan:
Contacts query (0.9 ms):
Contact Load (0.9ms) SELECT "contacts".* FROM "contacts" WHERE "contacts"."discarded_at" IS NULL AND "contacts"."firm_id" = $1 ORDER BY id DESC LIMIT $2 [["firm_id", 1], ["LIMIT", 2]]
=> EXPLAIN (ANALYZE,BUFFERS) SELECT "contacts".* FROM "contacts" WHERE "contacts"."discarded_at" IS NULL AND "contacts"."firm_id" = 1 ORDER BY id DESC LIMIT 2;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..11.27 rows=2 width=991) (actual time=5.407..5.412 rows=2 loops=1)
Buffers: shared hit=7 read=70
-> Index Scan Backward using contacts_pkey on contacts (cost=0.43..484798.76 rows=89438 width=991) (actual time=5.406..5.410 rows=2 loops=1)
Filter: ((discarded_at IS NULL) AND (firm_id = 1))
Rows Removed by Filter: 86
Buffers: shared hit=7 read=70
Planning Time: 0.271 ms
Execution Time: 5.440 ms
Notes query (294.5ms):
Note Load (294.5ms) SELECT "notes".* FROM "notes" WHERE "notes"."firm_id" = $1 ORDER BY id DESC LIMIT $2 [["firm_id", 1], ["LIMIT", 2]]
=> EXPLAIN (ANALYZE,BUFFERS) SELECT "notes".* FROM "notes" WHERE "notes"."firm_id" = 1 ORDER BY id DESC LIMIT 2
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.88 rows=2 width=390) (actual time=387.278..387.280 rows=2 loops=1)
Buffers: shared hit=29871 read=36815
-> Index Scan Backward using notes_pkey on notes (cost=0.42..115349.39 rows=502862 width=390) (actual time=387.277..387.278 rows=2 loops=1)
Filter: (firm_id = 1)
Rows Removed by Filter: 271557
Buffers: shared hit=29871 read=36815
Planning Time: 5.389 ms
Execution Time: 387.322 ms
Both tables have an index on the firm_id and the contacts also have an index in discarded_at columns.
Is the difference in query time because of the number of rows that postgres has to check? if not, what could account for that difference? Let me know if any other information is necessary.
In both cases PostgreSQL reads the rows in index order to avoid an explicit sort, and keeps discarding rows that don't meet the filter condition until it has found two rows that match.
The difference is that in the first case the goal is reached afzer discarding only 86 rows, while in the second case almost 300000 rows have to be scanned.
the following search is taking 15 seconds despite an index on the last_time column which is a timestamp
select * from my_table where last_time>now() - interval '1 hour'
The table has around 15 M rows and the query returns around 100 rows.
I have recreated the index and have vacuumed the table. but the search is still slow.
Similar searches on similar tables of comparable size return around 1000 rows in < 1 sec.
Seq Scan on my_table (cost=0.00..625486.61 rows=3817878 width=237) (actual time=16397.053..16397.054 rows=0 loops=1)
Filter: (last_time > (now() - '01:00:00'::interval))
Rows Removed by Filter: 11453235
Buffers: shared hit=73 read=424975
Planning Time: 0.290 ms
Execution Time: 16397.097 ms
Any suggests?
Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?
I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.
You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);
I have a very simple SQL:
select * from email.email_task where acquire_time < now() and state IN ('CREATED', 'RELEASED') order by creation_time asc limit 1;
I have 2 indexes created:
Index of state
Index of state, acquire_time, creation_time
Ideally I think Postgres should pick the 2nd one since it matches every column required in this SQL:
However the execution plan shows differently, it uses neither of the indexes:
Limit (cost=187404.36..187404.36 rows=1 width=743)
-> Sort (cost=187404.36..190753.58 rows=1339690 width=743)
Sort Key: creation_time
-> Seq Scan on email_task (cost=0.00..180705.91 rows=1339690 width=743)
Filter: (((state)::text = 'CREATED'::text) AND (acquire_time < now()))
I understand that if the number of rows returned arrives like 10% of total, then it would pick Seq Scan over Index Scan. (As explained at Why does PostgreSQL perform sequential scan on indexed column?
) So that's why index1 is not picked.
What I don't understand is why index2 is not picked since matches all the columns?
Then I tried a 3rd index:
Index of create_time, acquire_time, state
And this time it uses the index3 (I add the index using another smaller database
perf_1 because the original one has 2 million rows and it takes too much time)
Limit (cost=0.29..0.36 rows=1 width=75) (actual time=0.043..0.043 rows=1 loops=1)
-> Index Scan using perf_1 on email_task (cost=0.29..763.76 rows=9998 width=75) (actual time=0.042..0.042 rows=1 loops=1)
Index Cond: (acquire_time < now())
Filter: ((state)::text = ANY ('{CREATED,RELEASED}'::text[]))
It seems that, Postgres execution planner is picking the order by clause first then the where clause which is a little bit counter-intuitive.
Is my understanding correct or there are some other factors that impact the Postgres planner?
Thanks in advance.