At job, we are using a program which is writing all its values in a single table.
Some of these data can be for instance settings of a machine and it can happen that these data does not change for a whole month. The program writes only on data change.
So getting last point of data was always with something like:
SELECT value
FROM table
WHERE tag_id = id
SORT BY timestamp DESC
LIMIT 1;
If I understand correctly in this case it will look up the whole table, which means longer execution time and wasting resources.
I will have multiple of these functions runningĀ and spamming server which is not acceptable.
I can not use last value of sequence of id as multiple values are writing in same table sharing same id sequence. So we are talking last row in table from certain tag_id since those are different for different data and not last row of table.
I was looking at fetch but from my understanding it will take same time as limit.
I was thinking along of FOR LOOP going back in section of time till 1 value is not found. And maybe later optimizing with temporary table where last value timestamp is written and then searching since that time stamp and if none found using last value.
But i was wondering if there is a better/faster more optimized way in Postgres.
If you want to speed up the query, you should create an index. Switching to a LOOP is almost always a recipe to make things slower.
To support your query, create the following index:
create index on the_table (tag_id, "timestamp");
On a test table with a million rows and 1000 different tag_id values, the execution plan looks like this:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..1.39 rows=1 width=53) (actual time=0.066..0.067 rows=1 loops=1)
Buffers: shared hit=1 read=3
I/O Timings: read=0.051
-> Index Scan Backward using test_table_tag_id_timestamp_idx on test_table (cost=0.42..924.66 rows=953 width=53) (actual time=0.066..0.066 rows=1 loops=1)
Index Cond: (tag_id = 41)
Buffers: shared hit=1 read=3
I/O Timings: read=0.051
Planning:
Buffers: shared hit=57 read=1
I/O Timings: read=0.011
Planning Time: 5.717 ms
Execution Time: 0.089 ms
You can see that Postgres only needed to read 4 blocks (shared hit=1 read=3) to get the row in question. This will be pretty much constant even if the table grows.
Note that the high planning time is caused by the buffer reads to fetch the table's metadata because I ran the query right after creating the table and inserting the rows. If the query is run for a second time, this pretty much vanishes as the meta data will be cached and the planning time will go down to substantially less then one millisecond.
Related
I have a table with 4707838 rows. When I run the following query on this table it takes around 9 seconds to execute.
SELECT json_agg(
json_build_object('accessorId',
p."accessorId",
'mobile',json_build_object('enabled', p.mobile,'settings',
json_build_object('proximityAccess', p."proximity",
'tapToAccess', p."tapToAccess",
'clickToAccessRange', p."clickToAccessRange",
'remoteAccess',p."remote")
),'
card',json_build_object('enabled',p."card"),
'fingerprint',json_build_object('enabled',p."fingerprint"))
) AS permissions
FROM permissions AS p
WHERE p."accessPointId"=99
The output of explain analyze is as follows:
Aggregate (cost=49860.12..49860.13 rows=1 width=32) (actual time=9011.711..9011.712 rows=1 loops=1)
Buffers: shared read=29720
I/O Timings: read=8192.273
-> Bitmap Heap Scan on permissions p (cost=775.86..49350.25 rows=33991 width=14) (actual time=48.886..8704.470 rows=36556 loops=1)
Recheck Cond: ("accessPointId" = 99)
Heap Blocks: exact=29331
Buffers: shared read=29720
I/O Timings: read=8192.273
-> Bitmap Index Scan on composite_key_accessor_access_point (cost=0.00..767.37 rows=33991 width=0) (actual time=38.767..38.768 rows=37032 loops=1)
Index Cond: ("accessPointId" = 99)
Buffers: shared read=105
I/O Timings: read=32.592
Planning Time: 0.142 ms
Execution Time: 9012.719 ms
This table has a btree index on accessorId column and composite index on (accessorId,accessPointId).
Can anyone tell me what could be the reason for this query to be slow even though it uses an index?
Over 90% of the time is waiting to get data from disk. At 3.6 ms per read, that is pretty fast for a harddrive (suggesting that much of the data was already in the filesystem cache, or that some of the reads brought in neighboring data that was also eventually required--that is sequential reads not just random reads) but slow for a SSD.
If you set enable_bitmapscan=off and clear the cache (or pick a not recently used "accessPointId" value) what performance do you get?
How big is the table? If you are reading a substantial fraction of the table and think you are not getting as much benefit from sequential reads as you should be, you can try making your OSes readahead settings more aggressive. On Linux that is something like sudo blockdev --setra ...
You could put all columns referred to by the query into the index, to enable index-only scans. But given the number of columns you are using that might be impractical. You could want "accessPointId" to be the first column in the index. By the way, is the index currently used really on (accessorId,accessPointId)? It looks to me like "accessPointId" is really the first column in that index, not the 2nd one.
You could cluster the table by an index which has "accessPointId" as the first column. That would group the related records together for faster access. But note it is a slow operation and takes a strong lock on the table while it is running, and future data going into the table won't be clustered, only the current data.
You could try to increase effective_io_concurrency so that you can have multiple io requests outstanding at a time. How effective this is will depend on your hardware.
I am interested in understanding how Postgres reads pages from disk/cache when using an index.
Consider querying an indexed single-column table of integers:
select i
into numbers
from generate_series(1, 200000) s(i);
create index idx on numbers(i);
explain (buffers, analyse)
select * from numbers where i = 456789; -- random row
This single index-only seek requires 3 page reads on this 200k row table (Buffers: shared hit=3):
Index Only Scan using idx on numbers (cost=0.42..8.44 rows=1 width=4) (actual time=0.010..0.010 rows=0 loops=1)
Index Cond: (i = 456789)
Heap Fetches: 0
Buffers: shared hit=3
Planning Time: 0.043 ms
Execution Time: 0.022 ms
Is this expected? What do the 3 pages relate to? Is this simply the number of index pages which had to be read in order to traverse the B-Tree?
Background: I am trying to tune a recursive CTE which walks a linked-list structure stored in a single table as a parent-child relationship / adjacency-tree. The recursive section of the query is a very simple index seek similar to above. Each 'loop' of the recursive CTE results in 3 page reads (as per above), which is where the majority of the cost of the query lies. It may be impossible to make this any more efficient, but I was wondering if this could be improved somehow (currently ~30000 page reads for a 10k node chain, ~25ms cached).
You were lucky that autovacuum finished before you ran your query, otherwise it would have been 4 blocks.
Your query accessed the root node, an intermediate. node and a leaf node of index. There was no need to access a table block (Heap Fetches: 0), because
all the information is available in the index
the table's visibility map indicated that the table block was all-visible, so there was no necessity to consult the table block for the visibility information.
I'm trying to get distinct values from a nested field on JSONB column, but it takes about 2 minutes on a 400K rows table.
The original query used DISTINCT but then I read that GROUP BY works better so tried this too, but no luck - still extremely slow.
Adding an index did not help either:
create index "orders_financial_status_index" on orders ((data ->'data'->> 'financial_status'));
ANALYZE EXPLAIN gave this result:
HashAggregate (cost=13431.16..13431.22 rows=4 width=32) (actual time=123074.941..123074.943 rows=4 loops=1)
Group Key: ((data -> 'data'::text) ->> 'financial_status'::text)
-> Seq Scan on orders (cost=0.00..12354.14 rows=430809 width=32) (actual time=2.993..122780.325 rows=434080 loops=1)
Planning time: 0.119 ms
Execution time: 123074.979 ms
It's worth mentioning that there are no null values on this column, and currently there are 4 unique values.
What should I do in order to query the distinct values faster?
No index will make this faster, because the query has to scan the whole table.
As you can see, the sequential scan uses almost all the time; the hash aggregate is fast.
Still I would not drop the index, because it allows PostgreSQL to estimate the number of groups accurately and decide on the more efficient hash aggregate rather than sorting the rows. You can try without the index to be sure.
However, two minutes for half a million rows is not very fast. Do you have slow storage? Is the table bloated? If the latter, VACUUM (FULL) should improve things.
You can speed up the query by reducing I/O. Load the table into RAM with pg_prewarm, then processing should be considerably faster.
I have a postgres DB with a single table (reddit_comments) that contains all the reddit comments since 2007. There are only 10 columns in the table but I am only trying to query off of subreddit which is a text field. I have built a btree index for the subreddit.
Notes about the table:
1) About 1.5-2 billion rows.
2) There are no more insertions or deletions from the table. It is static.
3) There are 2 more indexes (author and month)
About hardware:
1) Intel 8 core processor
2) 128 GB of ram
3) Stored on a 7200 SATA drive
When I run the following query:
EXPLAIN (ANALYZE, BUFFERS) select * from reddit_comments WHERE
subreddit = 'boston' LIMIT 20000;
The query takes a significant amount of time and I get the following output:
Limit (cost=0.70..80375.57 rows=20000 width=320) (actual
time=32.421..52218.645 rows=20000 loops=1)
Buffers: shared hit=344 read=19532
I/O Timings: read=52051.619
-> Index Scan using subr_idx on reddit_comments
(cost=0.70..1487554.68 rows=370154 width=320) (actual
time=32.419..52202.785 rows=20000 loops=1)
Index Cond: (subreddit = 'boston'::text)
Buffers: shared hit=344 read=19532
I/O Timings: read=52051.619
Planning time: 0.184 ms
Execution time: 52228.975 ms
If I were to not set a limit = 20000 it takes hours to run (for about 600,000 results)
I have tried to implement many of the suggestions from here:
https://wiki.postgresql.org/wiki/SlowQueryQuestions
but nothing seems to be speeding up the process. Is there something I am missing that would increase performance or is it just going to be slow to query this database whenever I need to get more data?
The data you want is spread all over the disk so it takes a lot of time to read it. If you're gonna operate mainly on subreddits you can execute:
CLUSTER reddit_comments USING subr_idx
This will reorder the data in the table so that the query will have to read a lot fewer pages when you run the query in your question. It may take longer to run queries based on other filter terms though, will lock the table exclusively and will take a lot of time (ref).
I have table with index:
Table:
Participates (player_id integer, other...)
Indexes:
"index_participates_on_player_id" btree (player_id)
Table contains 400kk rows.
I execute the same query two times:
Query: explain analyze select * from participates where player_id=149294217;
First time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=261.061..2025.559 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 2025.644 ms
(3 rows)
Second time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=0.030..0.479 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 0.527 ms
(3 rows)
So, first query has big actual time - how to increase speed the first execute?
UPDATE
Sorry, How to accelerate first query?)
Why index scan search so slow?
The difference in execution time is probably because the second time through, the table/index data from the first run of the query is in the shared buffers cache, and so the subsequent run of the query takes less time, since it doesn't have to go long-path to disk for that information.
Edit:
Regarding the slowness of the original query, does the table have a lot of dead tuples? Those can slow things down considerably. If so, VACUUM ANALYZE the table.
Another factor can be if there are long-ish idle transactions on the server (i.e. several minutes or more). Due to the nature of MVCC this can also slow even index-based queries down quite a bit.
Also, what the query planner is expecting for the results vs. actual is quite different, so you may want to do an ANALYZE on the query beforehand to update the stats.
1.) Take a look at http://www.postgresql.org/docs/9.3/static/runtime-config-resource.html and check out for some tuning for using more memory. This can speed up your search but will not give you a warranty (depending on the answer before)!
2.) Transfer a part of your tables/indexes to a more powerful tablespace. For example a tablespace based on SSDs.