Simple postgres queries very slow - postgresql

I have a postgres DB with a single table (reddit_comments) that contains all the reddit comments since 2007. There are only 10 columns in the table but I am only trying to query off of subreddit which is a text field. I have built a btree index for the subreddit.
Notes about the table:
1) About 1.5-2 billion rows.
2) There are no more insertions or deletions from the table. It is static.
3) There are 2 more indexes (author and month)
About hardware:
1) Intel 8 core processor
2) 128 GB of ram
3) Stored on a 7200 SATA drive
When I run the following query:
EXPLAIN (ANALYZE, BUFFERS) select * from reddit_comments WHERE
subreddit = 'boston' LIMIT 20000;
The query takes a significant amount of time and I get the following output:
Limit (cost=0.70..80375.57 rows=20000 width=320) (actual
time=32.421..52218.645 rows=20000 loops=1)
Buffers: shared hit=344 read=19532
I/O Timings: read=52051.619
-> Index Scan using subr_idx on reddit_comments
(cost=0.70..1487554.68 rows=370154 width=320) (actual
time=32.419..52202.785 rows=20000 loops=1)
Index Cond: (subreddit = 'boston'::text)
Buffers: shared hit=344 read=19532
I/O Timings: read=52051.619
Planning time: 0.184 ms
Execution time: 52228.975 ms
If I were to not set a limit = 20000 it takes hours to run (for about 600,000 results)
I have tried to implement many of the suggestions from here:
https://wiki.postgresql.org/wiki/SlowQueryQuestions
but nothing seems to be speeding up the process. Is there something I am missing that would increase performance or is it just going to be slow to query this database whenever I need to get more data?

The data you want is spread all over the disk so it takes a lot of time to read it. If you're gonna operate mainly on subreddits you can execute:
CLUSTER reddit_comments USING subr_idx
This will reorder the data in the table so that the query will have to read a lot fewer pages when you run the query in your question. It may take longer to run queries based on other filter terms though, will lock the table exclusively and will take a lot of time (ref).

Related

Slow postgres query even though it does bitmap index scan

I have a table with 4707838 rows. When I run the following query on this table it takes around 9 seconds to execute.
SELECT json_agg(
json_build_object('accessorId',
p."accessorId",
'mobile',json_build_object('enabled', p.mobile,'settings',
json_build_object('proximityAccess', p."proximity",
'tapToAccess', p."tapToAccess",
'clickToAccessRange', p."clickToAccessRange",
'remoteAccess',p."remote")
),'
card',json_build_object('enabled',p."card"),
'fingerprint',json_build_object('enabled',p."fingerprint"))
) AS permissions
FROM permissions AS p
WHERE p."accessPointId"=99
The output of explain analyze is as follows:
Aggregate (cost=49860.12..49860.13 rows=1 width=32) (actual time=9011.711..9011.712 rows=1 loops=1)
Buffers: shared read=29720
I/O Timings: read=8192.273
-> Bitmap Heap Scan on permissions p (cost=775.86..49350.25 rows=33991 width=14) (actual time=48.886..8704.470 rows=36556 loops=1)
Recheck Cond: ("accessPointId" = 99)
Heap Blocks: exact=29331
Buffers: shared read=29720
I/O Timings: read=8192.273
-> Bitmap Index Scan on composite_key_accessor_access_point (cost=0.00..767.37 rows=33991 width=0) (actual time=38.767..38.768 rows=37032 loops=1)
Index Cond: ("accessPointId" = 99)
Buffers: shared read=105
I/O Timings: read=32.592
Planning Time: 0.142 ms
Execution Time: 9012.719 ms
This table has a btree index on accessorId column and composite index on (accessorId,accessPointId).
Can anyone tell me what could be the reason for this query to be slow even though it uses an index?
Over 90% of the time is waiting to get data from disk. At 3.6 ms per read, that is pretty fast for a harddrive (suggesting that much of the data was already in the filesystem cache, or that some of the reads brought in neighboring data that was also eventually required--that is sequential reads not just random reads) but slow for a SSD.
If you set enable_bitmapscan=off and clear the cache (or pick a not recently used "accessPointId" value) what performance do you get?
How big is the table? If you are reading a substantial fraction of the table and think you are not getting as much benefit from sequential reads as you should be, you can try making your OSes readahead settings more aggressive. On Linux that is something like sudo blockdev --setra ...
You could put all columns referred to by the query into the index, to enable index-only scans. But given the number of columns you are using that might be impractical. You could want "accessPointId" to be the first column in the index. By the way, is the index currently used really on (accessorId,accessPointId)? It looks to me like "accessPointId" is really the first column in that index, not the 2nd one.
You could cluster the table by an index which has "accessPointId" as the first column. That would group the related records together for faster access. But note it is a slow operation and takes a strong lock on the table while it is running, and future data going into the table won't be clustered, only the current data.
You could try to increase effective_io_concurrency so that you can have multiple io requests outstanding at a time. How effective this is will depend on your hardware.

Postgresql getting last row without use of sequence or timestamp

At job, we are using a program which is writing all its values in a single table.
Some of these data can be for instance settings of a machine and it can happen that these data does not change for a whole month. The program writes only on data change.
So getting last point of data was always with something like:
SELECT value
FROM table
WHERE tag_id = id
SORT BY timestamp DESC
LIMIT 1;
If I understand correctly in this case it will look up the whole table, which means longer execution time and wasting resources.
I will have multiple of these functions running  and spamming server which is not acceptable.
I can not use last value of sequence of id as multiple values are writing in same table sharing same id sequence. So we are talking last row in table from certain tag_id since those are different for different data and not last row of table.
I was looking at fetch but from my understanding it will take same time as limit.
I was thinking along of FOR LOOP going back in section of time till 1 value is not found. And maybe later optimizing with temporary table where last value timestamp is written and then searching since that time stamp and if none found using last value.
But i was wondering if there is a better/faster more optimized way in Postgres.
If you want to speed up the query, you should create an index. Switching to a LOOP is almost always a recipe to make things slower.
To support your query, create the following index:
create index on the_table (tag_id, "timestamp");
On a test table with a million rows and 1000 different tag_id values, the execution plan looks like this:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..1.39 rows=1 width=53) (actual time=0.066..0.067 rows=1 loops=1)
Buffers: shared hit=1 read=3
I/O Timings: read=0.051
-> Index Scan Backward using test_table_tag_id_timestamp_idx on test_table (cost=0.42..924.66 rows=953 width=53) (actual time=0.066..0.066 rows=1 loops=1)
Index Cond: (tag_id = 41)
Buffers: shared hit=1 read=3
I/O Timings: read=0.051
Planning:
Buffers: shared hit=57 read=1
I/O Timings: read=0.011
Planning Time: 5.717 ms
Execution Time: 0.089 ms
You can see that Postgres only needed to read 4 blocks (shared hit=1 read=3) to get the row in question. This will be pretty much constant even if the table grows.
Note that the high planning time is caused by the buffer reads to fetch the table's metadata because I ran the query right after creating the table and inserting the rows. If the query is run for a second time, this pretty much vanishes as the meta data will be cached and the planning time will go down to substantially less then one millisecond.

Postgresql 12 - Simple select statement without filter condition is giving less throughput of 5K rows/sec? What can we optimize here?

I am trying to read about 4.5 to 5 million record table without any filter conditions..
I just need only two to three columns (varchar) from a table in postgres12 version..
The table contains just 20 columns (most are varchar)
So, my query goes like this.
SELECT
id as INDIV_ID,
loc
FROM
table
Explain plan output:
pgres=> explain (analyze, buffers, timing, format text) SELECT id as INDIV_ID, org_ext_loc FROM individuals;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Seq Scan on individuals (cost=0.00..353469.48 rows=4869048 width=54) (actual time=0.017..2659.760 rows=4869591 loops=1)
Buffers: shared hit=2133 read=302646
Planning Time: 0.814 ms
Execution Time: 3092.984 ms
(4 rows)
explain plan output with track_io_timing = ON
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Seq Scan on individuals (cost=0.00..353469.48 rows=4869048 width=54) (actual time=0.019..2607.686 rows=4869591 loops=1)
Buffers: shared read=304779
Planning Time: 2.975 ms
Execution Time: 3034.370 ms
(4 rows)
Our server information:
OS : Oracle Linux 7.3
RAM : 65707 MB
HDD Capacity : 2 Terabytes
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 16
CPU MHz: 2294.614
I tried various approaches of using
table partitioning with range on(another sequence column)
using parallel hints
SET max_parallel_workers_per_gather TO 8;
Quite vexed with the exhaustive search & without proper results & throughput is really down to 5K rows/sec.
I am using pentaho(kettle) etl tool to run this query through jdbc connectivity on server.
My postgres12 server is on same machine as the pentaho
I tried creating table in two ways
Normally without any partitions
Using range partitioning
But still the retrieval times are very high..
What can I do to get throughput of about 15K rows/sec?
The execution plan says that the query returns the 5 million rows in 3 seconds.
If you see worse performance on the client end, it must either be the network or the client software that's limiting you.

Need suggestion on how to handle large table in PostgresSQL

I have a table of size 32Gb and the index size is around 38Gb in Postgres.
I have a column x which is not indexed.
The table size is growing at 1GB per week.
There are a lot of queries run on column x.
Each query on this table for column x is consuming 17% of my CPU and taking approx. 5~6sec to return the data with a heavy load on the database.
What is the best way to handle this? what is the industry standard?
I indexed the column x, and the size of the index increased by 2GB — Query time reduced to ~100ms.
I'm looking into DynamoDB to replicate the data of the table, but I am not sure if this is the correct way to proceed, hence this question.
I want the data access to be faster, also keeping in mind that this should cause a bottleneck in the feature.
As requested here is the query that runs:
database_backup1=> EXPLAIN ANALYZE SELECT * FROM "table_name" WHERE "table_name"."x" IN ('ID001', 'ID002', 'ID003', 'ID004', 'ID005') LIMIT 1;
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------
Limit (cost=0.00..56442.83 rows=100 width=1992) (actual time=0.010..155288.649 rows=7 loops=1)
-> Seq Scan on "table_name" (cost=0.00..691424.62 rows=1225 width=1992) (actual time=0.009..155288.643 rows=7 loops=1)
Filter: ((x)::text = ANY ('{ID001,ID002,ID003,ID004,ID005}'::text[]))
Rows Removed by Filter: 9050574
Planning time: 0.196 ms
Execution time: 155288.691 ms
(6 rows)
The execution plan indicates that your index is clearly the way to go.
If you run the query often, it is worth paying the price in storage space and data modification performance such an index incurs.
Of course I cannot say that with authority, but I don't believe that other database systems have a magic bullet that will make everything faster. If your data are suited for a relational model, PostgreSQL will be a good choice.

Slow index scan

I have table with index:
Table:
Participates (player_id integer, other...)
Indexes:
"index_participates_on_player_id" btree (player_id)
Table contains 400kk rows.
I execute the same query two times:
Query: explain analyze select * from participates where player_id=149294217;
First time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=261.061..2025.559 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 2025.644 ms
(3 rows)
Second time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=0.030..0.479 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 0.527 ms
(3 rows)
So, first query has big actual time - how to increase speed the first execute?
UPDATE
Sorry, How to accelerate first query?)
Why index scan search so slow?
The difference in execution time is probably because the second time through, the table/index data from the first run of the query is in the shared buffers cache, and so the subsequent run of the query takes less time, since it doesn't have to go long-path to disk for that information.
Edit:
Regarding the slowness of the original query, does the table have a lot of dead tuples? Those can slow things down considerably. If so, VACUUM ANALYZE the table.
Another factor can be if there are long-ish idle transactions on the server (i.e. several minutes or more). Due to the nature of MVCC this can also slow even index-based queries down quite a bit.
Also, what the query planner is expecting for the results vs. actual is quite different, so you may want to do an ANALYZE on the query beforehand to update the stats.
1.) Take a look at http://www.postgresql.org/docs/9.3/static/runtime-config-resource.html and check out for some tuning for using more memory. This can speed up your search but will not give you a warranty (depending on the answer before)!
2.) Transfer a part of your tables/indexes to a more powerful tablespace. For example a tablespace based on SSDs.