Need suggestion on how to handle large table in PostgresSQL - postgresql

I have a table of size 32Gb and the index size is around 38Gb in Postgres.
I have a column x which is not indexed.
The table size is growing at 1GB per week.
There are a lot of queries run on column x.
Each query on this table for column x is consuming 17% of my CPU and taking approx. 5~6sec to return the data with a heavy load on the database.
What is the best way to handle this? what is the industry standard?
I indexed the column x, and the size of the index increased by 2GB — Query time reduced to ~100ms.
I'm looking into DynamoDB to replicate the data of the table, but I am not sure if this is the correct way to proceed, hence this question.
I want the data access to be faster, also keeping in mind that this should cause a bottleneck in the feature.
As requested here is the query that runs:
database_backup1=> EXPLAIN ANALYZE SELECT * FROM "table_name" WHERE "table_name"."x" IN ('ID001', 'ID002', 'ID003', 'ID004', 'ID005') LIMIT 1;
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------
Limit (cost=0.00..56442.83 rows=100 width=1992) (actual time=0.010..155288.649 rows=7 loops=1)
-> Seq Scan on "table_name" (cost=0.00..691424.62 rows=1225 width=1992) (actual time=0.009..155288.643 rows=7 loops=1)
Filter: ((x)::text = ANY ('{ID001,ID002,ID003,ID004,ID005}'::text[]))
Rows Removed by Filter: 9050574
Planning time: 0.196 ms
Execution time: 155288.691 ms
(6 rows)

The execution plan indicates that your index is clearly the way to go.
If you run the query often, it is worth paying the price in storage space and data modification performance such an index incurs.
Of course I cannot say that with authority, but I don't believe that other database systems have a magic bullet that will make everything faster. If your data are suited for a relational model, PostgreSQL will be a good choice.

Related

Slow distinct PostgreSQL query on nested jsonb field won't use index

I'm trying to get distinct values from a nested field on JSONB column, but it takes about 2 minutes on a 400K rows table.
The original query used DISTINCT but then I read that GROUP BY works better so tried this too, but no luck - still extremely slow.
Adding an index did not help either:
create index "orders_financial_status_index" on orders ((data ->'data'->> 'financial_status'));
ANALYZE EXPLAIN gave this result:
HashAggregate (cost=13431.16..13431.22 rows=4 width=32) (actual time=123074.941..123074.943 rows=4 loops=1)
Group Key: ((data -> 'data'::text) ->> 'financial_status'::text)
-> Seq Scan on orders (cost=0.00..12354.14 rows=430809 width=32) (actual time=2.993..122780.325 rows=434080 loops=1)
Planning time: 0.119 ms
Execution time: 123074.979 ms
It's worth mentioning that there are no null values on this column, and currently there are 4 unique values.
What should I do in order to query the distinct values faster?
No index will make this faster, because the query has to scan the whole table.
As you can see, the sequential scan uses almost all the time; the hash aggregate is fast.
Still I would not drop the index, because it allows PostgreSQL to estimate the number of groups accurately and decide on the more efficient hash aggregate rather than sorting the rows. You can try without the index to be sure.
However, two minutes for half a million rows is not very fast. Do you have slow storage? Is the table bloated? If the latter, VACUUM (FULL) should improve things.
You can speed up the query by reducing I/O. Load the table into RAM with pg_prewarm, then processing should be considerably faster.

Simple postgres queries very slow

I have a postgres DB with a single table (reddit_comments) that contains all the reddit comments since 2007. There are only 10 columns in the table but I am only trying to query off of subreddit which is a text field. I have built a btree index for the subreddit.
Notes about the table:
1) About 1.5-2 billion rows.
2) There are no more insertions or deletions from the table. It is static.
3) There are 2 more indexes (author and month)
About hardware:
1) Intel 8 core processor
2) 128 GB of ram
3) Stored on a 7200 SATA drive
When I run the following query:
EXPLAIN (ANALYZE, BUFFERS) select * from reddit_comments WHERE
subreddit = 'boston' LIMIT 20000;
The query takes a significant amount of time and I get the following output:
Limit (cost=0.70..80375.57 rows=20000 width=320) (actual
time=32.421..52218.645 rows=20000 loops=1)
Buffers: shared hit=344 read=19532
I/O Timings: read=52051.619
-> Index Scan using subr_idx on reddit_comments
(cost=0.70..1487554.68 rows=370154 width=320) (actual
time=32.419..52202.785 rows=20000 loops=1)
Index Cond: (subreddit = 'boston'::text)
Buffers: shared hit=344 read=19532
I/O Timings: read=52051.619
Planning time: 0.184 ms
Execution time: 52228.975 ms
If I were to not set a limit = 20000 it takes hours to run (for about 600,000 results)
I have tried to implement many of the suggestions from here:
https://wiki.postgresql.org/wiki/SlowQueryQuestions
but nothing seems to be speeding up the process. Is there something I am missing that would increase performance or is it just going to be slow to query this database whenever I need to get more data?
The data you want is spread all over the disk so it takes a lot of time to read it. If you're gonna operate mainly on subreddits you can execute:
CLUSTER reddit_comments USING subr_idx
This will reorder the data in the table so that the query will have to read a lot fewer pages when you run the query in your question. It may take longer to run queries based on other filter terms though, will lock the table exclusively and will take a lot of time (ref).

Slow index scan

I have table with index:
Table:
Participates (player_id integer, other...)
Indexes:
"index_participates_on_player_id" btree (player_id)
Table contains 400kk rows.
I execute the same query two times:
Query: explain analyze select * from participates where player_id=149294217;
First time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=261.061..2025.559 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 2025.644 ms
(3 rows)
Second time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=0.030..0.479 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 0.527 ms
(3 rows)
So, first query has big actual time - how to increase speed the first execute?
UPDATE
Sorry, How to accelerate first query?)
Why index scan search so slow?
The difference in execution time is probably because the second time through, the table/index data from the first run of the query is in the shared buffers cache, and so the subsequent run of the query takes less time, since it doesn't have to go long-path to disk for that information.
Edit:
Regarding the slowness of the original query, does the table have a lot of dead tuples? Those can slow things down considerably. If so, VACUUM ANALYZE the table.
Another factor can be if there are long-ish idle transactions on the server (i.e. several minutes or more). Due to the nature of MVCC this can also slow even index-based queries down quite a bit.
Also, what the query planner is expecting for the results vs. actual is quite different, so you may want to do an ANALYZE on the query beforehand to update the stats.
1.) Take a look at http://www.postgresql.org/docs/9.3/static/runtime-config-resource.html and check out for some tuning for using more memory. This can speed up your search but will not give you a warranty (depending on the answer before)!
2.) Transfer a part of your tables/indexes to a more powerful tablespace. For example a tablespace based on SSDs.

PostgreSQL query doesn't use index

I have a very simple db schema, which has a multi column b-tree index on following columns:
PersonId, Amount, Commission
Now, if I try to select the table with following query:
explain select * from "Order" where "PersonId" = 2 AND "Commission" > 3
Pg is scanning the index and the query is very fast, but if I try the following query:
explain select * from "Order" where "PersonId" > 2 AND "Commission" > 3
It does a sequential scan, even when the index is present. Even this query
explain select * from "Order" where "Commission" > 3
does a sequential scan.
Anyone care to explain why? :-)
Thank you very much.
UPDATE
The table contains 100 million rows. I have created it just to test PostgreSQL performance against MS SQL. The table is already VACUUMED. I'm runnning Core I5 2500k quad core cpu and 8 GB of ram.
Here's the result of explain analyze for this query:
explain ANALYZE select * from "Order" where "Commission" BETWEEN 3000000 AND 3000010 LIMIT 20
Limit (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.249..28043.249 rows=0 loops=1)
-> Seq Scan on "Order" (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.247..28043.247 rows=0 loops=1)
Filter: (("Commission" >= 3000000::numeric) AND ("Commission" <= 3000010::numeric))
Total runtime: 28043.278 ms
The short answer is that when comparing the various available plans, the sequential scan is expected to be the fastest, based on the costing factors you have configured and the latest statistics available. From what little information you've provided, it seems quite likely that the planner has made the right choice. If you had three single-column indexes, it might be able to use bitmap index scans, particularly if the rows to be selected are less than about 10% of the rows in the table.
Note that with the index you describe, the entire index would need to be scanned from for all rows where "PersonId" > 2; which unless you have a lot of negative values for "PersonId" is very likely to be most of the table.
Also note that if you have a tiny table -- say a few thousand rows or less, accessing the rows through an index will rarely be faster than just scanning those few rows. Plans are sensitive to data volume, and the plan you get with a small number of rows is very unlikely to be the same plan you get with a lot of rows.
If it is, in fact, not picking the fastest plan, the odds are good that you need to adjust your cost factors to better model the costs on your machine. Another possibility is that you need to be more aggressive in your autovacuum settings, to make sure up-to-date statistics are available, or you may need to configure collection of finer-grained statistics.
People will be able to provide more specific advice if you show the table descriptions (including indexes), the EXPLAIN ANALYZE output for the query, and a description of the hardware.

How to influence the Postgres Query Analyzer when dealing with text search and geospatial data

I have a quite serious performance issue with the following statement that i can't fix myself.
Given Situation
I have a postgres 8.4 Database with Postgis 1.4 installed
I have a geospatial table with ~9 Million entries. This table has a (postgis) geometry column and a tsvector column
I have a GIST Index on the geometry and a VNAME Index on the vname column
Table is ANALYZE'd
I want to excecute a to_tsquerytext search within a subset of these geometries that should give me all affecteded ids back.
The area to search in will limit the 9 Million datasets to approximately 100.000 and the resultset of the ts_query inside this area will most likely give an output of 0..1000 Entries.
Problem
The query analyzer decides that he wants to do an Bitmap Index Scan on the vname first, and then aggreates and puts a filter on the geometry (and other conditions I have in this statement)
Query Analyzer output:
Aggregate (cost=12.35..12.62 rows=1 width=510) (actual time=5.616..5.616 rows=1 loops=1)
-> Bitmap Heap Scan on mxgeom g (cost=8.33..12.35 rows=1 width=510) (actual time=5.567..5.567 rows=0 loops=1)
Recheck Cond: (vname ## '''hemer'' & ''hauptstrasse'':*'::tsquery)
Filter: (active AND (geom && '0107000020E6100000010000000103000000010000000B0000002AFFFF5FD15B1E404AE254774BA8494096FBFF3F4CC11E40F37563BAA9A74940490200206BEC1E40466F209648A949404DF6FF1F53311F400C9623C206B2494024EBFF1F4F711F404C87835954BD4940C00000B0E7CA1E4071551679E0BD4940AD02004038991E40D35CC68418BE49408EF9FF5F297C1E404F8CFFCB5BBB4940A600006015541E40FAE6468054B8494015040060A33E1E4032E568902DAE49402AFFFF5FD15B1E404AE254774BA84940'::geometry) AND (mandator_id = ANY ('{257,1}'::bigint[])))
-> Bitmap Index Scan on gis_vname_idx (cost=0.00..8.33 rows=1 width=0) (actual time=5.566..5.566 rows=0 loops=1)
Index Cond: (vname ## '''hemer'' & ''hauptstrasse'':*'::tsquery)
which causes a LOT of I/O - AFAIK It would be smarter to limit the geometry first, and do the vname search after.
Attempted Solutions
To achieve the desired behaviour i tried to
I Put the geom ## AREA into a subselect -> Did not change the execution plan
I created a temporary view with the desired area subset -> Did not change the execution plan
I created a temporary table of the desired area -> Takes 4~6 seconds to create, so that made it even worse.
Btw, sorry for not posting the actual query: I think my boss would really be mad at me if I did, also I'm looking more for theoretical pointers for someone to fix my actual query. Please ask if you need further clarification
EDIT
Richard had a very good point: You can achieve the desired behaviour of the Query Planner with the width statement. The bad thing is that this temporary Table (or CTE) messes up the vname index, thus making the query return nothing in some cases.
I was able to fix this with creating a new vname on-the-fly with to_tsvector(), but this is (too) costly - around 300 - 500ms per query.
My Solution
I ditched the vname search and went with a simple LIKE('%query_string%') (10-20 ms/query), but this is only fast in my given environment. YMMV.
There have been some improvements in statistics handling for tsvector (and I think PostGIS too, but I don't use it). If you've got the time, it might be worth trying again with a 9.1 release and see what that does for you.
However, for this single query you might want to look at the WITH construct.
http://www.postgresql.org/docs/8.4/static/queries-with.html
If you put the geometry part as the WITH clause, it will be evaluated first (guaranteed) and then that result-set will filtered by the following SELECT. It might end up slower though, you won't know until you try.
It might be an adjustment to work_mem would help too - you can do this per-session ("SET work_mem = ...") but be careful with setting it too high - concurrent queries can quickly burn through all your RAM.
http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY