I'm running postgresql-9.1.6 on RHEL 5.8 OS. I got a statement implementing seq scan on which column is indexed.
Table "public.table"
Column | Type | Modifiers
----------+-----------------------+-----------------------------------------
col1 | character(3) | not null
col2 | character varying(20) | not null
col3 | character varying(20) |
col4 | character(1) | default 0
Indexes:
"table_pkey" PRIMARY KEY, btree (col1, col2)
postgres=# explain analyze select * from table where col1=right('10000081',3);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Seq Scan on table (cost=0.00..31053.24 rows=5650 width=286) (actual time=3.221..429.950 rows=110008 loops=1)
Filter: ((col1)::text = '081'::text)
Total runtime: 435.904 ms
(3 rows)
postgres=# explain analyze select * from table where col1=right('10000081',3)::char(3);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table (cost=3097.81..18602.98 rows=112173 width=286) (actual time=18.125..32.707 rows=110008 loops=1)
Recheck Cond: (col1 = '081'::character(3))
-> Bitmap Index Scan on table_pkey (cost=0.00..3069.77 rows=112173 width=0) (actual time=17.846..17.846 rows=110008 loops=1)
Index Cond: (col1 = '081'::character(3))
Total runtime: 38.640 ms
(5 rows)
and I found that [alter column] is one of the solution....
postgres=# alter table table alter column col1 type varchar(3);
ALTER TABLE
postgres=# explain analyze select * from table where col1=right('10000081',3);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table (cost=160.26..10902.32 rows=5650 width=295) (actual time=20.249..41.658 rows=110008 loops=1)
Recheck Cond: ((col1)::text = '081'::text)
-> Bitmap Index Scan on table_pkey (cost=0.00..158.85 rows=5650 width=0) (actual time=20.007..20.007 rows=110008 loops=1)
Index Cond: ((col1)::text = '081'::text)
Total runtime: 47.408 ms
(5 rows)
What I wonder is WHY???
It looks from the first plan that there is an implicit type cast performed on col1 to match the return type of right().
Filter: ((col1)::text = '081'::text)
Evidently the expression right('10000081',3) returns text.
So I would say that yes, you do have to type cast the expression, although an alternative would be to index on (col1)::text -- not my favourite solution though.
Avoid the char datatype. It's awful for many reasons, and this is only one of them.
If you stick to text or varchar you'll have fewer issues with implicit casts and confusing behavior.
Related
I have a query which is taking 2.5 seconds to run. On checking the query plan, I got to know that postgres is heavily underestimating the number of rows leading to nested loops.
Following is the query
explain analyze
SELECT
reprocessed_videos.video_id AS reprocessed_videos_video_id
FROM
reprocessed_videos
JOIN commit_info ON commit_info.id = reprocessed_videos.commit_id
WHERE
commit_info.tag = 'stop_sign_tbc_inertial_fix'
AND reprocessed_videos.reprocess_type_id = 28
AND reprocessed_videos.classification_crop_type_id = 0
AND reprocessed_videos.reprocess_status = 'success';
Following is the explain analyze output.
Nested Loop (cost=0.84..22941.18 rows=1120 width=4) (actual time=31.169..2650.181 rows=179524 loops=1)
-> Index Scan using commit_info_tag_key on commit_info (cost=0.28..8.29 rows=1 width=4) (actual time=0.395..0.397 rows=1 loops=1)
Index Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
-> Index Scan using ix_reprocessed_videos_commit_id on reprocessed_videos (cost=0.56..22919.99 rows=1289 width=8) (actual time=30.770..2634.546 rows=179524 loops=1)
Index Cond: (commit_id = commit_info.id)
Filter: ((reprocess_type_id = 28) AND (classification_crop_type_id = 0) AND ((reprocess_status)::text = 'success'::text))
Rows Removed by Filter: 1190
Planning Time: 0.326 ms
Execution Time: 2657.724 ms
As we can see index scan using ix_reprocessed_videos_commit_id anticipated 1289 rows, whereas there were 179524 rows. I have trying to find the reason for this but have been unsuccessful in whatever I tried.
Following are the things I tried.
Vacuum and analyzing all the involved tables. (helped a little but not much maybe because the tables were automatically vacuumed and analyzed)
Increasing the statistics count for commit_id column alter table reprocessed_videos alter column commit_id set statistics 1000; (helped a little)
I read about extended statistics, but not sure if they are of any use here.
Following are the number of tuples in each of these tables
kpis=> SELECT relname, reltuples FROM pg_class where relname in ('reprocessed_videos', 'video_catalog', 'commit_info');
relname | reltuples
--------------------+---------------
commit_info | 1439
reprocessed_videos | 3.1563756e+07
Following is some information related to table schemas
Table "public.reprocessed_videos"
Column | Type | Collation | Nullable | Default
-----------------------------+-----------------------------+-----------+----------+------------------------------------------------
id | integer | | not null | nextval('reprocessed_videos_id_seq'::regclass)
video_id | integer | | |
reprocess_status | character varying | | |
commit_id | integer | | |
reprocess_type_id | integer | | |
classification_crop_type_id | integer | | |
Indexes:
"reprocessed_videos_pkey" PRIMARY KEY, btree (id)
"ix_reprocessed_videos_commit_id" btree (commit_id)
"ix_reprocessed_videos_video_id" btree (video_id)
"reprocessed_videos_video_commit_reprocess_crop_key" UNIQUE CONSTRAINT, btree (video_id, commit_id, reprocess_type_id, classification_crop_type_id)
Foreign-key constraints:
"reprocessed_videos_commit_id_fkey" FOREIGN KEY (commit_id) REFERENCES commit_info(id)
Table "public.commit_info"
Column | Type | Collation | Nullable | Default
------------------------+-------------------+-----------+----------+-----------------------------------------
id | integer | | not null | nextval('commit_info_id_seq'::regclass)
tag | character varying | | |
commit | character varying | | |
Indexes:
"commit_info_pkey" PRIMARY KEY, btree (id)
"commit_info_tag_key" UNIQUE CONSTRAINT, btree (tag)
I am sure that postgres should not use nested loops in this case, but is using them because of bad row estimates. Any help is highly appreciated.
Following are the experiments I tried.
Disabling index scan
Nested Loop (cost=734.59..84368.70 rows=1120 width=4) (actual time=274.694..934.965 rows=179524 loops=1)
-> Bitmap Heap Scan on commit_info (cost=4.29..8.30 rows=1 width=4) (actual time=0.441..0.444 rows=1 loops=1)
Recheck Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on commit_info_tag_key (cost=0.00..4.29 rows=1 width=0) (actual time=0.437..0.439 rows=1 loops=1)
Index Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
-> Bitmap Heap Scan on reprocessed_videos (cost=730.30..84347.51 rows=1289 width=8) (actual time=274.250..920.137 rows=179524 loops=1)
Recheck Cond: (commit_id = commit_info.id)
Filter: ((reprocess_type_id = 28) AND (classification_crop_type_id = 0) AND ((reprocess_status)::text = 'success'::text))
Rows Removed by Filter: 1190
Heap Blocks: exact=5881
-> Bitmap Index Scan on ix_reprocessed_videos_commit_id (cost=0.00..729.98 rows=25256 width=0) (actual time=273.534..273.534 rows=180714 loops=1)
Index Cond: (commit_id = commit_info.id)
Planning Time: 0.413 ms
Execution Time: 941.874 ms
I also set updated the statistics for the commit_id column. I observe a approximately 3x speed increase.
On trying to disable bitmapscan, the query does a sequential scan and takes 19 seconds to run
The nested loop is the perfect join strategy, because there is only one row from commit_info. Any other join strategy would lose.
The question is if the index scan on reprocessed_videos is really too slow. To experiment, try again after SET enable_indexscan = off; to get a bitmap index scan and see if that is better. Then also SET enable_bitmapscan = off; to get a sequential scan. I suspect that your current plan will win, but the bitmap index scan has a good chance.
If the bitmap index scan is better, you should indeed try to improve the estimate:
ALTER TABLE reprocessed_videos ALTER commit_id SET STATISTICS 1000;
ANALYZE reprocessed_videos;
You can try with other values; pick the lowest that gives you a good enough estimate.
Another thing to try are extended statistics:
CREATE STATISTICS corr (dependencies)
ON (reprocess_type_id, classification_crop_type_id, reprocess_status)
FROM reprocessed_videos;
ANALYZE reprocessed_videos;
Perhaps you don't need even all three columns in there; play with it.
If the bitmap index scan does not offer enough benefit, there is one way how you can speed up the current index scan:
CLUSTER reprocessed_videos USING ix_reprocessed_videos_commit_id;
That rewrites the table in index order (and blocks concurrent access while it is running, so be careful!). After that, the index scan could be considerably faster. However, the order is not maintained, so you'll have to repeat the CLUSTER occasionally if enough of the table has been modified.
Create a covering index; one that has all the condition columns (first, in descending order of cardinality) and the value columns (last) needed for you query, which means the index alone can be used - avoiding accessing the table:
create index covering_index on reprocessed_videos(
reprocess_type_id,
classification_crop_type_id,
reprocess_status,
commit_id,
video_id
);
And ensure there's one on commit_info(id) too - indexes are not automatically defined in postgres, even for primary keys:
create index commit_info__id on commit_info(id);
To get more accurate query plans, you can manually set the cardinality of condition columns, for example:
select count(distinct reprocess_type_id) from reprocessed_videos;
Then set that value to the column:
alter table reprocessed_videos alter column reprocess_type_id set (n_distinct = number_from_above_query)
This seems like a straightforward question, but I can't find the answer online.
I'm using Postgres 9.4 and have this table:
Table "public.title"
Column | Type | Collation | Nullable | Default
---------------------------------+-------------------------+-----------+----------+-----------------------------------
id | integer | | not null | nextval('title_id_seq'::regclass)
name1 | character varying(1000) | | |
name2 | character varying(1000) | | |
name3 | character varying(1000) | | |
name4 | character varying(1000) | | |
And I have a multicolumn index:
"idx_title_names" btree (name1, name2, name3, name4)
But for OR queries, the index isn't being used:
EXPLAIN ANALYZE SELECT * FROM "title" WHERE ("title"."name1" = 'foo'
OR "title"."name3" = 'foo' OR "title"."name3" = 'foo' OR "title"."name4" = 'foo');
Gather (cost=1000.00..436451.46 rows=659 width=4500) (actual time=561.418..1297.877 rows=3222 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on title (cost=0.00..435385.56 rows=275 width=4500) (actual time=551.627..1286.724 rows=1074 loops=3)
Filter: (((name1)::text = 'foo'::text) OR ((name2)::text = 'foo'::text) OR ((name3)::text = 'foo'::text) OR ((name4)::text = 'foo'::text))
Rows Removed by Filter: 1231911
Planning Time: 0.102 ms
Execution Time: 1298.148 ms
Is this because these indexes don't work with OR queries?
And: if so, is my best bet just to create 4 separate standard indexes?
One option is to create a GIN index on the array of the columns, then use an array operator:
create index on title using gin (array[name1,name2,name3,name4]);
Then use
SELECT *
FROM title
WHERE array[name1,name2,name3,name4] #> array['foo'];
Note that a GIN index is a bit more expensive to maintain than a BTree index.
OR is often a performance problem in SQL.
This index cannot be used for a condition like that.
Your best bet is to create four single-column indexes and hope for a Bitmap Or:
CREATE INDEX ON public.title (name1);
CREATE INDEX ON public.title (name2);
CREATE INDEX ON public.title (name3);
CREATE INDEX ON public.title (name4);
Having index on (col1, col2, col3, etc) it will be used for conditions/ordering on col1, or col1 and col2, or col1, col2 and col3 etc. It will be not used for conditions/ordering only on col3 for example.
Look at this:
# create table t as select random() as a, random() as b from generate_series(1,1000000);
# create index i on t(a,b);
# analyze t;
# explain analyze select * from t where a > 0.9;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=2246.83..8863.15 rows=96826 width=16) (actual time=10.973..28.023 rows=99311 loops=1)
Recheck Cond: (a > '0.9'::double precision)
Heap Blocks: exact=5406
-> Bitmap Index Scan on i (cost=0.00..2222.62 rows=96826 width=0) (actual time=10.251..10.252 rows=99311 loops=1)
Index Cond: (a > '0.9'::double precision)
Planning Time: 0.348 ms
Execution Time: 31.054 ms
# explain analyze select * from t where b > 0.9;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..17906.00 rows=99117 width=16) (actual time=0.015..70.505 rows=100137 loops=1)
Filter: (b > '0.9'::double precision)
Rows Removed by Filter: 899863
Planning Time: 0.090 ms
Execution Time: 73.656 ms
However when you are using or condition the DBMS actually should to perform several queries, for our example select * from t where a > 0.9 or b > 0.9 is equal to select * from t where a > 0.9 (index could be used) and select * from t where b > 0.9 (index could not be used) thus instead of two actions (scan index then scan whole table) DBMS performs only one action (scan whole table)
Hope it explains why your index is not used for your query.
I'm running into an issue in PostgreSQL (version 9.6.10) with indexes not working to speed up a MAX query with a simple equality filter on another column. Logically it seems that a simple multicolumn index on (A, B DESC) should make the query super fast.
I can't for the life of me figure out why I can't get a query to be performant regardless of what indexes are defined.
The table definition has the following:
- A primary key foo VARCHAR PRIMARY KEY (not used in the query)
- A UUID field that is NOT NULL called bar UUID
- A sequential_id column that was created as a BIGSERIAL UNIQUE type
Here's what the relevant columns look like exactly (with names modified for privacy):
Table "public.foo"
Column | Type | Modifiers
----------------------+--------------------------+--------------------------------------------------------------------------------
foo_uid | character varying | not null
bar_uid | uuid | not null
sequential_id | bigint | not null default nextval('foo_sequential_id_seq'::regclass)
Indexes:
"foo_pkey" PRIMARY KEY, btree (foo_uid)
"foo_bar_uid_sequential_id_idx", btree (bar_uid, sequential_id DESC)
"foo_sequential_id_key" UNIQUE CONSTRAINT, btree (sequential_id)
Despite having the index listed above on (bar_uid, sequential_id DESC), the following query requires an index scan and takes 100-300ms with a few million rows in the database.
The Query (get the max sequential_id for a given bar_uid):
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f';
The EXPLAIN ANALYZE result doesn't use the proper index. Also, for some reason it checks if sequential_id IS NOT NULL even though it's declared as not null.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.75..0.76 rows=1 width=8) (actual time=321.110..321.110 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.75 rows=1 width=8) (actual time=321.106..321.106 rows=1 loops=1)
-> Index Scan Backward using foo_sequential_id_key on foo (cost=0.43..98936.43 rows=308401 width=8) (actual time=321.106..321.106 rows=1 loops=1)
Index Cond: (sequential_id IS NOT NULL)
Filter: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Rows Removed by Filter: 920761
Planning time: 0.196 ms
Execution time: 321.127 ms
(9 rows)
I can add a seemingly unnecessary GROUP BY to this query, and that speeds it up a bit, but it's still really slow for a query that should be near instantaneous with indexes defined:
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'
GROUP BY bar_uid;
The EXPLAIN (ANALYZE, BUFFERS) result:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=8510.54..65953.61 rows=6 width=24) (actual time=234.529..234.530 rows=1 loops=1)
Group Key: bar_uid
Buffers: shared hit=1 read=11909
-> Bitmap Heap Scan on foo (cost=8510.54..64411.55 rows=308401 width=24) (actual time=65.259..201.969 rows=309023 loops=1)
Recheck Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Heap Blocks: exact=10385
Buffers: shared hit=1 read=11909
-> Bitmap Index Scan on foo_bar_uid_sequential_id_idx (cost=0.00..8433.43 rows=308401 width=0) (actual time=63.549..63.549 rows=309023 loops=1)
Index Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Buffers: shared read=1525
Planning time: 3.067 ms
Execution time: 234.589 ms
(12 rows)
Does anyone have any idea what's blocking this query from being on the order of 10 milliseconds? This should logically be instantaneous with the right index defined. It should only require the time to follow links to the leaf value in the B-Tree.
Someone asked:
What do you get for SELECT * FROM pg_stats WHERE tablename = 'foo' and attname = 'bar_uid';?
schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_elems | most_common_elem_freqs | elem_count_histogram
------------+------------------------+-------------+-----------+-----------+-----------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------+------------------+-------------+-------------------+------------------------+----------------------
public | foo | bar_uir | f | 0 | 16 | 6 | {fa61424d-389f-4e75-ba2d-b77e6bb8491f,5c5dcae9-1b7e-4413-99a1-62fde2b89c32,50b1e842-fc32-4c2c-b00f-4a17c3c1c5fa,7ff1999c-c0ea-b700-343f-9a737f6ad659,f667b353-e199-4890-9ffd-4940ea11fe2c,b24ce968-29fd-4587-ba1f-227036ee3135} | {0.203733,0.203167,0.201567,0.195867,0.1952,0.000466667} | | -0.158093 | | |
(1 row)
A table with trigram index, does not work if there is mixed case or ILike in the query.
Im not sure what I have missed. Any ideas?
(Im using PostgreSQL 9.6.2)
CREATE TABLE public.tbltest (
"tbltestId" int NOT null ,
"mystring1" text,
"mystring2" character varying,
CONSTRAINT "tbltest_pkey" PRIMARY KEY ("tbltestId")
);
insert into tbltest ("tbltestId","mystring1", "mystring2")
select x.id, x.id || ' Test', x.id || ' Test' from generate_series(1,100000) AS x(id);
CREATE EXTENSION pg_trgm;
CREATE INDEX tbltest_idx1 ON tbltest using gin ("mystring1" gin_trgm_ops);
CREATE INDEX tbltest_idx2 ON tbltest using gin ("mystring2" gin_trgm_ops);
Using lower case text in the query works, and uses the index
explain analyse
select * from tbltest
where "mystring2" Like '%test%';
QUERY PLAN |
-----------------------------------------------------------------------------------------------------------------------------|
Bitmap Heap Scan on tbltest (cost=20.08..56.68 rows=10 width=24) (actual time=29.846..29.846 rows=0 loops=1) |
Recheck Cond: ((mystring2)::text ~~ '%test%'::text) |
Rows Removed by Index Recheck: 100000 |
Heap Blocks: exact=726 |
-> Bitmap Index Scan on tbltest_idx2 (cost=0.00..20.07 rows=10 width=0) (actual time=12.709..12.709 rows=100000 loops=1) |
Index Cond: ((mystring2)::text ~~ '%test%'::text) |
Planning time: 0.086 ms |
Execution time: 29.875 ms |
Like does not use the index if I add mixed case in the search
explain analyse
select * from tbltest
where "mystring2" Like '%Test%';
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------|
Seq Scan on tbltest (cost=0.00..1976.00 rows=99990 width=24) (actual time=0.011..33.376 rows=100000 loops=1) |
Filter: ((mystring2)::text ~~ '%Test%'::text) |
Planning time: 0.083 ms |
Execution time: 51.259 ms |
ILike does not use the index either
explain analyse
select * from tbltest
where "mystring2" ILike '%Test%';
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------|
Seq Scan on tbltest (cost=0.00..1976.00 rows=99990 width=24) (actual time=0.012..87.038 rows=100000 loops=1) |
Filter: ((mystring2)::text ~~* '%Test%'::text) |
Planning time: 0.134 ms |
Execution time: 105.757 ms |
PostgreSQL does not use the index in the last two queries because that is the best way to process the query, not because it cannot use it.
In your EXPLAIN output you can see that the first query returns zero rows (actual ... rows=0), while the other two queries return every single row in the table (actual ... rows=100000).
The PostgreSQL optimizer's estimates reflect that situation accurately.
Since it has to access most of the rows of the table anyway, PostgreSQL knows that it will be able to get the result much cheaper if it scans the table sequentially than by using the more complicated index access method.
I'm running Geodjango/Postgres 9.1/PostGIS and I'm trying to get the following query (and others like it) to run faster.
[query snipped for brevity]
SELECT "crowdbreaks_incomingkeyword"."keyword_id"
, COUNT("crowdbreaks_incomingkeyword"."keyword_id") AS "cnt"
FROM "crowdbreaks_incomingkeyword"
INNER JOIN "crowdbreaks_tweet"
ON ("crowdbreaks_incomingkeyword"."tweet_id"
= "crowdbreaks_tweet"."tweet_id")
LEFT OUTER JOIN "crowdbreaks_place"
ON ("crowdbreaks_tweet"."place_id"
= "crowdbreaks_place"."place_id")
WHERE (("crowdbreaks_tweet"."coordinates"
# ST_GeomFromEWKB(E'\\001 ... \\000\\000\\000\\0008#'::bytea)
OR ST_Overlaps("crowdbreaks_place"."bounding_box"
, ST_GeomFromEWKB(E'\\001...00\\000\\0008#'::bytea)
))
AND "crowdbreaks_tweet"."created_at" > E'2012-04-17 15:46:12.109893'
AND "crowdbreaks_tweet"."created_at" < E'2012-04-18 15:46:12.109899' )
GROUP BY "crowdbreaks_incomingkeyword"."keyword_id"
, "crowdbreaks_incomingkeyword"."keyword_id"
;
Here is what the crowdbreaks_tweet table looks like:
\d+ crowdbreaks_tweet;
Table "public.crowdbreaks_tweet"
Column | Type | Modifiers | Storage | Description
---------------+--------------------------+-----------+----------+-------------
tweet_id | bigint | not null | plain |
tweeter | bigint | not null | plain |
text | text | not null | extended |
created_at | timestamp with time zone | not null | plain |
country_code | character varying(3) | | extended |
place_id | character varying(32) | | extended |
coordinates | geometry | | main |
Indexes:
"crowdbreaks_tweet_pkey" PRIMARY KEY, btree (tweet_id)
"crowdbreaks_tweet_coordinates_id" gist (coordinates)
"crowdbreaks_tweet_created_at" btree (created_at)
"crowdbreaks_tweet_place_id" btree (place_id)
"crowdbreaks_tweet_place_id_like" btree (place_id varchar_pattern_ops)
Check constraints:
"enforce_dims_coordinates" CHECK (st_ndims(coordinates) = 2)
"enforce_geotype_coordinates" CHECK (geometrytype(coordinates) = 'POINT'::text OR coordinates IS NULL)
"enforce_srid_coordinates" CHECK (st_srid(coordinates) = 4326)
Foreign-key constraints:
"crowdbreaks_tweet_place_id_fkey" FOREIGN KEY (place_id) REFERENCES crowdbreaks_place(place_id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
TABLE "crowdbreaks_incomingkeyword" CONSTRAINT "crowdbreaks_incomingkeyword_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
TABLE "crowdbreaks_tweetanswer" CONSTRAINT "crowdbreaks_tweetanswer_tweet_id_id_fkey" FOREIGN KEY (tweet_id_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no
And here is the explain analyze for the query:
HashAggregate (cost=184022.03..184023.18 rows=115 width=4) (actual time=6381.707..6381.769 rows=62 loops=1)
-> Hash Join (cost=103857.48..183600.24 rows=84357 width=4) (actual time=1745.449..6377.505 rows=3453 loops=1)
Hash Cond: (crowdbreaks_incomingkeyword.tweet_id = crowdbreaks_tweet.tweet_id)
-> Seq Scan on crowdbreaks_incomingkeyword (cost=0.00..36873.97 rows=2252597 width=12) (actual time=0.008..2136.839 rows=2252597 loops=1)
-> Hash (cost=102535.68..102535.68 rows=80544 width=8) (actual time=1744.815..1744.815 rows=3091 loops=1)
Buckets: 4096 Batches: 4 Memory Usage: 32kB
-> Hash Left Join (cost=16574.93..102535.68 rows=80544 width=8) (actual time=112.551..1740.651 rows=3091 loops=1)
Hash Cond: ((crowdbreaks_tweet.place_id)::text = (crowdbreaks_place.place_id)::text)
Filter: ((crowdbreaks_tweet.coordinates # '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) OR ((crowdbreaks_place.bounding_box && '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) AND _st_overlaps(crowdbreaks_place.bounding_box, '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry)))
-> Bitmap Heap Scan on crowdbreaks_tweet (cost=15874.18..67060.28 rows=747873 width=125) (actual time=96.012..940.462 rows=736784 loops=1)
Recheck Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
-> Bitmap Index Scan on crowdbreaks_tweet_crreated_at (cost=0.00..15687.22 rows=747873 width=0) (actual time=94.259..94.259 rows=736784 loops=1)
Index Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
-> Hash (cost=217.11..217.11 rows=6611 width=469) (actual time=15.926..15.926 rows=6611 loops=1)
Buckets: 1024 Batches: 4 Memory Usage: 259kB
-> Seq Scan on crowdbreaks_place (cost=0.00..217.11 rows=6611 width=469) (actual time=0.005..6.908 rows=6611 loops=1)
Total runtime: 6381.903 ms
(17 rows)
That's a pretty bad runtime for the query. Ideally, I'd like to get results back in a second or two.
I've increased shared_buffers on Postgres to 2GB (I have 8GB of RAM) but other than that I'm not quite sure what to do. What are my options? Should I do fewer joins? Are there any other indexes I can throw on there? The sequential scan on crowdbreaks_incomingkeyword doesn't make sense to me. It's a table of foreign keys to other tables, and thus has indexes on it.
Judging from your comment I would try two things:
Raise statistics target for involved columns (and run ANALYZE).
ALTER TABLE tbl ALTER COLUMN column SET STATISTICS 1000;
The data distribution may be uneven. A bigger sample may provide the query planner with more accurate estimates.
Play with the cost settings in postgresql.conf. Your sequential scans might need to be more expensive compared to your index scans to give good estimates.
Try to lower the cost for cpu_index_tuple_cost and set effective_cache_size to something as high as three quaters of your total RAM for a dedicated DB server.