Postgres Slow Query on large table

Postgres Slow Query on large table - postgresql

I am trying to reduce the query execution time of the query given below. It joins 3 tables to get the data from very big Postgres tables, I have tried to introduce all the necessary indexes on relevant tables but still, the query is taking too long. The total size of the database is around 2TB.
Query:
explain (ANALYZE, COSTS, VERBOSE, BUFFERS)
with au as (
select tbl2.client, tbl2.uid
from tbl2 where tbl2.client = '123kkjk444kjkhj3ddd'
and (tbl2.property->>'num') IN ('1', '2', '3', '31', '12a', '45', '78', '99')
)
SELECT tbl1.id,
CASE WHEN tbl3.displayname IS NOT NULL THEN tbl3.displayname ELSE tbl1.name END AS name,
tbl1.tbl3number, tbl3.originalname as orgtbl3
FROM table_1 tbl1
inner JOIN au tbl2 ON tbl2.client = '123kkjk444kjkhj3ddd' AND tbl2.uid = tbl1.uid
LEFT JOIN tbl3 ON tbl3.client = '123kkjk444kjkhj3ddd' AND tbl3.originalname = tbl1.name
WHERE tbl1.client = '123kkjk444kjkhj3ddd'
AND tbl1.date_col BETWEEN '2021-08-01T05:32:40Z' AND '2021-08-29T05:32:40Z'
ORDER BY tbl1.date_col DESC, tbl1.sid, tbl1.tbl3number
LIMIT 50000;
I have the above Query running but the query execution even after the index scan is very slow. I have attached the Query plan.
Query Plan:
-> Limit (cost=7272.83..7272.86 rows=14 width=158) (actual time=40004.140..40055.737 rows=871 loops=1)
Output: tbl1.id, (CASE WHEN (tbl3.displayname IS NOT NULL) THEN tbl3.displayname ELSE tbl1.name END), tbl1.tbl3number, tbl3.originalsc
reenname, tbl1.date_col
Buffers: shared hit=249656881 dirtied=32
-> Sort (cost=7272.83..7272.86 rows=14 width=158) (actual time=40004.139..40055.671 rows=871 loops=1)
Output: tbl1.id, (CASE WHEN (tbl3.displayname IS NOT NULL) THEN tbl3.displayname ELSE tbl1.name END), tbl1.tbl3number, tbl3.orig
inalname, tbl1.date_col
Sort Key: tbl1.date_col DESC, tbl1.id, tbl1.tbl3number
Sort Method: quicksort Memory: 142kB
Buffers: shared hit=249656881 dirtied=32
-> Gather (cost=1001.39..7272.56 rows=14 width=158) (actual time=9147.574..40055.005 rows=871 loops=1)
Output: tbl1.id, (CASE WHEN (tbl3.displayname IS NOT NULL) THEN tbl3.displayname ELSE tbl1.name END), tbl1.tbl3number, scree
n.originalname, tbl1.date_col
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=249656881 dirtied=32
-> Nested Loop Left Join (cost=1.39..6271.16 rows=4 width=158) (actual time=3890.074..39998.436 rows=174 loops=5)
Output: tbl1.id, CASE WHEN (tbl3.displayname IS NOT NULL) THEN tbl3.displayname ELSE tbl1.name END, tbl1.tbl3number, s
creen.originalname, tbl1.date_col
Inner Unique: true
Buffers: shared hit=249656881 dirtied=32
Worker 0: actual time=1844.246..39996.744 rows=182 loops=1
Buffers: shared hit=49568277 dirtied=5
Worker 1: actual time=3569.032..39997.124 rows=210 loops=1
Buffers: shared hit=49968461 dirtied=10
Worker 2: actual time=2444.911..39997.561 rows=197 loops=1
Buffers: shared hit=49991521 dirtied=2
Worker 3: actual time=2445.013..39998.065 rows=110 loops=1
Buffers: shared hit=49670445 dirtied=10
-> Nested Loop (cost=1.12..6269.94 rows=4 width=610) (actual time=3890.035..39997.924 rows=174 loops=5)
Output: tbl1.id, tbl1.name, tbl1.tbl3number, tbl1.date_col
Inner Unique: true
Buffers: shared hit=249655135 dirtied=32
Worker 0: actual time=1844.200..39996.206 rows=182 loops=1
Buffers: shared hit=49567912 dirtied=5
Worker 1: actual time=3568.980..39996.522 rows=210 loops=1
Buffers: shared hit=49968040 dirtied=10
Worker 2: actual time=2444.872..39996.987 rows=197 loops=1
Buffers: shared hit=49991126 dirtied=2
Worker 3: actual time=2444.965..39997.712 rows=110 loops=1
Buffers: shared hit=49670224 dirtied=10
-> Parallel Index Only Scan using idx_sv_cuf8_110523 on public.table_1_110523 tbl1 (cost=0.69..5692.16 rows=220 width=692) (actual time=0.059..1458.129 rows=2922506 loops=5)
Output: tbl1.client, tbl1.id, tbl1.tbl3number, tbl1.date_col, tbl1.id, tbl1.name
Index Cond: ((tbl1.client = '123kkjk444kjkhj3ddd'::text) AND (tbl1.date_col >= '2021-08-01 05:32:40+00'::timestamp with time zone) AND (tbl1.date_col <= '2021-08-29 05:32:40+00'::timestamp with time zone))
Heap Fetches: 0
Buffers: shared hit=538663
Worker 0: actual time=0.059..1479.907 rows=2912875 loops=1
Buffers: shared hit=107477
Worker 1: actual time=0.100..1475.863 rows=2930306 loops=1
Buffers: shared hit=107817
Worker 2: actual time=0.054..1481.032 rows=2925849 loops=1
Buffers: shared hit=107812
Worker 3: actual time=0.058..1477.443 rows=2897544 loops=1
Buffers: shared hit=107047
-> Index Scan using tbl2_pkey_102328 on public.tbl2_102328 tbl2_1 (cost=0.43..2.63 rows=1 width=25) (actual time=0.013..0.013 rows=0 loops=14612531)
Output: tbl2_1.id
Index Cond: (((tbl2_1.id)::text = (tbl1.id)::text) AND ((tbl2_1.client)::text = '123kkjk444kjkhj3ddd'::text))
Filter: ((tbl2_1.property ->> 'num'::text) = ANY ('{"1","2","3","31","12a","45","78","99"}'::text[]))
Rows Removed by Filter: 1
Buffers: shared hit=249116472 dirtied=32
Worker 0: actual time=0.013..0.013 rows=0 loops=2912875
Buffers: shared hit=49460435 dirtied=5
Worker 1: actual time=0.013..0.013 rows=0 loops=2930306
Buffers: shared hit=49860223 dirtied=10
Worker 2: actual time=0.013..0.013 rows=0 loops=2925849
Buffers: shared hit=49883314 dirtied=2
Worker 3: actual time=0.013..0.013 rows=0 loops=2897544
Buffers: shared hit=49563177 dirtied=10
-> Index Scan using tbl3_unikey_104219 on public.tbl3_104219 tbl3 (cost=0.27..0.30 rows=1 width=52) (actual time=0.002..0.002 rows=0 loops=871)
Output: tbl3.client, tbl3.originalname, tbl3.displayname
Index Cond: (((tbl3.client)::text = '123kkjk444kjkhj3ddd'::text) AND ((tbl3.originalname)::text = (tbl1.name)::text))
Buffers: shared hit=1746
Worker 0: actual time=0.002..0.002 rows=0 loops=182
Buffers: shared hit=365
Worker 1: actual time=0.002..0.002 rows=0 loops=210
Buffers: shared hit=421
Worker 2: actual time=0.002..0.002 rows=0 loops=197
Buffers: shared hit=395
Worker 3: actual time=0.002..0.002 rows=0 loops=110
Buffers: shared hit=221
Planning Time: 0.361 ms
Execution Time: 40056.008 ms
Planning Time: 0.589 ms
Execution Time: 40071.485 ms
(89 rows)
Time: 40072.986 ms (00:40.073)
Can this query be further optimized to reduce the query execution time? Thank you in advance for your input.
The table definitions are as follows:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
client | character varying(32) | | not null | | extended | |
sid | character varying(32) | | not null | | extended | |
uid | character varying(32) | | | | extended | |
id | character varying(32) | | | | extended | |
tbl3number | integer | | not null | | plain | |
name | character varying(255) | | | | extended | |
date_col | timestamp without time zone | | | | plain | |
Indexes:
idx_sv_cuf8_110523(client,date_col desc,sid,tbl3number)
Table "public.tbl2"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------------------+-----------------------------+-----------+----------+-------------------------+----------+--------------+-------------
id | character varying(32) | | not null | | extended | |
uid | character varying(255) | | | NULL::character varying | extended | |
client | character varying(32) | | not null | | extended | |
property | jsonb | | | | extended | |
Indexes:
"tbl2_pkey" PRIMARY KEY, btree (uid, client)
--
Table "public.tbl3"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------------+------------------------+-----------+----------+---------+----------+--------------+-------------
client | character varying(500) | | not null | | extended | |
originalname | character varying(500) | | | | extended | |
displayname | character varying(500) | | | | extended | |
Indexes:
"tbl3_unikey" UNIQUE CONSTRAINT, btree (client, originalname)

tl;dr: Multicolumn covering indexes.
Query clarity
I have a preference for using a rigid format for queries, so it's easier to see the columns and tables being processed. I removed your CTE and moved its conditions to the main query for the same reason. I also removed the multiple identical client id constants. Here is my rewrite.
SELECT tbl1.id,
COALESCE(tbl3.displayname, tbl1.name) AS name,
tbl1.tbl3number,
tbl3.originalname as orgtbl3
FROM table_1 tbl1
INNER JOIN tbl2
ON tbl2.client = tbl1.client
AND tbl2.uid = tbl1.uid
AND (tbl2.property->>'num') IN ('1', '2', '3', '31', '12a', '45', '78', '99')
LEFT JOIN tbl3
ON tbl3.client = tbl1.client
AND tbl3.originalname = tbl1.name
WHERE tbl1.client = '123kkjk444kjkhj3ddd'
AND tbl1.date_col BETWEEN '2021-08-01T05:32:40Z' AND '2021-08-29T05:32:40Z'
ORDER BY tbl1.date_col DESC, tbl1.sid, tbl1.tbl3number
LIMIT 50000;
ORDER BY ... LIMIT ...
When you ORDER BY then LIMIT, you sometimes force the server to do a lot of datashuffling: sorting your result set then discarding some of it. Can you avoid either the ORDER BY or the LIMIT, or both?
It also may help to use the DESC keyword on the index for the column that's ordered by DESC.
Covering indexes
It's a big query. But I believe judicious choices of multicolumn covering indexes will help speed it up.
You filter tbl by a constant comparison on client and a range scan on date_col. You then use uid and output id, name, and tbl3number. Therefore, this BTREE index will allow an index-only range scan, which generally is fast. (Notice the DESC keyword on date_col. It's an attempt to help your ORDER BY clause.)
CREATE INDEX CONCURRENTLY tbl1_name_num_lookup
ON tbl1 (client, date_col DESC)
INCLUDE (uid, id, name, tbl3number);
From tbl2, you access client and uid, and then use the jsonb column property. So this index will likely help you.
CREATE INDEX CONCURRENTLY tbl2_name_num_lookup
ON tbl2 (client, uid)
INCLUDE (property);
From tbl3, you access it by client and originalname. You output displayname. So this index should help.
CREATE INDEX CONCURRENTLY tbl3_name_num_lookup
ON tbl3 (client, originalname)
INCLUDE (displayname);
Join column type mismatch
You join ON tbl2.uid = tbl1.uid. But those two columns have different data types: character varying(32) in tbl1 and 255 in tbl2. JOIN operations are faster when the ON columns have the same data type. Consider fixing one or the other.
The same goes for ON tbl3.originalname = tbl1.name.

That you have two different things aliased to tbl2 certainly does not enhance the readability of your plan. That your plan is over indented so that we need to keep scrolling left and right to see it doesn't help either.
Why does your plan show (tbl2_1.id)::text = (tbl1.id)::text while your query shows tbl2.uid = tbl1.uid? Is that a bit of mis-anonymization?
Essentially all the times goes to the join between tbl1 and tbl2, so that is the thing you need to optimize. If you eliminate the join to tbl3, that would simplify the EXPLAIN, and so make it easier to understand.
You are hitting tbl2 14 million times but only getting 174 rows. But we can't tell if the index finds one row for each 14 million inputs, but it gets filtered out, or it finds 0 rows on average. Maybe it would be more efficient to reverse the order of that join, which you might be able to do by creating an index on tbl2 (client, (property->>'num'),uid). Or maybe "id" rather than "uid", I don't really know what your true query is.

Your first query uses JSON and operate a filter (restriction) inside the JSON structure to find data :
tbl2.property->>'num'
This part of the WHERE predicate is not "sargable". So the only maner to answer your demand, is to scan every row in the table tbl2 and then for every row to scan the json text stream to find the desired value.
So the iteration is a kind of cross product between row cardinality of the table and parts of the JSON.
There is no way to optimize such a query...
Every time you will introduce an objects (in your query a JSON) which have an iterative comportement while querying it, inside a dataset that can be retrieve using set based algorithms (index, parallelism...) the result is to scan and scan, and scan...
Actually JSON cannot be indexed. PostgreSQL does not accepts JSON indexes nor XML ones, in contrary to DB2, Oracle or SQL Server that are able to create specialized indexes on XML...

Related

PostgreSQL Adding an index slowed down the query execution drastically even though its not used

I am fairly new to Postgres and I am trying to debug an issue I found recently in which adding an Index has slowed down my query. To begin with, here is the bare bones table that I have. There are more indexes and columns but I am only showing the ones that are relevant to the issue:
> \d+ classes
Table "public.classes"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+-----------------------------+-----------+----------+--------------------------+----------+--------------+-------------
classid | bigint | | not null | gen_random_js_safe_int() | plain | |
schoolid | bigint | | | | plain | |
classname | character varying(255) | | not null | | extended | |
Indexes:
"classes_classname_schoolid_idx" UNIQUE, btree (classname, schoolid) WHERE schoolid IS NOT NULL
Foreign-key constraints:
"classes_schoolid_fkey" FOREIGN KEY (schoolid) REFERENCES schools(schoolid)
Referenced by:
TABLE "classtoschool" CONSTRAINT "classtoschool_classid_fkey" FOREIGN KEY (classid) REFERENCES classes(classid) ON DELETE CASCADE
> \d+ classtoschool
Table "public.classtoschool"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------------+-----------------------------+-----------+----------+-----------------------------+----------+--------------+-------------
classid | bigint | | | | plain | |
schoolid | bigint | | not null | | plain | |
Foreign-key constraints:
"classtoschool_classid_fkey" FOREIGN KEY (classid) REFERENCES classes(classid) ON DELETE CASCADE
"classtoschool_schoolid_fkey" FOREIGN KEY (schoolid) REFERENCES schools(schoolid) ON DELETE CASCADE
> \d+ schools
Table "public.schools"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------------------------------+-----------------------------+-----------+----------+--------------------------+----------+--------------+-------------
schoolid | bigint | | not null | gen_random_js_safe_int() | plain | |
schoolStatus | character varying(100) | | not null | | extended | |
Referenced by:
TABLE "classes" CONSTRAINT "classes_schoolid_fkey" FOREIGN KEY (schoolid) REFERENCES schools(schoolid)
TABLE "classtoschool" CONSTRAINT "classtoschool_schoolid_fkey" FOREIGN KEY (schoolid) REFERENCES school(schoolid) ON DELETE CASCADE
The index in question is:
"classes_classname_schoolid_idx" UNIQUE, btree (classname, schoolid) WHERE schoolid IS NOT NULL
Before adding this index, the query execution times are in single digit milliseconds, however, after adding this index, it is triple digits milliseconds sometimes even in seconds under heavy load.
The funniest thing is, that when I run the EXPLAIN, it doesn't even include the index. But for whatever reason, adding the index impacts the query execution time. Another awkward thing is that I have different databases for different regions, and this increase in latency is only observed in this one database. All Postgres versions and settings are identical.
The query that I run is:
EXPLAIN ANALYZE VERBOSE SELECT
schools.ommitedColumnA,
classes.classId,
classes.classname,
classes.omittedColumnB,
classes.omittedColumnC,
classes.omittedColumnD,
classes.omittedColumnE
FROM
classes
LEFT JOIN classtoschool ON (classes.classId = classtoschool.classid)
LEFT JOIN schools ON (classtoschool.schoolid = schools.schoolid)
WHERE
(classes.classname = 'maths') AND
(schools.schoolid = '12345678') AND
((schools.banned = false OR schools.banned IS NULL) AND schools.schoolStatus = 'RUNNING') AND
(schools.schoolStatus != 'BREAK' OR schools.schoolStatus IS NULL);
Explain Results:
Nested Loop (cost=1834.56..7368.73 rows=2704 width=104) (actual time=3165.232..3191.356 rows=1 loops=1)
Output: classes.classid, classes.classname, classes.omittedColumnB, classes.omittedColumnC, classesomittedColumnD, classes.omittedColumnE
-> Index Scan using schools_pkey on public.schools (cost=0.42..2.64 rows=1 width=16) (actual time=30.811..30.815 rows=1 loops=1)
Output: schools.orgid, schools.schoolid
Index Cond: (schools.schoolid = '12345678'::bigint)
Filter: (((NOT schools.banned) OR (schools.banned IS NULL)) AND (((schools.schoolstatus)::text <> 'BREAK'::text) OR (schools.schoolstatus IS NULL)) AND ((schools.schoolstatus)::text = 'RUNNING'::text))
-> Gather (cost=1834.15..7339.05 rows=2704 width=104) (actual time=3134.409..3160.528 rows=1 loops=1)
Output: classes.classid, classes.classname, classes.omittedColumnB, classes.omittedColumnC, classesomittedColumnD, classes.omittedColumnE, classtoschool.schoolid
Workers Planned: 2
Workers Launched: 2
-> Hash Join (cost=834.15..6068.65 rows=1127 width=104) (actual time=2890.802..3034.482 rows=0 loops=3)
Output: classes.classid, classes.classname, classes.omittedColumnB, classes.omittedColumnC, classesomittedColumnD, classes.omittedColumnE, classtoschool.schoolid
Inner Unique: true
Hash Cond: (classes.classid = classtoschool.classid)
Worker 0: actual time=2555.322..2986.361 rows=1 loops=1
Worker 1: actual time=2986.457..2986.459 rows=0 loops=1
-> Parallel Seq Scan on public.classes (cost=0.00..5108.60 rows=47960 width=96) (actual time=3.194..1527.764 rows=38171 loops=3)
Output: classes.classid, classes.schoolid, classes.omittedColumnC, classesomittedColumnD, classes.classname, classes.omittedColumnB, classes.omittedColumnE
Filter: ((classes.classname)::text = 'maths'::text)
Rows Removed by Filter: 60282
Worker 0: actual time=0.229..1524.734 rows=39144 loops=1
Worker 1: actual time=0.255..1524.984 rows=39508 loops=1
-> Hash (cost=747.56..747.56 rows=6927 width=16) (actual time=1502.768..1502.769 rows=2134 loops=3)
Output: classtoschool.classid, classtoschool.schoolid
Buckets: 8192 Batches: 1 Memory Usage: 165kB
Worker 0: actual time=1457.649..1457.649 rows=2134 loops=1
Worker 1: actual time=1457.576..1457.576 rows=2134 loops=1
-> Index Only Scan using classtoschool_schoolid_classid_idx on public.classtoschool (cost=0.42..747.56 rows=6927 width=16) (actual time=12.409..1501.892 rows=2134 loops=3)
Output: classtoschool.classid, classtoschool.schoolid
Index Cond: (classtoschool.schoolid = '12345678'::bigint)
Heap Fetches: 1512
Worker 0: actual time=0.134..1456.763 rows=2134 loops=1
Worker 1: actual time=0.135..1456.757 rows=2134 loops=1
Planning Time: 307.276 ms
Execution Time: 3191.475 ms
As you can see, the index classes_classname_schoolid_idx is nowhere used.
Below is the EXPLAIN with the index removed:
Nested Loop (cost=1834.56..7368.73 rows=2704 width=104) (actual time=41.516..190.198 rows=1 loops=1)
Output: classes.classid, classes.classname, classes.omittedColumnB, classes.omittedColumnC, classes.omittedColumnD, classes.omittedColumnE
-> Index Scan using schools_pkey on public.schools (cost=0.42..2.64 rows=1 width=16) (actual time=0.034..0.042 rows=1 loops=1)
Output: schools.schoolid
Index Cond: (schools.schoolid = '12345678'::bigint)
Filter: (((NOT schools.banned) OR (schools.banned IS NULL)) AND (((schools.schoolstatus)::text <> 'BREAK'::text) OR (schools.schoolstatus IS NULL)) AND ((schools.schoolstatus)::text = 'RUNNING'::text))
-> Gather (cost=1834.15..7339.05 rows=2704 width=104) (actual time=41.477..190.150 rows=1 loops=1)
Output: classes.classid, classes.classname, classes.omittedColumnB, classes.omittedColumnC, classes.omittedColumnD, classes.omittedColumnE, classtoschool.schoolid
Workers Planned: 2
Workers Launched: 2
-> Hash Join (cost=834.15..6068.65 rows=1127 width=104) (actual time=12.321..17.334 rows=0 loops=3)
Output: classes.classid, classes.classname, classes.omittedColumnB, classes.omittedColumnC, classes.omittedColumnD, classes.omittedColumnE, classtoschool.schoolid
Inner Unique: true
Hash Cond: (classes.classid = classtoschool.classid)
Worker 0: actual time=0.006..0.007 rows=0 loops=1
Worker 1: actual time=0.006..0.006 rows=0 loops=1
-> Parallel Seq Scan on public.classes (cost=0.00..5108.60 rows=47960 width=96) (actual time=0.006..13.902 rows=38171 loops=3)
Output: classes.classid, classes.schoolid, classes.omittedColumnC, classes.omittedColumnD, classes.classname, classes.omittedColumnB, classes.omittedColumnE
Filter: ((classes.classname)::text = 'maths'::text)
Rows Removed by Filter: 60282
Worker 0: actual time=0.004..0.004 rows=0 loops=1
Worker 1: actual time=0.004..0.004 rows=0 loops=1
-> Hash (cost=747.56..747.56 rows=6927 width=16) (actual time=1.358..1.359 rows=2134 loops=1)
Output: classtoschool.classid, classtoschool.schoolid
Buckets: 8192 Batches: 1 Memory Usage: 165kB
-> Index Only Scan using classtoschool_schoolid_classid_idx on public.classtoschool (cost=0.42..747.56 rows=6927 width=16) (actual time=0.246..1.060 rows=2134 loops=1)
Output: classtoschool.classid, classtoschool.schoolid
Index Cond: (classtoschool.schoolid = '12345678'::bigint)
Heap Fetches: 504
Planning Time: 5.019 ms
Execution Time: 190.426 ms
Someone suggested I look at the pg_stat_user_tables where I found that the value of column idx_tup_fetch is vastly different in the database with the index (3 billion) vs without the index (300K).
I have 2 questions:
Why & how does the index impact the query execution even though EXPLAIN does not show it?
Why does the index not impact the query execution in other databases with more rows & data, but only impact this one database?

Query optimization on timestamp and group by in PostgreSQL

I want to query for my table with the following structure:
Table "public.company_geo_table"
Column | Type | Collation | Nullable | Default
--------------------+--------+-----------+----------+---------
geoname_id | bigint | | |
date | text | | |
cik | text | | |
count | bigint | | |
country_iso_code | text | | |
subdivision_1_name | text | | |
city_name | text | | |
Indexes:
"cik_country_index" btree (cik, country_iso_code)
"cik_geoname_index" btree (cik, geoname_id)
"cik_index" btree (cik)
"date_index" brin (date)
I tried with the following sql query, which need to query for a specific cik number during a time perid, and group by the cik with geoname_id(different areas).
select cik, geoname_id, sum(count) as total
from company_geo_table
where cik = '1111111'
and date between '2016-01-01' and '2016-01-10'
group by cik, geoname_id
The explanation result showed that they only use the cik index and date index, and did not use the cik_geoname index. Why? Is there any way I can optimize my solution? Any new indices? Thank you in advance.
HashAggregate (cost=117182.79..117521.42 rows=27091 width=47) (actual time=560132.903..560134.229 rows=3552 loops=1)
Group Key: cik, geoname_id
-> Bitmap Heap Scan on company_geo_table (cost=16467.77..116979.48 rows=27108 width=23) (actual time=6486.232..560114.828 rows=8175 loops=1)
Recheck Cond: ((date >= '2016-01-01'::text) AND (date <= '2016-01-10'::text) AND (cik = '1288776'::text))
Rows Removed by Index Recheck: 16621155
Heap Blocks: lossy=193098
-> BitmapAnd (cost=16467.77..16467.77 rows=27428 width=0) (actual time=6469.640..6469.641 rows=0 loops=1)
-> Bitmap Index Scan on date_index (cost=0.00..244.81 rows=7155101 width=0) (actual time=53.034..53.035 rows=8261120 loops=1)
Index Cond: ((date >= '2016-01-01'::text) AND (date <= '2016-01-10'::text))
-> Bitmap Index Scan on cik_index (cost=0.00..16209.15 rows=739278 width=0) (actual time=6370.930..6370.930 rows=676231 loops=1)
Index Cond: (cik = '1111111'::text)
Planning time: 12.909 ms
Execution time: 560135.432 ms

There is not good estimation (and probably the value '1111111' is used too often (I am not sure about impact, but looks so cik column has wrong data type (text), what can be a reason (or partial reason) of not good estimation.
Bitmap Heap Scan on company_geo_table (cost=16467.77..116979.48 rows=27108 width=23) (actual time=6486.232..560114.828 rows=8175 loops=1)
Looks like composite index (date, cik) can helps

Your problem seems to be here:
Rows Removed by Index Recheck: 16621155
Heap Blocks: lossy=193098
Your work_mem setting is too low, so PostgreSQL cannot fit a bitmap that contains one bit per table row, so it degrades to one bit per 8K block. This means that many false positive hits have to be removed during that bitmap heap scan.
Try with higher work_mem and see if that improves query performance.
The ideal index would be
CREATE INDEX ON company_geo_table (cik, date);

Why is a MAX query with an equality filter on one other column so slow in Postgresql?

I'm running into an issue in PostgreSQL (version 9.6.10) with indexes not working to speed up a MAX query with a simple equality filter on another column. Logically it seems that a simple multicolumn index on (A, B DESC) should make the query super fast.
I can't for the life of me figure out why I can't get a query to be performant regardless of what indexes are defined.
The table definition has the following:
- A primary key foo VARCHAR PRIMARY KEY (not used in the query)
- A UUID field that is NOT NULL called bar UUID
- A sequential_id column that was created as a BIGSERIAL UNIQUE type
Here's what the relevant columns look like exactly (with names modified for privacy):
Table "public.foo"
Column | Type | Modifiers
----------------------+--------------------------+--------------------------------------------------------------------------------
foo_uid | character varying | not null
bar_uid | uuid | not null
sequential_id | bigint | not null default nextval('foo_sequential_id_seq'::regclass)
Indexes:
"foo_pkey" PRIMARY KEY, btree (foo_uid)
"foo_bar_uid_sequential_id_idx", btree (bar_uid, sequential_id DESC)
"foo_sequential_id_key" UNIQUE CONSTRAINT, btree (sequential_id)
Despite having the index listed above on (bar_uid, sequential_id DESC), the following query requires an index scan and takes 100-300ms with a few million rows in the database.
The Query (get the max sequential_id for a given bar_uid):
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f';
The EXPLAIN ANALYZE result doesn't use the proper index. Also, for some reason it checks if sequential_id IS NOT NULL even though it's declared as not null.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.75..0.76 rows=1 width=8) (actual time=321.110..321.110 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.75 rows=1 width=8) (actual time=321.106..321.106 rows=1 loops=1)
-> Index Scan Backward using foo_sequential_id_key on foo (cost=0.43..98936.43 rows=308401 width=8) (actual time=321.106..321.106 rows=1 loops=1)
Index Cond: (sequential_id IS NOT NULL)
Filter: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Rows Removed by Filter: 920761
Planning time: 0.196 ms
Execution time: 321.127 ms
(9 rows)
I can add a seemingly unnecessary GROUP BY to this query, and that speeds it up a bit, but it's still really slow for a query that should be near instantaneous with indexes defined:
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'
GROUP BY bar_uid;
The EXPLAIN (ANALYZE, BUFFERS) result:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=8510.54..65953.61 rows=6 width=24) (actual time=234.529..234.530 rows=1 loops=1)
Group Key: bar_uid
Buffers: shared hit=1 read=11909
-> Bitmap Heap Scan on foo (cost=8510.54..64411.55 rows=308401 width=24) (actual time=65.259..201.969 rows=309023 loops=1)
Recheck Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Heap Blocks: exact=10385
Buffers: shared hit=1 read=11909
-> Bitmap Index Scan on foo_bar_uid_sequential_id_idx (cost=0.00..8433.43 rows=308401 width=0) (actual time=63.549..63.549 rows=309023 loops=1)
Index Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Buffers: shared read=1525
Planning time: 3.067 ms
Execution time: 234.589 ms
(12 rows)
Does anyone have any idea what's blocking this query from being on the order of 10 milliseconds? This should logically be instantaneous with the right index defined. It should only require the time to follow links to the leaf value in the B-Tree.
Someone asked:
What do you get for SELECT * FROM pg_stats WHERE tablename = 'foo' and attname = 'bar_uid';?
schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_elems | most_common_elem_freqs | elem_count_histogram
------------+------------------------+-------------+-----------+-----------+-----------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------+------------------+-------------+-------------------+------------------------+----------------------
public | foo | bar_uir | f | 0 | 16 | 6 | {fa61424d-389f-4e75-ba2d-b77e6bb8491f,5c5dcae9-1b7e-4413-99a1-62fde2b89c32,50b1e842-fc32-4c2c-b00f-4a17c3c1c5fa,7ff1999c-c0ea-b700-343f-9a737f6ad659,f667b353-e199-4890-9ffd-4940ea11fe2c,b24ce968-29fd-4587-ba1f-227036ee3135} | {0.203733,0.203167,0.201567,0.195867,0.1952,0.000466667} | | -0.158093 | | |
(1 row)

Not Sure if Postgresql Cube Gist Index is working

I'm trying to figure out if my GIST index on my cube column for my table is working for my nearest neighbors query (metric = Euclidean). My cube values are 75 dimensional vectors.
Table:
\d+ reduced_features
Table "public.reduced_features"
Column | Type | Modifiers | Storage | Stats target | Description
----------+--------+-----------+---------+--------------+-------------
id | bigint | not null | plain | |
features | cube | not null | plain | |
Indexes:
"reduced_features_id_idx" UNIQUE, btree (id)
"reduced_features_features_idx" gist (features)
Here is my query:
explain analyze select id from reduced_features order by features <-> (select features from reduced_features where id = 198990) limit 10;
Results:
QUERY PLAN
---------------
Limit (cost=8.58..18.53 rows=10 width=16) (actual time=0.821..35.987 rows=10 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using reduced_features_id_idx on reduced_features reduced_features_1 (cost=0.29..8.31 rows=1 width=608) (actual time=0.014..0.015 rows=1 loops=1)
Index Cond: (id = 198990)
-> Index Scan using reduced_features_features_idx on reduced_features (cost=0.28..36482.06 rows=36689 width=16) (actual time=0.819..35.984 rows=10 loops=1)
Order By: (features <-> $0)
Planning time: 0.117 ms
Execution time: 36.232 ms
I have 36689 total records in my table. Is my index working?

Making Postgres Query Faster. More Indexes?

I'm running Geodjango/Postgres 9.1/PostGIS and I'm trying to get the following query (and others like it) to run faster.
[query snipped for brevity]
SELECT "crowdbreaks_incomingkeyword"."keyword_id"
, COUNT("crowdbreaks_incomingkeyword"."keyword_id") AS "cnt"
FROM "crowdbreaks_incomingkeyword"
INNER JOIN "crowdbreaks_tweet"
ON ("crowdbreaks_incomingkeyword"."tweet_id"
= "crowdbreaks_tweet"."tweet_id")
LEFT OUTER JOIN "crowdbreaks_place"
ON ("crowdbreaks_tweet"."place_id"
= "crowdbreaks_place"."place_id")
WHERE (("crowdbreaks_tweet"."coordinates"
# ST_GeomFromEWKB(E'\\001 ... \\000\\000\\000\\0008#'::bytea)
OR ST_Overlaps("crowdbreaks_place"."bounding_box"
, ST_GeomFromEWKB(E'\\001...00\\000\\0008#'::bytea)
))
AND "crowdbreaks_tweet"."created_at" > E'2012-04-17 15:46:12.109893'
AND "crowdbreaks_tweet"."created_at" < E'2012-04-18 15:46:12.109899' )
GROUP BY "crowdbreaks_incomingkeyword"."keyword_id"
, "crowdbreaks_incomingkeyword"."keyword_id"
;
Here is what the crowdbreaks_tweet table looks like:
\d+ crowdbreaks_tweet;
Table "public.crowdbreaks_tweet"
Column | Type | Modifiers | Storage | Description
---------------+--------------------------+-----------+----------+-------------
tweet_id | bigint | not null | plain |
tweeter | bigint | not null | plain |
text | text | not null | extended |
created_at | timestamp with time zone | not null | plain |
country_code | character varying(3) | | extended |
place_id | character varying(32) | | extended |
coordinates | geometry | | main |
Indexes:
"crowdbreaks_tweet_pkey" PRIMARY KEY, btree (tweet_id)
"crowdbreaks_tweet_coordinates_id" gist (coordinates)
"crowdbreaks_tweet_created_at" btree (created_at)
"crowdbreaks_tweet_place_id" btree (place_id)
"crowdbreaks_tweet_place_id_like" btree (place_id varchar_pattern_ops)
Check constraints:
"enforce_dims_coordinates" CHECK (st_ndims(coordinates) = 2)
"enforce_geotype_coordinates" CHECK (geometrytype(coordinates) = 'POINT'::text OR coordinates IS NULL)
"enforce_srid_coordinates" CHECK (st_srid(coordinates) = 4326)
Foreign-key constraints:
"crowdbreaks_tweet_place_id_fkey" FOREIGN KEY (place_id) REFERENCES crowdbreaks_place(place_id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
TABLE "crowdbreaks_incomingkeyword" CONSTRAINT "crowdbreaks_incomingkeyword_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
TABLE "crowdbreaks_tweetanswer" CONSTRAINT "crowdbreaks_tweetanswer_tweet_id_id_fkey" FOREIGN KEY (tweet_id_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no
And here is the explain analyze for the query:
HashAggregate (cost=184022.03..184023.18 rows=115 width=4) (actual time=6381.707..6381.769 rows=62 loops=1)
-> Hash Join (cost=103857.48..183600.24 rows=84357 width=4) (actual time=1745.449..6377.505 rows=3453 loops=1)
Hash Cond: (crowdbreaks_incomingkeyword.tweet_id = crowdbreaks_tweet.tweet_id)
-> Seq Scan on crowdbreaks_incomingkeyword (cost=0.00..36873.97 rows=2252597 width=12) (actual time=0.008..2136.839 rows=2252597 loops=1)
-> Hash (cost=102535.68..102535.68 rows=80544 width=8) (actual time=1744.815..1744.815 rows=3091 loops=1)
Buckets: 4096 Batches: 4 Memory Usage: 32kB
-> Hash Left Join (cost=16574.93..102535.68 rows=80544 width=8) (actual time=112.551..1740.651 rows=3091 loops=1)
Hash Cond: ((crowdbreaks_tweet.place_id)::text = (crowdbreaks_place.place_id)::text)
Filter: ((crowdbreaks_tweet.coordinates # '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) OR ((crowdbreaks_place.bounding_box && '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) AND _st_overlaps(crowdbreaks_place.bounding_box, '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry)))
-> Bitmap Heap Scan on crowdbreaks_tweet (cost=15874.18..67060.28 rows=747873 width=125) (actual time=96.012..940.462 rows=736784 loops=1)
Recheck Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
-> Bitmap Index Scan on crowdbreaks_tweet_crreated_at (cost=0.00..15687.22 rows=747873 width=0) (actual time=94.259..94.259 rows=736784 loops=1)
Index Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
-> Hash (cost=217.11..217.11 rows=6611 width=469) (actual time=15.926..15.926 rows=6611 loops=1)
Buckets: 1024 Batches: 4 Memory Usage: 259kB
-> Seq Scan on crowdbreaks_place (cost=0.00..217.11 rows=6611 width=469) (actual time=0.005..6.908 rows=6611 loops=1)
Total runtime: 6381.903 ms
(17 rows)
That's a pretty bad runtime for the query. Ideally, I'd like to get results back in a second or two.
I've increased shared_buffers on Postgres to 2GB (I have 8GB of RAM) but other than that I'm not quite sure what to do. What are my options? Should I do fewer joins? Are there any other indexes I can throw on there? The sequential scan on crowdbreaks_incomingkeyword doesn't make sense to me. It's a table of foreign keys to other tables, and thus has indexes on it.

Judging from your comment I would try two things:
Raise statistics target for involved columns (and run ANALYZE).
ALTER TABLE tbl ALTER COLUMN column SET STATISTICS 1000;
The data distribution may be uneven. A bigger sample may provide the query planner with more accurate estimates.
Play with the cost settings in postgresql.conf. Your sequential scans might need to be more expensive compared to your index scans to give good estimates.
Try to lower the cost for cpu_index_tuple_cost and set effective_cache_size to something as high as three quaters of your total RAM for a dedicated DB server.