Slow regex query on 80M record in PostgreSQL - postgresql

I have a read-only table with 80 million rows :
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------------------------+-----------+----------+--------------+-------------
id | character(11) | not null | extended | |
gender | character(1) | | extended | |
postal_code | character varying(10) | | extended | |
operator | character varying(5) | | extended | |
Indexes:
"categorised_phones_pkey" PRIMARY KEY, btree (id)
"operator_idx" btree (operator)
"postal_code_trgm_idx" gin (postal_code gin_trgm_ops)
id is Primary Key and contains unique mobile numbers. Table rows looks like this:
id | gender | postal_code | operator
----------------+--------------+----------------+------------
09567849087 | m | 7414776788 | mtn
09565649846 | f | 1268398732 | mci
09568831245 | f | 7412556443 | mtn
09469774390 | m | 5488312790 | mci
This query takes almost ~65 seconds for the first time and ~8 seconds for next times:
select operator,count(*) from categorised_phones where postal_code like '1%' group by operator;
And the output looks like this:
operator | count
----------+---------
mci | 4050314
mtn | 6235778
And the output of explain alanyze :
HashAggregate (cost=1364980.61..1364980.63 rows=2 width=10) (actual time=8257.026..8257.026 rows=2 loops=1)
Group Key: operator
-> Bitmap Heap Scan on categorised_phones (cost=95969.17..1312915.34 rows=10413054 width=2) (actual time=1140.803..6332.534 rows=10286092 loops=1)
Recheck Cond: ((postal_code)::text ~~ '1%'::text)
Rows Removed by Index Recheck: 25105697
Heap Blocks: exact=50449 lossy=237243
-> Bitmap Index Scan on postal_code_trgm_idx (cost=0.00..93365.90 rows=10413054 width=0) (actual time=1129.270..1129.270 rows=10287127 loops=1)
Index Cond: ((postal_code)::text ~~ '1%'::text)
Planning time: 0.540 ms
Execution time: 8257.392 ms
How can I make this query faster?
Any idea would be great appreciated.
P.S:
I'm using PostgreSQL 9.6.1
UPDATE
I just updated the question. I disabled Parallel Query and results changed.

For queries that involve comparisons of the form LIKE '%start', and following PostgreSQL own advice, you can use the following index:
CREATE INDEX postal_code_idx ON categorised_phones (postal_code varchar_pattern_ops) ;
With that index in place, and some simulated data, your execution plan could very likely look like:
| QUERY PLAN |
| :------------------------------------------------------------------------------------------------------------------------------------- |
| HashAggregate (cost=2368.65..2368.67 rows=2 width=12) (actual time=18.093..18.094 rows=2 loops=1) |
| Group Key: operator |
| -> Bitmap Heap Scan on categorised_phones (cost=536.79..2265.83 rows=20564 width=4) (actual time=2.564..12.061 rows=22171 loops=1) |
| Filter: ((postal_code)::text ~~ '1%'::text) |
| Heap Blocks: exact=1455 |
| -> Bitmap Index Scan on postal_code_idx (cost=0.00..531.65 rows=21923 width=0) (actual time=2.386..2.386 rows=22171 loops=1) |
| Index Cond: (((postal_code)::text ~>=~ '1'::text) AND ((postal_code)::text ~<~ '2'::text)) |
| Planning time: 0.119 ms |
| Execution time: 18.122 ms |
You can check it at dbfiddle here
If you have both queries with LIKE 'start%' and LIKE '%middle%', you should add this index, but keep the one already in place. Trigram indexes might prove useful with this second kind of match.
Why?
From PostgreSQL documentation on operator classes:
The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.
From PostgreSQL documentation on Index Types
The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your database does not use the C locale you will need to create the index with a special operator class to support indexing of pattern-matching queries; see Section 11.9 below. It is also possible to use B-tree indexes for ILIKE and ~*, but only if the pattern starts with non-alphabetic characters, i.e., characters that are not affected by upper/lower case conversion.
UPDATE
If the queries performed involved always a fix (and relatively small) number of LIKE 'x%' expressions, consider using partial indexes.
For instance, for LIKE '1%', you'd have the following index, and the following query plan (it shows about a 3x improvement):
CREATE INDEX idx_1 ON categorised_phones (operator) WHERE postal_code LIKE '1%';
VACUUM categorised_phones ;
| QUERY PLAN |
| :-------------------------------------------------------------------------------------------------------------------------------------------- |
| GroupAggregate (cost=0.29..658.74 rows=3 width=12) (actual time=3.235..6.493 rows=2 loops=1) |
| Group Key: operator |
| -> Index Only Scan using idx_1 on categorised_phones (cost=0.29..554.10 rows=20921 width=4) (actual time=0.028..3.266 rows=22290 loops=1) |
| Heap Fetches: 0 |
| Planning time: 0.293 ms |
| Execution time: 6.517 ms |

Related

How can optimize the database to reduce time on query?

There are two large table num,pre in database :
\d num
Table "public.num"
Column | Type | Collation | Nullable | Default
----------+-------------------------+-----------+----------+---------------------------------
id | integer | | not null | nextval('num_id_seq'::regclass)
adsh | character varying(20) | | |
tag | character varying(256) | | |
version | character varying(20) | | |
coreg | character varying(256) | | |
ddate | date | | |
qtrs | numeric(8,0) | | |
uom | character varying(20) | | |
value | numeric(28,4) | | |
footnote | character varying(1024) | | |
\d pre
Table "public.pre"
Column | Type | Collation | Nullable | Default
----------+------------------------+-----------+----------+---------------------------------
id | integer | | not null | nextval('pre_id_seq'::regclass)
adsh | character varying(20) | | |
report | numeric(6,0) | | |
line | numeric(6,0) | | |
stmt | character varying(2) | | |
inpth | boolean | | |
rfile | character(1) | | |
tag | character varying(256) | | |
version | character varying(20) | | |
plabel | character varying(512) | | |
negating | boolean | | |
Check how many records in the tables:
select count(*) from num;
count
----------
83862587
(1 row)
Time: 204945.436 ms (03:24.945)
select count(*) from pre;
count
----------
36738034
(1 row)
Time: 100604.085 ms (01:40.604)
Execute a long query :
explain analyze select tag,uom,qtrs,value,ddate from num
where adsh='0000320193-22-000108' and tag in
(select tag from pre where stmt='IS' and
adsh='0000320193-22-000108') and ddate='2022-09-30';
It cost almost 7 minutes 30 seconds.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=2000.00..2909871.29 rows=2 width=59) (actual time=357717.922..450523.035 rows=45 loops=1)
Join Filter: ((num.tag)::text = (pre.tag)::text)
Rows Removed by Join Filter: 61320
-> Gather (cost=1000.00..1984125.01 rows=32 width=59) (actual time=190.355..92987.731 rows=678 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on num (cost=0.00..1983121.81 rows=13 width=59) (actual time=348.753..331304.671 rows=226 loops=3)
Filter: (((adsh)::text = '0000320193-22-000108'::text) AND (ddate = '2022-09-30'::date))
Rows Removed by Filter: 27953970
-> Materialize (cost=1000.00..925725.74 rows=43 width=33) (actual time=0.097..527.331 rows=91 loops=678)
-> Gather (cost=1000.00..925725.53 rows=43 width=33) (actual time=65.880..357527.133 rows=96 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on pre (cost=0.00..924721.22 rows=18 width=33) (actual time=201.713..357490.037 rows=32 loops=3)
Filter: (((adsh)::text = '0000320193-22-000108'::text) AND ((stmt)::text = 'IS'::text))
Rows Removed by Filter: 12245979
Planning Time: 0.632 ms
JIT:
Functions: 27
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.870 ms, Inlining 272.489 ms, Optimization 367.828 ms, Emission 213.288 ms, Total 859.474 ms
Execution Time: 450524.974 ms
(22 rows)
Time: 450526.084 ms (07:30.526)
How can optimize the database to reduce time on query?Add index and close something (the db running on my local pc without any other users)?
Its performing a full table scan on those columns, as you see from the analysis that says "Parallel Seq Scan" (ie sequential scan) on num and pre.
this means it is looking at every row to check if it should be used.
To speed it up, massively, you need to create an index on the columns used in the where clauses (pre.stmt, pre.adsh, num.adsh and num.ddate). Then the query will use the index to decide which rows to include, and as indexes are specially organised for this task, the performance will increase.
adding a CTE would greatly improve your query :
WITH t AS (
SELECT tag FROM tag WHERE stmt= 'yy' AND adsh = 'xxxx'
)
select tag,uom,qtrs,value,ddate
from num n
JOIN t ON t.tag = q.tag
where adsh='xxxx' and ddate='2022-09-30';
Also you can specialize one of the used columns if the queries used most frequently are very known :
CREATE INDEX idx_num_adshxxxx ON num(adsh) WHERE adsh='xxxx';
This would create a very fast index for just a small portion of the table.
Is important to notice that a indexes have very limited usage for non-preplanned selects, for large tables is often better to create a index scan select and re-query the results from a CTE than affording to load the entire table, as example this is a very common predicate of every day db usage and its problems:
WHERE LOWER(adsh) = 'xxxx' ;
Now notice how important is for your queries to be matching the correct indexing
This means if you change the column being researched your indexes should match or they doenst get used, this is same for integercolumn::text = 'x' or date::text = '2019-10-01'
after adding the proper index this get solved:
as much tables start getting bigger less and less random filters can be allowed, table scan will toggle a S.O memory cache , than later the same data will be doubled in the postgresql cache and only later will stabilize .
Unstable postgresql cached queries will reduce the speed of the previously cached queries whenever a new random query is executed if the cached memory of the server is already on the limit.
The subselect in the where clause needs to be rewritten into a table join which will be much faster.The DISTINCT may not be necessary, but I don't know the cardinality of your data, so I included it.
select DISTINCT
num.tag
, num.uom
, num.qtrs
, num.value
, num.ddate
from num
inner join pre
on num.adsh = pre.adsh
and num.adsh = 'xxxx'
and pre.stmt = 'yy'
and num.ddate = '2022-09-30'

What's the best way to Select the first N digits in an Integer (PostgreSQL)

I have a bigint column which is a 19 digit CAS, the first 13 digits of which are a unix timestamp. I need to extract the first 13 digits.
What is the best way to extract the first n digits from the bigint.
Both of these work:
select
cast(left(cast(1641394918760185856 as varchar), 13) as bigint) as withcasting,
1641394918760185856/1000000 as integerdiv;
OUTPUT:1641394918760, 1641394918760
Is there an obviously better way?
Which one of these is better as far as performance?
Which one if any is the canonical way to do it.
Intuitively, I'd think integer division is more performant because it's one operation.
But I like CAS because it's very simple (doesn't require figuring out the divisor) and expresses most clearly what I'm looking to do (extract the leading 13 digits) and also lends itself to a generalized UDF abstraction (LEFT_DIGITS(number,n)).
So I guess, assuming that it is indeed less performant, the question is really how much less performant?
Actually this way is simpler, you really don't need the left function , just cast to n number of character that you need:
select cast(1641394918760185856 as varchar(3))
well, my initial assumption was not right since actually you hav to cast it back to bigint , after looking at the execution plans, you can see actually arithmetic operation is done much easier ( with less memory usage) :
select 1641394918760185856/1000000 as ii
| QUERY PLAN |
| :--------------------------------------------------------------------------------- |
| Result (cost=0.00..0.01 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1) |
| Output: '1641394918760'::bigint |
| Planning Time: 0.021 ms |
| Execution Time: 0.007 ms |
select cast(cast(1641394918760185856 as varchar(13)) as bigint) as ii
| QUERY PLAN |
| :--------------------------------------------------------------------------------- |
| Result (cost=0.00..0.01 rows=1 width=8) (actual time=0.000..0.001 rows=1 loops=1) |
| Output: '1641394918760'::bigint |
| Planning: |
| Buffers: shared hit=12 read=3 |
| Planning Time: 0.097 ms |
| Execution Time: 0.005 ms |
db<>fiddle here

Why is select query is very slow in Postgres?

I have a simple Postgres Table. A simple query to count total records takes ages. I have 7.5 millions records in table, I using 8 vCPUs, 32 GB memory machine. Database is in same machine.
Edit: add query.
Following query is very slow:
SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000
Output of explain
$ explain SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000
---------------------------------------------------------------------------------------------------------
Limit (cost=5.42..49915.17 rows=10000 width=1985)
-> Index Scan using import_csv_id_idx on import_csv (cost=0.43..19144730.02 rows=3835870 width=1985)
Filter: (NOT processed)
(3 rows)
My table is as below:
Column | Type | Collation | Nullable | Default
-------------------+----------------+-----------+----------+---------
id | integer | | |
name | character(500) | | |
domain | character(500) | | |
year_founded | real | | |
industry | character(500) | | |
size_range | character(500) | | |
locality | character(500) | | |
country | character(500) | | |
linkedinurl | character(500) | | |
employees | integer | | |
processed | boolean | | not null | false
employee_estimate | integer | | |
Indexes:
"import_csv_id_idx" btree (id)
"processed_idx" btree (processed)
Thank you
Edit 3:
# explain analyze SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5.42..49915.33 rows=10000 width=1985) (actual time=8331.070..8355.556 rows=10000 loops=1)
-> Index Scan using import_csv_id_idx on import_csv (cost=0.43..19144790.06 rows=3835870 width=1985) (actual time=8331.067..8354.874 rows=10001 loops=1)
Filter: (NOT processed)
Rows Removed by Filter: 3482252
Planning time: 0.081 ms
Execution time: 8355.925 ms
(6 rows)
explain (analyze, buffers)
# explain (analyze, buffers) SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5.42..49915.33 rows=10000 width=1985) (actual time=8236.899..8260.941 rows=10000 loops=1)
Buffers: shared hit=724036 read=2187905 dirtied=17 written=35
-> Index Scan using import_csv_id_idx on import_csv (cost=0.43..19144790.06 rows=3835870 width=1985) (actual time=8236.896..8260.104 rows=10001 loops=1)
Filter: (NOT processed)
Rows Removed by Filter: 3482252
Buffers: shared hit=724036 read=2187905 dirtied=17 written=35
Planning time: 0.386 ms
Execution time: 8261.406 ms
(8 rows)
It is slow because it has to dig through 3482252 rows which fail the processed = False criterion before finding the 10001st on which passes, and apparently all those failing rows are scattered randomly about the table leading to a lot of slow IO.
You either need an index on (processed, id), or on (id) where processed = false
If you do the first of these, you can drop the index on processed alone, as it would no longer be independently useful (if it ever were to start with).

Casting in join filter - does it preclude an index scan?

So I'm joining over a good few tables, and the results are horrendous.
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Nested Loop Left Join (cost=15291.63..367830181335340285952.00 rows=8062002970089247211520 width=676) |
| Join Filter: CASE WHEN (x415."SOURCE_URI" IS NOT NULL) THEN ((x415."SOURCE_URI")::text = (x47."SOURCE_URI")::text) ELSE NULL::boolean END |
| -> Nested Loop Left Join (cost=15291.63..53031043002887.09 rows=1767529209075750 width=900) |
| Join Filter: CASE WHEN (CASE WHEN (x414."CAT" IS NULL) THEN NULL::integer ELSE 1 END IS NOT NULL) THEN ((x414."CAT")::text = (x415."CAT")::text) ELSE NULL::boolean END |
| -> Nested Loop Left Join (cost=15291.63..5166730086.91 rows=172147962900 width=846) |
| Join Filter: CASE WHEN (CASE WHEN (x297."ID" IS NULL) THEN NULL::integer ELSE 1 END IS NOT NULL) THEN (x297."ID" = x258."TRACK") ELSE NULL::boolean END |
| -> Nested Loop Left Join (cost=15291.63..2291101.66 rows=71430690 width=816) |
| Join Filter: CASE WHEN (CASE WHEN (x297."ID" IS NULL) THEN NULL::integer ELSE 1 END IS NOT NULL) THEN (x297."ID" = x213."TRACK") ELSE NULL::boolean END |
| -> Nested Loop Left Join (cost=15291.63..148094.28 rows=33270 width=786) |
| Join Filter: CASE WHEN (CASE WHEN (x330."ID" IS NULL) THEN NULL::integer ELSE 1 END IS NOT NULL) THEN (x330."ID" = x297."MEDIUM") ELSE NULL::boolean END |
| -> Nested Loop Left Join (cost=15291.63..146700.24 rows=4 width=724) |
| -> Nested Loop Left Join (cost=15291.22..146681.05 rows=4 width=654) |
| -> Nested Loop Left Join (cost=15290.80..146667.32 rows=4 width=542) |
| -> Nested Loop Left Join (cost=15290.25..146643.82 rows=4 width=496) |
| -> Nested Loop Left Join (cost=15289.69..146621.36 rows=4 width=448) |
| -> Hash Right Join (cost=15289.27..146602.54 rows=4 width=438) |
| Hash Cond: ((x355."SOURCE_URI")::text = (x410."SOURCE_URI")::text) |
| -> Seq Scan on "RELEASE_IMAGE" x355 (cost=0.00..111558.45 rows=5267945 width=120) |
| -> Hash (cost=15289.22..15289.22 rows=4 width=318) |
| -> Hash Right Join (cost=13996.67..15289.22 rows=4 width=318) |
| Hash Cond: ((x376."SOURCE_URI")::text = (x410."SOURCE_URI")::text) |
| -> Seq Scan on "RELEASE_VOTED_TAG" x376 (cost=0.00..1118.94 rows=46294 width=82)
|
| -> Hash (cost=13996.62..13996.62 rows=4 width=236) |
| -> Hash Right Join (cost=13934.61..13996.62 rows=4 width=236) |
| Hash Cond: ((x330."SOURCE_URI")::text = (x410."SOURCE_URI")::text) |
| -> Seq Scan on "MEDIUM" x330 (cost=0.00..53.00 rows=2400 width=83) |
| -> Hash (cost=13934.56..13934.56 rows=4 width=153) |
| -> Nested Loop (cost=0.56..13934.56 rows=4 width=153) |
| -> Seq Scan on "RELEASE_BARCODE" (cost=0.00..13900.21 rows=4 width=40) |
| Filter: (("BARCODE")::text = ANY ('{75992731324,075992731324,0075992731324}'::text[])) |
| -> Index Scan using "RELEASE_pkey" on "RELEASE" x410 (cost=0.56..8.58 rows=1 width=153) |
| Index Cond: (("SOURCE_URI")::text = ("RELEASE_BARCODE"."SOURCE_URI")::text) |
| -> Index Only Scan using "RELEASE_CAT_PK" on "RELEASE_CAT_NO" x414 (cost=0.41..4.70 rows=1 width=74) |
| Index Cond: ("SOURCE_URI" = (x410."SOURCE_URI")::text) |
| -> Index Only Scan using "RELEASE_GENRE_PK" on "RELEASE_GENRE" x409 (cost=0.56..5.61 rows=1 width=48) |
| Index Cond: ("SOURCE_URI" = (x410."SOURCE_URI")::text) |
| -> Index Only Scan using "RELEASE_TYPE_PK" on "RELEASE_TYPE" x394 (cost=0.56..5.83 rows=4 width=46) |
| Index Cond: ("SOURCE_URI" = (x410."SOURCE_URI")::text) |
| -> Index Only Scan using "RELEASE_URL_PK" on "RELEASE_URL" x165 (cost=0.41..3.41 rows=2 width=112) |
| Index Cond: ("SOURCE_URI" = (x410."SOURCE_URI")::text) |
| -> Index Scan using release_label_source_uri on "RELEASE_LABEL" x111 (cost=0.41..4.79 rows=1 width=134) |
| Index Cond: ((x410."SOURCE_URI")::text = ("SOURCE_URI")::text) |
| -> Materialize (cost=0.00..437.53 rows=16635 width=62) |
| -> Seq Scan on "TRACK" x297 (cost=0.00..354.35 rows=16635 width=62) |
| -> Materialize (cost=0.00..97.41 rows=4294 width=30) |
| -> Seq Scan on "TRACK_COMPOSER" x213 (cost=0.00..75.94 rows=4294 width=30) |
| -> Materialize (cost=0.00..110.30 rows=4820 width=30)
| -> Seq Scan on "TRACK_ARTIST" x258 (cost=0.00..86.20 rows=4820 width=30) |
| -> Materialize (cost=0.00..579.02 rows=20535 width=74) |
| -> Seq Scan on "RELEASE_CAT_NO" x415 (cost=0.00..476.35 rows=20535 width=74) |
| -> Materialize (cost=0.00..366235.13 rows=9122342 width=40) |
| -> Seq Scan on "RELEASE" x47 (cost=0.00..249354.42 rows=9122342 width=40) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
https://explain.depesz.com/s/3DSD
My first reaction was to add some indexes. So I added the following:
CREATE INDEX RELEASE_CAT_CAT_NO on "RELEASE_CAT_NO" ("CAT");
CREATE INDEX "track_medium" on "TRACK" ("MEDIUM");
CREATE INDEX "track_composer_track" on "TRACK_COMPOSER" ("TRACK");
CREATE INDEX "track_artist_track" on "TRACK_ARTIST" ("TRACK");
But this makes no difference. When I perform simpler queries I can see the indexes being used, but still not for this query.
That said, adding indexes did help for:
CREATE INDEX "release_label_source_uri" on "RELEASE_LABEL" ("SOURCE_URI");
I'm wondering whether the join filters, which potentially cast values to different types, are responsible:
| Join Filter: CASE WHEN (CASE WHEN (x414."CAT" IS NULL) THEN NULL::integer ELSE 1 END IS NOT NULL) THEN ((x414."CAT")::text = (x415."CAT")::text) ELSE NULL::boolean END |
CAT is a varchar and I created an index for it as above. The code above code is taken when a subquery performs a select for, returning 1 or 0 depending on whether CAT is null.
I assume this only occurs on results, and doesn't impact the type of scan? But the reason I am wondering is because it's appearing in the "join filter" output.
This is a query generated by the Slick framework btw. PostgreSQL 9.6.3.
Some ideas:
You have exclusively outer joins. That limits the possible execution paths considerably.
Check if you really need outer joins everywhere or if you could use inner joins in some places.
Many of your join conditions are really complicated and allow only nested loop joins, which will hurt performance a lot if many rows are involved.
Try to simplify them!
For example, consider this:
... LEFT JOIN ...
ON CASE
WHEN (x415."SOURCE_URI" IS NOT NULL)
THEN ((x415."SOURCE_URI")::text = (x47."SOURCE_URI")::text)
ELSE NULL::boolean
END
This brain damaged piece of SQL could be written as
... LEFT JOIN ...
ON x415."SOURCE_URI" = x47."SOURCE_URI"
Then PostgreSQL could use a hash join, and if you have a lot of rows, that will speed up your query considerably.
One more index that could help with your execution plan, depending on how big the table is:
CREATE INDEX ON "RELEASE_BARCODE"("BARCODE");

Slow Select Query with DATE Between

I searched for my problem a bit, but couldn't find a solution.
I run a PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit and my Query is pretty simple.
EXPLAIN (ANALYZE) SELECT CUSTOMER, PRICE, BUYDATE FROM dbo.Invoice WHERE CUSTOMER = 11111111 AND BUYDATE BETWEEN '2012-11-01' AND '2013-10-31';
Output:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on Invoice (cost=88193.54..152981.03 rows=20699 width=14) (actual time=987.205..987.242 rows=36 loops=1)
Recheck Cond: ((CUSTOMER = 11111111) AND (BUYDATE >= '2012-11-01'::date) AND (BUYDATE <= '2013-10-31'::date))
-> BitmapAnd (cost=88193.54..88193.54 rows=20699 width=0) (actual time=987.189..987.189 rows=0 loops=1)
-> Bitmap Index Scan on ix_Invoice (cost=0.00..1375.69 rows=74375 width=0) (actual time=0.043..0.043 rows=40 loops=1)
Index Cond: (CUSTOMER = 11111111)
-> Bitmap Index Scan on ix_Invoice3 (cost=0.00..86807.24 rows=4139736 width=0) (actual time=986.562..986.562 rows=4153999 loops=1)
Index Cond: ((BUYDATE >= '2012-11-01'::date) AND (BUYDATE <= '2013-10-31'::date))
Total runtime: 987.294 ms
(8 rows)
The Table Structure:
Column | Type | Modifiers | Storage | Stats target | Description
-----------------+---------------------------+-----------+----------+--------------+-------------
profitcenter | character varying(5) | not null | extended | |
invoicenumber | character varying(10) | not null | extended | |
invoiceposition | character varying(6) | not null | extended | |
buydate | date | not null | plain | |
customer | integer | | plain | |
nettowert | numeric(18,2) | | main | |
Indexes:
"filialbons_key" PRIMARY KEY, btree (profitcenter, invoicenumber, invoiceposition, buydate)
"ix_Invoice" btree (customer)
"ix_Invoice2" btree (invoicenumber)
"ix_Invoice3" btree (buydate)
"ix_Invoice4" btree (articlenumber)
Has OIDs: no
Example Output from the Query:
customer | price | buydate
--------------+-----------+----
11111111 | 8.32 | 2013-02-06
11111111 | 5.82 | 2013-02-06
11111111 | 16.64 | 2013-02-06
I ran the same Query on a MSSQL 2010? Express with the Date Column as varchar() and it was much faster.
Thanks for your help
with index on (customer, buydate) query should work much faster.
you may try to help planner to chose better plan by collecting more statistic data:
ALTER TABLE Invoice ALTER COLUMN customer SET STATISTICS 1000;
ALTER TABLE Invoice ALTER COLUMN buydate SET STATISTICS 1000;
ANALYZE Invoice;