Postgres JSONB Retrieval is very slow - postgresql

At bit of a loss here. First and foremost, I'm not a dba nor do I really have any experience with Postgres with the exception of what I'm doing now.
Postgres seems to choke when you want to return jsonb documents in general for anything more than a couple of hundred rows. When you try to return thousands, query performance becomes terrible. If you go even further and attempt to return multiple jsonb documents following various table joins, forget it.
Here is my scenario with some code:
I have 3 tables - all tables have jsonb models, all complex models and 2 of which are sizeable (8 to 12kb in size uncompressed). In this particular operation I need to unnest a jsonb array of elements to then work through - this gives me roughly 12k records.
Each record then contains an ID that I use to join another important table - I need to retreive the jsonb doc from this table. From there, I need to join that table on to another (much smaller) table and also pull the doc from there based on another key.
The output is therefore several columns + 3 jsonb documents ranging from <1kb in size to around 12kb uncompressed in size.
Query data retrieval is effectively pointless - I've yet to see the query return data. As soon as I strip away the json doc columns, naturally the query speeds up to seconds or less. 1 jsonb document bumps the retrieval to 40seconds in my case, adding a second takes us to 2 minutes and adding the third is much longer.
What am I doing wrong? Is there any way to retrieve the jsonb documents in a performant way?
SELECT x.id,
a.doc1,
b.doc2,
c.doc3
FROM ( SELECT id,
(elements.elem ->> 'a'::text)::integer AS a,
(elements.elem ->> 'b'::text)::integer AS b,
(elements.elem ->> 'c'::text)::integer AS c,
(elements.elem ->> 'd'::text)::integer AS d,
(elements.elem ->> 'e'::text)::integer AS e
FROM tab
CROSS JOIN LATERAL jsonb_array_elements(tab.doc -> 'arrayList'::text) WITH ORDINALITY elements(elem, ordinality)) x
LEFT JOIN table2 a ON x.id = a.id
LEFT JOIN table3 b ON a.other_id = b.id
LEFT JOIN table4 c ON b.other_id = c.id;
The tables themselves are fairly standard:
CREATE TABLE a (
id (primary key),
other_id (foreign key),
doc jsonb
)
Nothing special about these tables, they are ids and jsonb documents
A note - we are using Postgres for a few reasons, we do need the relational aspects of PG but at the same time we need to document storage and retrieval ability for later in our workflow.
Apologies if I've not provided enough data here, I can try to add some more based on any comments
EDIT: added explain:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=465.79..96225.93 rows=11300 width=1843)
Hash Cond: (pr.table_3_id = br.id)
-> Hash Left Join (cost=451.25..95756.86 rows=11300 width=1149)
Hash Cond: (((p.doc ->> 'secondary_id'::text))::integer = pr.id)
-> Nested Loop Left Join (cost=0.44..95272.14 rows=11300 width=1029)
-> Nested Loop (cost=0.01..239.13 rows=11300 width=40)
-> Seq Scan on table_3 (cost=0.00..13.13 rows=113 width=710)
-> Function Scan on jsonb_array_elements elements (cost=0.01..1.00 rows=100 width=32)
-> Index Scan using table_1_pkey on table_1 p (cost=0.43..8.41 rows=1 width=993)
Index Cond: (((elements.elem ->> 'field_id'::text))::integer = id)
-> Hash (cost=325.36..325.36 rows=10036 width=124)
-> Seq Scan on table_2 pr (cost=0.00..325.36 rows=10036 width=124)
-> Hash (cost=13.13..13.13 rows=113 width=710)
-> Seq Scan on table_3 br (cost=0.00..13.13 rows=113 width=710)
(14 rows)
EDIT2: Sorry been mega busy - I will try to go into more detail - firstly the fully explain plan (I didn't know about the additional parameters) - Ill leave in the actual tables (I wasn't sure if I was allowed to):
Hash Left Join (cost=465.79..96225.93 rows=11300 width=1726) (actual time=4.669..278.781 rows=12522 loops=1)
Hash Cond: (pr.brand_id = br.id)
Buffers: shared hit=64813
-> Hash Left Join (cost=451.25..95756.86 rows=11300 width=1032) (actual time=4.537..265.749 rows=12522 loops=1)
Hash Cond: (((p.doc ->> 'productId'::text))::integer = pr.id)
Buffers: shared hit=64801
-> Nested Loop Left Join (cost=0.44..95272.14 rows=11300 width=912) (actual time=0.240..39.480 rows=12522 loops=1)
Buffers: shared hit=49964
-> Nested Loop (cost=0.01..239.13 rows=11300 width=40) (actual time=0.230..8.177 rows=12522 loops=1)
Buffers: shared hit=163
-> Seq Scan on brand (cost=0.00..13.13 rows=113 width=710) (actual time=0.003..0.038 rows=113 loops=1)
Buffers: shared hit=12
-> Function Scan on jsonb_array_elements elements (cost=0.01..1.00 rows=100 width=32) (actual time=0.045..0.057 rows=111 loops=113)
Buffers: shared hit=151
-> Index Scan using product_variant_pkey on product_variant p (cost=0.43..8.41 rows=1 width=876) (actual time=0.002..0.002 rows=1 loops=12522)
Index Cond: (((elements.elem ->> 'productVariantId'::text))::integer = id)
Buffers: shared hit=49801
-> Hash (cost=325.36..325.36 rows=10036 width=124) (actual time=4.174..4.174 rows=10036 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1684kB
Buffers: shared hit=225
-> Seq Scan on product pr (cost=0.00..325.36 rows=10036 width=124) (actual time=0.003..1.836 rows=10036 loops=1)
Buffers: shared hit=225
-> Hash (cost=13.13..13.13 rows=113 width=710) (actual time=0.114..0.114 rows=113 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 90kB
Buffers: shared hit=12
-> Seq Scan on brand br (cost=0.00..13.13 rows=113 width=710) (actual time=0.003..0.043 rows=113 loops=1)
Buffers: shared hit=12
Planning Time: 0.731 ms
Execution Time: 279.952 ms
(29 rows)

Your query is hard to follow for a few reasons:
Your tables are named tab, table2, table3, `table4.
Your subquery parses JSON for every single row in the table, projects out some values, and then the outer query never uses those values. The only one that appears to be relevant is id.
Outer joins must be executed in order while inner joins can be freely re-arranged for performance. Without knowing the purpose of this query, it's impossible for me to determine if an outer join is appropriate or not.
The table names and column names in the execution plan do not match the query, so I'm not convinced this plan is accurate.
You did not supply a schema.
That said, I'll do my best.
Things that stand out performance-wise
No where clause
Since there is no where clause, your query will run jsonb_array_elements against every single row of tab, which is what is happening. Aside from extracting the data out of JSON and storing it into a separate column, I can't imagine much that could be done to optimize it.
Insufficient indexes to power the joins
The query plan suggests there might be a meaningful cost to the joins. Except for table1, each join is driven by a sequential scan of the tables, which means, reading every row of both tables. I suspect adding indexes on each table will help. It appears you are joining on id columns, so a simple primary key constraint will improve both your data integrity and query performance.
alter table tab add constraint primary key (id);
alter table table2 add constraint primary key (id);
alter table table3 add constraint primary key (id);
alter table table4 add constraint primary key (id);
Type conversions
This part of the execution plan shows a double type conversion in your first join:
Index Cond: (((elements.elem ->> 'field_id'::text))::integer = id)
This predicate means that the id value from tab is being converted to text, then the text converted to an integer so it can match against table2.id. These conversions can be expensive in compute time and it some cases, can prevent index usage. It's hard to advise on what to do because I don't know what the actual types are.

Related

Improve Postgres performance

I am new to Postgres and sure I’m doing something wrong.
So I just wondered if anybody had experienced something similar to my experiences below or could point me in the right direction to improve Postgres performance.
My initial goal was to speed up the analytical processing of my Datamarts in various Dashboards by moving from MS SQL Server to Postgres.
To get a sample query to compare speeds I ran query profiler on MS SQL Server whilst referencing a BI dashboard, which produced something similar to this (I know there are redundant columns in the sub query):
SELECT COUNT(*)
FROM (
SELECT
BM.Key_Date, BM.[Actual Date], BM.[Month]
,BM.[Month Number], BM.[Month Year], BM.[No of Working Days]
,SDI.Key_Delivery, SDI.[Order Number], SDI.[Quantity SKU]
,SDI.[Quantity Sales Unit], SDI.[FactSales - GBP], SDI.[NNSA Capsules]
,SFI.[Ship-to], SFI.[Sold-to], SFI.[Sales Force Type], SFI.Region
,SFI.[Top Level Account], SFI.[Customer Organisation]
,EX.Rate
,PDI.[Product Description], PDI.[Product Type Group], PDI.[Product Type],
PDI.[Main Product Categories], PDI.Section, PDI.Family
FROM Fact.SalesDataInvoiced AS SDI
JOIN Dimension.SalesforceInvoiced AS SFI
ON SDI.[Key_Ship-to]=SFI.[Key_Ship-to]
JOIN Dimension.BillingMonth AS BM
ON SDI.[Key_Billing Month]=BM.Key_Date
JOIN Dimension.ProductDataInvoiced AS PDI
ON SDI.[Key_Product Code]=PDI.[Key_Product Code]
CROSS JOIN Dimension.Exchange AS EX
WHERE BM.[Actual Date] BETWEEN '20160101' AND '20211001'
) AS a
GROUP BY [Product Type], [Product Type Group],[Main Product Categories]
I then installed Postgres 14 (on Centos 8) and MS SQL Server Developer 2017 (on windows 10) on separate identical laptops and created a Database and tables from the same csv data files to enable the replication of the above query.
Running a Postgres query with indexing performs massively slower than MS SQL without indexing.
Adding indexes to MS SQL produces results almost instantly.
Because of the difference in processing time I even installed Citus with Postgres14 and created Fact.SalesDataInvoiced as a columnar table (This made the processing time worse).
I have played about with memory settings in postgresql.conf but nothing seems to enable speeds comparable to MSSQL.
Explain Analyze shows that despite the indexes it always runs a sequential scan of all tables. Forcing indexed scans doesn't make any difference to processing time.
Would I be right in thinking Postgres would perform significantly better using a cluster and partitioning? Even if this is the case surely a simple query like the one I'm trying to run on a stand alone machine should be faster?
TABLE DETAILS
Dimension.BillingMonth
Records 120,
Primary Key is KeyDate,
Clustered Unique Index on KeyDate
Dimension.Exchange
Records 1
Dimension.ProductDataInvoiced
Records 275563,
Primary Key is KeyProduct,
Clustered Unique Index on KeyProduct
Dimension.SalesforceInvoiced
Records 377414,
Primary Key is KeyShipTo,
Clustered Unique Index on KeyShipTo
Fact.SalesDataInvoiced
Records 43807943,
Non-Clustered Unique Index on KeyShipTo, KeyProduct, KeyBillingMonth
Any help would be appreciated as previously mentioned I'm sure I must be missing something obvious.
Many thanks in advance.
David
Thank you for the responses. I have placed additional info below.
Forgot to add my postgres performance woes were after i'd carried out a Full Vacuum and Reindex. I performed these maintenance tasks after I had imported the data and created my indexes.
Output after querying pg_indexes
tablename
indexname
indexdef
BillingMonth
BillingMonth_pkey
CREATE UNIQUE INDEX BillingMonth_pkey ON public.BillingMonth USING btree (KeyDate)
ProductDataInvoiced
ProductDataInvoiced_pkey
CREATE UNIQUE INDEX ProductDataInvoiced_pkey ON public.ProductDataInvoiced USING btree (KeyProductCode)
SalesforceInvoiced
SalesforceInvoiced_pkey
CREATE UNIQUE INDEX SalesforceInvoiced_pkey ON public.SalesforceInvoiced USING btree (KeyShipTo)
SalesDataInvoiced
CI_SalesData
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Output After running EXPLAIN (ANALYZE, BUFFERS)
Finalize GroupAggregate (cost=1435439.30..1435565.71 rows=480 width=53) (actual time=25960.468..25973.326 rows=31 loops=1)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Buffers: shared hit=71246 read=859119
-> Gather Merge (cost=1435439.30..1435551.31 rows=960 width=53) (actual time=25960.458..25973.282 rows=89 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=71246 read=859119
-> Sort (cost=1434439.28..1434440.48 rows=480 width=53) (actual time=25956.982..25956.989 rows=30 loops=3)
Sort Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=71246 read=859119
Worker 0: Sort Method: quicksort Memory: 29kB
Worker 1: Sort Method: quicksort Memory: 29kB
-> Partial HashAggregate (cost=1434413.10..1434417.90 rows=480 width=53) (actual time=25956.878..25956.895 rows=30 loops=3)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Batches: 1 Memory Usage: 49kB
Buffers: shared hit=71230 read=859119
Worker 0: Batches: 1 Memory Usage: 49kB
Worker 1: Batches: 1 Memory Usage: 49kB
-> Parallel Hash Join (cost=62124.74..1327935.46 rows=10647764 width=45) (actual time=285.864..19240.004 rows=14602648 loops=3)
Hash Cond: (sdi."KeyShipTo" = sfi."KeyShipTo")
Buffers: shared hit=71230 read=859119
-> Hash Join (cost=19648.48..1257508.51 rows=10647764 width=49) (actual time=204.794..12862.063 rows=14602648 loops=3)
Hash Cond: (sdi."KeyProductCode" = pdi."KeyProductCode")
Buffers: shared hit=32264 read=859119
-> Hash Join (cost=3.67..1091456.95 rows=10647764 width=8) (actual time=0.143..7076.104 rows=14602648 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
Buffers: shared hit=3
-> Seq Scan on "BillingMonth" bm (cost=0.00..2.80 rows=70 width=4) (actual time=0.012..0.028
rows=70 loops=3)
Filter: (("ActualDate" >= '2016-01-01'::date) AND ("ActualDate" <= '2021-10-01'::date))
Rows Removed by Filter: 50
Buffers: shared hit=3
-> Hash (cost=16200.27..16200.27 rows=275563 width=49) (actual time=203.237..203.238 rows=275563 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 26832kB
Buffers: shared hit=32067
-> Nested Loop (cost=0.00..16200.27 rows=275563 width=49) (actual time=0.034..104.143 rows=275563 loops=3)
Buffers: shared hit=32067
-> Seq Scan on "Exchange" ex (cost=0.00..1.01 rows=1 width=0) (actual time=0.024..0.024 rows=
1 loops=3)
Buffers: shared hit=3
-> Seq Scan on "ProductData_Invoiced" pdi (cost=0.00..13443.63 rows=275563 width=49) (actual
time=0.007..48.176 rows=275563 loops=3)
Buffers: shared hit=32064
-> Parallel Hash (cost=40510.56..40510.56 rows=157256 width=4) (actual time=79.536..79.536 rows=125805 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 18912kB
Buffers: shared hit=38938
-> Parallel Seq Scan on "Salesforce_Invoiced" sfi (cost=0.00..40510.56 rows=157256 width=4) (actual time=
0.011..42.968 rows=125805 loops=3)
Buffers: shared hit=38938
Planning:
Buffers: shared hit=426
Planning Time: 1.936 ms
Execution Time: 25973.709 ms
(55 rows)
Firstly, remember to run VACUUM ANALYZE after rebuilding indexes, or sometimes after importing large amount of data. (VACUUM FULL is mainly useful for the OS to reclaim disk space, and you'd still need to analyse afterwards, especially after rebuilding indexes.)
It seems from your query that your main table is SalesDataInvoiced (SDI) and that you'd want to use an index on KeyBillingMonth if possible (since it's the main restriction you're placing). In general, you'd also want indexes, at least on the other tables on the columns that are used for the joins.
As the documentation for multi-column indexes in PostgreSQL says:
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through. This index could in principle be used for queries that have constraints on b and/or c with no constraint on a — but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.
In your example, the main column you'd want to use a constraint on (KeyBillingMonth) is in third position, so it's unlikely to be used.
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced
USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Creating this should make it more likely to be used:
CREATE INDEX ON SalesDataInvoiced(KeyBillingMonth);
Then, run VACUUM ANALYZE and try your query again.
You may also want an index on BillingMonth(ActualDate), but that's not necessarily useful since there seems to be few rows (and most of them are returned in your query).
It's not clear what the BillingMonth table is for. If it's basically about truncating the ActualDate to have the first day of the month, you could for example get rid of the join on BillingMonth and use the constraint on SalesDataInvoiced.KeyBillingMonth directly. For example ... WHERE SDI.KeyBillingMonth BETWEEN '2016-01-01' AND '2021-10-01' ....
As a side-note, as far as I know, BETWEEN is inclusive for its upper bound. I'd imagine a query like this is meant to represent some monthly statistics, hence should probably not include what's on 2021-10-01 (but not the rest of that month).

Postgresql. Optimize retriving distinct values from large table

I have one de-normalized table with 40+ columns (~ 1.5 million rows, 1 Gb).
CREATE TABLE tbl1 (
...
division_id integer,
division_name varchar(10),
...
);
I need to speed up query
SELECT DISTINCT division_name, division_id
FROM table
ORDER BY division_name;
Query return only ~250 rows, but very slow cause size of table.
I have tried to create index:
create index idx1 on tbl1 (division_name, division_id)
But current execution plan:
explain analyze SELECT Distinct division_name, division_id FROM tbl1 ORDER BY 1;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=143135.77..143197.64 rows=24748 width=74) (actual time=1925.697..1925.723 rows=294 loops=1)
Sort Key: division_name
Sort Method: quicksort Memory: 74kB
-> HashAggregate (cost=141082.30..141329.78 rows=24748 width=74) (actual time=1923.853..1923.974 rows=294 loops=1)
Group Key: division_name, division_id
-> Seq Scan on tbl1 (cost=0.00..132866.20 rows=1643220 width=74) (actual time=0.069..703.008 rows=1643220 loops=1)
Planning time: 0.311 ms
Execution time: 1925.883 ms
Any suggestion why index does not work or how I can speed up query in other way?
Server Postgresql 9.6.
p.s. Yes, table has 40+ columns and de-normalized, but I know all pros and cons for with decision.
Update1
#a_horse_with_no_name suggest to use vacuum analyze instead of analyze to update table statistic. Now query plain is:
QUERY PLAN
------------------------
Unique (cost=0.55..115753.43 rows=25208 width=74) (actual time=0.165..921.426 rows=294 loops=1)
-> Index Only Scan using idx1 on tbl1 (cost=0.55..107538.21 rows=1643044 width=74) (actual time=0.162..593.322 rows=1643220 loops=1)
Heap Fetches: 0
Much better!
The index will probably only help if PostgreSQL chooses an “index only scan”, that means that it does not have to look at the table data at all.
Normally PostgreSQL has to check the table data (“heap”) to see if a row is visible for the current transaction, because visibility information is not stored in the index.
If, however, the table does not change much and has recently been VACUUMed, PostgreSQL knows that most of the pages consist only of items visible for everyone (there is a “visibility map” to keep track of that information), and then it might be cheaper to scan the index.
Try running VACUUM on the table and see if that causes an index only scan to be used.
Other than that, there is no way to speed up such a query.

Indexes on join tables

When searching on Google for join table indexes, I got this question.
Now, I believe that it is giving some false information in the accepted answer, or I do not understand how everything works.
Given the following tables (running on PostGreSQL 9.4):
CREATE TABLE "albums" ("album_id" serial PRIMARY KEY, "album_name" text)
CREATE TABLE "artists" ("artist_id" serial PRIMARY KEY, "artist_name" text)
CREATE TABLE "albums_artists" ("album_id" integer REFERENCES "albums", "artist_id" integer REFERENCES "artists")
I was trying to replicate the scenario from the question mentioned above, by creating first an index on both of the columns of the albums_artists table and then one index for each column (without keeping the index on both columns).
I would have been expecting very different results when using the EXPLAIN command for a normal, traditional select like the following one:
SELECT "artists".* FROM "test"."artists"
INNER JOIN "test"."albums_artists" ON ("albums_artists"."artist_id" = "artists"."artist_id")
WHERE ("albums_artists"."album_id" = 1)
However, when actually running explain on it, I get exactly the same result for each of the cases (with one index on each column vs. one index on both columns).
I've been reading the documentation on PostGreSQL about indexing and it doesn't make any sense on the results that I am getting:
Hash Join (cost=15.05..42.07 rows=11 width=36) (actual time=0.024..0.025 rows=1 loops=1)
Hash Cond: (artists.artist_id = albums_artists.artist_id)
-> Seq Scan on artists (cost=0.00..22.30 rows=1230 width=36) (actual time=0.006..0.006 rows=1 loops=1)
-> Hash (cost=14.91..14.91 rows=11 width=4) (actual time=0.009..0.009 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Bitmap Heap Scan on albums_artists (cost=4.24..14.91 rows=11 width=4) (actual time=0.008..0.009 rows=1 loops=1)
Recheck Cond: (album_id = 1)
Heap Blocks: exact=1
-> Bitmap Index Scan on albums_artists_album_id_index (cost=0.00..4.24 rows=11 width=0) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: (album_id = 1)
I would expect to not get an Index scan at the last step when using an index composed by 2 different columns (since I am only using one of them in the WHERE clause).
I was about to open a bug in an ORM library that adds one index for both columns for join tables, but now I am not so sure. Can anyone help me understand why is the behavior similar in the two cases and what would actually be the difference, if there is any?
add a NOT NULL constraint on the key columns (a tuple with NULLs would make no sense here)
add a PRIMARY KEY (forcing a UNIQUE index on the two keyfields)
As a suport for FK lookups : add a compound index for the PK fields in reversed order
after creating/adding PKs and indexes, you may want to ANALYZE the table (only key columns have statistics)
CREATE TABLE albums_artists
( album_id integer NOT NULL REFERENCES albums (album_id)
, artist_id integer NOT NULL REFERENCES artists (artist_id)
, PRIMARY KEY (album_id, artist_id)
);
CREATE UNIQUE INDEX ON albums_artists (artist_id, album_id);
The reason behind the observed behaviour is the fact that the planner/optimiser is information based, driven by heuristics. Without any information about the fraction of rows that will actually be needed given the conditions, or the fraction of rows that actually maches (in case of a JOIN), the planner makes a guess: (for example: 10% for a range query). For a small query, a hash join will always be a winning scenario, it does imply fetching all tuples from both tables, but the join itself is very efficient.
For columns that are part of a key or index, statistics will be collected, so the planner can make more realistic estimates about the amount of rows involved. Ald that will often result in an indexed plan, since that could need fewer pages to be fetched.
Foreign keys are a very special case; since the planner will know that all the values from the referring table will be present in the referred table. (that is 100%, assuming NOT NULL)

join tables on like with index

I have a table of URLs (domains and pages)
URLs
-----
url_id
url
I have a list of domain names, that I want to see if are contained in the URLs table.
so if I have a domain in my list:
http://stackoverflow.com
I want it to match the URLs.url record of:
https://stackoverflow.com/question/230479
https://stackoverflow.com/question/395872364
etc
The URL table is quite large, 10million+ and will grow
The list of domain names I want to test will vary between 1-10k
Currently I am creating a temp table of the list of domains, then joining to the URLs table to find all URLs that match
SELECT * from URLs
JOIN tmp_table_domains on tmp_table_domain.domain like URLs.url || '%'
I have indexed the URLs.url and the tmp_table_domain.domain, with the thinking that indexing will work as the wild card is on the right.
However, EXPLAIN ANALYSE doesn't show any index being used. An old post mentioned that postgres 8.x cannot like join with index, but I could find nothing else to back this up or alternatives or whether it applies to newer versions
If it helps, my postgres is 9.1. If upgrading will fix this, that is fine, only reason haven't upgraded is not been any reason to that I am aware of
Edit_1
this is a first database project have worked on and am learning it all as I go along
I don't mind ripping out all of the above and using whatever works better, whether that is a temp table / array / better query
edit_2
GroupAggregate (cost=1429152.90..1435118.48 rows=340890 width=44) (actual time=157905.450..157905.609 rows=27 loops=1)
-> Sort (cost=1429152.90..1430005.13 rows=340890 width=44) (actual time=157905.425..157905.451 rows=29 loops=1)
Sort Key: task_items.task_item
Sort Method: quicksort Memory: 29kB
-> Nested Loop (cost=14210.95..1387337.41 rows=340890 width=44) (actual time=18216.187..157905.055 rows=29 loops=1)
Join Filter: ((task_items.task_item)::text ~~ ((tmp_domains.domain)::text || '%'::text))
-> Hash Join (cost=14210.95..194126.53 rows=14066 width=44) (actual time=452.262..7953.639 rows=13737 loops=1)
Hash Cond: (task_items.task_id = tasks.task_id)
-> Seq Scan on task_items (cost=0.00..170062.71 rows=2589924 width=48) (actual time=0.019..4480.360 rows=2575206 loops=1)
Filter: (task_item_status_id = 2)
-> Hash (cost=14205.68..14205.68 rows=421 width=4) (actual time=440.409..440.409 rows=171 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 7kB
-> Seq Scan on tasks (cost=0.00..14205.68 rows=421 width=4) (actual time=101.491..439.821 rows=171 loops=1)
Filter: ((account_detail_id = 695) AND (base_action_type_id <> ALL ('{1,3,4}'::integer[])))
-> Materialize (cost=0.00..109.70 rows=4847 width=32) (actual time=0.002..4.924 rows=4536 loops=13737)
-> Seq Scan on tmp_domains (cost=0.00..85.47 rows=4847 width=32) (actual time=0.010..5.851 rows=4536 loops=1)
Total runtime: 157907.403 ms
The actual query is a bit different to the simplified explanation above.
task_items has just under 7million rows
and the tmp_domains has 4,500
tl;dr
so to summarise. What is the best way to partial match a list of strings to a column
A few months back Peter Eisentraut published the pguri extension which can greatly simplify your work. It is currently only source code so you'd have to build the library code, which is very easy on any Linux box, then place the files in the PG installation directory and finally CREATE EXTENSION in your database. After that you can do simple queries like:
SELECT *
FROM urls
JOIN tmp_table_domains d ON uri_host(d.domain::uri) = uri_host(urls.url::uri);
Note that this will also match between different schemes, so an http:// domain will match the corresponding https:// url. If you do not want that, then also join on uri_scheme() for both domain and url.
Indexes will work on the text data type that the functions of the extension returns. If your database uses UTF-8 encoding you should create your index somewhat like this:
CREATE INDEX url_index ON urls (uri_host(url::uri) text_pattern_ops);
And then also for your domain names table.
You can ALTER TABLE urls ALTER COLUMN url SET DATA TYPE uri so you can forego the casts.

Postgres 9.4: How to fix Query Planner's Choice of Hash Join in ANY ARRAY lookup which runs 10x slower

I realize of course that figuring out these issues can be complex and require lots of info but I'm hoping there is a known issue or workaround for this particular case. I've narrowed down the change in the query that causes the sub-optimal query plan (this is running Postgres 9.4).
The following query runs in about 50ms. The tag_device table is a junction table with ~2 million entries, the devices table has about 1.5 million entries and the tags table has about 500,000 entries (Note: the actual IP values are just made up).
WITH inner_query AS (
SELECT * FROM tag_device
INNER JOIN tags ON tag_device.tag_id = tags.id
INNER JOIN devices ON tag_device.device_id = devices.id
WHERE devices.device_ip <<= ANY(ARRAY[
'10.0.0.1', '10.0.0.2', '10.0.0.5', '11.1.1.1', '12.2.2.35','13.0.0.1', '15.0.0.8', '1.160.0.1', '17.1.1.24', '18.2.2.1',
'10.0.0.6', '10.0.0.21', '10.0.0.52', '11.1.1.2', '12.2.2.34','13.0.0.2', '15.0.0.7', '1.160.0.2', '17.1.1.23', '18.2.2.2',
'10.0.0.7', '10.0.0.22', '10.0.0.53', '11.1.1.3', '12.2.2.33','13.0.0.3', '15.0.0.6', '1.160.0.3', '17.1.1.22', '18.2.2.3'
]::iprange[])
))
SELECT * FROM inner_query LIMIT 100 OFFSET 0;
A few things to note. device_ip is using the ip4r module (https://github.com/RhodiumToad/ip4r) to provide ip range lookups and this column has a gist index on it. The above query runs in about 50ms using the following query plan:
Limit (cost=140367.19..140369.19 rows=100 width=239)
CTE inner_query
-> Nested Loop (cost=40147.63..140367.19 rows=56193 width=431)
-> Merge Join (cost=40147.20..113345.15 rows=56193 width=261)
Merge Cond: (tag_device.device_id = devices.id)
-> Index Scan using tag_device_device_id_idx on tag_device (cost=0.43..67481.36 rows=1900408 width=51)
-> Materialize (cost=40136.82..40402.96 rows=53228 width=210)
-> Sort (cost=40136.82..40269.89 rows=53228 width=210)
Sort Key: devices.id
-> Bitmap Heap Scan on devices (cost=1489.12..30498.45 rows=53228 width=210)
Recheck Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2.2.2,10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2.2.2 (...)
-> Bitmap Index Scan on devices_iprange_idx (cost=0.00..1475.81 rows=53228 width=0)
Index Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2.2.2,10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.2,13.0.0.1,15.0.0.2,1.160.0.5,17.1.1.1,18.2 (...)
-> Index Scan using tags_id_pkey on tags (cost=0.42..0.47 rows=1 width=170)
Index Cond: (id = tag_device.tag_id)
-> CTE Scan on inner_query (cost=0.00..1123.86 rows=56193 width=239)
If I increase the number of IP addresses in the ARRAY being looked up then the query plan changes and becomes drastically slower. So in the fast version of the query there are 30 items in the array. If I increase this to 80 items in the array then the query plan changes and becomes significantly slower (over 10x) The query remains the same in all other ways. The new query plan makes use of hash joins instead of merge joins and nested loops. Here is the new (much slower) query plan for when the array has 80 items in it instead of 30.
Limit (cost=204482.39..204484.39 rows=100 width=239)
CTE inner_query
-> Hash Join (cost=85839.13..204482.39 rows=146180 width=431)
Hash Cond: (tag_device.tag_id = tags.id)
-> Hash Join (cost=51368.40..145023.34 rows=146180 width=261)
Hash Cond: (tag_device.device_id = devices.id)
-> Seq Scan on tag_device (cost=0.00..36765.08 rows=1900408 width=51)
-> Hash (cost=45580.57..45580.57 rows=138466 width=210)
-> Bitmap Heap Scan on devices (cost=3868.31..45580.57 rows=138466 width=210)
Recheck Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.35,13.0.0.1,15.0.0.8,1.160.0.1,17.1.1.24,18.2.2.1,10.0.0.6,10.0.0.21,10.0.0.52,11.1.1.2,12.2.2.34,13.0.0.2,15.0.0.7,1.160.0.2,17.1.1.23,18.2.2.2 (...)
-> Bitmap Index Scan on devices_iprange_idx (cost=0.00..3833.70 rows=138466 width=0)
Index Cond: (device_ip <<= ANY ('{10.0.0.1,10.0.0.2,10.0.0.5,11.1.1.1,12.2.2.35,13.0.0.1,15.0.0.8,1.160.0.1,17.1.1.24,18.2.2.1,10.0.0.6,10.0.0.21,10.0.0.52,11.1.1.2,12.2.2.34,13.0.0.2,15.0.0.7,1.160.0.2,17.1.1.23,18.2 (...)
-> Hash (cost=16928.88..16928.88 rows=475188 width=170)
-> Seq Scan on tags (cost=0.00..16928.88 rows=475188 width=170)
-> CTE Scan on inner_query (cost=0.00..2923.60 rows=146180 width=239)
The above query with it's default query plan runs in about 500ms (over 10 times slower). If I turn off hash joins using SET enable_hashjoin= OFF; then the query plan goes back to using merge joins and runs in ~50ms again with 80 items in the array.
Again the only change here is the number of items in the ARRAY that are being looked up.
Does anyone have any thoughts on why the planner is making the poor choice that results in the massive slow down?
The database fits into memory completely and is on SSDs.
I also want to point out that I'm using a CTE because I ran into an issue where the planner would not use the index on the tag_device table when I added in the limit to the query. Basically the issue described here: http://thebuild.com/blog/2014/11/18/when-limit-attacks/.
Thanks!
I see that there is a sort as part of the merge join. Once you get past a certain threshold the sort operation needed to do the merge join is deemed to be too expensive and a hash join is estimated to be cheaper. It may be more expensive (time wise) but cheaper in terms of CPU consumption to run the query this way.