join tables on like with index - postgresql

I have a table of URLs (domains and pages)
URLs
-----
url_id
url
I have a list of domain names, that I want to see if are contained in the URLs table.
so if I have a domain in my list:
http://stackoverflow.com
I want it to match the URLs.url record of:
https://stackoverflow.com/question/230479
https://stackoverflow.com/question/395872364
etc
The URL table is quite large, 10million+ and will grow
The list of domain names I want to test will vary between 1-10k
Currently I am creating a temp table of the list of domains, then joining to the URLs table to find all URLs that match
SELECT * from URLs
JOIN tmp_table_domains on tmp_table_domain.domain like URLs.url || '%'
I have indexed the URLs.url and the tmp_table_domain.domain, with the thinking that indexing will work as the wild card is on the right.
However, EXPLAIN ANALYSE doesn't show any index being used. An old post mentioned that postgres 8.x cannot like join with index, but I could find nothing else to back this up or alternatives or whether it applies to newer versions
If it helps, my postgres is 9.1. If upgrading will fix this, that is fine, only reason haven't upgraded is not been any reason to that I am aware of
Edit_1
this is a first database project have worked on and am learning it all as I go along
I don't mind ripping out all of the above and using whatever works better, whether that is a temp table / array / better query
edit_2
GroupAggregate (cost=1429152.90..1435118.48 rows=340890 width=44) (actual time=157905.450..157905.609 rows=27 loops=1)
-> Sort (cost=1429152.90..1430005.13 rows=340890 width=44) (actual time=157905.425..157905.451 rows=29 loops=1)
Sort Key: task_items.task_item
Sort Method: quicksort Memory: 29kB
-> Nested Loop (cost=14210.95..1387337.41 rows=340890 width=44) (actual time=18216.187..157905.055 rows=29 loops=1)
Join Filter: ((task_items.task_item)::text ~~ ((tmp_domains.domain)::text || '%'::text))
-> Hash Join (cost=14210.95..194126.53 rows=14066 width=44) (actual time=452.262..7953.639 rows=13737 loops=1)
Hash Cond: (task_items.task_id = tasks.task_id)
-> Seq Scan on task_items (cost=0.00..170062.71 rows=2589924 width=48) (actual time=0.019..4480.360 rows=2575206 loops=1)
Filter: (task_item_status_id = 2)
-> Hash (cost=14205.68..14205.68 rows=421 width=4) (actual time=440.409..440.409 rows=171 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 7kB
-> Seq Scan on tasks (cost=0.00..14205.68 rows=421 width=4) (actual time=101.491..439.821 rows=171 loops=1)
Filter: ((account_detail_id = 695) AND (base_action_type_id <> ALL ('{1,3,4}'::integer[])))
-> Materialize (cost=0.00..109.70 rows=4847 width=32) (actual time=0.002..4.924 rows=4536 loops=13737)
-> Seq Scan on tmp_domains (cost=0.00..85.47 rows=4847 width=32) (actual time=0.010..5.851 rows=4536 loops=1)
Total runtime: 157907.403 ms
The actual query is a bit different to the simplified explanation above.
task_items has just under 7million rows
and the tmp_domains has 4,500
tl;dr
so to summarise. What is the best way to partial match a list of strings to a column

A few months back Peter Eisentraut published the pguri extension which can greatly simplify your work. It is currently only source code so you'd have to build the library code, which is very easy on any Linux box, then place the files in the PG installation directory and finally CREATE EXTENSION in your database. After that you can do simple queries like:
SELECT *
FROM urls
JOIN tmp_table_domains d ON uri_host(d.domain::uri) = uri_host(urls.url::uri);
Note that this will also match between different schemes, so an http:// domain will match the corresponding https:// url. If you do not want that, then also join on uri_scheme() for both domain and url.
Indexes will work on the text data type that the functions of the extension returns. If your database uses UTF-8 encoding you should create your index somewhat like this:
CREATE INDEX url_index ON urls (uri_host(url::uri) text_pattern_ops);
And then also for your domain names table.
You can ALTER TABLE urls ALTER COLUMN url SET DATA TYPE uri so you can forego the casts.

Related

How to use ts_headline() in PostgreSQL while doing efficient full-text search? Comparing two query plans

I am experimenting with a full-text search system over my PostgreSQL database, where I am using tsvectors with ts_rank() to pull out relevant items to a user search query. In general this works really fantastic as a simple solution (i.e. no major overhead infrastructure). However, I am finding that the ts_headline() component (which gives users context for the relevance of the search results) is slowing down my queries significantly, by an order of about 10x. I wanted to inquire what is the best way to use ts_headline() without incurring computational expense.
To give an example, here is a very fast tsvector search that does not use ts_headline(). For context, my table has two relevant fields, search_text which has the natural-language text which is being searched against, and search_text_tsv which is a tsvector that is directly queried against (and also used to rank the item). When I use ts_headline(), it references the main search_text field in order to produce a user-readable headline. Further, the column search_text_tsv is indexed using GIN, which provides very fast lookups for ## websearch_to_tsquery('my query here').
Again, here is query #1:
SELECT
item_id,
title,
author,
search_text,
ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv ## websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20
This gives me 20 top results very fast, running on my laptop about 50ms.
Now, query #2 uses ts_headline() to produce a user-readable headline. I found that this was very slow when it ran against all possible search results, so I used a sub-query to produce the top 20 results and then calculated the ts_headline() only for those top results (as opposed to, say, 1000 possible results).
SELECT *,
ts_headline(search_text,websearch_to_tsquery(unaccent('my query text here')),'StartSel=<b>,StopSel=</b>,MaxFragments=2,' || 'FragmentDelimiter=...,MaxWords=10,MinWords=1') AS headline
FROM (
SELECT
item_id,
title,
author,
search_text,
ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my query text here')), 1) as rank
FROM search_index
WHERE search_text_tsv ## websearch_to_tsquery(unaccent('my query text here'))
ORDER BY rank DESC
LIMIT 20 OFFSET 20) as foo
Basically, what this does is limits the # of results (as in the first query), and then uses that as a sub-query, returning all of the columns in the subquery (i.e. *) and also the ts_headline() calculation. However, this is very slow, by an order of magnitude of about 10, coming in at around 800ms on my laptop.
Is there anything I can do to speed up ts_headline()? It seems pretty clear that this is what is slowing down the second query.
For reference, here are the query plans being produced by Postgresql (from EXPLAIN ANALYZE):
Query plan 1: (straight full-text search)
Limit (cost=56.79..56.79 rows=1 width=270) (actual time=66.118..66.125 rows=20 loops=1)
-> Sort (cost=56.78..56.79 rows=1 width=270) (actual time=66.113..66.120 rows=40 loops=1)
Sort Key: (ts_rank(search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
Sort Method: top-N heapsort Memory: 34kB
-> Bitmap Heap Scan on search_index (cost=52.25..56.77 rows=1 width=270) (actual time=1.070..65.641 rows=462 loops=1)
Recheck Cond: (search_text_tsv ## websearch_to_tsquery(unaccent('my search query here'::text)))
Heap Blocks: exact=424
-> Bitmap Index Scan on idx_fts_search (cost=0.00..52.25 rows=1 width=0) (actual time=0.966..0.966 rows=462 loops=1)
Index Cond: (search_text_tsv ## websearch_to_tsquery(unaccent('my search query here'::text)))
Planning Time: 0.182 ms
Execution Time: 66.154 ms
Query plan 2: (full text search w/ subquery & ts_headline())
Subquery Scan on foo (cost=56.79..57.31 rows=1 width=302) (actual time=116.424..881.617 rows=20 loops=1)
-> Limit (cost=56.79..56.79 rows=1 width=270) (actual time=62.470..62.497 rows=20 loops=1)
-> Sort (cost=56.78..56.79 rows=1 width=270) (actual time=62.466..62.484 rows=40 loops=1)
Sort Key: (ts_rank(search_index.search_text_tsv, websearch_to_tsquery(unaccent('my search query here'::text)), 1)) DESC
Sort Method: top-N heapsort Memory: 34kB
-> Bitmap Heap Scan on search_index (cost=52.25..56.77 rows=1 width=270) (actual time=2.378..62.151 rows=462 loops=1)
Recheck Cond: (search_text_tsv ## websearch_to_tsquery(unaccent('my search query here'::text)))
Heap Blocks: exact=424
-> Bitmap Index Scan on idx_fts_search (cost=0.00..52.25 rows=1 width=0) (actual time=2.154..2.154 rows=462 loops=1)
Index Cond: (search_text_tsv ## websearch_to_tsquery(unaccent('my search query here'::text)))
Planning Time: 0.350 ms
Execution Time: 881.702 ms
Just encountered exactly the same issue. When collecting a list of search results (20-30 documents) and also getting their ts_headline highlight in the same query the execution time was x10 at least.
To be fair, Postgres documentation is warning about that [1]:
ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care.
I ended up getting the list of documents first and then loading the highlights with ts_headline asynchronously one-by-one. Still slow single queries (>150ms) but better user experience then waiting multiple seconds for an initial load.
[1] https://www.postgresql.org/docs/15/textsearch-controls.html#TEXTSEARCH-HEADLINE
I think I can buy you a few more milliseconds. In your query, you're returning "SELECT *, ts_headline" which includes the full original document search_text in the return. When I limited my SELECT to everything but the "search_text" from the subquery (+ ts_headline as headline), my queries dropped from 500-800ms to 100-400ms. I'm also using AWS RDS so that might play a role on my end.

Improve Postgres performance

I am new to Postgres and sure I’m doing something wrong.
So I just wondered if anybody had experienced something similar to my experiences below or could point me in the right direction to improve Postgres performance.
My initial goal was to speed up the analytical processing of my Datamarts in various Dashboards by moving from MS SQL Server to Postgres.
To get a sample query to compare speeds I ran query profiler on MS SQL Server whilst referencing a BI dashboard, which produced something similar to this (I know there are redundant columns in the sub query):
SELECT COUNT(*)
FROM (
SELECT
BM.Key_Date, BM.[Actual Date], BM.[Month]
,BM.[Month Number], BM.[Month Year], BM.[No of Working Days]
,SDI.Key_Delivery, SDI.[Order Number], SDI.[Quantity SKU]
,SDI.[Quantity Sales Unit], SDI.[FactSales - GBP], SDI.[NNSA Capsules]
,SFI.[Ship-to], SFI.[Sold-to], SFI.[Sales Force Type], SFI.Region
,SFI.[Top Level Account], SFI.[Customer Organisation]
,EX.Rate
,PDI.[Product Description], PDI.[Product Type Group], PDI.[Product Type],
PDI.[Main Product Categories], PDI.Section, PDI.Family
FROM Fact.SalesDataInvoiced AS SDI
JOIN Dimension.SalesforceInvoiced AS SFI
ON SDI.[Key_Ship-to]=SFI.[Key_Ship-to]
JOIN Dimension.BillingMonth AS BM
ON SDI.[Key_Billing Month]=BM.Key_Date
JOIN Dimension.ProductDataInvoiced AS PDI
ON SDI.[Key_Product Code]=PDI.[Key_Product Code]
CROSS JOIN Dimension.Exchange AS EX
WHERE BM.[Actual Date] BETWEEN '20160101' AND '20211001'
) AS a
GROUP BY [Product Type], [Product Type Group],[Main Product Categories]
I then installed Postgres 14 (on Centos 8) and MS SQL Server Developer 2017 (on windows 10) on separate identical laptops and created a Database and tables from the same csv data files to enable the replication of the above query.
Running a Postgres query with indexing performs massively slower than MS SQL without indexing.
Adding indexes to MS SQL produces results almost instantly.
Because of the difference in processing time I even installed Citus with Postgres14 and created Fact.SalesDataInvoiced as a columnar table (This made the processing time worse).
I have played about with memory settings in postgresql.conf but nothing seems to enable speeds comparable to MSSQL.
Explain Analyze shows that despite the indexes it always runs a sequential scan of all tables. Forcing indexed scans doesn't make any difference to processing time.
Would I be right in thinking Postgres would perform significantly better using a cluster and partitioning? Even if this is the case surely a simple query like the one I'm trying to run on a stand alone machine should be faster?
TABLE DETAILS
Dimension.BillingMonth
Records 120,
Primary Key is KeyDate,
Clustered Unique Index on KeyDate
Dimension.Exchange
Records 1
Dimension.ProductDataInvoiced
Records 275563,
Primary Key is KeyProduct,
Clustered Unique Index on KeyProduct
Dimension.SalesforceInvoiced
Records 377414,
Primary Key is KeyShipTo,
Clustered Unique Index on KeyShipTo
Fact.SalesDataInvoiced
Records 43807943,
Non-Clustered Unique Index on KeyShipTo, KeyProduct, KeyBillingMonth
Any help would be appreciated as previously mentioned I'm sure I must be missing something obvious.
Many thanks in advance.
David
Thank you for the responses. I have placed additional info below.
Forgot to add my postgres performance woes were after i'd carried out a Full Vacuum and Reindex. I performed these maintenance tasks after I had imported the data and created my indexes.
Output after querying pg_indexes
tablename
indexname
indexdef
BillingMonth
BillingMonth_pkey
CREATE UNIQUE INDEX BillingMonth_pkey ON public.BillingMonth USING btree (KeyDate)
ProductDataInvoiced
ProductDataInvoiced_pkey
CREATE UNIQUE INDEX ProductDataInvoiced_pkey ON public.ProductDataInvoiced USING btree (KeyProductCode)
SalesforceInvoiced
SalesforceInvoiced_pkey
CREATE UNIQUE INDEX SalesforceInvoiced_pkey ON public.SalesforceInvoiced USING btree (KeyShipTo)
SalesDataInvoiced
CI_SalesData
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Output After running EXPLAIN (ANALYZE, BUFFERS)
Finalize GroupAggregate (cost=1435439.30..1435565.71 rows=480 width=53) (actual time=25960.468..25973.326 rows=31 loops=1)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Buffers: shared hit=71246 read=859119
-> Gather Merge (cost=1435439.30..1435551.31 rows=960 width=53) (actual time=25960.458..25973.282 rows=89 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=71246 read=859119
-> Sort (cost=1434439.28..1434440.48 rows=480 width=53) (actual time=25956.982..25956.989 rows=30 loops=3)
Sort Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=71246 read=859119
Worker 0: Sort Method: quicksort Memory: 29kB
Worker 1: Sort Method: quicksort Memory: 29kB
-> Partial HashAggregate (cost=1434413.10..1434417.90 rows=480 width=53) (actual time=25956.878..25956.895 rows=30 loops=3)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Batches: 1 Memory Usage: 49kB
Buffers: shared hit=71230 read=859119
Worker 0: Batches: 1 Memory Usage: 49kB
Worker 1: Batches: 1 Memory Usage: 49kB
-> Parallel Hash Join (cost=62124.74..1327935.46 rows=10647764 width=45) (actual time=285.864..19240.004 rows=14602648 loops=3)
Hash Cond: (sdi."KeyShipTo" = sfi."KeyShipTo")
Buffers: shared hit=71230 read=859119
-> Hash Join (cost=19648.48..1257508.51 rows=10647764 width=49) (actual time=204.794..12862.063 rows=14602648 loops=3)
Hash Cond: (sdi."KeyProductCode" = pdi."KeyProductCode")
Buffers: shared hit=32264 read=859119
-> Hash Join (cost=3.67..1091456.95 rows=10647764 width=8) (actual time=0.143..7076.104 rows=14602648 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
Buffers: shared hit=3
-> Seq Scan on "BillingMonth" bm (cost=0.00..2.80 rows=70 width=4) (actual time=0.012..0.028
rows=70 loops=3)
Filter: (("ActualDate" >= '2016-01-01'::date) AND ("ActualDate" <= '2021-10-01'::date))
Rows Removed by Filter: 50
Buffers: shared hit=3
-> Hash (cost=16200.27..16200.27 rows=275563 width=49) (actual time=203.237..203.238 rows=275563 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 26832kB
Buffers: shared hit=32067
-> Nested Loop (cost=0.00..16200.27 rows=275563 width=49) (actual time=0.034..104.143 rows=275563 loops=3)
Buffers: shared hit=32067
-> Seq Scan on "Exchange" ex (cost=0.00..1.01 rows=1 width=0) (actual time=0.024..0.024 rows=
1 loops=3)
Buffers: shared hit=3
-> Seq Scan on "ProductData_Invoiced" pdi (cost=0.00..13443.63 rows=275563 width=49) (actual
time=0.007..48.176 rows=275563 loops=3)
Buffers: shared hit=32064
-> Parallel Hash (cost=40510.56..40510.56 rows=157256 width=4) (actual time=79.536..79.536 rows=125805 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 18912kB
Buffers: shared hit=38938
-> Parallel Seq Scan on "Salesforce_Invoiced" sfi (cost=0.00..40510.56 rows=157256 width=4) (actual time=
0.011..42.968 rows=125805 loops=3)
Buffers: shared hit=38938
Planning:
Buffers: shared hit=426
Planning Time: 1.936 ms
Execution Time: 25973.709 ms
(55 rows)
Firstly, remember to run VACUUM ANALYZE after rebuilding indexes, or sometimes after importing large amount of data. (VACUUM FULL is mainly useful for the OS to reclaim disk space, and you'd still need to analyse afterwards, especially after rebuilding indexes.)
It seems from your query that your main table is SalesDataInvoiced (SDI) and that you'd want to use an index on KeyBillingMonth if possible (since it's the main restriction you're placing). In general, you'd also want indexes, at least on the other tables on the columns that are used for the joins.
As the documentation for multi-column indexes in PostgreSQL says:
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through. This index could in principle be used for queries that have constraints on b and/or c with no constraint on a — but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.
In your example, the main column you'd want to use a constraint on (KeyBillingMonth) is in third position, so it's unlikely to be used.
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced
USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Creating this should make it more likely to be used:
CREATE INDEX ON SalesDataInvoiced(KeyBillingMonth);
Then, run VACUUM ANALYZE and try your query again.
You may also want an index on BillingMonth(ActualDate), but that's not necessarily useful since there seems to be few rows (and most of them are returned in your query).
It's not clear what the BillingMonth table is for. If it's basically about truncating the ActualDate to have the first day of the month, you could for example get rid of the join on BillingMonth and use the constraint on SalesDataInvoiced.KeyBillingMonth directly. For example ... WHERE SDI.KeyBillingMonth BETWEEN '2016-01-01' AND '2021-10-01' ....
As a side-note, as far as I know, BETWEEN is inclusive for its upper bound. I'd imagine a query like this is meant to represent some monthly statistics, hence should probably not include what's on 2021-10-01 (but not the rest of that month).

Very slow query planning time with many indexes

I have a table "nodes" with a JSONB-column "data", in which I store various types of information.
The JSON includes pieces of text in different languages, that need to be frequently searched on by end-users. Per language, I therefore create about 4 indices similar to the following (usually with a separate search dictionary for that language)
CREATE INDEX nodes_label_sv_idx
ON nodes
USING GIN (to_tsvector('swedish_text', data #>> '{label,sv}'));
This works fine when only 2 or 3 languages are present, but when adding 20 more languages (each with 4 indices for that language's path into the JSON), the query planner becomes extremely slow for some queries (180 ms), even though those queries are still executing very fast (less than 1ms). The table currently contains about 50K records.
Weird thing is, those queries are simple joins on other columns of the table (unrelated to the "data" column), so the language-related indices are completely irrelevant. Also, the more language-related indices I drop, the faster the planner becomes again.
I completely understand that the planner needs to check all (150+) indices for potential relevance, but 180ms is extreme. Anybody have a suggestion? By the way, the problem only seems to occur when using a view (not when directly using the query underlying the view).
I am using PostgresSQL 13 on Mac & Linux.
Edit:
query:
EXPLAIN (ANALYZE, BUFFERS)
select 1
from ca_nodes can
where (can.owner_id = 168 or can.aco_id = 0)
limit 1;
underlying view:
CREATE VIEW ca_nodes AS
SELECT n.nid, n.owner_id, c.aco_id,
FROM nodes n inner join acos c on n.nid = c.nid;
explain (analyze, buffers) output:
Limit (cost=0.58..32.45 rows=1 width=4) (actual time=0.038..0.039 rows=1 loops=1)
Buffers: shared hit=6
-> Merge Join (cost=0.58..15136.78 rows=475 width=4) (actual time=0.037..0.037 rows=1 loops=1)
Merge Cond: (n.nid = c.nid)
Join Filter: ((n.owner_id = 168) OR (c.aco_id = 0))
Buffers: shared hit=6
-> Index Scan using nodes_pkey on nodes n (cost=0.29..12094.35 rows=47604 width=8) (actual time=0.017..0.017 rows=1 loops=1)
Buffers: shared hit=3
-> Index Scan using acos_nid_idx on precalculated_acos c (cost=0.29..2090.35 rows=47604 width=8) (actual time=0.014..0.014 rows=1 loops=1)
Buffers: shared hit=3
Planning:
Buffers: shared hit=83
Planning Time: 180.392 ms
Execution Time: 0.079 ms

Postgres JSONB Retrieval is very slow

At bit of a loss here. First and foremost, I'm not a dba nor do I really have any experience with Postgres with the exception of what I'm doing now.
Postgres seems to choke when you want to return jsonb documents in general for anything more than a couple of hundred rows. When you try to return thousands, query performance becomes terrible. If you go even further and attempt to return multiple jsonb documents following various table joins, forget it.
Here is my scenario with some code:
I have 3 tables - all tables have jsonb models, all complex models and 2 of which are sizeable (8 to 12kb in size uncompressed). In this particular operation I need to unnest a jsonb array of elements to then work through - this gives me roughly 12k records.
Each record then contains an ID that I use to join another important table - I need to retreive the jsonb doc from this table. From there, I need to join that table on to another (much smaller) table and also pull the doc from there based on another key.
The output is therefore several columns + 3 jsonb documents ranging from <1kb in size to around 12kb uncompressed in size.
Query data retrieval is effectively pointless - I've yet to see the query return data. As soon as I strip away the json doc columns, naturally the query speeds up to seconds or less. 1 jsonb document bumps the retrieval to 40seconds in my case, adding a second takes us to 2 minutes and adding the third is much longer.
What am I doing wrong? Is there any way to retrieve the jsonb documents in a performant way?
SELECT x.id,
a.doc1,
b.doc2,
c.doc3
FROM ( SELECT id,
(elements.elem ->> 'a'::text)::integer AS a,
(elements.elem ->> 'b'::text)::integer AS b,
(elements.elem ->> 'c'::text)::integer AS c,
(elements.elem ->> 'd'::text)::integer AS d,
(elements.elem ->> 'e'::text)::integer AS e
FROM tab
CROSS JOIN LATERAL jsonb_array_elements(tab.doc -> 'arrayList'::text) WITH ORDINALITY elements(elem, ordinality)) x
LEFT JOIN table2 a ON x.id = a.id
LEFT JOIN table3 b ON a.other_id = b.id
LEFT JOIN table4 c ON b.other_id = c.id;
The tables themselves are fairly standard:
CREATE TABLE a (
id (primary key),
other_id (foreign key),
doc jsonb
)
Nothing special about these tables, they are ids and jsonb documents
A note - we are using Postgres for a few reasons, we do need the relational aspects of PG but at the same time we need to document storage and retrieval ability for later in our workflow.
Apologies if I've not provided enough data here, I can try to add some more based on any comments
EDIT: added explain:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=465.79..96225.93 rows=11300 width=1843)
Hash Cond: (pr.table_3_id = br.id)
-> Hash Left Join (cost=451.25..95756.86 rows=11300 width=1149)
Hash Cond: (((p.doc ->> 'secondary_id'::text))::integer = pr.id)
-> Nested Loop Left Join (cost=0.44..95272.14 rows=11300 width=1029)
-> Nested Loop (cost=0.01..239.13 rows=11300 width=40)
-> Seq Scan on table_3 (cost=0.00..13.13 rows=113 width=710)
-> Function Scan on jsonb_array_elements elements (cost=0.01..1.00 rows=100 width=32)
-> Index Scan using table_1_pkey on table_1 p (cost=0.43..8.41 rows=1 width=993)
Index Cond: (((elements.elem ->> 'field_id'::text))::integer = id)
-> Hash (cost=325.36..325.36 rows=10036 width=124)
-> Seq Scan on table_2 pr (cost=0.00..325.36 rows=10036 width=124)
-> Hash (cost=13.13..13.13 rows=113 width=710)
-> Seq Scan on table_3 br (cost=0.00..13.13 rows=113 width=710)
(14 rows)
EDIT2: Sorry been mega busy - I will try to go into more detail - firstly the fully explain plan (I didn't know about the additional parameters) - Ill leave in the actual tables (I wasn't sure if I was allowed to):
Hash Left Join (cost=465.79..96225.93 rows=11300 width=1726) (actual time=4.669..278.781 rows=12522 loops=1)
Hash Cond: (pr.brand_id = br.id)
Buffers: shared hit=64813
-> Hash Left Join (cost=451.25..95756.86 rows=11300 width=1032) (actual time=4.537..265.749 rows=12522 loops=1)
Hash Cond: (((p.doc ->> 'productId'::text))::integer = pr.id)
Buffers: shared hit=64801
-> Nested Loop Left Join (cost=0.44..95272.14 rows=11300 width=912) (actual time=0.240..39.480 rows=12522 loops=1)
Buffers: shared hit=49964
-> Nested Loop (cost=0.01..239.13 rows=11300 width=40) (actual time=0.230..8.177 rows=12522 loops=1)
Buffers: shared hit=163
-> Seq Scan on brand (cost=0.00..13.13 rows=113 width=710) (actual time=0.003..0.038 rows=113 loops=1)
Buffers: shared hit=12
-> Function Scan on jsonb_array_elements elements (cost=0.01..1.00 rows=100 width=32) (actual time=0.045..0.057 rows=111 loops=113)
Buffers: shared hit=151
-> Index Scan using product_variant_pkey on product_variant p (cost=0.43..8.41 rows=1 width=876) (actual time=0.002..0.002 rows=1 loops=12522)
Index Cond: (((elements.elem ->> 'productVariantId'::text))::integer = id)
Buffers: shared hit=49801
-> Hash (cost=325.36..325.36 rows=10036 width=124) (actual time=4.174..4.174 rows=10036 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1684kB
Buffers: shared hit=225
-> Seq Scan on product pr (cost=0.00..325.36 rows=10036 width=124) (actual time=0.003..1.836 rows=10036 loops=1)
Buffers: shared hit=225
-> Hash (cost=13.13..13.13 rows=113 width=710) (actual time=0.114..0.114 rows=113 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 90kB
Buffers: shared hit=12
-> Seq Scan on brand br (cost=0.00..13.13 rows=113 width=710) (actual time=0.003..0.043 rows=113 loops=1)
Buffers: shared hit=12
Planning Time: 0.731 ms
Execution Time: 279.952 ms
(29 rows)
Your query is hard to follow for a few reasons:
Your tables are named tab, table2, table3, `table4.
Your subquery parses JSON for every single row in the table, projects out some values, and then the outer query never uses those values. The only one that appears to be relevant is id.
Outer joins must be executed in order while inner joins can be freely re-arranged for performance. Without knowing the purpose of this query, it's impossible for me to determine if an outer join is appropriate or not.
The table names and column names in the execution plan do not match the query, so I'm not convinced this plan is accurate.
You did not supply a schema.
That said, I'll do my best.
Things that stand out performance-wise
No where clause
Since there is no where clause, your query will run jsonb_array_elements against every single row of tab, which is what is happening. Aside from extracting the data out of JSON and storing it into a separate column, I can't imagine much that could be done to optimize it.
Insufficient indexes to power the joins
The query plan suggests there might be a meaningful cost to the joins. Except for table1, each join is driven by a sequential scan of the tables, which means, reading every row of both tables. I suspect adding indexes on each table will help. It appears you are joining on id columns, so a simple primary key constraint will improve both your data integrity and query performance.
alter table tab add constraint primary key (id);
alter table table2 add constraint primary key (id);
alter table table3 add constraint primary key (id);
alter table table4 add constraint primary key (id);
Type conversions
This part of the execution plan shows a double type conversion in your first join:
Index Cond: (((elements.elem ->> 'field_id'::text))::integer = id)
This predicate means that the id value from tab is being converted to text, then the text converted to an integer so it can match against table2.id. These conversions can be expensive in compute time and it some cases, can prevent index usage. It's hard to advise on what to do because I don't know what the actual types are.

PostgreSQL chooses not to use index despite improved performance

I had a DB in MySQL and am in the process of moving to PostgreSQL with a Django front-end.
I have a table of 650k-750k rows on which I perform the following query:
SELECT "MMG", "Gene", COUNT(*) FROM at_summary_typing WHERE "MMG" != '' GROUP BY "MMG", "Gene" ORDER BY COUNT(*);
In the MySQL this returns in ~0.5s. However when I switched to PostgreSQL the same query takes ~3s. I have put an index on MMG and Gene together to try and speed it up but when using EXPLAIN (analyse, buffers, verbose) I see the output shows the index is not used :
Sort (cost=59013.54..59053.36 rows=15927 width=14) (actual time=2880.222..2885.475 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=16093 read=11482, temp read=2230 written=2230
-> GroupAggregate (cost=55915.50..57901.90 rows=15927 width=14) (actual time=2179.809..2861.679 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Sort (cost=55915.50..56372.29 rows=182713 width=14) (actual time=2179.782..2830.232 rows=180657 loops=1)
Output: "MMG", "Gene"
Sort Key: at_summary_typing."MMG", at_summary_typing."Gene"
Sort Method: external merge Disk: 8168kB
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Seq Scan on public.at_summary_typing (cost=0.00..36821.60 rows=182713 width=14) (actual time=0.010..224.658 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Buffers: shared hit=16093 read=11482
Total runtime: 2888.804 ms
After some searching I found that I could force the use of the index by setting SET enable_seqscan = OFF; and the EXPLAIN now shows the following :
Sort (cost=1181591.18..1181631.00 rows=15927 width=14) (actual time=555.546..560.839 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=173219 read=87094 written=7, temp read=411 written=411
-> GroupAggregate (cost=0.42..1180479.54 rows=15927 width=14) (actual time=247.546..533.202 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=173219 read=87094 written=7
-> Index Only Scan using mm_gene_idx on public.at_summary_typing (cost=0.42..1178949.93 rows=182713 width=14) (actual time=247.533..497.771 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Heap Fetches: 739728
Buffers: shared hit=173219 read=87094 written=7
Total runtime: 562.735 ms
Performance now comparable with the MySQL.
The problem is that I understand that setting this is bad practice and that I should try and find a way to improve my query/encourage the use of the index automatically. However I'm very inexperienced with PostgreSQL and cannot work out how or why it is choosing to use a Seq Scan over an Index Scan in the first place.
why it is choosing to use a Seq Scan over an Index Scan in the first place
Because the seq scan is actually twice as fast as the index scan (224ms vs. 497ms) despite the fact that the index was nearly completely in the cache, but the table was not.
So choosing the seq scan was the right thing to do.
The bottleneck in the first plan is the sorting and grouping that needs to be done on disk.
The better strategy would be to increase work_mem to something more realistic than the really small default of 4MB. You might want to start with something like 16MB, by running
set work_mem=16MB;
before running your query. If that doesn't remove the "Sort Method: external merge Disk" steps, increase it work_mem further.
By increasing the work_mem it also is possible that Postgres switches to a hash aggregate instead of the sorting that it currently does which will probably be faster anyway (but isn't feasible if not enough memory is available)
Once you find a good value, you might want to make that permanent by putting the new value into postgresql.conf
Don't set this too high: that memory may be requested multiple times for each query.
If your where condition is static, you could also create a partial index matching that criteria:
create index on at_summary_typing ("MMG", "Gene")
where "MMG" <> '';
Don't forget to analyze the table to update the statistics.