Phrase frequency counter with FULL Text Search of PostgreSQL 9.6 - postgresql

I need to calculate the number of times that a phrase appears using ts_query against an indexed text field (ts_vector data type). It works but it is very slow because the table is huge. For single words I pre-calculated all the frequencies but I have no ideas for increasing the speed of a phrase search.
Edit: Thank you for your reply #jjanes.
This is my query:
SELECT substring(date_input::text,0,5) as myear, ts_headline('simple',text_input,q, 'StartSel=<b>, StopSel=</b>,MaxWords=2, MinWords=1, ShortWord=1, HighlightAll=FALSE, MaxFragments=9999, FragmentDelimiter=" ... "') as headline
FROM
db_test, to_tsquery('simple','united<->kingdom') as q WHERE date_input BETWEEN '2000-01-01'::DATE AND '2019-12-31'::DATE and idxfti_simple ## q
And this is the EXPLAIN (ANALYZE, BUFFERS) output:
Nested Loop (cost=25408.33..47901.67 rows=5509 width=64) (actual time=286.536..17133.343 rows=38127 loops=1)
Buffers: shared hit=224723
-> Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual time=0.005..0.007 rows=1 loops=1)
-> Append (cost=25408.33..46428.00 rows=5510 width=625) (actual time=285.372..864.868 rows=38127 loops=1)
Buffers: shared hit=165713
-> Bitmap Heap Scan on db_test (cost=25408.33..46309.01 rows=5509 width=625) (actual time=285.368..791.111 rows=38127 loops=1)
Recheck Cond: ((idxfti_simple ## q.q) AND (date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
Rows Removed by Index Recheck: 136
Heap Blocks: exact=29643
Buffers: shared hit=165607
-> BitmapAnd (cost=25408.33..25408.33 rows=5509 width=0) (actual time=278.370..278.371 rows=0 loops=1)
Buffers: shared hit=3838
-> Bitmap Index Scan on idxftisimple_idx (cost=0.00..1989.01 rows=35869 width=0) (actual time=67.280..67.281 rows=176654 loops=1)
Index Cond: (idxfti_simple ## q.q)
Buffers: shared hit=611
-> Bitmap Index Scan on db_test_date_input_idx (cost=0.00..23142.24 rows=1101781 width=0) (actual time=174.711..174.712 rows=1149456 loops=1)
Index Cond: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
Buffers: shared hit=3227
-> Seq Scan on test (cost=0.00..118.98 rows=1 width=451) (actual time=0.280..0.280 rows=0 loops=1)
Filter: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date) AND (idxfti_simple ## q.q))
Rows Removed by Filter: 742
Buffers: shared hit=106
Planning time: 0.332 ms
Execution time: 17176.805 ms
Sorry, I can't set track_io_timing turned on. I do know that ts_headline is not recommended but I need it to calculate the number of times that a phrase appears on the same field.
Thank you in advance for your help.

Note that fetching the rows in Bitmap Heap Scan is quite fast, <0.8 seconds, and almost all the time is spent in the top-level node. That time is likely to be spent in ts_headline, reparsing the text_input document. As long as you keep using ts_headline, there isn't much you can do about this.
ts_headline doesn't directly give you what you want (frequency), so you must be doing some kind of post-processing of it. Maybe you could move to postprocessing the tsvector directly, so the document doesn't need to be reparsed.
Another option is to upgrade further, which could allow the work of ts_headline to be spread over multiple CPUs. PostgreSQL 9.6 was the first version which supported parallel query, and it was not mature enough in that version to be able to parallelize this type of thing. v10 is probably enough to get parallel query for this, but you might as well jump all the way to v12.

Version 9.2 is old and out of support. It didn't have native support for phrase searching in the first place (introduced in 9.6).
Please upgrade.
And if it is still slow, show us the query, and the EXPLAIN (ANALYZE, BUFFERS) for it, preferably with track_io_timing turned on.

Related

Postgres Optimizing Free Text Search when many results are returned

We are building lightweight text search on top of our data at Postgres with GIN indexes. When the matched data is small, it works, really fast. However, if we search common terms, due to many matches, the performance of it degrades significantly.
Consider the following query:
EXPLAIN ANALYZE
SELECT count(id)
FROM data_change_records d
WHERE to_tsvector('english', d.content) ## websearch_to_tsquery('english', 'mustafa');
The result is as follows:
Finalize Aggregate (cost=47207.99..47208.00 rows=1 width=8) (actual time=15.461..17.129 rows=1 loops=1)
-> Gather (cost=47207.78..47207.99 rows=2 width=8) (actual time=9.734..17.119 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=46207.78..46207.79 rows=1 width=8) (actual time=3.773..3.774 rows=1 loops=3)
-> Parallel Bitmap Heap Scan on data_change_records d (cost=1759.41..46194.95 rows=5130 width=37) (actual time=1.765..3.673 rows=1143 loops=3)
Recheck Cond: (to_tsvector('english'::regconfig, content) ## '''mustafa'''::tsquery)"
Heap Blocks: exact=2300
-> Bitmap Index Scan on data_change_records_content_to_tsvector_idx (cost=0.00..1756.33 rows=12311 width=0) (actual time=4.219..4.219 rows=3738 loops=1)
Index Cond: (to_tsvector('english'::regconfig, content) ## '''mustafa'''::tsquery)"
Planning Time: 0.141 ms
Execution Time: 17.163 ms
If the query is simple, like mustafa replaced with aws, which reduced to aw with tokenizer the analysis is as follows:
Finalize Aggregate (cost=723889.39..723889.40 rows=1 width=8) (actual time=1073.513..1086.414 rows=1 loops=1)
-> Gather (cost=723889.17..723889.38 rows=2 width=8) (actual time=1069.439..1086.401 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=722889.17..722889.18 rows=1 width=8) (actual time=1063.847..1063.848 rows=1 loops=3)
-> Parallel Bitmap Heap Scan on data_change_records d (cost=17128.34..721138.59 rows=700233 width=37) (actual time=389.347..1014.440 rows=542724 loops=3)
Recheck Cond: (to_tsvector('english'::regconfig, content) ## '''aw'''::tsquery)"
Heap Blocks: exact=167605
-> Bitmap Index Scan on data_change_records_content_to_tsvector_idx (cost=0.00..16708.20 rows=1680560 width=0) (actual time=282.517..282.518 rows=1647916 loops=1)
Index Cond: (to_tsvector('english'::regconfig, content) ## '''aw'''::tsquery)"
Planning Time: 0.150 ms
Execution Time: 1086.455 ms
At this point, we are not sure how to proceed in this case. Options include changing the tokenization to not allow 2 words. We have lots of aws indexed that is the cause. For instance, if we search for ok which is also 2 words but not that common, the query returns in 61.378 ms
Searching for frequent words can never be as fast as searching for rare words.
One thing that strikes me is that you are using English stemming to search for names. If that is really your use case, you should use the simple dictionary that wouldn't stem aws to aw.
Alternatively, you could introduce an additional synonym dictionary to a custom text search configuration that contains aws and prevents stemming.
But, as I said, searching for frequent words cannot be fast if you want all result rows. A trick you could use is to set gin_fuzzy_search_limit to the limit of hits you want to find, then the index scan will stop early and may be faster (but you won't get all results).
If you have a new-enough version of PostgreSQL and your table is well-vacuumed, you can get an bitmap-only scan which doesn't need to visit the table, just the index. But, you would need to use count(*), not count(id), to get that. If "id" is never NULL, then these should give identical answers.
The query plan does not make it easy to tell when the bitmap-only optimization kicks in or how effective it is. If you use EXPLAIN (ANALYZE, BUFFERS) you should get at least some clue based on the buffer counts.

Can I have a feedback about my Postgres performance?

this is the query I performed in pgAdmin4:
update point
set grid_id_new=g.grid_id
from grid as g
where (point.region='EMILIA-ROMAGNA'and st_within(point.geom,g.geom))
Point is a 34 millions record table describing a point geometry (16 GB - 20 columns)
Grid is a 10 millions record table describing a multlipolygon geometry (grid) (4 GB)
I want my point table to associate with the grid ID they lie in. The query output are 2.5 million records updated (since I filter by region), in 24 minutes.
I feel like it took too much time.
These are my computer specifics:
Windows 10 PRO/Intel(R) Core(TM) i9-10920X CPU # 3.50 GHz/RAM 128 GB/953GB SSD(C)+3.4TB HDD(F)
I have installed Postgres13 and the data folder is on F (I know this may be wrong so I am planning to move it).
I have also tried to tune postgres.conf file but I got poor results.
Can someone please explain if my Postgres performance are as poor as I think? And, if so, how can I make it better? Also, what could be a good configuration for postgres.conf according with my hardware?
Update
#jjanes Hi there! it took 8 minutes to run the query you wrote, and this is the result:
QUERY PLAN
Gather (cost=1363.89..273178616690.49 rows=23057026760 width=28) (actual time=76.935..503830.684 rows=2335279 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=18634521 read=2426823
-> Nested Loop (cost=363.89..270872913014.49 rows=9607094483 width=28) (actual time=157.628..503021.991 rows=778426 loops=3)
Buffers: shared hit=18634521 read=2426823
-> Parallel Seq Scan on egon_geom_new (cost=0.00..2657488.69 rows=1064319 width=59) (actual time=1.575..8642.488 rows=855390 loops=3)
Filter: (dsxreg = 'EMILIA-ROMAGNA'::text)
Rows Removed by Filter: 10581246
Buffers: shared hit=259223 read=2225262
-> Bitmap Heap Scan on "6_emilia_grid" (cost=363.89..254491.98 rows=903 width=148) (actual time=0.573..0.573 rows=1 loops=2566171)
Filter: st_within((egon_geom_new.geom_new)::geometry, geom)
Heap Blocks: exact=784879
Buffers: shared hit=18375298 read=201561
-> Bitmap Index Scan on emilia_idx (cost=0.00..363.66 rows=9027 width=0) (actual time=0.283..0.283 rows=1 loops=2566171)
Index Cond: (geom ~ (egon_geom_new.geom_new)::geometry)
Buffers: shared hit=16167046 read=74534
Planning:
Buffers: shared hit=130 read=3 dirtied=2
Planning Time: 22.756 ms
Execution Time: 504042.609 ms
Thanks!
You can create a GiST index on one of the geometry columns, that will speed up the nested loop join. But you cannot use another join strategy, because the join condition is not using the equality operator (=), so it will always be slow to join two big tables.

What are the things you look for when using EXPLAIN ANALYZE to determine if there are improvements you can make or not

I've been reading up a bunch of the Postgres docs, but it's still not super clear to me when looking at the output if the query is optimized, or not. I've tried adding some indexes, which has reduced the number of lines in the output. If you were to look at something like this:
Limit (cost=26.16..26.18 rows=10 width=322) (actual time=0.077..0.079 rows=10 loops=1)
-> Sort (cost=26.16..26.19 rows=12 width=322) (actual time=0.076..0.077 rows=10 loops=1)
Sort Key: like_count DESC, inserted_at DESC
Sort Method: top-N heapsort Memory: 28kB
-> Bitmap Heap Scan on comments c0 (cost=4.40..25.94 rows=12 width=322) (actual time=0.036..0.049 rows=38 loops=1)
Recheck Cond: ((post_id = 'dc1ab68f-db3f-4b45-aa48-b5c30298e261'::uuid) AND (parent_id IS NULL))
Heap Blocks: exact=9
-> Bitmap Index Scan on comments_post_id_parent_id_index (cost=0.00..4.40 rows=12 width=0) (actual time=0.013..0.013 rows=38 loops=1)
Index Cond: ((post_id = 'dc1ab68f-db3f-4b45-aa48-b5c30298e261'::uuid) AND (parent_id IS NULL))
Planning Time: 0.099 ms
Execution Time: 0.099 ms
(11 rows)
Are there any key things you look at to say "This query is pretty optimized", or "Wow, there's an index I can add to reduce all that work"?
The first thing I'd notice is that it took less than 1/10,000 of a second to run, and so is unlikely to need manual optimization work. And then I'd wonder, why did I get started looking at such a fast query in the first place? Surely I should be examining the slow queries, not the fast ones.
I first look for sequential table scans which indicate that the query planner could not use an index either because there isn't one, or it has failed to use it for some reason.

How do I improve the perfomance of my delete-query if the bottleneck appears to be I/O?

I have a (badly designed) database table (postgres) that I'm trying to clean up. The table is about 270 GB in size, 38K rows (+- 70 MB per row --> columns contain file contents).
In parallel with changing the design to no longer contain files, I want to remove 80% of the data to reduce disk usage. Hence, I've tried to following query:
DELETE FROM table_name.dynamic_data
WHERE table_name.env = 'AE'
Which should cover +- 25% of the data. This query times out without any warning, or sometimes reporting that the logs file are full: PANIC: could not write to file "pg_xlog/xlogtemp.32455": No space left on device
I've tried to
DELETE FROM table_name
WHERE ctid IN (
SELECT ctid
FROM table_name
WHERE table_name.env = 'AE'
LIMIT 1000)
Which works, but is incredibly slow (200-250 ms per row deleted) and times out if I go over 1000 a lot.
To find the bottleneck in the query, I ran the query above with explain (analyze,buffers,timing) on a smaller version of this query (with LIMIT 1 instead of LIMIT 1000), resulting in the following explain:
QUERY PLAN
Delete on dynamic_data (cost=0.38..4410.47 rows=1 width=36) (actual time=338.913..338.913 rows=0 loops=1)
Buffers: shared hit=7972 read=988 dirtied=975
I/O Timings: read=312.160
-> Nested Loop (cost=0.38..4410.47 rows=1 width=36) (actual time=3.919..13.700 rows=1 loops=1)
Join Filter: (dynamic_data.ctid = "ANY_subquery".ctid)
Rows Removed by Join Filter: 35938
Buffers: shared hit=4013
-> Unique (cost=0.38..0.39 rows=1 width=36) (actual time=2.786..2.788 rows=1 loops=1)
Buffers: shared hit=479
-> Sort (cost=0.38..0.39 rows=1 width=36) (actual time=2.786..2.787 rows=1 loops=1)
Sort Key: "ANY_subquery".ctid
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=479
-> Subquery Scan on "ANY_subquery" (cost=0.00..0.37 rows=1 width=36) (actual time=2.753..2.753 rows=1 loops=1)
Buffers: shared hit=474
-> Limit (cost=0.00..0.36 rows=1 width=6) (actual time=2.735..2.735 rows=1 loops=1)
Buffers: shared hit=474
-> Seq Scan on dynamic_data dynamic_data_1 (cost=0.00..4020.71 rows=11093 width=6) (actual time=2.735..2.735 rows=1 loops=1)
Filter: (env = 'AE'::text)
Rows Removed by Filter: 5614
Buffers: shared hit=474
-> Seq Scan on dynamic_data (cost=0.00..3923.37 rows=38937 width=6) (actual time=0.005..8.130 rows=35939 loops=1)
Buffers: shared hit=3534
Planning time: 0.354 ms
Execution time: 338.969 ms
My main take-away from the query plan is that I/O timings take 312/338 = 92% of the time:
actual time=338.913..338.913 rows=0 loops=1)
Buffers: shared hit=7972 read=988 dirtied=975
I/O Timings: read=312.160
I can't find anything on how to improve the I/O performance of this query without changing the database configuration. Is this simply an unfortunate effect of the large table rows / database? Or am I missing something important? How do I speed up this delete operation?
Without any resolution I'm defaulting to having a script delete 1 row at a time with separate queries. Far from ideal, so I hope you have suggestions.
Note that I am not administering the database and not authorized to make any changes to its configuration, nor is it likely that the DBA will change the config to cope with my ill-designed setup.

PostgreSQL chooses not to use index despite improved performance

I had a DB in MySQL and am in the process of moving to PostgreSQL with a Django front-end.
I have a table of 650k-750k rows on which I perform the following query:
SELECT "MMG", "Gene", COUNT(*) FROM at_summary_typing WHERE "MMG" != '' GROUP BY "MMG", "Gene" ORDER BY COUNT(*);
In the MySQL this returns in ~0.5s. However when I switched to PostgreSQL the same query takes ~3s. I have put an index on MMG and Gene together to try and speed it up but when using EXPLAIN (analyse, buffers, verbose) I see the output shows the index is not used :
Sort (cost=59013.54..59053.36 rows=15927 width=14) (actual time=2880.222..2885.475 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=16093 read=11482, temp read=2230 written=2230
-> GroupAggregate (cost=55915.50..57901.90 rows=15927 width=14) (actual time=2179.809..2861.679 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Sort (cost=55915.50..56372.29 rows=182713 width=14) (actual time=2179.782..2830.232 rows=180657 loops=1)
Output: "MMG", "Gene"
Sort Key: at_summary_typing."MMG", at_summary_typing."Gene"
Sort Method: external merge Disk: 8168kB
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Seq Scan on public.at_summary_typing (cost=0.00..36821.60 rows=182713 width=14) (actual time=0.010..224.658 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Buffers: shared hit=16093 read=11482
Total runtime: 2888.804 ms
After some searching I found that I could force the use of the index by setting SET enable_seqscan = OFF; and the EXPLAIN now shows the following :
Sort (cost=1181591.18..1181631.00 rows=15927 width=14) (actual time=555.546..560.839 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=173219 read=87094 written=7, temp read=411 written=411
-> GroupAggregate (cost=0.42..1180479.54 rows=15927 width=14) (actual time=247.546..533.202 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=173219 read=87094 written=7
-> Index Only Scan using mm_gene_idx on public.at_summary_typing (cost=0.42..1178949.93 rows=182713 width=14) (actual time=247.533..497.771 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Heap Fetches: 739728
Buffers: shared hit=173219 read=87094 written=7
Total runtime: 562.735 ms
Performance now comparable with the MySQL.
The problem is that I understand that setting this is bad practice and that I should try and find a way to improve my query/encourage the use of the index automatically. However I'm very inexperienced with PostgreSQL and cannot work out how or why it is choosing to use a Seq Scan over an Index Scan in the first place.
why it is choosing to use a Seq Scan over an Index Scan in the first place
Because the seq scan is actually twice as fast as the index scan (224ms vs. 497ms) despite the fact that the index was nearly completely in the cache, but the table was not.
So choosing the seq scan was the right thing to do.
The bottleneck in the first plan is the sorting and grouping that needs to be done on disk.
The better strategy would be to increase work_mem to something more realistic than the really small default of 4MB. You might want to start with something like 16MB, by running
set work_mem=16MB;
before running your query. If that doesn't remove the "Sort Method: external merge Disk" steps, increase it work_mem further.
By increasing the work_mem it also is possible that Postgres switches to a hash aggregate instead of the sorting that it currently does which will probably be faster anyway (but isn't feasible if not enough memory is available)
Once you find a good value, you might want to make that permanent by putting the new value into postgresql.conf
Don't set this too high: that memory may be requested multiple times for each query.
If your where condition is static, you could also create a partial index matching that criteria:
create index on at_summary_typing ("MMG", "Gene")
where "MMG" <> '';
Don't forget to analyze the table to update the statistics.