Speed up fulltext search - pgsql - postgresql

I saw a millions of threads according to speed up postgresql query with full-text search. I tried to do everything, but have not more ideas.
I have pretty big (20 612 971 records at this moment) table and search in it with pgsql's fulltext serach, then order it by ts_rank_cd. I reached up to around 3500-4000ms to execute query. Any ideas to make it faster ? If its possible I dont want to use external soft like sphinx or solr. So, native postgresql solutions are preffered :) Below are describe of my table and example of explain analyze select.
# \d artifacts.item
Table "artifacts.item"
Column | Type | Modifiers
-------------------------+-----------------------------+-------------------------------------------------------------
add_timestamp | timestamp without time zone |
author_account_id | integer |
description | text |
id | integer | not null default nextval('artifacts.item_id_seq'::regclass)
removed_since_timestamp | timestamp without time zone |
slug | character varying(2044) | not null
thumb_height | integer |
thumb_path | character varying(2044) | default NULL::character varying
thumb_width | integer |
title | character varying(2044) | not null
search_data | tsvector |
tags | integer[] |
is_age_restricted | boolean | not null default false
is_on_homepage | boolean | not null default false
is_public | boolean | not null default false
thumb_filename | character varying(2044) |
is_removed | boolean | not null default false
Indexes:
"artifacts_item_add_timestamp_idx" btree (add_timestamp DESC NULLS LAST)
"artifacts_item_id_idx" btree (id)
"artifacts_item_is_on_homepage_add_timestamp" btree (is_on_homepage DESC, add_timestamp DESC NULLS LAST)
"artifacts_item_is_on_homepage_idx" btree (is_on_homepage)
"artifacts_item_search_results" gin (search_data) WHERE is_public IS TRUE AND is_removed IS FALSE
"artifacts_item_tags_gin_idx" gin (tags)
"artifacts_item_thumbs_list" btree (is_public, is_removed, id DESC)
"index1" btree (add_timestamp)
"itemIdx" btree (is_public, is_removed, is_age_restricted)
"item_author_account_id_idx" btree (author_account_id)
analyze:
# explain analyze SELECT i.id,
# i.title,
# i.description,
# i.slug,
# i.thumb_path,
# i.thumb_filename,
# CONCAT(
# i.thumb_path,
# '/',
# i.thumb_filename
# ) AS thumb_url,
# (CASE WHEN i.thumb_width = 0 THEN 280 ELSE i.thumb_width END) as thumb_width,
# (CASE WHEN i.thumb_height = 0 THEN 280 ELSE i.thumb_height END) as thumb_height,
# (i.thumb_height > i.thumb_width) AS is_vertical,
# i.add_timestamp
# FROM artifacts.item AS i
# WHERE i.is_public IS true
# AND i.is_removed IS false
# AND (i.search_data ## to_tsquery('public.polish', $$'lego'$$))
# ORDER BY ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# i.add_timestamp DESC NULLS LAST
# LIMIT 60
# OFFSET 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=358061.78..358061.93 rows=60 width=315) (actual time=335.870..335.876 rows=60 loops=1)
-> Sort (cost=358061.78..358357.25 rows=118189 width=315) (actual time=335.868..335.868 rows=60 loops=1)
Sort Key: (ts_rank_cd(search_data, '''lego'' | ''lega'''::tsquery)), add_timestamp
Sort Method: top-N heapsort Memory: 55kB
-> Bitmap Heap Scan on item i (cost=2535.96..353980.19 rows=118189 width=315) (actual time=33.163..308.371 rows=62025 loops=1)
Recheck Cond: ((search_data ## '''lego'' | ''lega'''::tsquery) AND (is_public IS TRUE) AND (is_removed IS FALSE))
-> Bitmap Index Scan on artifacts_item_search_results (cost=0.00..2506.42 rows=118189 width=0) (actual time=23.066..23.066 rows=62085 loops=1)
Index Cond: (search_data ## '''lego'' | ''lega'''::tsquery)
Total runtime: 335.967 ms
(9 rows)
Time: 3444.731 ms

There are 62025 rows that match the condition, and they have to be ordered…
That will take a while. Is there a chance that you can have the whole database or at least the index in RAM? That would help.

Related

How can I optimize a postgresql query where one dependent column is a timestamp

I have a table with a foreign key and a timestamp for when the row was most recently updated. rows with the same foreign key value are updated at roughly the same time, plus or minus an hour. I have an index on (foreign_key, timestamp). This is on postgresql 11.
When I make a query like:
select * from table where foreign_key = $1 and timestamp > $2 order by primary_key;
It will use my index in cases where the timestamp query is selective across the entire table. But if the timestamp is far enough in the past that the majority of rows match it will scan the primary_key index assuming it'll be faster. This problem goes away if I remove the order by.
I've looked at Postgresql's CREATE STATISTICS, but it doesn't seem to help in cases where the correlation is over a range of values like a timestamp plus or minus five minutes, rather than an specific value.
What are the best ways to work around this? I can remove the order by, but that complicates the business logic. I can partition the table on the foreign key id, but that is also a pretty expensive change.
Specifics:
Table "public.property_home_attributes"
Column | Type | Collation | Nullable | Default
----------------------+-----------------------------+-----------+----------+------------------------------------------------------
id | integer | | not null | nextval('property_home_attributes_id_seq'::regclass)
mls_id | integer | | not null |
property_id | integer | | not null |
formatted_attributes | jsonb | | not null |
created_at | timestamp without time zone | | |
updated_at | timestamp without time zone | | |
Indexes:
"property_home_attributes_pkey" PRIMARY KEY, btree (id)
"index_property_home_attributes_on_property_id" UNIQUE, btree (property_id)
"index_property_home_attributes_on_updated_at" btree (updated_at)
"property_home_attributes_mls_id_updated_at_idx" btree (mls_id, updated_at)
The table has about 16 million rows.
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') ORDER BY id ASC LIMIT 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..10147.83 rows=1000 width=880) (actual time=1519.718..22310.674 rows=1000 loops=1)
-> Index Scan using property_home_attributes_pkey on property_home_attributes (cost=0.56..6094202.57 rows=600576 width=880) (actual time=1519.716..22310.398 rows=1000 loops=1)
Filter: ((updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone) AND (mls_id = 46))
Rows Removed by Filter: 358834
Planning Time: 0.110 ms
Execution Time: 22310.842 ms
(6 rows)
and then without the order by:
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..1049.38 rows=1000 width=880) (actual time=0.053..162.081 rows=1000 loops=1)
-> Index Scan using foo on property_home_attributes (cost=0.56..629893.60 rows=600576 width=880) (actual time=0.053..161.992 rows=1000 loops=1)
Index Cond: ((mls_id = 46) AND (updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone))
Planning Time: 0.100 ms
Execution Time: 162.140 ms
(5 rows)
If you want to keep PostgreSQL from using an index scan on property_home_attributes_pkey to support the ORDER BY, you can simply use
ORDER BY primary_key + 0

Why is my count query on index field slow?

I have the following schema:
leadgenie-django=> \d main_lead;
Table "public.main_lead"
Column | Type | Modifiers
-----------------+--------------------------+-----------
id | uuid | not null
body | text | not null
username | character varying(255) | not null
link | character varying(255) | not null
source | character varying(10) | not null
keyword_matches | character varying(255)[] | not null
json | jsonb | not null
created_at | timestamp with time zone | not null
updated_at | timestamp with time zone | not null
campaign_id | uuid | not null
is_accepted | boolean |
is_closed | integer |
raw_body | text |
accepted_at | timestamp with time zone |
closed_at | timestamp with time zone |
score | double precision |
Indexes:
"main_lead_pkey" PRIMARY KEY, btree (id)
"main_lead_campaign_id_75034b1f" btree (campaign_id)
Foreign-key constraints:
"main_lead_campaign_id_75034b1f_fk_main_campaign_id" FOREIGN KEY (campaign_id) REFERENCES main_campaign(id) DEFERRABLE INITIALLY DEFERRED
As you can see, campaign_id is indexed.
When I do a simple WHERE with a campaign_id, the query still takes 16 seconds.
leadgenie-django=> EXPLAIN ANALYZE select count(*) from main_lead where campaign_id = '9a183263-7a60-4ec0-a354-2175f8a2e5c9';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=202866.79..202866.80 rows=1 width=8) (actual time=16715.762..16715.763 rows=1 loops=1)
-> Seq Scan on main_lead (cost=0.00..202189.94 rows=270739 width=0) (actual time=1143.886..16516.490 rows=279405 loops=1)
Filter: (campaign_id = '9a183263-7a60-4ec0-a354-2175f8a2e5c9'::uuid)
Rows Removed by Filter: 857300
Planning time: 0.080 ms
Execution time: 16715.807 ms
I would have expected this query to be fast (under 1s), since this field is indexed. Is there a reason my expectation is wrong? Anything I could do to speed it up?
The query fetches about 25% of your table, so PostgreSQL thinks that this is most cheaply done with a sequential scan of the whole table. That is probably correct.
Try running
VACUUM main_lead;
That will update the visibility map, and if there are no long-running concurrent transactions, that should mark most of the table blocks as all-visible, so that you can get a faster index only scan for the query.

Group by count query takes a long time

I have a table like this:
Column | Type | Modifiers
-------------+-----------------------------+-------------------------------------------------------
id | integer | not null default nextval('oks_id_seq'::regclass)
uname | text | not null
ess | text |
quest | text |
details | text |
status | character(1) | not null default 'q'::bpchar
last_parsed | timestamp without time zone |
qstatus | character(1) | not null default 'q'::bpchar
media_wc | integer | not null default 0
Indexes:
"oks_pkey" PRIMARY KEY, btree (id)
"oks_uname_key" UNIQUE CONSTRAINT, btree (uname)
"last_parsed_idx" btree (last_parsed)
"qstatus_idx" btree (qstatus)
"status_idx" btree (status)
And I have a query like this:
SELECT COUNT(status), status FROM oks GROUP BY status ORDER BY status;
Which results in:
count | status
---------+--------
1478472 | d
23599 | p
10178 | q
6278206 | s
(4 rows)
Which is great, but this takes forever, and for some reason Postgres keeps the whole index on disk, because disk activity is really high during the query.
Sort (cost=1117385.91..1117385.92 rows=4 width=2) (actual time=54122.991..54122.993 rows=4 loops=1)
Sort Key: status
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=1117385.82..1117385.86 rows=4 width=2) (actual time=54122.280..54122.283 rows=4 loops=1)
-> Seq Scan on oks (cost=0.00..1078433.55 rows=7790455 width=2) (actual time=0.009..47978.616 rows=7790455 loops=1)
Total runtime: 54123.487 ms
(6 rows)
In my config I have the memory usage set at
work_mem = 128MB
Any ideas about how I can optimize such queries that use group by on the whole table? This seems unrealistically slow as it would have been much faster with flat files storage.
Edit:
I was able to get the query to run in a fraction of a second by modifying the postgres config files. Specifically, setting
fsync = off
synchronous_commit = off
full_page_writes = off
commit_delay = 2000
effective_cache_size = 4GB
work_mem = 512MB
maintenance_work_mem = 512MB
Not sure if these are optimal, but these options work in my case.
fsync = off helped the most I think.
try to use cstore. It is column store "table".
info:
https://github.com/citusdata/cstore_fdw
how to use cstore:
https://stackoverflow.com/questions/29970937/psql-using-cstore-table-for-aggregation-big-data

How can this simple query take so long?

=> SELECT * FROM "tags" WHERE ("kind" = 'View') ORDER BY "name";
Time: 278.318 ms
The tags table contains 358 rows. All of them are views at the moment.
Column | Type | Modifiers
-------------+--------------------------+-------------------------------------
id | uuid | not null default uuid_generate_v4()
name | text | not null
slug | text | not null
kind | text | not null
external_id | text |
created_at | timestamp with time zone | not null default now()
updated_at | timestamp with time zone |
filter | json |
Indexes:
"tags_pkey" PRIMARY KEY, btree (id)
"tags_kind_index" btree (kind)
"tags_name_index" btree (name)
Analyze says:
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Sort (cost=9.29..9.47 rows=358 width=124) (actual time=0.654..0.696 rows=358 loops=1)
Sort Key: name
Sort Method: quicksort Memory: 75kB
-> Seq Scan on tags (cost=0.00..6.25 rows=358 width=124) (actual time=0.006..0.108 rows=358 loops=1)
Filter: (kind = 'View'::text)
Total runtime: 0.756 ms
(6 rows)
Did you run analyze tags? It will update the table's statistics.
First if all the kind values are 'view' then an index on that column is useless. The index will only be used if the column's cardinality is high otherwise it is cheaper to do a sequential scan on the table.
Second with only 358 rows it is cheaper to do a sequential scan anyway.

SQL optimization problem

In our product database, ther is a slow sql which runs very frequently , just as the following
csv logs shows that it usually taks more than 1 second , But it is so inconceivable that when I
excute the sql in Database Server, it usually just taks about 0.3 s。From the plan ,we can see
that it uses a right index。 Anybody can explain it and have some optimization advice,thanks !
--sql
SELECT id, content, the_geo, lon, lat, skyid, addtime
FROM skytf.tbl_map
where skytf.ST_Distance_sphere(the_geo, skytf.geometryFromText('POINT(0 -4.0E-5)')) <= 1000
and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext('POINT(0 -4.0E-5)'),0.005)
order by addtime desc limit 30
--csv log
2011-08-30 03:02:06.029 CST,"lbs","skytf",28656,"192.168.168.46:53430",4e5b9d45.6ff0,356,"SELECT",2011-08-29 22:08:05 CST,106/3030952,0,LOG,00000,"duration: 1782.945 ms execute <unnamed>: SELECT id,content,the_geo,lon,lat,skyid,addtime FROM skytf.tbl_map where skytf.ST_Distance_sphere(the_geo,skytf.geometryFromText($1))<=1000 and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext($2),0.005) order by addtime desc limit 30 ","parameters: $1 = 'POINT(0 -4.0E-5)', $2 = 'POINT(0 -4.0E-5)'",,,,,,,,""
There are plenty of the duration logs, I just paste a line 。 ""*duration: 1782.945 ms"*
--explain analyze
skytf=> explain analyze SELECT id,content,the_geo,lon,lat,skyid,addtime FROM skytf.tbl_map
skytf-> where skytf.ST_Distance_sphere(the_geo,skytf.geometryFromText('POINT(0 -4.0E-5)'))<=1000
skytf-> and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext('POINT(0 -4.0E-5)'),0.005)
skytf-> order by addtime desc limit 30 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11118.56..11118.64 rows=30 width=128) (actual time=338.031..338.037 rows=30 loops=1)
-> Sort (cost=11118.56..11124.60 rows=2416 width=128) (actual time=338.030..338.032 rows=30 loops=1)
Sort Key: addtime
Sort Method: top-N heapsort Memory: 29kB
-> Bitmap Heap Scan on tbl_map (cost=201.51..11047.21 rows=2416 width=128) (actual time=53.455..309.962 rows=78121 loops=1)
Recheck Cond: ((the_geo)::box && '(0.005,0.00496),(-0.005,-0.00504)'::box)
Filter: (skytf.st_distance_sphere(the_geo, '01010000000000000000000000F168E388B5F804BF'::skytf.geometry) <=1000::double precision)
-> Bitmap Index Scan on tbl_map_idx_gin (cost=0.00..200.91 rows=7249 width=0) (actual time=49.392..49.392 rows=78559 loops=1)
Index Cond: ((the_geo)::box && '(0.005,0.00496),(-0.005,-0.00504)'::box)
Total runtime: 338.125 ms
(10 rows)
--table information
skytf=> \dt+ skytf.tbl_map
List of relations
Schema | Name | Type | Owner | Size | Description
------------+---------------+-------+-------+--------+-------------
skytf | tbl_map | table | lbs | 158 MB |
(1 row)
skytf=> \d skytf.tbl_map
Table "skytf.tbl_map"
Column | Type | Modifiers
--------------+--------------------------------+-----------------------------------------------------------------------
id | integer | not null default nextval('skytf.tbl_map_id_seq'::regclass)
content | character varying(100) |
lon | double precision |
lat | double precision |
skyid | integer |
addtime | timestamp(1) without time zone | default now()
the_geo | skytf.geometry |
viewcount | integer | default 0
lastreadtime | timestamp without time zone |
ischeck | boolean |
Indexes:
"tbl_map_pkey" PRIMARY KEY, btree (id)
"idx_map_book_skyid" btree (skyid, addtime)
"tbl_map_idx_gin" gist ((the_geo::box))
It is possible to have slower queries while checkpoint is executed.
Try to log checkpoints and if this is your case tune checkpoints for latency. In general this may reduce bytes written per sec. but will improve slow queries.