SQL optimization problem - postgresql

In our product database, ther is a slow sql which runs very frequently , just as the following
csv logs shows that it usually taks more than 1 second , But it is so inconceivable that when I
excute the sql in Database Server, it usually just taks about 0.3 s。From the plan ,we can see
that it uses a right index。 Anybody can explain it and have some optimization advice,thanks !
--sql
SELECT id, content, the_geo, lon, lat, skyid, addtime
FROM skytf.tbl_map
where skytf.ST_Distance_sphere(the_geo, skytf.geometryFromText('POINT(0 -4.0E-5)')) <= 1000
and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext('POINT(0 -4.0E-5)'),0.005)
order by addtime desc limit 30
--csv log
2011-08-30 03:02:06.029 CST,"lbs","skytf",28656,"192.168.168.46:53430",4e5b9d45.6ff0,356,"SELECT",2011-08-29 22:08:05 CST,106/3030952,0,LOG,00000,"duration: 1782.945 ms execute <unnamed>: SELECT id,content,the_geo,lon,lat,skyid,addtime FROM skytf.tbl_map where skytf.ST_Distance_sphere(the_geo,skytf.geometryFromText($1))<=1000 and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext($2),0.005) order by addtime desc limit 30 ","parameters: $1 = 'POINT(0 -4.0E-5)', $2 = 'POINT(0 -4.0E-5)'",,,,,,,,""
There are plenty of the duration logs, I just paste a line 。 ""*duration: 1782.945 ms"*
--explain analyze
skytf=> explain analyze SELECT id,content,the_geo,lon,lat,skyid,addtime FROM skytf.tbl_map
skytf-> where skytf.ST_Distance_sphere(the_geo,skytf.geometryFromText('POINT(0 -4.0E-5)'))<=1000
skytf-> and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext('POINT(0 -4.0E-5)'),0.005)
skytf-> order by addtime desc limit 30 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11118.56..11118.64 rows=30 width=128) (actual time=338.031..338.037 rows=30 loops=1)
-> Sort (cost=11118.56..11124.60 rows=2416 width=128) (actual time=338.030..338.032 rows=30 loops=1)
Sort Key: addtime
Sort Method: top-N heapsort Memory: 29kB
-> Bitmap Heap Scan on tbl_map (cost=201.51..11047.21 rows=2416 width=128) (actual time=53.455..309.962 rows=78121 loops=1)
Recheck Cond: ((the_geo)::box && '(0.005,0.00496),(-0.005,-0.00504)'::box)
Filter: (skytf.st_distance_sphere(the_geo, '01010000000000000000000000F168E388B5F804BF'::skytf.geometry) <=1000::double precision)
-> Bitmap Index Scan on tbl_map_idx_gin (cost=0.00..200.91 rows=7249 width=0) (actual time=49.392..49.392 rows=78559 loops=1)
Index Cond: ((the_geo)::box && '(0.005,0.00496),(-0.005,-0.00504)'::box)
Total runtime: 338.125 ms
(10 rows)
--table information
skytf=> \dt+ skytf.tbl_map
List of relations
Schema | Name | Type | Owner | Size | Description
------------+---------------+-------+-------+--------+-------------
skytf | tbl_map | table | lbs | 158 MB |
(1 row)
skytf=> \d skytf.tbl_map
Table "skytf.tbl_map"
Column | Type | Modifiers
--------------+--------------------------------+-----------------------------------------------------------------------
id | integer | not null default nextval('skytf.tbl_map_id_seq'::regclass)
content | character varying(100) |
lon | double precision |
lat | double precision |
skyid | integer |
addtime | timestamp(1) without time zone | default now()
the_geo | skytf.geometry |
viewcount | integer | default 0
lastreadtime | timestamp without time zone |
ischeck | boolean |
Indexes:
"tbl_map_pkey" PRIMARY KEY, btree (id)
"idx_map_book_skyid" btree (skyid, addtime)
"tbl_map_idx_gin" gist ((the_geo::box))

It is possible to have slower queries while checkpoint is executed.
Try to log checkpoints and if this is your case tune checkpoints for latency. In general this may reduce bytes written per sec. but will improve slow queries.

Related

How can I optimize a postgresql query where one dependent column is a timestamp

I have a table with a foreign key and a timestamp for when the row was most recently updated. rows with the same foreign key value are updated at roughly the same time, plus or minus an hour. I have an index on (foreign_key, timestamp). This is on postgresql 11.
When I make a query like:
select * from table where foreign_key = $1 and timestamp > $2 order by primary_key;
It will use my index in cases where the timestamp query is selective across the entire table. But if the timestamp is far enough in the past that the majority of rows match it will scan the primary_key index assuming it'll be faster. This problem goes away if I remove the order by.
I've looked at Postgresql's CREATE STATISTICS, but it doesn't seem to help in cases where the correlation is over a range of values like a timestamp plus or minus five minutes, rather than an specific value.
What are the best ways to work around this? I can remove the order by, but that complicates the business logic. I can partition the table on the foreign key id, but that is also a pretty expensive change.
Specifics:
Table "public.property_home_attributes"
Column | Type | Collation | Nullable | Default
----------------------+-----------------------------+-----------+----------+------------------------------------------------------
id | integer | | not null | nextval('property_home_attributes_id_seq'::regclass)
mls_id | integer | | not null |
property_id | integer | | not null |
formatted_attributes | jsonb | | not null |
created_at | timestamp without time zone | | |
updated_at | timestamp without time zone | | |
Indexes:
"property_home_attributes_pkey" PRIMARY KEY, btree (id)
"index_property_home_attributes_on_property_id" UNIQUE, btree (property_id)
"index_property_home_attributes_on_updated_at" btree (updated_at)
"property_home_attributes_mls_id_updated_at_idx" btree (mls_id, updated_at)
The table has about 16 million rows.
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') ORDER BY id ASC LIMIT 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..10147.83 rows=1000 width=880) (actual time=1519.718..22310.674 rows=1000 loops=1)
-> Index Scan using property_home_attributes_pkey on property_home_attributes (cost=0.56..6094202.57 rows=600576 width=880) (actual time=1519.716..22310.398 rows=1000 loops=1)
Filter: ((updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone) AND (mls_id = 46))
Rows Removed by Filter: 358834
Planning Time: 0.110 ms
Execution Time: 22310.842 ms
(6 rows)
and then without the order by:
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..1049.38 rows=1000 width=880) (actual time=0.053..162.081 rows=1000 loops=1)
-> Index Scan using foo on property_home_attributes (cost=0.56..629893.60 rows=600576 width=880) (actual time=0.053..161.992 rows=1000 loops=1)
Index Cond: ((mls_id = 46) AND (updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone))
Planning Time: 0.100 ms
Execution Time: 162.140 ms
(5 rows)
If you want to keep PostgreSQL from using an index scan on property_home_attributes_pkey to support the ORDER BY, you can simply use
ORDER BY primary_key + 0

Speed up fulltext search - pgsql

I saw a millions of threads according to speed up postgresql query with full-text search. I tried to do everything, but have not more ideas.
I have pretty big (20 612 971 records at this moment) table and search in it with pgsql's fulltext serach, then order it by ts_rank_cd. I reached up to around 3500-4000ms to execute query. Any ideas to make it faster ? If its possible I dont want to use external soft like sphinx or solr. So, native postgresql solutions are preffered :) Below are describe of my table and example of explain analyze select.
# \d artifacts.item
Table "artifacts.item"
Column | Type | Modifiers
-------------------------+-----------------------------+-------------------------------------------------------------
add_timestamp | timestamp without time zone |
author_account_id | integer |
description | text |
id | integer | not null default nextval('artifacts.item_id_seq'::regclass)
removed_since_timestamp | timestamp without time zone |
slug | character varying(2044) | not null
thumb_height | integer |
thumb_path | character varying(2044) | default NULL::character varying
thumb_width | integer |
title | character varying(2044) | not null
search_data | tsvector |
tags | integer[] |
is_age_restricted | boolean | not null default false
is_on_homepage | boolean | not null default false
is_public | boolean | not null default false
thumb_filename | character varying(2044) |
is_removed | boolean | not null default false
Indexes:
"artifacts_item_add_timestamp_idx" btree (add_timestamp DESC NULLS LAST)
"artifacts_item_id_idx" btree (id)
"artifacts_item_is_on_homepage_add_timestamp" btree (is_on_homepage DESC, add_timestamp DESC NULLS LAST)
"artifacts_item_is_on_homepage_idx" btree (is_on_homepage)
"artifacts_item_search_results" gin (search_data) WHERE is_public IS TRUE AND is_removed IS FALSE
"artifacts_item_tags_gin_idx" gin (tags)
"artifacts_item_thumbs_list" btree (is_public, is_removed, id DESC)
"index1" btree (add_timestamp)
"itemIdx" btree (is_public, is_removed, is_age_restricted)
"item_author_account_id_idx" btree (author_account_id)
analyze:
# explain analyze SELECT i.id,
# i.title,
# i.description,
# i.slug,
# i.thumb_path,
# i.thumb_filename,
# CONCAT(
# i.thumb_path,
# '/',
# i.thumb_filename
# ) AS thumb_url,
# (CASE WHEN i.thumb_width = 0 THEN 280 ELSE i.thumb_width END) as thumb_width,
# (CASE WHEN i.thumb_height = 0 THEN 280 ELSE i.thumb_height END) as thumb_height,
# (i.thumb_height > i.thumb_width) AS is_vertical,
# i.add_timestamp
# FROM artifacts.item AS i
# WHERE i.is_public IS true
# AND i.is_removed IS false
# AND (i.search_data ## to_tsquery('public.polish', $$'lego'$$))
# ORDER BY ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# i.add_timestamp DESC NULLS LAST
# LIMIT 60
# OFFSET 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=358061.78..358061.93 rows=60 width=315) (actual time=335.870..335.876 rows=60 loops=1)
-> Sort (cost=358061.78..358357.25 rows=118189 width=315) (actual time=335.868..335.868 rows=60 loops=1)
Sort Key: (ts_rank_cd(search_data, '''lego'' | ''lega'''::tsquery)), add_timestamp
Sort Method: top-N heapsort Memory: 55kB
-> Bitmap Heap Scan on item i (cost=2535.96..353980.19 rows=118189 width=315) (actual time=33.163..308.371 rows=62025 loops=1)
Recheck Cond: ((search_data ## '''lego'' | ''lega'''::tsquery) AND (is_public IS TRUE) AND (is_removed IS FALSE))
-> Bitmap Index Scan on artifacts_item_search_results (cost=0.00..2506.42 rows=118189 width=0) (actual time=23.066..23.066 rows=62085 loops=1)
Index Cond: (search_data ## '''lego'' | ''lega'''::tsquery)
Total runtime: 335.967 ms
(9 rows)
Time: 3444.731 ms
There are 62025 rows that match the condition, and they have to be ordered…
That will take a while. Is there a chance that you can have the whole database or at least the index in RAM? That would help.

Group by count query takes a long time

I have a table like this:
Column | Type | Modifiers
-------------+-----------------------------+-------------------------------------------------------
id | integer | not null default nextval('oks_id_seq'::regclass)
uname | text | not null
ess | text |
quest | text |
details | text |
status | character(1) | not null default 'q'::bpchar
last_parsed | timestamp without time zone |
qstatus | character(1) | not null default 'q'::bpchar
media_wc | integer | not null default 0
Indexes:
"oks_pkey" PRIMARY KEY, btree (id)
"oks_uname_key" UNIQUE CONSTRAINT, btree (uname)
"last_parsed_idx" btree (last_parsed)
"qstatus_idx" btree (qstatus)
"status_idx" btree (status)
And I have a query like this:
SELECT COUNT(status), status FROM oks GROUP BY status ORDER BY status;
Which results in:
count | status
---------+--------
1478472 | d
23599 | p
10178 | q
6278206 | s
(4 rows)
Which is great, but this takes forever, and for some reason Postgres keeps the whole index on disk, because disk activity is really high during the query.
Sort (cost=1117385.91..1117385.92 rows=4 width=2) (actual time=54122.991..54122.993 rows=4 loops=1)
Sort Key: status
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=1117385.82..1117385.86 rows=4 width=2) (actual time=54122.280..54122.283 rows=4 loops=1)
-> Seq Scan on oks (cost=0.00..1078433.55 rows=7790455 width=2) (actual time=0.009..47978.616 rows=7790455 loops=1)
Total runtime: 54123.487 ms
(6 rows)
In my config I have the memory usage set at
work_mem = 128MB
Any ideas about how I can optimize such queries that use group by on the whole table? This seems unrealistically slow as it would have been much faster with flat files storage.
Edit:
I was able to get the query to run in a fraction of a second by modifying the postgres config files. Specifically, setting
fsync = off
synchronous_commit = off
full_page_writes = off
commit_delay = 2000
effective_cache_size = 4GB
work_mem = 512MB
maintenance_work_mem = 512MB
Not sure if these are optimal, but these options work in my case.
fsync = off helped the most I think.
try to use cstore. It is column store "table".
info:
https://github.com/citusdata/cstore_fdw
how to use cstore:
https://stackoverflow.com/questions/29970937/psql-using-cstore-table-for-aggregation-big-data

How can this simple query take so long?

=> SELECT * FROM "tags" WHERE ("kind" = 'View') ORDER BY "name";
Time: 278.318 ms
The tags table contains 358 rows. All of them are views at the moment.
Column | Type | Modifiers
-------------+--------------------------+-------------------------------------
id | uuid | not null default uuid_generate_v4()
name | text | not null
slug | text | not null
kind | text | not null
external_id | text |
created_at | timestamp with time zone | not null default now()
updated_at | timestamp with time zone |
filter | json |
Indexes:
"tags_pkey" PRIMARY KEY, btree (id)
"tags_kind_index" btree (kind)
"tags_name_index" btree (name)
Analyze says:
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Sort (cost=9.29..9.47 rows=358 width=124) (actual time=0.654..0.696 rows=358 loops=1)
Sort Key: name
Sort Method: quicksort Memory: 75kB
-> Seq Scan on tags (cost=0.00..6.25 rows=358 width=124) (actual time=0.006..0.108 rows=358 loops=1)
Filter: (kind = 'View'::text)
Total runtime: 0.756 ms
(6 rows)
Did you run analyze tags? It will update the table's statistics.
First if all the kind values are 'view' then an index on that column is useless. The index will only be used if the column's cardinality is high otherwise it is cheaper to do a sequential scan on the table.
Second with only 358 rows it is cheaper to do a sequential scan anyway.

PostgreSQL: running a query for each row and saving the result in it

I store weekly game score in a table called pref_money:
# select * from pref_money limit 5;
id | money | yw
----------------+-------+---------
OK32378280203 | -27 | 2011-44
OK274037315447 | -56 | 2011-44
OK19644992852 | 8 | 2011-44
OK21807961329 | 114 | 2011-44
FB1845091917 | 774 | 2011-44
(5 rows)
And for the winners of each week I display medal(s):
I find the number of medals for a user by running:
# select count(id) from (
select id,
row_number() over(partition by yw order by money desc) as ranking
from pref_money
) x
where x.ranking = 1 and id='OK260246921082';
count
-------
3
(1 row)
And that query is quite costly:
# explain analyze select count(id) from (
select id,
row_number() over(partition by yw order by money desc) as ranking
from pref_money
) x
where x.ranking = 1 and id='OK260246921082';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=18946.46..18946.47 rows=1 width=82) (actual time=2423.145..2423.145 rows=1 loops=1)
-> Subquery Scan x (cost=14829.44..18946.45 rows=3 width=82) (actual time=2400.004..2423.138 rows=3 loops=1)
Filter: ((x.ranking = 1) AND ((x.id)::text = 'OK260246921082'::text))
-> WindowAgg (cost=14829.44..17182.02 rows=117629 width=26) (actual time=2289.079..2403.685 rows=116825 loops=1)
-> Sort (cost=14829.44..15123.51 rows=117629 width=26) (actual time=2289.069..2319.575 rows=116825 loops=1)
Sort Key: pref_money.yw, pref_money.money
Sort Method: external sort Disk: 4320kB
-> Seq Scan on pref_money (cost=0.00..2105.29 rows=117629 width=26) (actual time=0.006..22.566 rows=116825 loops=1)
Total runtime: 2425.001 ms
(9 rows)
That is why (and because my web site is struggling during peak times, with 50 queries/s displayed in pgbouncer log) I'd like to cache that value and have added a column medals to another table - the pref_users:
pref=> \d pref_users;
Table "public.pref_users"
Column | Type | Modifiers
------------+-----------------------------+---------------
id | character varying(32) | not null
first_name | character varying(32) |
last_name | character varying(32) |
female | boolean |
avatar | character varying(128) |
city | character varying(32) |
lat | real |
lng | real |
login | timestamp without time zone | default now()
last_ip | inet |
medals | smallint | default 0
logout | timestamp without time zone |
Indexes:
"pref_users_pkey" PRIMARY KEY, btree (id)
Check constraints:
"pref_users_lat_check" CHECK ((-90)::double precision <= lat AND lat <= 90::double precision)
"pref_users_lng_check" CHECK ((-90)::double precision <= lng AND lng <= 90::double precision)
"pref_users_medals_check" CHECK (medals >= 0)
I would like to create a cronjob to be run every 15 minutes to update that column for all users in the pref_users table:
*/15 * * * * psql -a -f $HOME/bin/medals.sql
As you see, I've got almost everything in-place. My problem is that I haven't come up with the SQL statement yet for updating the medals column.
Any help please?
I'm using PostgreSQL 8.4.8 with CentOS Linux 5.6 / 64 bit.
Thank you!
Alex
Well, won't this produce a result of user IDs and medal counts?
create view user_medal_count as
select id, count(*) as medals from (
select id,
row_number() over(partition by yw order by money desc) as ranking
from pref_money
) x
where x.ranking = 1
group by id
So you need to use that as a source to update your users:
update pref_users
set medals = user_medal_count.medals
from user_medal_count
where pref_users.id = user_medal_count.id
and (pref_users.medal_count is null
or pref_users.medal_count <> user_medal_count.medal_count)
I hope that gets you started.
There are issues left to consider. You probably want to define at which point a user is awarded a medal- the medal for the "current week" is presumably subject to change, so you may want to define the medal count as the stable count of previous weeks' medals, calculate the current week's medal on the fly (which should require looking at much less data), or simply exclude it. (If you don't do anything, then you may find users get a medal_count of 1 if they temporarily get the current week's medal, but that never gets reset to 0 if it is later given to someone else).