postgresql- slow update query - postgresql

I'm working with a table that has about 19 million rows and 60 columns (bigtable). Of the 19 million records, about 17 million have x and y coordinates (about 1.8 million distinct combinations of x and y). I needed to add some additional geocoded information to the table from another file (census_geocode). I've created a lookup table (distinct_xy) that has a list of all the distinct x and y coordinate pairs and an ID. I have indexes on bigtable (x_coord, y_coord), census_geocode (x_coord, y_coord), and distinct_xy (x_coord, y_coord), and a primary key in distinct_xy (xy_id) and census_geocode (xy_id). So here's the query:
Update bigtable
set block_grp = cg.blkgrp,
block = cg.block,
tract = cg.tractce10
from census_geocode cg, distinct_xy xy
where bigtable.x_coord = xy.x_coord and
bigtable.y_coord=xy.y_coord and cg.xy_id=xy.xy_id;
This is running extremely slowly. as in:
"Update on bigtable (cost=17675751.51..17827040.74 rows=22 width=327)"
" -> Nested Loop (cost=17675751.51..17827040.74 rows=22 width=327)"
" -> Merge Join (cost=17675751.51..17826856.26 rows=22 width=312)"
" Merge Cond: ((bigtable.x_coord = xy.x_coord) AND (bigtable.y_coord = xy.y_coord))"
" -> Sort (cost=17318145.58..17366400.81 rows=19302092 width=302)"
" Sort Key: bigtable.x_coord, bigtable.y_coord"
" -> Seq Scan on bigtable (cost=0.00..1457709.92 rows=19302092 width=302)"
" -> Materialize (cost=357588.42..366887.02 rows=1859720 width=26)"
" -> Sort (cost=357588.42..362237.72 rows=1859720 width=26)"
" Sort Key: xy.x_coord, xy.y_coord"
" -> Seq Scan on distinct_xy xy (cost=0.00..30443.20 rows=1859720 width=26)"
" -> Index Scan using census_geocode_pkey on census_geocode cg (cost=0.00..8.37 rows=1 width=23)"
" Index Cond: (xy_id = xy.xy_id)"
I've also tried splitting this apart and inserting the lookup key back into the big table to avoid the multi-table join.
Update bigtable
set xy_id = xy.xy_id
from distinct_xy xy
where bigtable.x_coord = xy.x_coord and bigtable.y_coord=xy.y_coord;
this also runs for hours without completing.
"Update on bigtable (cost=0.00..20577101.71 rows=22 width=404)"
" -> Nested Loop (cost=0.00..20577101.71 rows=22 width=404)"
" -> Seq Scan on distinct_xy xy (cost=0.00..30443.20 rows=1859720 width=26)"
" -> Index Scan using rae_xy_idx on bigtable (cost=0.00..11.03 rows=1 width=394)"
" Index Cond: ((x_coord = xy.x_coord) AND (y_coord = xy.y_coord))"
Can someone please help me improve this query's performance?

Related

Avoid Materialize in Explain Plan while running Postgres Query

I am trying to know the explain plan and optimize my query. Here is the query that I am using. While I am joining with pd_ontology table, I am seeing that the cost is increasing heavily.
explain create table l1.test as
select
null as empi,
coalesce(nullif(a.pid_2_1,''),nullif(a.pid_3_1,''),nullif(a.pid_4_1,'')) as id,
coalesce(nullif(pid_3_5,''),'Patient ID') as idt,
upper(trim(pid_5_2)) as fn,
upper(trim(pid_5_3)) as mn,
upper(trim(pid_5_1)) as ln,
nullif(pid_7_1,'')::date as dob,
upper(trim(pid_8_1)) as gn,
nullif(pid_29_1,'')::date as dod,
upper(trim(pid_30_1)) as df,
upper(trim(pid_11_1)) as psa1,
upper(trim(pid_11_2)) as psa2,
upper(trim(pid_11_3)) as pci,
upper(trim(pid_11_4)) as pst,
upper(trim(pid_11_5)) as pz,
upper(trim(pid_11_6)) as pcy,
coalesce(nullif(a.pid_13_1,''),nullif(a.pid_13_2,''),nullif(a.pid_13_3,''),nullif(a.pid_14_1,''),nullif(a.pid_14_2,''),nullif(a.pid_14_3,'')) as tel1,
coalesce(nullif(a.pid_13_1,''),nullif(a.pid_13_2,''),nullif(a.pid_13_3,''),nullif(a.pid_14_1,''),nullif(a.pid_14_2,''),nullif(a.pid_14_3,'')) as cell1,
lower(trim(pid_13_4)) as eml1,
upper(trim(pid_10_1)) as race,
upper(trim(pid_10_2)) as racen,
upper(trim(pid_22_1)) as ethn,
upper(trim(pid_22_2)) as ethnm,
upper(trim(pid_24_1)) as mbi,
upper(trim(pid_16_1)) as ms,
upper(trim(pid_16_2)) as msn,
coalesce(nullif(a.pid_11_9,''),nullif(a.pid_12_1,'')) as pct,
upper(trim(pid_15_1)) as pl,
upper(trim(pid_17_1)) as rel,
upper(trim(pid_19_1)) as ssn,
trim(obx_3_1) as rc,
--trim(o.cdscs) as rn,
null as rn,
trim(obx_3_3) as rcs,
trim(obx_5_1) as rv,
obx_6_1 as uru,
obx_8_1 as oac,
obr_25_1 as rst,
rtrim(trim(replace(replace(regexp_replace(replace(obx_7_1,'x10E3','*10^3'),'[a-zA-Z%]','','g'),'^','E'),'*','x')),'/') as onrr,
trim(split_part(rtrim(trim(replace(replace(regexp_replace(replace(obx_7_1,'x10E3','*10^3'),'[a-zA-Z%]','','g'),'^','E'),'*','x')),'/'),'-',1)) as rrl,
trim(split_part(rtrim(trim(replace(replace(regexp_replace(replace(obx_7_1,'x10E3','*10^3'),'[a-zA-Z%]','','g'),'^','E'),'*','x')),'/'),'-',2)) as rrh,
obx_10_1 as natc,
orc_2_1 as "pon",
left(nullif(obx_14_1,''),8)::date as rdt,
case when to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') not between '1800-01-01' and current_date then null else to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') end as efdt,
case when to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') not between '1800-01-01' and current_date then null else to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') end as eldt,
coalesce(obr_16_1,'') as opid,
nullif(obr_16_13,'null') as opidt,
trim(orc_12_1) as opnpi,
--trim(upper(n.name)) as opn,
null as opn,
trim(nullif(obr_4_1,'null')) as oc,
trim(nullif(obr_4_3,'null')) as ocs,
trim(nullif(obr_4_2,'null')) as on,
to_date(nullif(obr_7_1,''),'yyyyMMddHH24miss') as ofdt,
trim(orc_5_1) as os,
--left(e.nte_3_1,512) as cmd,
split_part(b.filename,'/',5) as sfn,
'Clinical' as st,
now() AS ingdt,
'4' as acoid ,
'Test' as acon,
'result' as cltp,
'Test' as sstp,
'202' as sid
from l1.vipn_pal_historical_all_oru_pid a
join l1.vipn_pal_historical_all_oru_obx b
on a.control_id = b.control_id
and b.cross_join_tuple_count = '0'
left join l1.vipn_pal_historical_all_oru_obr c
on a.control_id = c.control_id
and b.order_observation_order = c.order_observation_order
and a.cross_join_tuple_count = '1'
left join l1.vipn_pal_historical_all_oru_orc d
on a.control_id = d.control_id
and d.order_observation_order = b.order_observation_order
and a.cross_join_tuple_count = '1'
left join (select control_id ,order_observation_order ,observation_order,replace(string_agg(nte_3_1 ,' '),'\.br\',chr(13)||chr(10)) as nte_3_2
from l1.vipn_pal_historical_all_oru_nte
group by control_id ,order_observation_order ,observation_order ) e
on a.control_id = e.control_id and e.observation_order = b.observation_order
and e.order_observation_order = b.order_observation_order
--and e.cross_join_tuple_count = '1'
left join (select * from l2.pd_ontology where dtp = 'result') o
on (b.obx_3_1 = o.cval or b.obx_3_1 = cvald)
left join l2.pd_npi n
on d.orc_12_1 = n.npi;
Here is the explain Plan generated where you can see that the materialize is taking load.
Merge Left Join (cost=106313.03..7599360149686.98 rows=329075452 width=1641)
Merge Cond: ((a.control_id)::text = (c.control_id)::text)
Join Filter: (((a.cross_join_tuple_count)::text = '1'::text) AND ((b.order_observation_order)::text = (c.order_observation_order)::text))
-> Merge Left Join (cost=106311.69..7599175158271.60 rows=329075452 width=244)
Merge Cond: ((a.control_id)::text = (d.control_id)::text)
Join Filter: (((a.cross_join_tuple_count)::text = '1'::text) AND ((d.order_observation_order)::text = (b.order_observation_order)::text))
-> Merge Join (cost=106310.57..7599144659758.97 rows=329075452 width=236)
Merge Cond: ((a.control_id)::text = (b.control_id)::text)
-> Index Scan using vipn_pal_historical_all_oru_pid_control_id_idx on vipn_pal_historical_all_oru_pid a (cost=0.56..800918.31 rows=9353452 width=96)
-> Materialize (cost=106309.92..7599139604853.41 rows=282211264 width=161)
-> Nested Loop Left Join (cost=106309.92..7599138899325.25 rows=282211264 width=161)
Join Filter: (((b.obx_3_1)::text = (pd_ontology.cval)::text) OR ((b.obx_3_1)::text = (pd_ontology.cvald)::text))
-> Index Scan using vipn_pal_historical_all_oru_obx_control_id_idx on vipn_pal_historical_all_oru_obx b (cost=0.57..53285968.32 rows=282211264 width=161)
Filter: ((cross_join_tuple_count)::text = '0'::text)
-> **Materialize (cost=106309.35..1255207.79 rows=1538682 width=19)**
-> Bitmap Heap Scan on pd_ontology (cost=106309.35..1247514.38 rows=1538682 width=19)
Recheck Cond: ((dtp)::text = 'result'::text)
-> Bitmap Index Scan on pd_ont_idx_dtp (cost=0.00..105924.68 rows=1538682 width=0)
Index Cond: ((dtp)::text = 'result'::text)
-> Materialize (cost=1.12..14373643.76 rows=18706904 width=29)
-> Nested Loop Left Join (cost=1.12..14326876.50 rows=18706904 width=29)
-> Index Scan using vipn_pal_historical_all_oru_orc_control_id_idx on vipn_pal_historical_all_oru_orc d (cost=0.56..2587122.40 rows=18706904 width=29)
-> Index Only Scan using idx_pd_npi_npi on pd_npi n (cost=0.56..0.62 rows=1 width=11)
Index Cond: (npi = (d.orc_12_1)::text)
-> Materialize (cost=0.57..12676277.17 rows=80915472 width=60)
-> Index Scan using vipn_pal_historical_all_oru_obr_control_id_idx on vipn_pal_historical_all_oru_obr c (cost=0.57..12473988.49 rows=80915472 width=60)
Is there a way to avoid Materialize in query and optimize it?
I removed the index to solve this issue. The Materialize is re-scanning and to avoid that, I dropped the index. Now materialize will not do an index scan and hence, it does not need to re-scan. Saving cost!!

Simple batch DELETE then INSERT procedure some 1000 times slower than executing the statements one after the other

In arather simple table with an composite primary key (see DDL) there are about 40k records.
create table la_ezg
(
be_id integer not null,
usage text not null,
area numeric(18, 6),
sk_area numeric(18, 6),
count_ezg numeric(18, 6),
...
...
constraint la_ezg_pkey
primary key (be_id, usage)
);
There is also a simple procedure which purpose is to delete rows with a certain be_id and persist the rows from another view where they are "generated"
CREATE OR REPLACE function pr_create_la_ezg(pBE_ID numeric) returns void as
$$
begin
delete from la_ezg where be_id = pBE_ID;
insert into la_ezg_(BE_ID, USAGE, ...)
select be_id, usage, ...
from vw_la_ezg_with_usage
where be_id = pBE_ID;
END;
$$ language plpgsql;
The procedure need about 7 Minutes to execute...
Both Statements (DELETE and INSERT) execute in less than 100ms on the very same be_id.
There are a lot of different locks happening in pg_lock during that 7 Minutes but I wasn't able to figure out what exactly is going on inside this transaction and if there is some kind of deadlocking. After all the procedure is returning successful, but it needs way too much time doing it.
EDIT (activated 'auto_explain' and ran all three queries again):
duration: 1.420 ms plan:
Query Text: delete from la_ezg where be_id=790696
Delete on la_ezg (cost=4.33..22.89 rows=5 width=6)
-> Bitmap Heap Scan on la_ezg (cost=4.33..22.89 rows=5 width=6)
Output: ctid
Recheck Cond: (la_ezg.be_id = 790696)
-> Bitmap Index Scan on sys_c0073325 (cost=0.00..4.33 rows=5 width=0)
Index Cond: (la_ezg.be_id = 790696)
1 row affected in 107 ms
duration: 71.645 ms plan:
Query Text: insert into la_ezg(BE_ID,USAGE,...)
select be_id,USAGE,... from vw_la_ezg_with_usage where be_id=790696
Insert on la_ezg (cost=1343.71..2678.87 rows=1 width=228)
-> Nested Loop (cost=1343.71..2678.87 rows=1 width=228)
Output: la_ezg_geo.be_id, usage.nutzungsart, COALESCE(round(((COALESCE(st_area(la_ezg_geo.geometry), '3'::double precision) / '10000'::double precision))::numeric, 2), '0'::numeric), NULL::numeric, COALESCE((count(usage.nutzungsart)), '0'::bigint), COALESCE(round((((sum(st_area(st_intersection(ezg.geometry, usage.geom)))) / '10000'::double precision))::numeric, 2), '0'::numeric), COALESCE(round(((((sum(st_area(st_intersection(ezg.geometry, usage.geom)))) * '100'::double precision) / COALESCE(st_area(la_ezg_geo.geometry), '3'::double precision)))::numeric, 2), '0'::numeric), NULL::character varying, NULL::timestamp without time zone, NULL::character varying, NULL::timestamp without time zone
-> GroupAggregate (cost=1343.71..1343.76 rows=1 width=41)
Output: ezg.be_id, usage.nutzungsart, sum(st_area(st_intersection(ezg.geometry, usage.geom))), count(usage.nutzungsart)
Group Key: ezg.be_id, usage.nutzungsart
-> Sort (cost=1343.71..1343.71 rows=1 width=1834)
Output: ezg.be_id, usage.nutzungsart, ezg.geometry, usage.geom
Sort Key: usage.nutzungsart
-> Nested Loop (cost=0.42..1343.70 rows=1 width=1834)
Output: ezg.be_id, usage.nutzungsart, ezg.geometry, usage.geom
-> Seq Scan on la_ezg_geo ezg (cost=0.00..1335.00 rows=1 width=1516)
Output: ezg.objectid, ezg.be_id, ezg.name, ezg.se_anno_cad_data, ezg.benutzer_geaendert, ezg.datum_geaendert, ezg.status, ezg.benutzer_erstellt, ezg.datum_erstellt, ezg.len, ezg.geometry, ezg.temp_char, ezg.vulgo, ezg.flaeche, ezg.hauptgemeinde, ezg.prozessart, ezg.verbauungsgrad, ezg.verordnung_txt, ezg.gemeinden_txt, ezg.hinderungsgrund, ezg.kompetenz, ezg.seehoehe_min, ezg.seehoehe_max, ezg.neigung_min, ezg.neigung_max, ezg.exposition
Filter: (ezg.be_id = 790696)
-> Index Scan using dkm_nutz_fl_geom_1551355663100174000 on dkm.dkm_nutz_fl nutzung (cost=0.42..8.69 rows=1 width=318)
Output: usage.gdo_gid, usage.gst, usage.nutzungsart, usage.nutzungsabschnitt, usage.statistik, usage.flaeche, usage.kennung, usage.von_datum, usage.bis_datum, usage.von_az, usage.bis_az, usage.projekt, usage.fme_basename, usage.fme_dataset, usage.fme_feature_type, usage.fme_type, usage.oracle_srid, usage.geom
Index Cond: ((usage.geom && ezg.geometry) AND (usage.geom && ezg.geometry))
Filter: _st_intersects(usage.geom, ezg.geometry)
-> Seq Scan on la_ezg_geo (cost=0.00..1335.00 rows=1 width=1516)
Output: la_ezg_geo.objectid, la_ezg_geo.be_id, la_ezg_geo.name, la_ezg_geo.se_anno_cad_data, la_ezg_geo.benutzer_geaendert, la_ezg_geo.datum_geaendert, la_ezg_geo.status, la_ezg_geo.benutzer_erstellt, la_ezg_geo.datum_erstellt, la_ezg_geo.len, la_ezg_geo.geometry, la_ezg_geo.temp_char, la_ezg_geo.vulgo, la_ezg_geo.flaeche, la_ezg_geo.hauptgemeinde, la_ezg_geo.prozessart, la_ezg_geo.verbauungsgrad, la_ezg_geo.verordnung_txt, la_ezg_geo.gemeinden_txt, la_ezg_geo.hinderungsgrund, la_ezg_geo.kompetenz, la_ezg_geo.seehoehe_min, la_ezg_geo.seehoehe_max, la_ezg_geo.neigung_min, la_ezg_geo.neigung_max, la_ezg_geo.exposition
Filter: (la_ezg_geo.be_id = 790696)
1 row affected in 149 ms
duration: 421851.819 ms plan:
Query Text: select pr_create_la_ezg(790696)
Result (cost=0.00..0.26 rows=1 width=4)
Output: pr_create_la_ezg('790696'::numeric)
1 row retrieved starting from 1 in 7 m 1 s 955 ms (execution: 7 m 1 s 929 ms, fetching: 26 ms)
P.S. I shortened some of the queries and names for the sake of readability
P.P.S. This database is a legacy migration project. Like in this case there are often views dependent on views in multiple layers. I´d like to streamline all this but Ia m in a desperate need to debug whats going on inside such an transaction, otherwise I would have to rebuild nearly all with the risk of breaking things

Postgres - Update Performance degraded

Can someone please help assist in identifying why below statement which used to take 2 hours is not taking 6 hours without volume increase being a factor.
with P as
(SELECT DISTINCT CD.CASE_DETAIL_ID, SVL.SERVICE_LEVEL_ID\n
FROM report_fct CD LEFT JOIN SERVICE_LEVEL SVL ON SVL.ORDER_TYPE_CD = CD.ORDER_TYPE_CD\n
AND SVL.SOURCE_ID = CD.SOURCE_ID\n AND SVL.AREA_ID = CD.HQ_AREA_ID\n AND SVL.CATEGORY_ID = CD.CATEGORY_ID\n AND SVL.STATE_CD = CD.CUST_STATE\n
WHERE CD.LINE_OF_BIZ = 'CLOTH'\n
AND CD.HQ_AREA_ID is NOT NULL\n
AND CD.SOURCE_ID is NOT NULL\n
AND CD.CATEGORY_ID is NOT NULL\n
AND CD.CUST_STATE is NOT NULL)\n
update report_fct rpt\n
set service_level_id = P.service_level_id\n
from P\n
where rpt.case_detail_id = P.case_detail_id;"}
CREATE TABLE report_fct
...
..
case_detail_id bigint NOT NULL,
...
CREATE INDEX report_fct _ix1
ON report_fct USING btree
(case_detail_id)
TABLESPACE pg_default;
CREATE INDEX report_fct _ix2
ON report_fct USING btree
(insert_dt)
TABLESPACE pg_default;
One doubt I have is whether statistics can be skewed on this table which is resulting in degradation.
relname inserts updates deletes live_tuples dead_tupes last autovacuum last autoanalyze
report_fct 262746347 5387849450 0 2473523 3573914 5/19/20 3:38 5/19/20 1:13
EXPLAIN:
"Update on report_fct rpt (cost=24847.47..27881.35 rows=415 width=3772)"
" CTE p"
" -> Unique (cost=24844.02..24847.05 rows=405 width=16)"
" -> Sort (cost=24844.02..24845.03 rows=405 width=16)"
" Sort Key: cd.case_detail_id, svl.service_level_id"
" -> Nested Loop Left Join (cost=0.41..24826.48 rows=405 width=16)"
" -> Seq Scan on report_fct cd (cost=0.00..21915.21 rows=405 width=44)"
" Filter: ((hq_area_id IS NOT NULL) AND (source_id IS NOT NULL) AND (category_id IS NOT NULL) AND (cust_state IS NOT NULL) AND ((line_of_biz)::text = 'CLOTH'::text))"
" -> Index Scan using service_level_unq on service_level svl (cost=0.41..7.18 rows=1 width=45)"
" Index Cond: ((area_id = cd.hq_area_id) AND ((order_type_cd)::text = (cd.order_type_cd)::text) AND (source_id = cd.source_id) AND (state_cd = (cd.cust_state)::bpchar) AND (category_id = cd.category_id))"
" -> Nested Loop (cost=0.41..3034.30 rows=415 width=3772)"
" -> CTE Scan on p (cost=0.00..8.10 rows=405 width=56)"
" -> Index Scan using report_fct_ix1 on report_fct rpt (cost=0.41..7.46 rows=1 width=3724)"
" Index Cond: (case_detail_id = p.case_detail_id)"

Postgres cube type distance vector index slower than seq scan

With a 128 dimension column and a distance query as below:
CREATE TABLE testes (id serial, name text, face cube);
CREATE INDEX testes_face_idx ON testes USING gist(face gist_cube_ops);
explain analyse select name from testes order by face <-> cube(array[-0.12341737002134323, 0.013954268768429756, 0.041934967041015625, -0.027295179665088654, -0.1557110995054245, -0.03121102601289749, 0.017772752791643143, -0.17166048288345337, 0.09068921208381653, -0.13417541980743408, 0.17567767202854156, -0.06697715818881989, -0.1830156147480011, -0.08423275500535965, -0.0623091384768486, 0.13855493068695068, 0.01960853487253189, -0.12219744175672531, -0.1498851776123047, -0.1448814421892166, -0.04667501151561737, 0.10095512866973877, -0.010014703497290611, 0.028698112815618515, -0.12299459427595139, -0.2449578195810318, -0.04310397803783417, -0.0786057710647583, -0.0006230985745787621, -0.012474060989916325, -0.0008601928129792213, 0.13489803671836853, -0.17316003143787384, -0.056241780519485474, 0.04442238435149193, 0.14999067783355713, -0.04893124848604202, -0.03364997357130051, 0.17365986108779907, 0.014477224089205265, -0.14650581777095795, 0.06581126153469086, 0.05907478928565979, 0.24371813237667084, 0.199946790933609, -0.07071209698915482, 0.030652550980448723, -0.06517398357391357, 0.19778677821159363, -0.3098893463611603, 0.04202471300959587, 0.06682528555393219, 0.11922725290060043, 0.04458840191364288, 0.07993366569280624, -0.09807920455932617, -0.02106720767915249, 0.17947503924369812, -0.15518437325954437, 0.11362187564373016, 0.05837336927652359, -0.11214996874332428, -0.13685055077075958, -0.10379699617624283, 0.13636618852615356, 0.1293313056230545, -0.11564487218856812, -0.10860224068164825, 0.2200884073972702, -0.16025489568710327, -0.05225272849202156, 0.10024034976959229, -0.10087429732084274, -0.1339828222990036, -0.27345386147499084, 0.1377202421426773, 0.437569797039032, 0.17741253972053528, -0.18133604526519775, -0.052022092044353485, -0.03961575776338577, 0.07023612409830093, 0.013044891878962517, 0.007585287094116211, -0.015369717963039875, -0.13501259684562683, -0.07265347242355347, 0.011824256740510464, 0.21609637141227722, -0.012745738960802555, -0.04935416579246521, 0.23810920119285583, 0.031168460845947266, 0.034897398203611374, -0.014598412439227104, 0.0809953436255455, -0.11255790293216705, -0.06797720491886139, -0.09544365853071213, -0.008347772061824799, 0.0790143683552742, -0.11389575153589249, 0.046258144080638885, 0.12429731339216232, -0.15094317495822906, 0.24766354262828827, -0.10882335901260376, 0.022879034280776978, 0.03814130276441574, -0.013778979890048504, -0.01565537415444851, 0.07461182028055191, 0.14960512518882751, -0.15471796691417694, 0.18988533318042755, 0.10148166120052338, 0.0060581183061003685, 0.1403576135635376, 0.06793759763240814, 0.04792795702815056, 0.00046137627214193344, -0.007764225825667381, -0.15212640166282654, -0.18374276161193848, 0.03233196958899498, -0.05509287118911743, -0.0091116763651371, 0.06819846481084824]) limit 3;
Limit (cost=0.41..5.44 rows=3 width=18) (actual time=1557.082..1697.857 rows=3 loops=1)
-> Index Scan using testes_face_idx on testes (cost=0.41..1859319.05 rows=1109532 width=18) (actual time=1557.081..1697.855 rows=3 loops=1)
Order By: (face <-> '(-0.123417370021343, 0.0139542687684298, 0.0419349670410156, -0.0272951796650887, -0.155711099505424, -0.0312110260128975, 0.0177727527916431, -0.171660482883453, 0.0906892120838165, -0.134175419807434, 0.175677672028542, -0.0669771581888199, -0.183015614748001, -0.0842327550053596, -0.0623091384768486, 0.138554930686951, 0.0196085348725319, -0.122197441756725, -0.149885177612305, -0.144881442189217, -0.0466750115156174, 0.100955128669739, -0.0100147034972906, 0.0286981128156185, -0.122994594275951, -0.244957819581032, -0.0431039780378342, -0.0786057710647583, -0.000623098574578762, -0.0124740609899163, -0.000860192812979221, 0.134898036718369, -0.173160031437874, -0.0562417805194855, 0.0444223843514919, 0.149990677833557, -0.048931248486042, -0.0336499735713005, 0.173659861087799, 0.0144772240892053, -0.146505817770958, 0.0658112615346909, 0.0590747892856598, 0.243718132376671, 0.199946790933609, -0.0707120969891548, 0.0306525509804487, -0.0651739835739136, 0.197786778211594, -0.30988934636116, 0.0420247130095959, 0.0668252855539322, 0.1192272529006, 0.0445884019136429, 0.0799336656928062, -0.0980792045593262, -0.0210672076791525, 0.179475039243698, -0.155184373259544, 0.11362187564373, 0.0583733692765236, -0.112149968743324, -0.13685055077076, -0.103796996176243, 0.136366188526154, 0.129331305623055, -0.115644872188568, -0.108602240681648, 0.22008840739727, -0.160254895687103, -0.0522527284920216, 0.100240349769592, -0.100874297320843, -0.133982822299004, -0.273453861474991, 0.137720242142677, 0.437569797039032, 0.177412539720535, -0.181336045265198, -0.0520220920443535, -0.0396157577633858, 0.0702361240983009, 0.0130448918789625, 0.00758528709411621, -0.0153697179630399, -0.135012596845627, -0.0726534724235535, 0.0118242567405105, 0.216096371412277, -0.0127457389608026, -0.0493541657924652, 0.238109201192856, 0.0311684608459473, 0.0348973982036114, -0.0145984124392271, 0.0809953436255455, -0.112557902932167, -0.0679772049188614, -0.0954436585307121, -0.0083477720618248, 0.0790143683552742, -0.113895751535892, 0.0462581440806389, 0.124297313392162, -0.150943174958229, 0.247663542628288, -0.108823359012604, 0.022879034280777, 0.0381413027644157, -0.0137789798900485, -0.0156553741544485, 0.0746118202805519, 0.149605125188828, -0.154717966914177, 0.189885333180428, 0.101481661200523, 0.00605811830610037, 0.140357613563538, 0.0679375976324081, 0.0479279570281506, 0.000461376272141933, -0.00776422582566738, -0.152126401662827, -0.183742761611938, 0.032331969588995, -0.0550928711891174, -0.0091116763651371, 0.0681984648108482)'::cube)
Planning time: 0.101 ms
Execution time: 1698.691 ms
Then I dropped the index and now it is faster:
Limit (cost=186715.64..186715.65 rows=3 width=18) (actual time=1362.653..1362.667 rows=3 loops=1)
-> Sort (cost=186715.64..189489.47 rows=1109532 width=18) (actual time=1362.652..1362.652 rows=3 loops=1)
Sort Key: ((face <-> '(-0.123417370021343, 0.0139542687684298, 0.0419349670410156, -0.0272951796650887, -0.155711099505424, -0.0312110260128975, 0.0177727527916431, -0.171660482883453, 0.0906892120838165, -0.134175419807434, 0.175677672028542, -0.0669771581888199, -0.183015614748001, -0.0842327550053596, -0.0623091384768486, 0.138554930686951, 0.0196085348725319, -0.122197441756725, -0.149885177612305, -0.144881442189217, -0.0466750115156174, 0.100955128669739, -0.0100147034972906, 0.0286981128156185, -0.122994594275951, -0.244957819581032, -0.0431039780378342, -0.0786057710647583, -0.000623098574578762, -0.0124740609899163, -0.000860192812979221, 0.134898036718369, -0.173160031437874, -0.0562417805194855, 0.0444223843514919, 0.149990677833557, -0.048931248486042, -0.0336499735713005, 0.173659861087799, 0.0144772240892053, -0.146505817770958, 0.0658112615346909, 0.0590747892856598, 0.243718132376671, 0.199946790933609, -0.0707120969891548, 0.0306525509804487, -0.0651739835739136, 0.197786778211594, -0.30988934636116, 0.0420247130095959, 0.0668252855539322, 0.1192272529006, 0.0445884019136429, 0.0799336656928062, -0.0980792045593262, -0.0210672076791525, 0.179475039243698, -0.155184373259544, 0.11362187564373, 0.0583733692765236, -0.112149968743324, -0.13685055077076, -0.103796996176243, 0.136366188526154, 0.129331305623055, -0.115644872188568, -0.108602240681648, 0.22008840739727, -0.160254895687103, -0.0522527284920216, 0.100240349769592, -0.100874297320843, -0.133982822299004, -0.273453861474991, 0.137720242142677, 0.437569797039032, 0.177412539720535, -0.181336045265198, -0.0520220920443535, -0.0396157577633858, 0.0702361240983009, 0.0130448918789625, 0.00758528709411621, -0.0153697179630399, -0.135012596845627, -0.0726534724235535, 0.0118242567405105, 0.216096371412277, -0.0127457389608026, -0.0493541657924652, 0.238109201192856, 0.0311684608459473, 0.0348973982036114, -0.0145984124392271, 0.0809953436255455, -0.112557902932167, -0.0679772049188614, -0.0954436585307121, -0.0083477720618248, 0.0790143683552742, -0.113895751535892, 0.0462581440806389, 0.124297313392162, -0.150943174958229, 0.247663542628288, -0.108823359012604, 0.022879034280777, 0.0381413027644157, -0.0137789798900485, -0.0156553741544485, 0.0746118202805519, 0.149605125188828, -0.154717966914177, 0.189885333180428, 0.101481661200523, 0.00605811830610037, 0.140357613563538, 0.0679375976324081, 0.0479279570281506, 0.000461376272141933, -0.00776422582566738, -0.152126401662827, -0.183742761611938, 0.032331969588995, -0.0550928711891174, -0.0091116763651371, 0.0681984648108482)'::cube))
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on testes (cost=0.00..172375.15 rows=1109532 width=18) (actual time=0.006..1239.698 rows=1109532 loops=1)
Planning time: 0.112 ms
Execution time: 1362.681 ms
Any ideas where to start debugging this? The time difference holds from 100k row to 1.1M rows.
Is it possible that it relates to the "high" 128 dimensions?

plpgsql function calling trigram similarity function inside does not utilize GIN or GIST indexes

I wanted to combine PostgreSQL Levenshtein and trigram similarity functions.
The main advantage of the trigram similarity function is that it can utilize GIN or GIST indexes and thus can return fuzzy match results quickly. However, if it is called inside another function, it does not use indexes.
For sake of this problem illustration, here is a plpgsql function "trigram_similarity" that calls original trigram's "similarity" function.
CREATE OR REPLACE FUNCTION public.trigram_similarity(
left_string text,
right_string text)
RETURNS real AS
$BODY$
BEGIN
RETURN similarity(left_string, right_string);
END;$BODY$
LANGUAGE plpgsql IMMUTABLE STRICT
COST 100;
ALTER FUNCTION public.trigram_similarity(text, text)
OWNER TO postgres;
Although the function actually just calls the trigram's similarity function, it behaves completely different when it comes to GIN indexes utilization. While the original trigram's similarity function inside WHERE clause of a query does utilize GIN indexes and thus a query returns result quickly and without much RAM consumption, when using trigram_similarity it does not. For large datasets fuzzy match analysis, the RAM is completely used and application freezes...
For sake of illustration, here is an example query:
SELECT DISTINCT
trigram_similarity(l.l_composite_18, r.r_composite_18)
::numeric(5,4) AS trigram_similarity_composite_score
, (trigram_similarity(l."Name", r."Name")*(0.166666666666667)
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333)
+ trigram_similarity(l."County", r."District")*(0.0416666666666667)
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667)
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667)
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667)
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667)
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667)
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667)
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667)
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667)
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667)
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333)
)
::numeric(5,4) AS trigram_similarity_weighted_score
, l."ClanID" AS "l_ClanID_1"
, l."Name" AS "l_Name_2"
, l."LastName" AS "l_LastName_3"
, l."County" AS "l_County_4"
, l."Town" AS "l_Town_5"
, l."PostalCode" AS "l_PostalCode_6"
, l."PostOffice" AS "l_PostOffice_7"
, l."Street" AS "l_Street_8"
, l."Number" AS "l_Number_9"
, l."Telephone1" AS "l_Telephone1_10"
, l."Telephone2" AS "l_Telephone2_11"
, l."EMail" AS "l_EMail_12"
, l."BirthDate" AS "l_BirthDate_13"
, l."Gender" AS "l_Gender_14"
, l."Aktivan" AS "l_Aktivan_15"
, l."ProgramCode" AS "l_ProgramCode_16"
, l."Card" AS "l_Card_17"
, l."DateOfCreation" AS "l_DateOfCreation_18"
, l."Assigned" AS "l_Assigned_19"
, l."Reserved" AS "l_Reserved_20"
, l."Sent" AS "l_Sent_21"
, l."MemberOfBothPrograms" AS "l_MemberOfBothPrograms_22"
, r."ClanID" AS "r_ClanID_23"
, r."Name" AS "r_Name_24"
, r."Surname" AS "r_Surname_25"
, r."District" AS "r_District_26"
, r."Location" AS "r_Location_27"
, r."ZipCode" AS "r_ZipCode_28"
, r."PostOffice" AS "r_PostOffice_29"
, r."Road" AS "r_Road_30"
, r."HomeNumber" AS "r_HomeNumber_31"
, r."Phone1" AS "r_Phone1_32"
, r."Phone2" AS "r_Phone2_33"
, r."EMail" AS "r_EMail_34"
, r."DateOfBirth" AS "r_DateOfBirth_35"
, r."Sex" AS "r_Sex_36"
, r."Active" AS "r_Active_37"
, r."ProgramCode" AS "r_ProgramCode_38"
, r."CardNo" AS "r_CardNo_39"
, r."CreationDate" AS "r_CreationDate_40"
, r."Assigned" AS "r_Assigned_41"
, r."Reserved" AS "r_Reserved_42"
, r."Sent" AS "r_Sent_43"
, r."BothPrograms" AS "r_BothPrograms_44"
FROM "l_leftdatasetexample3214274191" AS l, "r_rightdatasetexample3214274191" AS r
WHERE
((l."Gender"=r."Sex") AND (l."Card"<>r."CardNo") AND (l."Name"=r."Name"))
AND
((l."ProgramCode"= '1') AND (r."ProgramCode"= '1'))
AND
(((l.l_composite_18 % r.r_composite_18)
)
OR (((l."Name" % r."Name")
OR (l."LastName" % r."Surname")
OR (l."County" % r."District")
OR (l."Town" % r."Location")
OR (l."PostalCode" % r."ZipCode")
OR (l."PostOffice" % r."PostOffice")
OR (l."Street" % r."Road")
OR (l."Number" % r."HomeNumber")
OR (l."Telephone1" % r."Phone1")
OR (l."Telephone2" % r."Phone2")
OR (l."EMail" % r."EMail")
OR (l."BirthDate" % r."DateOfBirth")
OR (l."Gender" % r."Sex"))
)
) AND ((trigram_similarity(l.l_composite_18, r.r_composite_18)
>= 0.7)
OR ((trigram_similarity(l."Name", r."Name")*(0.166666666666667)
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333)
+ trigram_similarity(l."County", r."District")*(0.0416666666666667)
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667)
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667)
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667)
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667)
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667)
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667)
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667)
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667)
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667)
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333)
)
>= 0.7)
) ORDER BY trigram_similarity_composite_score DESC;
This query causes RAM clottage and application freezes. When "trigram_similarity" is replaced with "similarity", the query executes fast and without RAM overconsumption.
Why "trigram_similarity" and "similarity" behave differently?
Is there a way I could force GIN or GIST indexes utilization for this "trigram_similarity" function or any other function calling trigram's similarity function inside?
Explain analyze when "similarity" is used:
"Unique (cost=170717.94..177633.17 rows=58853 width=383) (actual time=260362.193..260362.279 rows=99 loops=1)"
" Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text) * '0 (...)"
" Buffers: shared hit=2513871 read=4158"
" -> Sort (cost=170717.94..170865.07 rows=58853 width=383) (actual time=260362.192..260362.198 rows=99 loops=1)"
" Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text (...)"
" Sort Key: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)) DESC, (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname" (...)"
" Sort Method: quicksort Memory: 76kB"
" Buffers: shared hit=2513871 read=4158"
" -> Nested Loop (cost=0.29..155793.36 rows=58853 width=383) (actual time=1851.503..260361.609 rows=99 loops=1)"
" Output: (similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4), ((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::t (...)"
" Buffers: shared hit=2513871 read=4158"
" -> Seq Scan on public.r_rightdatasetexample3214274191 r (cost=0.00..11228.86 rows=101669 width=188) (actual time=9.149..67.134 rows=50837 loops=1)"
" Output: r."ClanID", r."Name", r."Surname", r."District", r."Location", r."ZipCode", r."PostOffice", r."Road", r."HomeNumber", r."Phone1", r."Phone2", r."EMail", r."DateOfBirth", r."Sex", r."Active", r."ProgramCode", r."CardNo", r."Creat (...)"
" Filter: ((r."ProgramCode")::text = '1'::text)"
" Buffers: shared hit=5800 read=4158"
" -> Index Scan using "idxbNameA8D72F00099E4B70885B2E0BB1DFB684l_leftdatasetexample321" on public.l_leftdatasetexample3214274191 l (cost=0.29..1.35 rows=1 width=195) (actual time=5.111..5.119 rows=0 loops=50837)"
" Output: l."ClanID", l."Name", l."LastName", l."County", l."Town", l."PostalCode", l."PostOffice", l."Street", l."Number", l."Telephone1", l."Telephone2", l."EMail", l."BirthDate", l."Gender", l."Aktivan", l."ProgramCode", l."Card", l."D (...)"
" Index Cond: ((l."Name")::text = (r."Name")::text)"
" Filter: (((l."ProgramCode")::text = '1'::text) AND ((l."Card")::text <> (r."CardNo")::text) AND ((r."Sex")::text = (l."Gender")::text) AND (((l.l_composite_18)::text % (r.r_composite_18)::text) OR ((l."Name")::text % (r."Name")::text) O (...)"
" Rows Removed by Filter: 50"
" Buffers: shared hit=2508071"
"Planning time: 13.885 ms"
"Execution time: 260362.730 ms"
The indices are created on table columns. You need to modify your plpgsql function to query a GIN- or GIST-indexed, table column rather than comparing two, string literals. If you compare two, string literals the plugin has no index to hit and must decompose both strings into their trigrams before comparing them, which is your problem.
http://www.postgresql.org/docs/9.1/static/pgtrgm.html
It is possible to create GIN trgm_ops expression index on the compound (concatenated) expression. This index can faciliate % similarity operator, but not the similarity function.