Related
I am trying to know the explain plan and optimize my query. Here is the query that I am using. While I am joining with pd_ontology table, I am seeing that the cost is increasing heavily.
explain create table l1.test as
select
null as empi,
coalesce(nullif(a.pid_2_1,''),nullif(a.pid_3_1,''),nullif(a.pid_4_1,'')) as id,
coalesce(nullif(pid_3_5,''),'Patient ID') as idt,
upper(trim(pid_5_2)) as fn,
upper(trim(pid_5_3)) as mn,
upper(trim(pid_5_1)) as ln,
nullif(pid_7_1,'')::date as dob,
upper(trim(pid_8_1)) as gn,
nullif(pid_29_1,'')::date as dod,
upper(trim(pid_30_1)) as df,
upper(trim(pid_11_1)) as psa1,
upper(trim(pid_11_2)) as psa2,
upper(trim(pid_11_3)) as pci,
upper(trim(pid_11_4)) as pst,
upper(trim(pid_11_5)) as pz,
upper(trim(pid_11_6)) as pcy,
coalesce(nullif(a.pid_13_1,''),nullif(a.pid_13_2,''),nullif(a.pid_13_3,''),nullif(a.pid_14_1,''),nullif(a.pid_14_2,''),nullif(a.pid_14_3,'')) as tel1,
coalesce(nullif(a.pid_13_1,''),nullif(a.pid_13_2,''),nullif(a.pid_13_3,''),nullif(a.pid_14_1,''),nullif(a.pid_14_2,''),nullif(a.pid_14_3,'')) as cell1,
lower(trim(pid_13_4)) as eml1,
upper(trim(pid_10_1)) as race,
upper(trim(pid_10_2)) as racen,
upper(trim(pid_22_1)) as ethn,
upper(trim(pid_22_2)) as ethnm,
upper(trim(pid_24_1)) as mbi,
upper(trim(pid_16_1)) as ms,
upper(trim(pid_16_2)) as msn,
coalesce(nullif(a.pid_11_9,''),nullif(a.pid_12_1,'')) as pct,
upper(trim(pid_15_1)) as pl,
upper(trim(pid_17_1)) as rel,
upper(trim(pid_19_1)) as ssn,
trim(obx_3_1) as rc,
--trim(o.cdscs) as rn,
null as rn,
trim(obx_3_3) as rcs,
trim(obx_5_1) as rv,
obx_6_1 as uru,
obx_8_1 as oac,
obr_25_1 as rst,
rtrim(trim(replace(replace(regexp_replace(replace(obx_7_1,'x10E3','*10^3'),'[a-zA-Z%]','','g'),'^','E'),'*','x')),'/') as onrr,
trim(split_part(rtrim(trim(replace(replace(regexp_replace(replace(obx_7_1,'x10E3','*10^3'),'[a-zA-Z%]','','g'),'^','E'),'*','x')),'/'),'-',1)) as rrl,
trim(split_part(rtrim(trim(replace(replace(regexp_replace(replace(obx_7_1,'x10E3','*10^3'),'[a-zA-Z%]','','g'),'^','E'),'*','x')),'/'),'-',2)) as rrh,
obx_10_1 as natc,
orc_2_1 as "pon",
left(nullif(obx_14_1,''),8)::date as rdt,
case when to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') not between '1800-01-01' and current_date then null else to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') end as efdt,
case when to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') not between '1800-01-01' and current_date then null else to_date(nullif(obx_14_1,''),'yyyyMMddHH24miss') end as eldt,
coalesce(obr_16_1,'') as opid,
nullif(obr_16_13,'null') as opidt,
trim(orc_12_1) as opnpi,
--trim(upper(n.name)) as opn,
null as opn,
trim(nullif(obr_4_1,'null')) as oc,
trim(nullif(obr_4_3,'null')) as ocs,
trim(nullif(obr_4_2,'null')) as on,
to_date(nullif(obr_7_1,''),'yyyyMMddHH24miss') as ofdt,
trim(orc_5_1) as os,
--left(e.nte_3_1,512) as cmd,
split_part(b.filename,'/',5) as sfn,
'Clinical' as st,
now() AS ingdt,
'4' as acoid ,
'Test' as acon,
'result' as cltp,
'Test' as sstp,
'202' as sid
from l1.vipn_pal_historical_all_oru_pid a
join l1.vipn_pal_historical_all_oru_obx b
on a.control_id = b.control_id
and b.cross_join_tuple_count = '0'
left join l1.vipn_pal_historical_all_oru_obr c
on a.control_id = c.control_id
and b.order_observation_order = c.order_observation_order
and a.cross_join_tuple_count = '1'
left join l1.vipn_pal_historical_all_oru_orc d
on a.control_id = d.control_id
and d.order_observation_order = b.order_observation_order
and a.cross_join_tuple_count = '1'
left join (select control_id ,order_observation_order ,observation_order,replace(string_agg(nte_3_1 ,' '),'\.br\',chr(13)||chr(10)) as nte_3_2
from l1.vipn_pal_historical_all_oru_nte
group by control_id ,order_observation_order ,observation_order ) e
on a.control_id = e.control_id and e.observation_order = b.observation_order
and e.order_observation_order = b.order_observation_order
--and e.cross_join_tuple_count = '1'
left join (select * from l2.pd_ontology where dtp = 'result') o
on (b.obx_3_1 = o.cval or b.obx_3_1 = cvald)
left join l2.pd_npi n
on d.orc_12_1 = n.npi;
Here is the explain Plan generated where you can see that the materialize is taking load.
Merge Left Join (cost=106313.03..7599360149686.98 rows=329075452 width=1641)
Merge Cond: ((a.control_id)::text = (c.control_id)::text)
Join Filter: (((a.cross_join_tuple_count)::text = '1'::text) AND ((b.order_observation_order)::text = (c.order_observation_order)::text))
-> Merge Left Join (cost=106311.69..7599175158271.60 rows=329075452 width=244)
Merge Cond: ((a.control_id)::text = (d.control_id)::text)
Join Filter: (((a.cross_join_tuple_count)::text = '1'::text) AND ((d.order_observation_order)::text = (b.order_observation_order)::text))
-> Merge Join (cost=106310.57..7599144659758.97 rows=329075452 width=236)
Merge Cond: ((a.control_id)::text = (b.control_id)::text)
-> Index Scan using vipn_pal_historical_all_oru_pid_control_id_idx on vipn_pal_historical_all_oru_pid a (cost=0.56..800918.31 rows=9353452 width=96)
-> Materialize (cost=106309.92..7599139604853.41 rows=282211264 width=161)
-> Nested Loop Left Join (cost=106309.92..7599138899325.25 rows=282211264 width=161)
Join Filter: (((b.obx_3_1)::text = (pd_ontology.cval)::text) OR ((b.obx_3_1)::text = (pd_ontology.cvald)::text))
-> Index Scan using vipn_pal_historical_all_oru_obx_control_id_idx on vipn_pal_historical_all_oru_obx b (cost=0.57..53285968.32 rows=282211264 width=161)
Filter: ((cross_join_tuple_count)::text = '0'::text)
-> **Materialize (cost=106309.35..1255207.79 rows=1538682 width=19)**
-> Bitmap Heap Scan on pd_ontology (cost=106309.35..1247514.38 rows=1538682 width=19)
Recheck Cond: ((dtp)::text = 'result'::text)
-> Bitmap Index Scan on pd_ont_idx_dtp (cost=0.00..105924.68 rows=1538682 width=0)
Index Cond: ((dtp)::text = 'result'::text)
-> Materialize (cost=1.12..14373643.76 rows=18706904 width=29)
-> Nested Loop Left Join (cost=1.12..14326876.50 rows=18706904 width=29)
-> Index Scan using vipn_pal_historical_all_oru_orc_control_id_idx on vipn_pal_historical_all_oru_orc d (cost=0.56..2587122.40 rows=18706904 width=29)
-> Index Only Scan using idx_pd_npi_npi on pd_npi n (cost=0.56..0.62 rows=1 width=11)
Index Cond: (npi = (d.orc_12_1)::text)
-> Materialize (cost=0.57..12676277.17 rows=80915472 width=60)
-> Index Scan using vipn_pal_historical_all_oru_obr_control_id_idx on vipn_pal_historical_all_oru_obr c (cost=0.57..12473988.49 rows=80915472 width=60)
Is there a way to avoid Materialize in query and optimize it?
I removed the index to solve this issue. The Materialize is re-scanning and to avoid that, I dropped the index. Now materialize will not do an index scan and hence, it does not need to re-scan. Saving cost!!
In arather simple table with an composite primary key (see DDL) there are about 40k records.
create table la_ezg
(
be_id integer not null,
usage text not null,
area numeric(18, 6),
sk_area numeric(18, 6),
count_ezg numeric(18, 6),
...
...
constraint la_ezg_pkey
primary key (be_id, usage)
);
There is also a simple procedure which purpose is to delete rows with a certain be_id and persist the rows from another view where they are "generated"
CREATE OR REPLACE function pr_create_la_ezg(pBE_ID numeric) returns void as
$$
begin
delete from la_ezg where be_id = pBE_ID;
insert into la_ezg_(BE_ID, USAGE, ...)
select be_id, usage, ...
from vw_la_ezg_with_usage
where be_id = pBE_ID;
END;
$$ language plpgsql;
The procedure need about 7 Minutes to execute...
Both Statements (DELETE and INSERT) execute in less than 100ms on the very same be_id.
There are a lot of different locks happening in pg_lock during that 7 Minutes but I wasn't able to figure out what exactly is going on inside this transaction and if there is some kind of deadlocking. After all the procedure is returning successful, but it needs way too much time doing it.
EDIT (activated 'auto_explain' and ran all three queries again):
duration: 1.420 ms plan:
Query Text: delete from la_ezg where be_id=790696
Delete on la_ezg (cost=4.33..22.89 rows=5 width=6)
-> Bitmap Heap Scan on la_ezg (cost=4.33..22.89 rows=5 width=6)
Output: ctid
Recheck Cond: (la_ezg.be_id = 790696)
-> Bitmap Index Scan on sys_c0073325 (cost=0.00..4.33 rows=5 width=0)
Index Cond: (la_ezg.be_id = 790696)
1 row affected in 107 ms
duration: 71.645 ms plan:
Query Text: insert into la_ezg(BE_ID,USAGE,...)
select be_id,USAGE,... from vw_la_ezg_with_usage where be_id=790696
Insert on la_ezg (cost=1343.71..2678.87 rows=1 width=228)
-> Nested Loop (cost=1343.71..2678.87 rows=1 width=228)
Output: la_ezg_geo.be_id, usage.nutzungsart, COALESCE(round(((COALESCE(st_area(la_ezg_geo.geometry), '3'::double precision) / '10000'::double precision))::numeric, 2), '0'::numeric), NULL::numeric, COALESCE((count(usage.nutzungsart)), '0'::bigint), COALESCE(round((((sum(st_area(st_intersection(ezg.geometry, usage.geom)))) / '10000'::double precision))::numeric, 2), '0'::numeric), COALESCE(round(((((sum(st_area(st_intersection(ezg.geometry, usage.geom)))) * '100'::double precision) / COALESCE(st_area(la_ezg_geo.geometry), '3'::double precision)))::numeric, 2), '0'::numeric), NULL::character varying, NULL::timestamp without time zone, NULL::character varying, NULL::timestamp without time zone
-> GroupAggregate (cost=1343.71..1343.76 rows=1 width=41)
Output: ezg.be_id, usage.nutzungsart, sum(st_area(st_intersection(ezg.geometry, usage.geom))), count(usage.nutzungsart)
Group Key: ezg.be_id, usage.nutzungsart
-> Sort (cost=1343.71..1343.71 rows=1 width=1834)
Output: ezg.be_id, usage.nutzungsart, ezg.geometry, usage.geom
Sort Key: usage.nutzungsart
-> Nested Loop (cost=0.42..1343.70 rows=1 width=1834)
Output: ezg.be_id, usage.nutzungsart, ezg.geometry, usage.geom
-> Seq Scan on la_ezg_geo ezg (cost=0.00..1335.00 rows=1 width=1516)
Output: ezg.objectid, ezg.be_id, ezg.name, ezg.se_anno_cad_data, ezg.benutzer_geaendert, ezg.datum_geaendert, ezg.status, ezg.benutzer_erstellt, ezg.datum_erstellt, ezg.len, ezg.geometry, ezg.temp_char, ezg.vulgo, ezg.flaeche, ezg.hauptgemeinde, ezg.prozessart, ezg.verbauungsgrad, ezg.verordnung_txt, ezg.gemeinden_txt, ezg.hinderungsgrund, ezg.kompetenz, ezg.seehoehe_min, ezg.seehoehe_max, ezg.neigung_min, ezg.neigung_max, ezg.exposition
Filter: (ezg.be_id = 790696)
-> Index Scan using dkm_nutz_fl_geom_1551355663100174000 on dkm.dkm_nutz_fl nutzung (cost=0.42..8.69 rows=1 width=318)
Output: usage.gdo_gid, usage.gst, usage.nutzungsart, usage.nutzungsabschnitt, usage.statistik, usage.flaeche, usage.kennung, usage.von_datum, usage.bis_datum, usage.von_az, usage.bis_az, usage.projekt, usage.fme_basename, usage.fme_dataset, usage.fme_feature_type, usage.fme_type, usage.oracle_srid, usage.geom
Index Cond: ((usage.geom && ezg.geometry) AND (usage.geom && ezg.geometry))
Filter: _st_intersects(usage.geom, ezg.geometry)
-> Seq Scan on la_ezg_geo (cost=0.00..1335.00 rows=1 width=1516)
Output: la_ezg_geo.objectid, la_ezg_geo.be_id, la_ezg_geo.name, la_ezg_geo.se_anno_cad_data, la_ezg_geo.benutzer_geaendert, la_ezg_geo.datum_geaendert, la_ezg_geo.status, la_ezg_geo.benutzer_erstellt, la_ezg_geo.datum_erstellt, la_ezg_geo.len, la_ezg_geo.geometry, la_ezg_geo.temp_char, la_ezg_geo.vulgo, la_ezg_geo.flaeche, la_ezg_geo.hauptgemeinde, la_ezg_geo.prozessart, la_ezg_geo.verbauungsgrad, la_ezg_geo.verordnung_txt, la_ezg_geo.gemeinden_txt, la_ezg_geo.hinderungsgrund, la_ezg_geo.kompetenz, la_ezg_geo.seehoehe_min, la_ezg_geo.seehoehe_max, la_ezg_geo.neigung_min, la_ezg_geo.neigung_max, la_ezg_geo.exposition
Filter: (la_ezg_geo.be_id = 790696)
1 row affected in 149 ms
duration: 421851.819 ms plan:
Query Text: select pr_create_la_ezg(790696)
Result (cost=0.00..0.26 rows=1 width=4)
Output: pr_create_la_ezg('790696'::numeric)
1 row retrieved starting from 1 in 7 m 1 s 955 ms (execution: 7 m 1 s 929 ms, fetching: 26 ms)
P.S. I shortened some of the queries and names for the sake of readability
P.P.S. This database is a legacy migration project. Like in this case there are often views dependent on views in multiple layers. I´d like to streamline all this but Ia m in a desperate need to debug whats going on inside such an transaction, otherwise I would have to rebuild nearly all with the risk of breaking things
There are two schema in the same database - oatarchival and oat
The schemas are completely similar to each other.
Here is the query that I am running, which is taking lot of time
DELETE FROM oat.oat_user_tag_verification
using oatarchival.oat_user_tag_verification outv, oat.fp_archived f
WHERE outv.tag_id = f.tag_id and f.is_archived=false
and oat_user_tag_verification.user_id = outv.user_id and
oat_user_tag_verification.tag_id = outv.tag_id and
oat_user_tag_verification.verification_status = outv.verification_status
and oat_user_tag_verification.created_at=outv.created_at
and oat_user_tag_verification.updated_at=outv.updated_at
Here is the explain verbose out of this query -
"Delete on oat.oat_user_tag_verification (cost=14989031.30..16227081.67 rows=1 width=18)"
" -> Nested Loop (cost=14989031.30..16227081.67 rows=1 width=18)"
" Output: oat_user_tag_verification.ctid, outv.ctid, f.ctid"
" Join Filter: (outv.tag_id = f.tag_id)"
" -> Merge Join (cost=14989031.30..16021422.32 rows=1 width=28)"
" Output: oat_user_tag_verification.ctid, oat_user_tag_verification.tag_id, outv.ctid, outv.tag_id"
" Merge Cond: ((oat_user_tag_verification.tag_id = outv.tag_id) AND (oat_user_tag_verification.user_id = outv.user_id) AND (oat_user_tag_verification.verification_status = outv.verification_status) AND (oat_user_tag_verification.created_at = ou (...)"
" -> Sort (cost=13223314.06..13368102.38 rows=57915328 width=38)"
" Output: oat_user_tag_verification.ctid, oat_user_tag_verification.user_id, oat_user_tag_verification.tag_id, oat_user_tag_verification.verification_status, oat_user_tag_verification.created_at, oat_user_tag_verification.updated_at"
" Sort Key: oat_user_tag_verification.tag_id, oat_user_tag_verification.user_id, oat_user_tag_verification.verification_status, oat_user_tag_verification.created_at, oat_user_tag_verification.updated_at"
" -> Seq Scan on oat.oat_user_tag_verification (cost=0.00..1005001.28 rows=57915328 width=38)"
" Output: oat_user_tag_verification.ctid, oat_user_tag_verification.user_id, oat_user_tag_verification.tag_id, oat_user_tag_verification.verification_status, oat_user_tag_verification.created_at, oat_user_tag_verification.updated_at"
" -> Materialize (cost=1765717.25..1812477.56 rows=9352062 width=38)"
" Output: outv.ctid, outv.tag_id, outv.user_id, outv.verification_status, outv.created_at, outv.updated_at"
" -> Sort (cost=1765717.25..1789097.40 rows=9352062 width=38)"
" Output: outv.ctid, outv.tag_id, outv.user_id, outv.verification_status, outv.created_at, outv.updated_at"
" Sort Key: outv.tag_id, outv.user_id, outv.verification_status, outv.created_at, outv.updated_at"
" -> Seq Scan on oatarchival.oat_user_tag_verification outv (cost=0.00..171454.62 rows=9352062 width=38)"
" Output: outv.ctid, outv.tag_id, outv.user_id, outv.verification_status, outv.created_at, outv.updated_at"
" -> Seq Scan on oat.fp_archived f (cost=0.00..191863.83 rows=1103642 width=14)"
" Output: f.ctid, f.tag_id"
" Filter: (NOT f.is_archived)"
Here is the create table structure of all tables involved:
Table fp_archived:
CREATE TABLE fp_archived
(
tag_id bigint NOT NULL,
detection_url text,
image_id bigint NOT NULL,
pixel_x smallint NOT NULL,
camera_num smallint NOT NULL,
pixel_y smallint NOT NULL,
width smallint NOT NULL,
height smallint NOT NULL,
is_archived boolean DEFAULT false,
id bigint NOT NULL DEFAULT nextval('fp_archived_seq'::regclass),
drive_id character varying(255),
CONSTRAINT fp_archived_pkey PRIMARY KEY (id)
)
Table oat_user_tag_verification:
CREATE TABLE oatarchival.oat_user_tag_verification
(
user_id integer NOT NULL,
tag_id bigint NOT NULL,
verification_status integer NOT NULL,
created_at timestamp without time zone NOT NULL DEFAULT now(),
updated_at timestamp without time zone DEFAULT now(),
CONSTRAINT oat_user_tag_verification_pkey PRIMARY KEY (user_id, tag_id, verification_status, created_at),
CONSTRAINT oat_user_tag_verification_tag_id_fkey FOREIGN KEY (tag_id)
REFERENCES oatarchival.oat_tags (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT oat_user_tag_verification_user_id_fkey FOREIGN KEY (user_id)
REFERENCES oatarchival.oat_users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT oat_user_tag_verification_verification_status_fkey FOREIGN KEY (verification_status)
REFERENCES oatarchival.oat_tag_verification_status (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
The delete query runs for hours and hours. How can I optimize it?
What indexes should I be created for this query to become faster?
Based on your EXPLAIN output (unfortunately you didn't run EXPLAIN (ANALYZE)) I'd suggest the following indexes:
CREATE INDEX ON oatarchival.oat_user_tag_verification(
ctid,
tag_id,
user_id,
verification_status,
created_at,
updated_at
);
CREATE INDEX ON oat.oat_user_tag_verification(
tag_id,
user_id,
verification_status,
created_at,
updated_at
);
These can help with the merge join.
Then I'd create the following index:
CREATE INDEX ON oat.fp_archived(tag_id);
This will speed up the nested loop join.
Not sure if that is the best way to run the query, but it's a starting point.
One hint out of bad experience - try to fiddle with work_mem setting for the session. I had similar problem with incredible costs of queries on new PostgreSQL 9.6 and fount that it simply needs higher limit of work_mem.
I wanted to combine PostgreSQL Levenshtein and trigram similarity functions.
The main advantage of the trigram similarity function is that it can utilize GIN or GIST indexes and thus can return fuzzy match results quickly. However, if it is called inside another function, it does not use indexes.
For sake of this problem illustration, here is a plpgsql function "trigram_similarity" that calls original trigram's "similarity" function.
CREATE OR REPLACE FUNCTION public.trigram_similarity(
left_string text,
right_string text)
RETURNS real AS
$BODY$
BEGIN
RETURN similarity(left_string, right_string);
END;$BODY$
LANGUAGE plpgsql IMMUTABLE STRICT
COST 100;
ALTER FUNCTION public.trigram_similarity(text, text)
OWNER TO postgres;
Although the function actually just calls the trigram's similarity function, it behaves completely different when it comes to GIN indexes utilization. While the original trigram's similarity function inside WHERE clause of a query does utilize GIN indexes and thus a query returns result quickly and without much RAM consumption, when using trigram_similarity it does not. For large datasets fuzzy match analysis, the RAM is completely used and application freezes...
For sake of illustration, here is an example query:
SELECT DISTINCT
trigram_similarity(l.l_composite_18, r.r_composite_18)
::numeric(5,4) AS trigram_similarity_composite_score
, (trigram_similarity(l."Name", r."Name")*(0.166666666666667)
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333)
+ trigram_similarity(l."County", r."District")*(0.0416666666666667)
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667)
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667)
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667)
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667)
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667)
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667)
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667)
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667)
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667)
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333)
)
::numeric(5,4) AS trigram_similarity_weighted_score
, l."ClanID" AS "l_ClanID_1"
, l."Name" AS "l_Name_2"
, l."LastName" AS "l_LastName_3"
, l."County" AS "l_County_4"
, l."Town" AS "l_Town_5"
, l."PostalCode" AS "l_PostalCode_6"
, l."PostOffice" AS "l_PostOffice_7"
, l."Street" AS "l_Street_8"
, l."Number" AS "l_Number_9"
, l."Telephone1" AS "l_Telephone1_10"
, l."Telephone2" AS "l_Telephone2_11"
, l."EMail" AS "l_EMail_12"
, l."BirthDate" AS "l_BirthDate_13"
, l."Gender" AS "l_Gender_14"
, l."Aktivan" AS "l_Aktivan_15"
, l."ProgramCode" AS "l_ProgramCode_16"
, l."Card" AS "l_Card_17"
, l."DateOfCreation" AS "l_DateOfCreation_18"
, l."Assigned" AS "l_Assigned_19"
, l."Reserved" AS "l_Reserved_20"
, l."Sent" AS "l_Sent_21"
, l."MemberOfBothPrograms" AS "l_MemberOfBothPrograms_22"
, r."ClanID" AS "r_ClanID_23"
, r."Name" AS "r_Name_24"
, r."Surname" AS "r_Surname_25"
, r."District" AS "r_District_26"
, r."Location" AS "r_Location_27"
, r."ZipCode" AS "r_ZipCode_28"
, r."PostOffice" AS "r_PostOffice_29"
, r."Road" AS "r_Road_30"
, r."HomeNumber" AS "r_HomeNumber_31"
, r."Phone1" AS "r_Phone1_32"
, r."Phone2" AS "r_Phone2_33"
, r."EMail" AS "r_EMail_34"
, r."DateOfBirth" AS "r_DateOfBirth_35"
, r."Sex" AS "r_Sex_36"
, r."Active" AS "r_Active_37"
, r."ProgramCode" AS "r_ProgramCode_38"
, r."CardNo" AS "r_CardNo_39"
, r."CreationDate" AS "r_CreationDate_40"
, r."Assigned" AS "r_Assigned_41"
, r."Reserved" AS "r_Reserved_42"
, r."Sent" AS "r_Sent_43"
, r."BothPrograms" AS "r_BothPrograms_44"
FROM "l_leftdatasetexample3214274191" AS l, "r_rightdatasetexample3214274191" AS r
WHERE
((l."Gender"=r."Sex") AND (l."Card"<>r."CardNo") AND (l."Name"=r."Name"))
AND
((l."ProgramCode"= '1') AND (r."ProgramCode"= '1'))
AND
(((l.l_composite_18 % r.r_composite_18)
)
OR (((l."Name" % r."Name")
OR (l."LastName" % r."Surname")
OR (l."County" % r."District")
OR (l."Town" % r."Location")
OR (l."PostalCode" % r."ZipCode")
OR (l."PostOffice" % r."PostOffice")
OR (l."Street" % r."Road")
OR (l."Number" % r."HomeNumber")
OR (l."Telephone1" % r."Phone1")
OR (l."Telephone2" % r."Phone2")
OR (l."EMail" % r."EMail")
OR (l."BirthDate" % r."DateOfBirth")
OR (l."Gender" % r."Sex"))
)
) AND ((trigram_similarity(l.l_composite_18, r.r_composite_18)
>= 0.7)
OR ((trigram_similarity(l."Name", r."Name")*(0.166666666666667)
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333)
+ trigram_similarity(l."County", r."District")*(0.0416666666666667)
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667)
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667)
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667)
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667)
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667)
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667)
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667)
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667)
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667)
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333)
)
>= 0.7)
) ORDER BY trigram_similarity_composite_score DESC;
This query causes RAM clottage and application freezes. When "trigram_similarity" is replaced with "similarity", the query executes fast and without RAM overconsumption.
Why "trigram_similarity" and "similarity" behave differently?
Is there a way I could force GIN or GIST indexes utilization for this "trigram_similarity" function or any other function calling trigram's similarity function inside?
Explain analyze when "similarity" is used:
"Unique (cost=170717.94..177633.17 rows=58853 width=383) (actual time=260362.193..260362.279 rows=99 loops=1)"
" Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text) * '0 (...)"
" Buffers: shared hit=2513871 read=4158"
" -> Sort (cost=170717.94..170865.07 rows=58853 width=383) (actual time=260362.192..260362.198 rows=99 loops=1)"
" Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text (...)"
" Sort Key: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)) DESC, (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname" (...)"
" Sort Method: quicksort Memory: 76kB"
" Buffers: shared hit=2513871 read=4158"
" -> Nested Loop (cost=0.29..155793.36 rows=58853 width=383) (actual time=1851.503..260361.609 rows=99 loops=1)"
" Output: (similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4), ((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::t (...)"
" Buffers: shared hit=2513871 read=4158"
" -> Seq Scan on public.r_rightdatasetexample3214274191 r (cost=0.00..11228.86 rows=101669 width=188) (actual time=9.149..67.134 rows=50837 loops=1)"
" Output: r."ClanID", r."Name", r."Surname", r."District", r."Location", r."ZipCode", r."PostOffice", r."Road", r."HomeNumber", r."Phone1", r."Phone2", r."EMail", r."DateOfBirth", r."Sex", r."Active", r."ProgramCode", r."CardNo", r."Creat (...)"
" Filter: ((r."ProgramCode")::text = '1'::text)"
" Buffers: shared hit=5800 read=4158"
" -> Index Scan using "idxbNameA8D72F00099E4B70885B2E0BB1DFB684l_leftdatasetexample321" on public.l_leftdatasetexample3214274191 l (cost=0.29..1.35 rows=1 width=195) (actual time=5.111..5.119 rows=0 loops=50837)"
" Output: l."ClanID", l."Name", l."LastName", l."County", l."Town", l."PostalCode", l."PostOffice", l."Street", l."Number", l."Telephone1", l."Telephone2", l."EMail", l."BirthDate", l."Gender", l."Aktivan", l."ProgramCode", l."Card", l."D (...)"
" Index Cond: ((l."Name")::text = (r."Name")::text)"
" Filter: (((l."ProgramCode")::text = '1'::text) AND ((l."Card")::text <> (r."CardNo")::text) AND ((r."Sex")::text = (l."Gender")::text) AND (((l.l_composite_18)::text % (r.r_composite_18)::text) OR ((l."Name")::text % (r."Name")::text) O (...)"
" Rows Removed by Filter: 50"
" Buffers: shared hit=2508071"
"Planning time: 13.885 ms"
"Execution time: 260362.730 ms"
The indices are created on table columns. You need to modify your plpgsql function to query a GIN- or GIST-indexed, table column rather than comparing two, string literals. If you compare two, string literals the plugin has no index to hit and must decompose both strings into their trigrams before comparing them, which is your problem.
http://www.postgresql.org/docs/9.1/static/pgtrgm.html
It is possible to create GIN trgm_ops expression index on the compound (concatenated) expression. This index can faciliate % similarity operator, but not the similarity function.
I'm working with a table that has about 19 million rows and 60 columns (bigtable). Of the 19 million records, about 17 million have x and y coordinates (about 1.8 million distinct combinations of x and y). I needed to add some additional geocoded information to the table from another file (census_geocode). I've created a lookup table (distinct_xy) that has a list of all the distinct x and y coordinate pairs and an ID. I have indexes on bigtable (x_coord, y_coord), census_geocode (x_coord, y_coord), and distinct_xy (x_coord, y_coord), and a primary key in distinct_xy (xy_id) and census_geocode (xy_id). So here's the query:
Update bigtable
set block_grp = cg.blkgrp,
block = cg.block,
tract = cg.tractce10
from census_geocode cg, distinct_xy xy
where bigtable.x_coord = xy.x_coord and
bigtable.y_coord=xy.y_coord and cg.xy_id=xy.xy_id;
This is running extremely slowly. as in:
"Update on bigtable (cost=17675751.51..17827040.74 rows=22 width=327)"
" -> Nested Loop (cost=17675751.51..17827040.74 rows=22 width=327)"
" -> Merge Join (cost=17675751.51..17826856.26 rows=22 width=312)"
" Merge Cond: ((bigtable.x_coord = xy.x_coord) AND (bigtable.y_coord = xy.y_coord))"
" -> Sort (cost=17318145.58..17366400.81 rows=19302092 width=302)"
" Sort Key: bigtable.x_coord, bigtable.y_coord"
" -> Seq Scan on bigtable (cost=0.00..1457709.92 rows=19302092 width=302)"
" -> Materialize (cost=357588.42..366887.02 rows=1859720 width=26)"
" -> Sort (cost=357588.42..362237.72 rows=1859720 width=26)"
" Sort Key: xy.x_coord, xy.y_coord"
" -> Seq Scan on distinct_xy xy (cost=0.00..30443.20 rows=1859720 width=26)"
" -> Index Scan using census_geocode_pkey on census_geocode cg (cost=0.00..8.37 rows=1 width=23)"
" Index Cond: (xy_id = xy.xy_id)"
I've also tried splitting this apart and inserting the lookup key back into the big table to avoid the multi-table join.
Update bigtable
set xy_id = xy.xy_id
from distinct_xy xy
where bigtable.x_coord = xy.x_coord and bigtable.y_coord=xy.y_coord;
this also runs for hours without completing.
"Update on bigtable (cost=0.00..20577101.71 rows=22 width=404)"
" -> Nested Loop (cost=0.00..20577101.71 rows=22 width=404)"
" -> Seq Scan on distinct_xy xy (cost=0.00..30443.20 rows=1859720 width=26)"
" -> Index Scan using rae_xy_idx on bigtable (cost=0.00..11.03 rows=1 width=394)"
" Index Cond: ((x_coord = xy.x_coord) AND (y_coord = xy.y_coord))"
Can someone please help me improve this query's performance?