Related
I have the following database schema (oversimplified):
create sequence partners_partner_id_seq;
create table partners
(
partner_id integer default nextval('partners_partner_id_seq'::regclass) not null primary key,
name varchar(255) default NULL::character varying,
company_id varchar(20) default NULL::character varying,
vat_id varchar(50) default NULL::character varying,
is_deleted boolean default false not null
);
INSERT INTO partners(name, company_id, vat_id) VALUES('test1','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test2','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test3','3214567890102', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test4','9999999999999', 'GE9999999999999');
I am trying to figure out how to return test1, test2 (because the company_id column value duplicates vertically) and test3 (because the vat_id column value duplicates vertically as well).
To put it in other words - I need to find duplicating company_id and vat_id records and group them together, so that test1, test2 and test3 would be together, because they duplicate by company_id and vat_id.
So far I have the following query:
SELECT *
FROM (
SELECT *, LEAD(row, 1) OVER () AS nextrow
FROM (
SELECT *, ROW_NUMBER() OVER (w) AS row
FROM partners
WHERE is_deleted = false
AND ((company_id != '' AND company_id IS NOT null) OR (vat_id != '' AND vat_id IS NOT NULL))
WINDOW w AS (PARTITION BY company_id, vat_id ORDER BY partner_id DESC)
) x
) y
WHERE (row > 1 OR nextrow > 1)
AND is_deleted = false
This successfully shows all company_id duplicates, but does not appear to show vat_id ones - test3 row is missing. Is this possible to be done within one query?
Here is a db-fiddle with the schema, data and predefined query reproducing my result.
You can do this with recursion, but depending on the size of your data you may want to iterate, instead.
The trick is to make the name just another match key instead of treating it differently than the company_id and vat_id:
create table partners (
partner_id integer generated always as identity primary key,
name text,
company_id text,
vat_id text,
is_deleted boolean not null default false
);
insert into partners (name, company_id, vat_id) values
('test1','1010109191191', 'BG1010109191192'),
('test2','1010109191191', 'BG1010109191192'),
('test3','3214567890102', 'BG1010109191192'),
('test4','9999999999999', 'GE9999999999999'),
('test5','3214567890102', 'BG8888888888888'),
('test6','2983489023408', 'BG8888888888888')
;
I added a couple of test cases and left in the lone partner.
with recursive keys as (
select partner_id,
array['n_'||name, 'c_'||company_id, 'v_'||vat_id] as matcher,
array[partner_id] as matchlist,
1 as size
from partners
), matchers as (
select *
from keys
union all
select p.partner_id, c.matcher,
p.matchlist||c.partner_id as matchlist,
p.size + 1
from matchers p
join keys c
on c.matcher && p.matcher
and not p.matchlist #> array[c.partner_id]
), largest as (
select distinct sort(matchlist) as matchlist
from matchers m
where not exists (select 1
from matchers
where matchlist #> m.matchlist
and size > m.size)
-- and size > 1
)
select *
from largest
;
matchlist
{1,2,3,5,6}
{4}
fiddle
EDIT UPDATE
Since recursion did not perform, here is an iterative example in plpgsql that uses a temporary table:
create temporary table match1 (
partner_id int not null,
group_id int not null,
matchkey uuid not null
);
create index on match1 (matchkey);
create index on match1 (group_id);
insert into match1
select partner_id, partner_id, md5('n_'||name)::uuid from partners
union all
select partner_id, partner_id, md5('c_'||company_id)::uuid from partners
union all
select partner_id, partner_id, md5('v_'||vat_id)::uuid from partners;
do $$
declare _cnt bigint;
begin
loop
with consolidate as (
select group_id,
min(group_id) over (partition by matchkey) as new_group_id
from match1
), minimize as (
select group_id, min(new_group_id) as new_group_id
from consolidate
group by group_id
), doupdate as (
update match1
set group_id = m.new_group_id
from minimize m
where m.group_id = match1.group_id
and m.new_group_id != match1.group_id
returning *
)
select count(*) into _cnt from doupdate;
if _cnt = 0 then
exit;
end if;
end loop;
end;
$$;
updated fiddle
Have following code in SQL,Please help.Iam Stuck on this pivot.
SELECT #QStr = COALESCE(#QStr,'')+',['+ColtoRow+']' FROM(
SELECT DISTINCT PeriodEndDatDisp ColtoRow
from #Esti
)A GROUP BY ColtoRow order by ColtoRow
SELECT #QStr=STUFF(#QStr,1,1,'')
SELECT #Query = '
SELECT GroupID,GroupName,Mes_Type,ParentProductId,ParentProductName,Meas_Nam,fn_ProperCase(SegmentType) SegmentType,SegMeasType,
CurrencyId,ValType,IsPerSha,Mes_Order,ShowSegData,' +#QStr+ '
FROM (
SELECT GroupID,GroupName,Mes_Type,ParentProductId,ParentProductName,Meas_Nam,SegmentType,SegMeasType,EstValue,CurrencyId,ValType,Mes_Order,
ShowSegData,IsPerSha,
PeriodEndDatDisp ColtoRow
FROM #Esti
) AS Src_Table PIVOT
(
MAX(EstValue) FOR ColtoRow IN (' +#QStr+' )
) AS PivotTable ORDER BY GroupID,Mes_Order,Meas_Nam,SegmentType,SegMeasType,ValType;'
EXECUTE #Query;
the Equivalent code in PGSQL is
SELECT string_agg(DISTINCT PeriodEndDateDisplay,',') ColtoRow
from t$Estimate
--o/p FY-2015,FY-2016,Q1-2015,Q1-2016,Q2-2015,Q2-2016,Q3-2015,Q3-2016,Q4-2015,Q4-2016
SELECT *
FROM crosstab(
'SELECT
GroupID,GroupName,MeasureType,ParentProductId,ParentProductName,
MeasureName,est.fn_ProperCase(SegmentType) SegmentType,SegmentMeasureType, EstValue
,CurrencyId,ValType,MeasureOrder,IsPerShare,ShowSegmentData
FROM est.t$estimate'
, $$ SELECT unnest('{Q2-2016,Q4-2015,Q1-2015,Q3-2016,Q4-2016,FY-2016,Q1-2016,Q3-2015,Q2-2015,FY-2015}'::text[])$$
) AS ct ( groupid integer, groupname character varying(500) , measuretype character varying(100) , parentproductid character varying(100) ,
parentproductname character varying(200), measurename character varying(200) , segmenttype character varying ,
segmentmeasuretype character varying(20), EstValue text, currencyid character varying(3), valtype integer,
measureorder integer, ispershare boolean,FY2016 text,
Q1-2015 text,Q1-2016 text,Q2-2015 text,Q2-2016 text,Q3-2015 text,Q3-2016 text,Q4-2015 text,Q4-2016 text);
Problem encountered:
Only 1 row is populated and the quarter values are all null. In MSSQL its 25 rows
hyphen(-)creating issue in Quater time period value eg:Q4-2015
We have a database that consists of 96 million mortgage loans. In this database we have the original houseprice at loan origination. We want to update these houseprices with a very simple houseprice index we extracted from the internet as csv and I imported this to a table in the same database as the mortgage loans. I am already able to join the tables, but it is very slow. I think I'm not working with the index correctly.... This is how the tables look like:
mortgage loans:
CREATE TABLE mydb.mortgageloans
(
pkrmbloan bigint NOT NULL,
fkdeal bigint NOT NULL,
edcode character varying(50) NOT NULL,
poolcutoffdate character varying(50) NOT NULL,
recno integer NOT NULL,
submissiontimestamp timestamp without time zone NOT NULL,
col1 character varying(10),
col2 character varying(100),
country character varying(10),
col......
col199 character varying(25)
CONSTRAINT rmb_loan_pkey PRIMARY KEY (pkrmbloan),
CONSTRAINT fk_rmbloan2deal FOREIGN KEY (fkdeal)
REFERENCES mydb_data.deal (pkdeal) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE mydb.mortgageloans
OWNER TO mydb_admin;
GRANT ALL ON TABLE mydb.mortgageloans TO mydb_admin;
GRANT SELECT ON TABLE mydb.mortgageloans TO mydb_addon;
CREATE INDEX idx_rmbloan_edcode_poolcod
ON mydb.mortgageloans
USING btree
(edcode COLLATE pg_catalog."default", poolcutoffdate COLLATE pg_catalog."default");
CREATE INDEX idx_rmbloan_fkdeal
ON mydb.mortgageloans
USING btree
(fkdeal);
CREATE INDEX idx_rmbloan_recno
ON mydb.mortgageloans
USING btree
(recno);
house price index table I self imported.
CREATE TABLE mydb.hpi
(
period character varying(100),
au character varying(100),
be character varying(100),
ca character varying(100),
ch character varying(100),
de character varying(100),
dk character varying(100),
es character varying(100),
fi character varying(100),
fr character varying(100),
uk character varying(100),
ie character varying(100),
it character varying(100),
jp character varying(100),
nl character varying(100),
no character varying(100),
nz character varying(100),
us character varying(100),
pt character varying(100)
)
WITH (
OIDS=FALSE
);
ALTER TABLE mydb.hpi
OWNER TO mydb_admin;
And the query to add the original house price index based on the loan origination date (col55)
ALTER TABLE mydb.mortgageloans ADD COLUMN OriginalHPI varchar(130);
UPDATE mydb.mortgageloans set OriginalHPI = test.rv
FROM
(
select
CASE
WHEN a.country = 'NL'::text THEN c.nl::numeric
WHEN a.country = 'BE'::text THEN c.be::numeric
WHEN a.country = 'ES'::text THEN c.es::numeric
WHEN a.country = 'FR'::text THEN c.fr::numeric
WHEN a.country = 'IT'::text THEN c.IT::numeric
WHEN a.country = 'DE'::text THEN c.de::numeric
WHEN a.country = 'IE'::text THEN c.ie::numeric
else NULL::numeric
END AS rv,
,a.pkrmbloan
FROM mydb.mortgageloans a
LEFT JOIN mydb_data.hpi c on a.col55 = c.Period
)
as test
where test.pkrmbloan = mydb.mortgageloans.pkrmbloan
Any help would be much appreciated!
Best regards,
Tim
edit: added the explain output
Using slightly different db names, wanted to anonomize first
Actual query:
EXPLAIN
UPDATE edp_data.rmb_loan set OriginalHPI = test.rv
FROM
(
select
CASE
WHEN "substring"(a.edcode::text, 5, 2)::text = 'NL'::text THEN c.nl::numeric
WHEN "substring"(a.edcode::text, 5, 2)::text = 'BE'::text THEN c.be::numeric
WHEN "substring"(a.edcode::text, 5, 2)::text = 'ES'::text THEN c.es::numeric
WHEN "substring"(a.edcode::text, 5, 2)::text = 'FR'::text THEN c.fr::numeric
WHEN "substring"(a.edcode::text, 5, 2)::text = 'IT'::text THEN c.IT::numeric
WHEN "substring"(a.edcode::text, 5, 2)::text = 'DE'::text THEN c.de::numeric
WHEN "substring"(a.edcode::text, 5, 2)::text = 'IE'::text THEN c.ie::numeric
else 12::numeric
END AS rv,
a.pkrmbloan, a.fkdeal
FROM edp_data.rmb_loan a
LEFT JOIN edp_data.hpi c on a.ar55 = c.period
)
as test
where test.pkrmbloan = edp_data.rmb_loan.pkrmbloan and test.fkdeal = edp_data.rmb_loan.fkdeal;
Output
"Update on rmb_loan (cost=22.11..60667621.09 rows=342266 width=4090)"
" -> Hash Left Join (cost=22.11..60667621.09 rows=342266 width=4090)"
" Hash Cond: ((a.ar55)::text = (c.period)::text)"
" -> Merge Join (cost=0.00..60635941.00 rows=341941 width=4049)"
" Merge Cond: (rmb_loan.pkrmbloan = a.pkrmbloan)"
" Join Filter: (rmb_loan.fkdeal = a.fkdeal)"
" -> Index Scan using rmb_loan_pkey on rmb_loan (cost=0.00..28746023.33 rows=179651105 width=4014)"
" -> Index Scan using rmb_loan_pkey on rmb_loan a (cost=0.00..28746023.33 rows=179651105 width=51)"
" -> Hash (cost=15.38..15.38 rows=538 width=56)"
" -> Seq Scan on hpi c (cost=0.00..15.38 rows=538 width=56)"
I think your confusing from clause originates from the fact that you want the column default to be 12. To avoid that just declare the default when adding the column
alter table mydb.mortgageloans
add column OriginalHPI varchar(130) default '12';
update edp_data.rmb_loan a
set OriginalHPI = (
case substring(a.edcode::text, 5, 2)
when 'NL' then c.nl
when 'BE' then c.be
when 'ES' then c.es
when 'FR' then c.fr
when 'IT' then c.IT
when 'DE' then c.de
when 'IE' then c.ie
else 12
end)::numeric
from edp_data.hpi c
where a.ar55 = c.period
Why do you cast the case result to numeric just to save it in a varchar column?
We do have a small data warehouse in PostgreSQL database and I have to document all the tables.
I thought I can add a comment to every column and table and use pipe "|" separator to add more attributes. Then I can use information schema and array function to get documentation and use any reporting software to create desired output.
select
ordinal_position,
column_name,
data_type,
character_maximum_length,
numeric_precision,
numeric_scale,
is_nullable,
column_default,
(string_to_array(descr.description,'|'))[1] as cs_name,
(string_to_array(descr.description,'|'))[2] as cs_description,
(string_to_array(descr.description,'|'))[3] as en_name,
(string_to_array(descr.description,'|'))[4] as en_description,
(string_to_array(descr.description,'|'))[5] as other
from
information_schema.columns columns
join pg_catalog.pg_class klass on (columns.table_name = klass.relname and klass.relkind = 'r')
left join pg_catalog.pg_description descr on (descr.objoid = klass.oid and descr.objsubid = columns.ordinal_position)
where
columns.table_schema = 'data_warehouse'
order by
columns.ordinal_position;
It is a good idea or is there better approach?
Unless you must include descriptions of the system tables, I wouldn't try to shoehorn your descriptions into pg_catalog.pg_description. Make your own table. That way you get to keep the columns as columns, and not have to use clunky string functions.
Alternatively, consider adding specially formatted comments to your master schema file, along the lines of javadoc. Then write a tool to extract those comments and create a document. That way the comments stay close to the thing they're commenting, and you don't have to mess with the database at all to produce the report. For example:
--* Used for authentication.
create table users
(
--* standard Rails-friendly primary key. Also an example of
--* a long comment placed before the item, rather than on the
--* the same line.
id serial primary key,
name text not null, --* Real name (hopefully)
login text not null, --* Name used for authentication
...
);
Your documentation tool reads the file, looks for the --* comments, figures out what comments go with what things, and produces some kind of report, e.g.:
table users: Used for authentication
id: standard Rails-friendly primary key. Also an example of a
long comment placed before the item, rather than on the same
line.
name: Real name
login: Name used for authentication
You might note that with appropriate comments, the master schema file itself is a pretty good report in its own right, and that perhaps nothing else is needed.
If anyone interested, here is what I've used for initial load for my small documentation project. Documentation is in two tables, one for describing tables and one for describing columns and constraints. I appreciate any feedback.
/* -- Initial Load - Tables */
drop table dw_description_table cascade;
create table dw_description_table (
table_description_key serial primary key,
physical_full_name character varying,
physical_schema_name character varying,
physical_table_name character varying,
Table_Type character varying, -- Fact Dimension ETL Transformation
Logical_Name_CS character varying,
Description_CS character varying,
Logical_Name_EN character varying,
Description_EN character varying,
ToDo character varying,
Table_Load_Type character varying, --Manually TruncateLoad AddNewRows
Known_Exclusions character varying,
Table_Clover_Script character varying
);
insert into dw_description_table (physical_full_name, physical_schema_name, physical_table_name) (
select
table_schema || '.' || table_name as physical_full_name,
table_schema,
table_name
from
information_schema.tables
where
table_name like 'dw%' or table_name like 'etl%'
)
/* -- Initial Load - Columns */
CREATE TABLE dw_description_column (
column_description_key serial,
table_description_key bigint,
physical_full_name text,
physical_schema_name character varying,
physical_table_name character varying,
physical_column_name character varying,
ordinal_position character varying,
column_default character varying,
is_nullable character varying,
data_type character varying,
logical_name_cs character varying,
description_cs character varying,
logical_name_en character varying,
description_en character varying,
derived_rule character varying,
todo character varying,
pk_name character varying,
fk_name character varying,
foreign_table_name character varying,
foreign_column_name character varying,
is_primary_key boolean,
is_foreign_key boolean,
CONSTRAINT dw_description_column_pkey PRIMARY KEY (column_description_key ),
CONSTRAINT fk_dw_description_table_key FOREIGN KEY (table_description_key)
REFERENCES dw_description_table (table_description_key) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
insert into dw_description_column (
table_description_key ,
physical_full_name ,
physical_schema_name ,
physical_table_name ,
physical_column_name ,
ordinal_position ,
column_default ,
is_nullable ,
data_type ,
logical_name_cs ,
description_cs ,
logical_name_en ,
description_en ,
derived_rule ,
todo ,
pk_name ,
fk_name ,
foreign_table_name ,
foreign_column_name ,
is_primary_key ,
is_foreign_key )
(
with
dw_constraints as (
SELECT
tc.constraint_name,
tc.constraint_schema || '.' || tc.table_name || '.' || kcu.column_name as physical_full_name,
tc.constraint_schema,
tc.table_name,
kcu.column_name,
ccu.table_name AS foreign_table_name,
ccu.column_name AS foreign_column_name,
TC.constraint_type
FROM
information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu ON (tc.constraint_name = kcu.constraint_name and tc.table_name = kcu.table_name)
JOIN information_schema.constraint_column_usage AS ccu ON ccu.constraint_name = tc.constraint_name
WHERE
constraint_type in ('PRIMARY KEY','FOREIGN KEY')
AND tc.constraint_schema = 'bizdata'
and (tc.table_name like 'dw%' or tc.table_name like 'etl%')
group by
tc.constraint_name,
tc.constraint_schema,
tc.table_name,
kcu.column_name,
ccu.table_name ,
ccu.column_name,
TC.constraint_type
)
select
dwdt.table_description_key,
col.table_schema || '.' || col.table_name || '.' || col.column_name as physical_full_name,
col.table_schema as physical_schema_name,
col.table_name as physical_table_name,
col.column_name as physical_column_name,
col.ordinal_position,
col.column_default,
col.is_nullable,
col.data_type,
null as Logical_Name_CS ,
null as Description_CS ,
null as Logical_Name_EN,
null as Description_EN ,
null as Derived_Rule ,
null as ToDo,
dwc1.constraint_name pk_name,
dwc2.constraint_name as fk_name,
dwc2.foreign_table_name,
dwc2.foreign_column_name,
case when dwc1.constraint_name is not null then true else false end as is_primary_key,
case when dwc2.constraint_name is not null then true else false end as foreign_key
from
information_schema.columns col
join dw_description_table dwdt on (col.table_schema || '.' || col.table_name = dwdt.physical_full_name )
left join dw_constraints dwc1 on ((col.table_schema || '.' || col.table_name || '.' || col.column_name) = dwc1.physical_full_name and dwc1.constraint_type = 'PRIMARY KEY')
left join dw_constraints dwc2 on ((col.table_schema || '.' || col.table_name || '.' || col.column_name) = dwc2.physical_full_name and dwc2.constraint_type = 'FOREIGN KEY')
where
col.table_name like 'dw%' or col.table_name like 'etl%'
)
I have a table t1 as below:
create table t1 (
person_id int,
item_name varchar(30),
item_value varchar(100)
);
There are five records in this table:
person_id | item_name | item_value
1 'NAME' 'john'
1 'GENDER' 'M'
1 'DOB' '1970/02/01'
1 'M_PHONE' '1234567890'
1 'ADDRESS' 'Some Addresses unknown'
Now I want to use crosstab function to extract NAME, GENDER data, so I write a SQL as:
select * from crosstab(
'select person_id, item_name, item_value from t1
where person_id=1 and item_name in ('NAME', 'GENDER') ')
as virtual_table (person_id int, NAME varchar, GENDER varchar)
My problem is, as you see the SQL in crosstab() contains condition of item_name, which will cause the quotation marks to be incorrect.
How do I solve the problem?
To avoid any confusion about how to escape single quotes and generally simplify the syntax, use dollar-quoting for the query string:
SELECT *
FROM crosstab(
$$
SELECT person_id, item_name, item_value
FROM t1
WHERE person_id = 1
AND item_name IN ('NAME', 'GENDER')
$$
) AS virtual_table (person_id int, name varchar, gender varchar);
See:
Insert text with single quotes in PostgreSQL
And you should add ORDER BY to your query string. I quote the manual for the tablefunc module:
In practice the SQL query should always specify ORDER BY 1,2 to ensure
that the input rows are properly ordered, that is, values with the
same row_name are brought together and correctly ordered within the
row. Notice that crosstab itself does not pay any attention to the
second column of the query result; it's just there to be ordered by,
to control the order in which the third-column values appear across the page.
See:
PostgreSQL Crosstab Query
Double your single quotes to escape them:
select * from crosstab(
'select person_id, item_name, item_value from t1
where person_id=1 and item_name in (''NAME'', ''GENDER'') ')
as virtual_table (person_id int, NAME varchar, GENDER varchar)