Efficiently select the most specific result from a table - postgresql

I have a table roughly as follows:
CREATE TABLE t_table (
f_userid BIGINT NOT NULL
,f_groupaid BIGINT
,f_groupbid BIGINT
,f_groupcid BIGINT
,f_itemid BIGINT
,f_value TEXT
);
The groups are orthogonal, so no hierarchy can be implied beyond the fact that every entry in the table will have a user ID. There is no uniqueness in any of the columns.
So for example a simple setup might be:
INSERT INTO t_table VALUES (1, NULL, NULL, NULL, NULL, 'Value for anything by user 1');
INSERT INTO t_table VALUES (1, 5, 2, NULL, NULL, 'Value for anything by user 1 in groupA 5 groupB 2');
INSERT INTO t_table VALUES (1, 4, NULL, 1, NULL, 'Value for anything by user 1 in groupA 5 and groupC 1');
INSERT INTO t_table VALUES (2, NULL, NULL, NULL, NULL, 'Value for anything by user 2');
INSERT INTO t_table VALUES (2, 1, NULL, NULL, NULL, 'Value for anything by user 2 in groupA 1');
INSERT INTO t_table VALUES (2, 1, 3, 4, 5, 'Value for item 5 by user 2 in groupA 1 and groupB 3 and groupC 4');
For any given set of user/groupA/groupB/groupC/item I want to be able to obtain the most specific item in the table that applies. If any of the given set are NULL then it can only match relevant columns in the table which contain NULL. For example:
// Exact match
SELECT MostSpecific(1, NULL, NULL, NULL, NULL) => "Value for anything by user 1"
// Match the second entry because groupC and item were not specified in the table and the other items matched
SELECT MostSpecific(1, 5, 2, 3, NULL) => "Value for anything by user 1 in groupA 5 groupB 2"
// Does not match the second entry because groupA is NULL in the query and set in the table
SELECT MostSpecific(1, NULL, 2, 3, 4) => "Value for anything by user 1"
The obvious approach here is for the stored procedure to work through the parameters and find out which are NULL and not, and then call the appropriate SELECT statement. But this seems very inefficient. IS there a better way of doing this?

This should do it, just filter out any non matching rows using a WHERE, then rank the remaining rows by how well they match. If any column doesn't match, the whole bop expression will result in NULL, so we filter that out in an outer query where we also order by match and limit the result to only the single best match.
CREATE FUNCTION MostSpecific(BIGINT, BIGINT, BIGINT, BIGINT, BIGINT)
RETURNS TABLE(f_userid BIGINT, f_groupaid BIGINT, f_groupbid BIGINT, f_groupcid BIGINT, f_itemid BIGINT, f_value TEXT) AS
'WITH cte AS (
SELECT *,
CASE WHEN f_groupaid IS NULL THEN 0 WHEN f_groupaid = $2 THEN 1 END +
CASE WHEN f_groupbid IS NULL THEN 0 WHEN f_groupbid = $3 THEN 1 END +
CASE WHEN f_groupcid IS NULL THEN 0 WHEN f_groupcid = $4 THEN 1 END +
CASE WHEN f_itemid IS NULL THEN 0 WHEN f_itemid = $5 THEN 1 END bop
FROM t_table
WHERE f_userid = $1
AND (f_groupaid IS NULL OR f_groupaid = $2)
AND (f_groupbid IS NULL OR f_groupbid = $3)
AND (f_groupcid IS NULL OR f_groupcid = $4)
AND (f_itemid IS NULL OR f_itemid = $5)
)
SELECT f_userid, f_groupaid, f_groupbid, f_groupcid, f_itemid, f_value FROM cte
WHERE bop IS NOT NULL
ORDER BY bop DESC
LIMIT 1'
LANGUAGE SQL
//
An SQLfiddle to test with.

Try something like:
select *
from t_table t
where f_userid = $p_userid
and (t.f_groupaid is not distinct from $p_groupaid or t.f_groupaid is null) --null in f_groupaid matches both null and not null values
and (t.f_groupbid is not distinct from $p_groupbid or t.f_groupbid is null)
and (t.f_groupcid is not distinct from $p_groupcid or t.f_groupcid is null)
order by (t.f_groupaid is not distinct from $p_groupaid)::int -- order by count of matches
+(t.f_groupbid is not distinct from $p_groupbid)::int
+(t.f_groupcid is not distinct from $p_groupcid)::int desc
limit 1;
It will give you the best match on groups.
A is not distinct from B fill return true if A and B are equal or both null.
::int means cast ( as int). Casting boolean true to int will give 1 (You can not add boolean values directly).

SQL Fiddle
create or replace function mostSpecific(
p_userid bigint,
p_groupaid bigint,
p_groupbid bigint,
p_groupcid bigint,
p_itemid bigint
) returns t_table as $body$
select *
from t_table
order by
(p_userid is not distinct from f_userid or f_userid is null)::integer
+
(p_groupaid is not distinct from f_groupaid or f_userid is null)::integer
+
(p_groupbid is not distinct from f_groupbid or f_userid is null)::integer
+
(p_groupcid is not distinct from f_groupcid or f_userid is null)::integer
+
(p_itemid is not distinct from f_itemid or f_userid is null)::integer
desc
limit 1
;
$body$ language sql;

Related

Is it possible to find duplicating records in two columns simultaneously in PostgreSQL?

I have the following database schema (oversimplified):
create sequence partners_partner_id_seq;
create table partners
(
partner_id integer default nextval('partners_partner_id_seq'::regclass) not null primary key,
name varchar(255) default NULL::character varying,
company_id varchar(20) default NULL::character varying,
vat_id varchar(50) default NULL::character varying,
is_deleted boolean default false not null
);
INSERT INTO partners(name, company_id, vat_id) VALUES('test1','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test2','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test3','3214567890102', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test4','9999999999999', 'GE9999999999999');
I am trying to figure out how to return test1, test2 (because the company_id column value duplicates vertically) and test3 (because the vat_id column value duplicates vertically as well).
To put it in other words - I need to find duplicating company_id and vat_id records and group them together, so that test1, test2 and test3 would be together, because they duplicate by company_id and vat_id.
So far I have the following query:
SELECT *
FROM (
SELECT *, LEAD(row, 1) OVER () AS nextrow
FROM (
SELECT *, ROW_NUMBER() OVER (w) AS row
FROM partners
WHERE is_deleted = false
AND ((company_id != '' AND company_id IS NOT null) OR (vat_id != '' AND vat_id IS NOT NULL))
WINDOW w AS (PARTITION BY company_id, vat_id ORDER BY partner_id DESC)
) x
) y
WHERE (row > 1 OR nextrow > 1)
AND is_deleted = false
This successfully shows all company_id duplicates, but does not appear to show vat_id ones - test3 row is missing. Is this possible to be done within one query?
Here is a db-fiddle with the schema, data and predefined query reproducing my result.
You can do this with recursion, but depending on the size of your data you may want to iterate, instead.
The trick is to make the name just another match key instead of treating it differently than the company_id and vat_id:
create table partners (
partner_id integer generated always as identity primary key,
name text,
company_id text,
vat_id text,
is_deleted boolean not null default false
);
insert into partners (name, company_id, vat_id) values
('test1','1010109191191', 'BG1010109191192'),
('test2','1010109191191', 'BG1010109191192'),
('test3','3214567890102', 'BG1010109191192'),
('test4','9999999999999', 'GE9999999999999'),
('test5','3214567890102', 'BG8888888888888'),
('test6','2983489023408', 'BG8888888888888')
;
I added a couple of test cases and left in the lone partner.
with recursive keys as (
select partner_id,
array['n_'||name, 'c_'||company_id, 'v_'||vat_id] as matcher,
array[partner_id] as matchlist,
1 as size
from partners
), matchers as (
select *
from keys
union all
select p.partner_id, c.matcher,
p.matchlist||c.partner_id as matchlist,
p.size + 1
from matchers p
join keys c
on c.matcher && p.matcher
and not p.matchlist #> array[c.partner_id]
), largest as (
select distinct sort(matchlist) as matchlist
from matchers m
where not exists (select 1
from matchers
where matchlist #> m.matchlist
and size > m.size)
-- and size > 1
)
select *
from largest
;
matchlist
{1,2,3,5,6}
{4}
fiddle
EDIT UPDATE
Since recursion did not perform, here is an iterative example in plpgsql that uses a temporary table:
create temporary table match1 (
partner_id int not null,
group_id int not null,
matchkey uuid not null
);
create index on match1 (matchkey);
create index on match1 (group_id);
insert into match1
select partner_id, partner_id, md5('n_'||name)::uuid from partners
union all
select partner_id, partner_id, md5('c_'||company_id)::uuid from partners
union all
select partner_id, partner_id, md5('v_'||vat_id)::uuid from partners;
do $$
declare _cnt bigint;
begin
loop
with consolidate as (
select group_id,
min(group_id) over (partition by matchkey) as new_group_id
from match1
), minimize as (
select group_id, min(new_group_id) as new_group_id
from consolidate
group by group_id
), doupdate as (
update match1
set group_id = m.new_group_id
from minimize m
where m.group_id = match1.group_id
and m.new_group_id != match1.group_id
returning *
)
select count(*) into _cnt from doupdate;
if _cnt = 0 then
exit;
end if;
end loop;
end;
$$;
updated fiddle

POSTGRES INSERT/UPDATE ON CONFLICT using WITH CTE

I have a table like below. I am trying to merge into this table based on the value in a CTE. But when I try to update the table when there is a conflict, it cannot get the value in CTE
CREATE TABLE IF NOT EXISTS master_config_details
(
master_config_id INT NOT NULL,
account_id INT NOT NULL,
date_value TIMESTAMP(3) NULL,
number_value BIGINT NULL,
string_value VARCHAR(50) NULL,
row_status SMALLINT NOT NULL,
created_date TIMESTAMP(3) NOT NULL,
modified_date TIMESTAMP(3) NULL,
CONSTRAINT pk_master_config_details PRIMARY KEY (master_config_id, account_id, row_status)
);
INSERT INTO master_config_details VALUES (
1, 11, NULL,100,NULL, 0, '2020-11-18 12:01:18', '2020-11-18 12:02:31');
select * from master_config_details;`
Now using a cte I want to insert/update records in this table. Below is the code I am using to do the same. When the record already exist in the table I want to update the table based on the data_type_id value in the cte (cte_input_data.data_type_id ) but it fails with the error.
SQL Error [42703]: ERROR: column excluded.data_type_id does not exist
what it should achieve is
if cte_input_data.data_type_id = 1 update master_config_details set date_value = cte.value
if cte_input_data.data_type_id = 2 update master_config_details set number_value = cte.value
if cte_input_data.data_type_id = 3 update master_config_details set string_value = cte.value
The below code should do an update to the table master_config_details.number_value = 22 as there is already a record in that combination (master_config_id, account_id, row_status) which is (1,11,1) ( run this to see the record select * from master_config_details;) but its throwing an error instead
SQL Error [42703]: ERROR: column excluded.data_type_id does not exist
WITH cte_input_data AS (
select
1 AS master_config_id
,11 AS account_id
,2 AS data_type_id
,'22' AS value
,1 AS row_status)
INSERT INTO master_config_details
SELECT
cte.master_config_id
,cte.account_id
,CASE WHEN cte.data_type_id = 1 THEN cte.value::timestamp(3) ELSE NULL END AS date_time_value
,CASE WHEN cte.data_type_id = 2 THEN cte.value::integer ELSE NULL END AS number_value
,CASE WHEN cte.data_type_id = 3 THEN cte.value ELSE NULL END AS string_value
,1
,NOW() AT TIME ZONE 'utc'
,NOW() AT TIME ZONE 'utc'
FROM cte_input_data cte
ON CONFLICT (master_config_id,account_id,row_status)
DO UPDATE SET
date_value = CASE WHEN excluded.data_type_id = 1 THEN excluded.date_time_value::timestamp(3) ELSE NULL END
,number_value = CASE WHEN excluded.data_type_id = 2 THEN excluded.number_value::integer ELSE NULL END
,string_value = CASE WHEN excluded.data_type_id = 3 THEN excluded.string_value ELSE NULL END
,modified_date = NOW() AT TIME ZONE 'utc';
Special excluded table is used to reference values originally proposed for insertion.
So you’re getting this error because this column doesn’t exist in your target table, and so in special excluded table. It exists only in your cte.
As a workaround you can select it from cte using nested select in on conflict statement.

How create a column that increase according to the value of another column

I have a table where I want to put all the information about articles use, and I need to create a column with autoincrement, where the ID can have the same value if the field (tipo) have another value, unique for this particular ID. For example:
ID / TIPO
1 / AJE -- Is Ok
1 / AJS -- Is Ok (because this Tipo is AJS, different from AJE)
1 / SI -- Is Ok
2 / AJE -- Is Ok (New ID)
2 / AJE -- Is Wrong, because ID=2, TIPO=AJE already exist.
I've done the unique sentence:
ALTER TABLE public.art_movimientos
ADD CONSTRAINT uk_mov UNIQUE (id,tipo) USING INDEX TABLESPACE sistema_index;
But how I can create the autoincrement covering two columns?
My table Code:
CREATE TABLE public.art_movimientos
(
id integer NOT NULL DEFAULT nextval('seq_art_movimientos'::regclass),
tipo character(3) NOT NULL, -- Tipos de Valores:...
documento integer,
fecha_doc date[] NOT NULL,
fecha_mov date[] NOT NULL,
calmacen integer NOT NULL,
status character(13) NOT NULL DEFAULT 'PENDIENTE'::bpchar, -- PENDIENTE...
mes integer NOT NULL,
"año" integer NOT NULL,
donado integer NOT NULL DEFAULT 0
)
WITH (
OIDS=FALSE
);
You can manage this situation by using an before insert trigger, mimicking the behaviour statedby #dhke:
CREATE TABLE art_movimientos
(
id integer NOT NULL DEFAULT NULL, -- You don't want a serial, nor a default
tipo character(3) NOT NULL, -- Tipos de Valores:...
documento integer,
fecha_doc date[] NOT NULL,
fecha_mov date[] NOT NULL,
calmacen integer NOT NULL,
status character(13) NOT NULL DEFAULT 'PENDIENTE'::bpchar, -- PENDIENTE...
mes integer NOT NULL,
"año" integer NOT NULL,
donado integer NOT NULL DEFAULT 0,
/* You have actually a 2-column Primary Key */
PRIMARY KEY (tipo, id)
);
-- Create a trigger function to generate 'id'
CREATE FUNCTION art_movimientos_insert_trigger()
RETURNS trigger
AS
$$
BEGIN
/* Compute "id", as the following id for a certain "tipo" */
new.id = coalesce(
(SELECT max(id) + 1
FROM art_movimientos a
WHERE a.tipo = new.tipo), 1);
return new;
END
$$
LANGUAGE 'plpgsql'
VOLATILE ;
-- This trigger will be called whenever a new row is inserted, and "id" is
-- not specified (i.e.: it defaults to null), or is specified as null
CREATE TRIGGER art_movimientos_ins_trg
BEFORE INSERT
ON art_movimientos
FOR EACH ROW
WHEN (new.id IS NULL)
EXECUTE PROCEDURE art_movimientos_insert_trigger();
You can then insert the following rows (without specifying the id column):
INSERT INTO art_movimientos
(tipo, documento, fecha_doc, fecha_mov, calmacen, mes, "año")
VALUES
('AJE', 1, array['20170128'::date], array['20170128'::date], 1, 1, 2017),
('AJS', 2, array['20170128'::date], array['20170128'::date], 1, 1, 2017),
('SI', 3, array['20170128'::date], array['20170128'::date], 1, 1, 2017),
('AJE', 4, array['20170128'::date], array['20170128'::date], 1, 1, 2017),
('AJE', 5, array['20170128'::date], array['20170128'::date], 1, 1, 2017) ;
... and see that you get what you intended:
SELECT
id, tipo
FROM
art_movimientos
ORDER BY
documento ;
| id | tipo |
|----|------|
| 1 | AJE |
| 1 | AJS |
| 1 | SI |
| 2 | AJE |
| 3 | AJE |
You can check everything a SQLFiddle (which is a bit picky about PL/pgSQL functions and semicolons).
Side note: There can be a few corner cases where this procedure might fail within a transaction because of deadlocks and/or race conditions, if other transactions are also inserting data into the same table at the same time. So, your overall code should be able to handle aborted transactions, and retry them or show an error to the user.

How to calculate average mark for each student

I have a table "Register"
with columns:
class_id bigint NOT NULL,
disciple text,
datelesson date NOT NULL,
student_id integer NOT NULL,
note character varying(2)
now I want to calculate the average score for each student_id and the number of absent
Select * from "Register" as m
Join
(SELECT AVG(average), COUNT(abs) FROM (SELECT
CASE
WHEN "note" ~ '[0-9]' THEN CAST("note" AS decimal)
END AS average,
CASE
WHEN "note" ='a' THEN "note"
END AS abs
FROM "Register" ) AS average)n
on class_id=0001
And datelesson between '01.01.2012' And '06.06.2012'
And discipline='music' order by student_id
Result is this:
0001;"music";"2012-05-02";101;"6";7.6666666666666667;1
0001;"music";"2012-05-03";101;"a";7.6666666666666667;1
0001;"music";"2012-05-01";101;"10";7.6666666666666667;1
0001;"music";"2012-05-02";102;"7";7.6666666666666667;1
0001;"music";"2012-05-03";102;"";7.6666666666666667;1
0001;"music";"2012-05-01";102;"";7.6666666666666667;1
The result I receive is for the whole column but how do I calculate average marks for each student?
Could look like this:
SELECT student_id
, AVG(CASE WHEN note ~ '^[0-9]*$' THEN note::numeric
ELSE NULL END) AS average
, COUNT(CASE WHEN note = 'a' THEN 1 ELSE NULL END) AS absent
FROM "Register"
WHERE class_id = 1
AND datelesson BETWEEN '2012-01-01' AND '2012-06-06'
AND discipline = 'music'
GROUP BY student_id
ORDER BY student_id;
I added a couple of improvements.
You don't need to double-quote lower case identifiers.
If you want to make sure, there are only digits in note, your regular expression must be something like note ~ '^[0-9]*$'. What you have only checks if there is any digit in the string.
It's best to use the ISO format for dates, which works the same with any locale: YYYY-MM-DD.
The count for absence works, because NULL values do not count. Ypu could also use sum for that.
As class_id is a number type, bigint to be precise, leading zeros are just noise.
Use class_id = 1 instead of class_id = 0001.
Looks to me like you are missing a "group by" clause. I am not familiar with postgress but I suspect the idea applies just the same.
here is an example in transact-sql:
--create table register
--(
--class_id bigint NOT NULL,
-- disciple text,
-- datelesson date NOT NULL,
-- student_id integer NOT NULL,
-- grade_report integer not null,
-- )
--drop table register
delete from register
go
insert into register
values( 1, 'math', '1/1/2011', 1, 1)
insert into register
values( 1, 'reading', '1/1/2011', 1, 2)
insert into register
values( 1, 'writing', '1/1/2011', 1, 5)
insert into register
values( 1, 'math', '1/1/2011', 2, 8)
insert into register
values( 1, 'reading', '1/1/2011', 2, 9)
SELECT student_id, AVG(grade_report) as 'Average', COUNT(*) as 'NumClasses'
from register
WHERE class_id=1
group by student_id
order by student_id
cheers

in T-SQL, is it possible to find names of columns containing NULL in a given row (without knowing all column names)?

Is it possible in T-SQL to write a proper query reflecting this pseudo-code:
SELECT {primary_key}, {column_name}
FROM {table}
WHERE {any column_name value} is NULL
i.e. without referencing each column-name explicitly.
Sounds simple enough but I've searched pretty extensively and found nothing.
You have to use dynamic sql to solve that problem. I have demonstrated how it could be done.
With this sql you can pick a table and check the row with id = 1 for columns being null and primary keys. I included a test table at the bottom of the script. Code will not display anything if there is not primary keys and no columns being null.
DECLARE #table_name VARCHAR(20)
DECLARE #chosencolumn VARCHAR(20)
DECLARE #sqlstring VARCHAR(MAX)
DECLARE #sqlstring2 varchar(100)
DECLARE #text VARCHAR(8000)
DECLARE #t TABLE (col1 VARCHAR(30), dummy INT)
SET #table_name = 'test_table' -- replace with your tablename if you want
SET #chosencolumn = 'ID=1' -- replace with criteria for selected row
SELECT #sqlstring = COALESCE(#sqlstring, '') + 'UNION ALL SELECT '',''''NULL '''' '' + '''+t1.column_name+''', 1000 ordinal_position FROM ['+#table_name+'] WHERE [' +t1.column_name+ '] is null and ' +#chosencolumn+ ' '
FROM INFORMATION_SCHEMA.COLUMNS t1
LEFT JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE t2
ON t1.column_name = t2.column_name
AND t1.table_name = t2.table_name
AND t1.table_schema = t2.table_schema
WHERE t1.table_name = #table_name
AND t2.column_name is null
SET #sqlstring = stuff('UNION ALL SELECT '',''''PRIMARY KEY'''' ''+ column_name + '' '' col1, ordinal_position
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
WHERE table_name = ''' + #table_name+ '''' + #sqlstring, 1, 10, '') + 'order by 2'
INSERT #t
EXEC( #sqlstring)
SELECT #text = COALESCE(#text, '') + col1
FROM #t
SET #sqlstring2 ='select '+stuff(#text,1,1,'')
EXEC( #sqlstring2)
Result:
id host_id date col1
PRIMARY KEY PRIMARY KEY PRIMARY KEY NULL
Test table
CREATE TABLE [dbo].[test_table](
[id] int not null,
[host_id] [int] NOT NULL,
[date] [datetime] NOT NULL,
[col1] [varchar](20) NULL,
[col2] [varchar](20) NULL,
CONSTRAINT [PK_test_table] PRIMARY KEY CLUSTERED
(
[id] ASC,
[host_id] ASC,
[date] ASC
))
Test data
INSERT test_table VALUES (1, 1, getdate(), null, 'somevalue')