How to calculate average mark for each student - postgresql

I have a table "Register"
with columns:
class_id bigint NOT NULL,
disciple text,
datelesson date NOT NULL,
student_id integer NOT NULL,
note character varying(2)
now I want to calculate the average score for each student_id and the number of absent
Select * from "Register" as m
Join
(SELECT AVG(average), COUNT(abs) FROM (SELECT
CASE
WHEN "note" ~ '[0-9]' THEN CAST("note" AS decimal)
END AS average,
CASE
WHEN "note" ='a' THEN "note"
END AS abs
FROM "Register" ) AS average)n
on class_id=0001
And datelesson between '01.01.2012' And '06.06.2012'
And discipline='music' order by student_id
Result is this:
0001;"music";"2012-05-02";101;"6";7.6666666666666667;1
0001;"music";"2012-05-03";101;"a";7.6666666666666667;1
0001;"music";"2012-05-01";101;"10";7.6666666666666667;1
0001;"music";"2012-05-02";102;"7";7.6666666666666667;1
0001;"music";"2012-05-03";102;"";7.6666666666666667;1
0001;"music";"2012-05-01";102;"";7.6666666666666667;1
The result I receive is for the whole column but how do I calculate average marks for each student?

Could look like this:
SELECT student_id
, AVG(CASE WHEN note ~ '^[0-9]*$' THEN note::numeric
ELSE NULL END) AS average
, COUNT(CASE WHEN note = 'a' THEN 1 ELSE NULL END) AS absent
FROM "Register"
WHERE class_id = 1
AND datelesson BETWEEN '2012-01-01' AND '2012-06-06'
AND discipline = 'music'
GROUP BY student_id
ORDER BY student_id;
I added a couple of improvements.
You don't need to double-quote lower case identifiers.
If you want to make sure, there are only digits in note, your regular expression must be something like note ~ '^[0-9]*$'. What you have only checks if there is any digit in the string.
It's best to use the ISO format for dates, which works the same with any locale: YYYY-MM-DD.
The count for absence works, because NULL values do not count. Ypu could also use sum for that.
As class_id is a number type, bigint to be precise, leading zeros are just noise.
Use class_id = 1 instead of class_id = 0001.

Looks to me like you are missing a "group by" clause. I am not familiar with postgress but I suspect the idea applies just the same.
here is an example in transact-sql:
--create table register
--(
--class_id bigint NOT NULL,
-- disciple text,
-- datelesson date NOT NULL,
-- student_id integer NOT NULL,
-- grade_report integer not null,
-- )
--drop table register
delete from register
go
insert into register
values( 1, 'math', '1/1/2011', 1, 1)
insert into register
values( 1, 'reading', '1/1/2011', 1, 2)
insert into register
values( 1, 'writing', '1/1/2011', 1, 5)
insert into register
values( 1, 'math', '1/1/2011', 2, 8)
insert into register
values( 1, 'reading', '1/1/2011', 2, 9)
SELECT student_id, AVG(grade_report) as 'Average', COUNT(*) as 'NumClasses'
from register
WHERE class_id=1
group by student_id
order by student_id
cheers

Related

Is it possible to find duplicating records in two columns simultaneously in PostgreSQL?

I have the following database schema (oversimplified):
create sequence partners_partner_id_seq;
create table partners
(
partner_id integer default nextval('partners_partner_id_seq'::regclass) not null primary key,
name varchar(255) default NULL::character varying,
company_id varchar(20) default NULL::character varying,
vat_id varchar(50) default NULL::character varying,
is_deleted boolean default false not null
);
INSERT INTO partners(name, company_id, vat_id) VALUES('test1','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test2','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test3','3214567890102', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test4','9999999999999', 'GE9999999999999');
I am trying to figure out how to return test1, test2 (because the company_id column value duplicates vertically) and test3 (because the vat_id column value duplicates vertically as well).
To put it in other words - I need to find duplicating company_id and vat_id records and group them together, so that test1, test2 and test3 would be together, because they duplicate by company_id and vat_id.
So far I have the following query:
SELECT *
FROM (
SELECT *, LEAD(row, 1) OVER () AS nextrow
FROM (
SELECT *, ROW_NUMBER() OVER (w) AS row
FROM partners
WHERE is_deleted = false
AND ((company_id != '' AND company_id IS NOT null) OR (vat_id != '' AND vat_id IS NOT NULL))
WINDOW w AS (PARTITION BY company_id, vat_id ORDER BY partner_id DESC)
) x
) y
WHERE (row > 1 OR nextrow > 1)
AND is_deleted = false
This successfully shows all company_id duplicates, but does not appear to show vat_id ones - test3 row is missing. Is this possible to be done within one query?
Here is a db-fiddle with the schema, data and predefined query reproducing my result.
You can do this with recursion, but depending on the size of your data you may want to iterate, instead.
The trick is to make the name just another match key instead of treating it differently than the company_id and vat_id:
create table partners (
partner_id integer generated always as identity primary key,
name text,
company_id text,
vat_id text,
is_deleted boolean not null default false
);
insert into partners (name, company_id, vat_id) values
('test1','1010109191191', 'BG1010109191192'),
('test2','1010109191191', 'BG1010109191192'),
('test3','3214567890102', 'BG1010109191192'),
('test4','9999999999999', 'GE9999999999999'),
('test5','3214567890102', 'BG8888888888888'),
('test6','2983489023408', 'BG8888888888888')
;
I added a couple of test cases and left in the lone partner.
with recursive keys as (
select partner_id,
array['n_'||name, 'c_'||company_id, 'v_'||vat_id] as matcher,
array[partner_id] as matchlist,
1 as size
from partners
), matchers as (
select *
from keys
union all
select p.partner_id, c.matcher,
p.matchlist||c.partner_id as matchlist,
p.size + 1
from matchers p
join keys c
on c.matcher && p.matcher
and not p.matchlist #> array[c.partner_id]
), largest as (
select distinct sort(matchlist) as matchlist
from matchers m
where not exists (select 1
from matchers
where matchlist #> m.matchlist
and size > m.size)
-- and size > 1
)
select *
from largest
;
matchlist
{1,2,3,5,6}
{4}
fiddle
EDIT UPDATE
Since recursion did not perform, here is an iterative example in plpgsql that uses a temporary table:
create temporary table match1 (
partner_id int not null,
group_id int not null,
matchkey uuid not null
);
create index on match1 (matchkey);
create index on match1 (group_id);
insert into match1
select partner_id, partner_id, md5('n_'||name)::uuid from partners
union all
select partner_id, partner_id, md5('c_'||company_id)::uuid from partners
union all
select partner_id, partner_id, md5('v_'||vat_id)::uuid from partners;
do $$
declare _cnt bigint;
begin
loop
with consolidate as (
select group_id,
min(group_id) over (partition by matchkey) as new_group_id
from match1
), minimize as (
select group_id, min(new_group_id) as new_group_id
from consolidate
group by group_id
), doupdate as (
update match1
set group_id = m.new_group_id
from minimize m
where m.group_id = match1.group_id
and m.new_group_id != match1.group_id
returning *
)
select count(*) into _cnt from doupdate;
if _cnt = 0 then
exit;
end if;
end loop;
end;
$$;
updated fiddle

Weighted Random Selection

Please. I have two tables with the most common first and last names. Each table has basically two fields:
Tables
CREATE TABLE "common_first_name" (
"first_name" text PRIMARY KEY, --The text representing the name
"ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.
"inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
"updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);
CREATE TABLE "common_last_name" (
"last_name" text PRIMARY KEY, --The text representing the name
"ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.
"inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
"updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);
P.S: The TOP 1 name occurs only ~ 1.8% of the time. The tables have 1000 rows each.
Function (Pseudo, not READY)
CREATE OR REPLACE FUNCTION create_sample_data(p_number_of_records INT)
RETURNS VOID
AS $$
DECLARE
SUM_OF_WEIGHTS CONSTANT INT := 100;
BEGIN
FOR i IN 1..coalesce(p_number_of_records, 0) LOOP
--Get the random first and last name but taking in consideration their probability (RATIO)round(random()*SUM_OF_WEIGHTS);
--create_person (random_first_name || ' ' || random_last_name);
END LOOP;
END
$$
LANGUAGE plpgsql VOLATILE;
P.S.: The sum of all ratios for each name (per table) sums up to 100%.
I want to run a function N times and get a name and a surname to create sample data... both tables have 1000 rows each.
The sample size can be anywhere from 1000 full names to 1000000 names, so if there is a "fast" way of doing this random weighted function, even better.
Any suggestion of how to do it in PL/PGSQL?
I am using PG 13.3 on SUPABASE.IO.
Thanks
Given the small input dataset, it's straightforward to do this in pure SQL. Use CTEs to build lower & upper bound columns for each row in each of the common_FOO_name tables, then use generate_series() to generate sets of random numbers. Join everything together, and use the random value between the bounds as the WHERE clause.
with first_names_weighted as (
select first_name,
sum(ratio) over (order by first_name) - ratio as lower_bound,
sum(ratio) over (order by first_name) as upper_bound
from common_first_name
),
last_names_weighted as (
select last_name,
sum(ratio) over (order by last_name) - ratio as lower_bound,
sum(ratio) over (order by last_name) as upper_bound
from common_last_name
),
randoms as (
select random() * (select sum(ratio) from common_first_name) as f_random,
random() * (select sum(ratio) from common_last_name) as l_random
from generate_series(1, 32)
)
select r, first_name, last_name
from randoms r
cross join first_names_weighted f
cross join last_names_weighted l
where f.lower_bound <= r.f_random and r.f_random <= f.upper_bound
and l.lower_bound <= r.l_random and r.l_random <= l.upper_bound;
Change the value passed to generate_series() to control how many names to generate. If it's important that it be a function, you can just use a LANGAUGE SQL function definition to parameterize that number:
https://www.db-fiddle.com/f/mmGQRhCP2W1yfhZTm1yXu5/3

Use a sequence to populate multiple columns with successive values

How can I use default constraints, triggers, or some other mechanism to automatically insert multiple successive values from a sequence into multiple columns on the same row of a table?
A standard use of a sequence in SQL Server is to combine it with default constraints on multiple tables to essentially get a cross-table identity. See for example the section "C. Using a Sequence Number in Multiple Tables" in the Microsoft documentation article "Sequence Numbers".
This works great if you only want to get a single value from the sequence for each row inserted. But sometimes I want to get multiple successive values. So theoretically I would create a sequence and table like this:
CREATE SEQUENCE DocumentationIDs;
CREATE TABLE Product
(
ProductID BIGINT NOT NULL IDENTITY(1, 1) PRIMARY KEY
, ProductName NVARCHAR(100) NOT NULL
, MarketingDocumentationID BIGINT NOT NULL DEFAULT ( NEXT VALUE FOR DocumentationIDs )
, TechnicalDocumentationID BIGINT NOT NULL DEFAULT ( NEXT VALUE FOR DocumentationIDs )
, InternalDocumentationID BIGINT NOT NULL DEFAULT ( NEXT VALUE FOR DocumentationIDs )
);
Unfortunately this will insert the same value in all three columns. This is by design:
If there are multiple instances of the NEXT VALUE FOR function specifying the same sequence generator within a single Transact-SQL statement, all those instances return the same value for a given row processed by that Transact-SQL statement. This behavior is consistent with the ANSI standard.
Increment by hack
The only suggestion I could find online was to use a hack where you have the sequence increment by the number of columns you need to insert (three in my contrived example) and manually add to the NEXT VALUE FOR function in the default constraint:
CREATE SEQUENCE DocumentationIDs START WITH 1 INCREMENT BY 3;
CREATE TABLE Product
(
ProductID BIGINT NOT NULL IDENTITY(1, 1) PRIMARY KEY
, ProductName NVARCHAR(100) NOT NULL
, MarketingDocumentationID BIGINT NOT NULL DEFAULT ( NEXT VALUE FOR DocumentationIDs )
, TechnicalDocumentationID BIGINT NOT NULL DEFAULT ( ( NEXT VALUE FOR DocumentationIDs ) + 1 )
, InternalDocumentationID BIGINT NOT NULL DEFAULT ( ( NEXT VALUE FOR DocumentationIDs ) + 2 )
)
This does not work for me because not all tables using my sequence require the same number of values.
One possible way using AFTER INSERT trigger is following.
Table definition need to be changed slighlty (DocumentationID columns should be defaulted to 0, or allowed to be nullable):
CREATE TABLE Product
(
ProductID BIGINT NOT NULL IDENTITY(1, 1)
, ProductName NVARCHAR(100) NOT NULL
, MarketingDocumentationID BIGINT NOT NULL CONSTRAINT DF_Product_1 DEFAULT (0)
, TechnicalDocumentationID BIGINT NOT NULL CONSTRAINT DF_Product_2 DEFAULT (0)
, InternalDocumentationID BIGINT NOT NULL CONSTRAINT DF_Product_3 DEFAULT (0)
, CONSTRAINT PK_Product PRIMARY KEY (ProductID)
);
And the trigger doing the job is following:
CREATE TRIGGER Product_AfterInsert ON Product
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
IF NOT EXISTS (SELECT 1 FROM INSERTED)
RETURN;
CREATE TABLE #DocIDs
(
ProductID BIGINT NOT NULL
, Num INT NOT NULL
, DocID BIGINT NOT NULL
, PRIMARY KEY (ProductID, Num)
);
INSERT INTO #DocIDs (ProductID, Num, DocID)
SELECT
i.ProductID
, r.n
, NEXT VALUE FOR DocumentationIDs OVER (ORDER BY i.ProductID, r.n)
FROM INSERTED i
CROSS APPLY (VALUES (1), (2), (3)) r(n)
;
WITH Docs (ProductID, MarketingDocID, TechnicalDocID, InternalDocID)
AS (
SELECT ProductID, [1], [2], [3]
FROM #DocIDs d
PIVOT (MAX(DocID) FOR Num IN ([1], [2], [3])) pvt
)
UPDATE p
SET
p.MarketingDocumentationID = d.MarketingDocID
, p.TechnicalDocumentationID = d.TechnicalDocID
, p.InternalDocumentationID = d.InternalDocID
FROM Product p
JOIN Docs d ON d.ProductID = p.ProductID
;
END

Efficiently select the most specific result from a table

I have a table roughly as follows:
CREATE TABLE t_table (
f_userid BIGINT NOT NULL
,f_groupaid BIGINT
,f_groupbid BIGINT
,f_groupcid BIGINT
,f_itemid BIGINT
,f_value TEXT
);
The groups are orthogonal, so no hierarchy can be implied beyond the fact that every entry in the table will have a user ID. There is no uniqueness in any of the columns.
So for example a simple setup might be:
INSERT INTO t_table VALUES (1, NULL, NULL, NULL, NULL, 'Value for anything by user 1');
INSERT INTO t_table VALUES (1, 5, 2, NULL, NULL, 'Value for anything by user 1 in groupA 5 groupB 2');
INSERT INTO t_table VALUES (1, 4, NULL, 1, NULL, 'Value for anything by user 1 in groupA 5 and groupC 1');
INSERT INTO t_table VALUES (2, NULL, NULL, NULL, NULL, 'Value for anything by user 2');
INSERT INTO t_table VALUES (2, 1, NULL, NULL, NULL, 'Value for anything by user 2 in groupA 1');
INSERT INTO t_table VALUES (2, 1, 3, 4, 5, 'Value for item 5 by user 2 in groupA 1 and groupB 3 and groupC 4');
For any given set of user/groupA/groupB/groupC/item I want to be able to obtain the most specific item in the table that applies. If any of the given set are NULL then it can only match relevant columns in the table which contain NULL. For example:
// Exact match
SELECT MostSpecific(1, NULL, NULL, NULL, NULL) => "Value for anything by user 1"
// Match the second entry because groupC and item were not specified in the table and the other items matched
SELECT MostSpecific(1, 5, 2, 3, NULL) => "Value for anything by user 1 in groupA 5 groupB 2"
// Does not match the second entry because groupA is NULL in the query and set in the table
SELECT MostSpecific(1, NULL, 2, 3, 4) => "Value for anything by user 1"
The obvious approach here is for the stored procedure to work through the parameters and find out which are NULL and not, and then call the appropriate SELECT statement. But this seems very inefficient. IS there a better way of doing this?
This should do it, just filter out any non matching rows using a WHERE, then rank the remaining rows by how well they match. If any column doesn't match, the whole bop expression will result in NULL, so we filter that out in an outer query where we also order by match and limit the result to only the single best match.
CREATE FUNCTION MostSpecific(BIGINT, BIGINT, BIGINT, BIGINT, BIGINT)
RETURNS TABLE(f_userid BIGINT, f_groupaid BIGINT, f_groupbid BIGINT, f_groupcid BIGINT, f_itemid BIGINT, f_value TEXT) AS
'WITH cte AS (
SELECT *,
CASE WHEN f_groupaid IS NULL THEN 0 WHEN f_groupaid = $2 THEN 1 END +
CASE WHEN f_groupbid IS NULL THEN 0 WHEN f_groupbid = $3 THEN 1 END +
CASE WHEN f_groupcid IS NULL THEN 0 WHEN f_groupcid = $4 THEN 1 END +
CASE WHEN f_itemid IS NULL THEN 0 WHEN f_itemid = $5 THEN 1 END bop
FROM t_table
WHERE f_userid = $1
AND (f_groupaid IS NULL OR f_groupaid = $2)
AND (f_groupbid IS NULL OR f_groupbid = $3)
AND (f_groupcid IS NULL OR f_groupcid = $4)
AND (f_itemid IS NULL OR f_itemid = $5)
)
SELECT f_userid, f_groupaid, f_groupbid, f_groupcid, f_itemid, f_value FROM cte
WHERE bop IS NOT NULL
ORDER BY bop DESC
LIMIT 1'
LANGUAGE SQL
//
An SQLfiddle to test with.
Try something like:
select *
from t_table t
where f_userid = $p_userid
and (t.f_groupaid is not distinct from $p_groupaid or t.f_groupaid is null) --null in f_groupaid matches both null and not null values
and (t.f_groupbid is not distinct from $p_groupbid or t.f_groupbid is null)
and (t.f_groupcid is not distinct from $p_groupcid or t.f_groupcid is null)
order by (t.f_groupaid is not distinct from $p_groupaid)::int -- order by count of matches
+(t.f_groupbid is not distinct from $p_groupbid)::int
+(t.f_groupcid is not distinct from $p_groupcid)::int desc
limit 1;
It will give you the best match on groups.
A is not distinct from B fill return true if A and B are equal or both null.
::int means cast ( as int). Casting boolean true to int will give 1 (You can not add boolean values directly).
SQL Fiddle
create or replace function mostSpecific(
p_userid bigint,
p_groupaid bigint,
p_groupbid bigint,
p_groupcid bigint,
p_itemid bigint
) returns t_table as $body$
select *
from t_table
order by
(p_userid is not distinct from f_userid or f_userid is null)::integer
+
(p_groupaid is not distinct from f_groupaid or f_userid is null)::integer
+
(p_groupbid is not distinct from f_groupbid or f_userid is null)::integer
+
(p_groupcid is not distinct from f_groupcid or f_userid is null)::integer
+
(p_itemid is not distinct from f_itemid or f_userid is null)::integer
desc
limit 1
;
$body$ language sql;

What's wrong with this T-SQL MERGE statement?

I am new to MERGE, and I'm sure I have some error in my code.
This code will run and create my scenario:
I have two tables, one that is called TempUpsert that fills from a SqlBulkCopy operation (100s of millions of records) and a Sales table that holds the production data which is to be indexed and used.
I wish to merge the TempUpsert table with the Sales one
I am obviously doing something wrong as it fails with even the smallest example
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TempUpsert]') )
drop table TempUpsert;
CREATE TABLE [dbo].[TempUpsert](
[FirstName] [varchar](200) NOT NULL,
[LastName] [varchar](200) NOT NULL,
[Score] [int] NOT NULL
) ON [PRIMARY] ;
CREATE TABLE [dbo].[Sales](
[FullName] [varchar](200) NOT NULL,
[LastName] [varchar](200) NOT NULL,
[FirstName] [varchar](200) NOT NULL,
[lastUpdated] [date] NOT NULL,
CONSTRAINT [PK_Sales] PRIMARY KEY CLUSTERED
(
[FullName] ASC
)
---- PROC
CREATE PROCEDURE [dbo].[sp_MoveFromTempUpsert_to_Sales]
(#HashMod int)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
MERGE Sales AS trget
USING (
SELECT
--- Edit: Thanks to Mikal added DISTINCT
DISTINCT
FirstName, LastName , [Score], LastName+'.'+FirstName AS FullName
FROM TempUpsert AS ups) AS src (FirstName, LastName, [Score], FullName)
ON
(
src.[Score] = #hashMod
AND
trget.FullName=src.FullName
)
WHEN MATCHED
THEN
UPDATE SET trget.lastUpdated = GetDate()
WHEN NOT MATCHED
THEN INSERT ([FullName], [LastName], [FirstName], [lastUpdated])
VALUES (FullName, src.LastName, src.FirstName, GetDate())
OUTPUT $action, Inserted.*, Deleted.* ;
--print ##rowcount
END
GO
--- Insert dummie data
INSERT INTO TempUpsert (FirstName, LastName, Score)
VALUES ('John','Smith',2);
INSERT INTO TempUpsert (FirstName, LastName, Score)
VALUES ('John','Block',2);
INSERT INTO TempUpsert (FirstName, LastName, Score)
VALUES ('John','Smith',2); --make multiple on purpose
----- EXECUTE PROC
GO
DECLARE #return_value int
EXEC #return_value = [dbo].[sp_MoveFromTempUpsert_to_Sales]
#HashMod = 2
SELECT 'Return Value' = #return_value
GO
This returns:
(1 row(s) affected)
(1 row(s) affected)
(1 row(s) affected)
Msg 2627, Level 14, State 1, Procedure sp_MoveFromTempUpsert_to_Sales, Line 12
Violation of PRIMARY KEY constraint 'PK_Sales'. Cannot insert duplicate key in object
'dbo.Sales'. The statement has been terminated.
(1 row(s) affected)
What am I doing wrong please?
Greatly appreciated
The first two rows in your staging table will give you the duplicate PK. violation. Conc is the PK and you insert tmain+dmain with the same value twice.
In Summation
MERGE requires its input (Using) to be duplicates free
the Using is a regular SQL statement, so you can use Group By, distinct and having as well as Where clauses.
My final Merge looks like so :
MERGE Sales AS trget
USING (
SELECT FirstName, LastName, Score, LastName + '.' + FirstName AS FullName
FROM TempUpsert AS ups
WHERE Score = #hashMod
GROUP BY FirstName, LastName, Score, LastName + '.' + FirstName
) AS src (FirstName, LastName, [Score], FullName)
ON
(
-- src.[Score] = #hashMod
--AND
trget.FullName=src.FullName
)
WHEN MATCHED
THEN
UPDATE SET trget.lastUpdated = GetDate()
WHEN NOT MATCHED
THEN INSERT ([FullName], [LastName], [FirstName], [lastUpdated])
VALUES (FullName, src.LastName, src.FirstName, GetDate())
OUTPUT $action, Inserted.*, Deleted.* ;
--print ##rowcount
END
And it works!
Thanks to you all :)
Without DISTINCT or proper AGGREGATE function in subquery used in USING part of MERGE there will be two rows which suits criteria used in ON part of MERGE, which is not allowed. (Two John.Smith)
AND
Move the condition src.[Score] = #hashMod inside the subquery,
instead if ON clause not succeed, for example John.Smith have score of 2, and #HashMod = 1 - then if you already have the row with John.Smith in target table - you'll get an error with Primary Key Constraint