How to get dynamic number of columns in Postgresql crosstab - postgresql

I'm new to the postgresql crosstab function and have tried out a few solutions here on SO but still stuck. So basically I have a query that result in an output like the one below:
|student_name|subject_name|marks|
|------------|------------|-----|
|John Doe |ENGLISH |65 |
|John Doe |MATHEMATICS |72 |
|Mary Jane |ENGLISH |74 |
|Mary Jane |MATHEMATICS |70 |
|------------|------------|-----|
And the output I'm aiming for with crosstab is:
|student_name| ENGLISH | MATHEMATICS |
|------------|---------|-------------|
|John Doe | 65 | 72 |
|Mary Jane | 74 | 70 |
|------------|---------|-------------|
My query that returns the first table (without crosstab) is:
SELECT student_name, subject_name, sum(marks) as marks FROM (
SELECT student_id, student_name, class_name, exam_type, subject_name, total_mark as marks, total_grade_weight as out_of, percentage, grade, sort_order
FROM(
SELECT student_id, student_name, class_name, exam_type, subject_name, total_mark, total_grade_weight, ceil(total_mark::float/total_grade_weight::float*100) as percentage,
(select grade from app.grading where (total_mark::float/total_grade_weight::float)*100 between min_mark and max_mark) as grade, sort_order
FROM (
SELECT --big query with lots of JOINS
) q ORDER BY sort_order
)v GROUP BY v.student_id, v.student_name, v.class_name, v.exam_type, v.subject_name, v.total_mark, v.total_grade_weight, v.percentage, v.grade, v.sort_order
ORDER BY student_name ASC, sort_order ASC
)a
GROUP BY student_name, subject_name
ORDER BY student_name
And for the crosstab, this is where I get stuck with the columns.
SELECT * FROM
crosstab(' //the query above here ',
$$VALUES ('MATHEMATICS'::text), ('marks')$$
) AS ct
(student_name text, subject_name character varying, marks numeric);
If I run it as shown above, this is what I end up with:
|student_name|subject_name|marks|
|------------|------------|-----|
|John Doe | 65 | |
|Mary Jane | 74 | |
|____________|____________|_____|
As in it says subject_name not ENGLISH or MATHEMATICS. Obviously now I see I don't need the marks column but how can I get it to pull all the subject names as the column names? They could be two, they could be 12.

Solved it, but I would have preferred a much more dynamic solution. I replaced this;
$$VALUES ('MATHEMATICS'::text), ('marks')$$ with this;
'SELECT subject_name FROM app.subjects WHERE ... ORDER BY ...' The downside to my solution is that the last part changes to
(student_name text, english bigint, mathematics bigint, physics bigint, biology bigint, chemistry bigint, history bigint, ...);
That is, I have to list all the subjects manually and exactly in the order they are listed in from the above select. I don't find this very convenient but it works.

Related

tsql - How to convert multiples rows and columns into one row

id | acct_num | name | orderdt
1 1006A Joe Doe 1/1/2021
2 1006A Joe Doe 1/5/2021
EXPECTED OUTPUT
id | acct_num | name | orderdt | id1 | acct_num1 | NAME1 | orderdt1
1 1006A Joe Doe 1/1/2021 2 1006A Joe Doe 1/5/2021
My query is the following:
Select id,
acct_num,
name,
orderdt
from order_tbl
where acct_num = '1006A'
and orderdt >= '1/1/2021'
If you always have one or two rows you could do it like this (I'm assuming the latest version of SQL Server because you said TSQL):
NOTE: If you have a known max (eg 4) this solution can be converted to support any number by changing the modulus and adding more columns and another join.
WITH order_table_numbered as
(
SELECT ID, ACCT_NUM, NAME, ORDERDT,
ROW_NUMBER() AS (PARTITION BY ACCT_NUM ORDER BY ORDERDT) as RN
)
SELECT first.id as id, first.acct_num as acct_num, first.num as num, first.order_dt as orderdt,
second.id as id1, second.acct_num as acct_num1, second.num as num1, second.order_dt as orderdt1
FROM order_table_numbered first
LEFT JOIN order_table_numbered second ON first.ACCT_NUM = second.ACCT_NUM and (second.RN % 2 = 0)
WHERE first.RN % 2 = 1
If you have an unknown number of rows I think you should solve this on the client OR convert the groups to XML -- the XML support in SQL Server is not bad.

Recursive CTE PostgreSQL Connecting Multiple IDs with Additional Logic for Other Fields

Within my PostgreSQL database, I have an id column that shows each unique lead that comes in. I also have a connected_lead_id column which shows whether accounts are related to each other (ie husband and wife, parents and children, group of friends, group of investors, etc).
When we count the number of ids created during a time period, we want to see the number of unique "groups" of connected_ids during a period. In other words, we wouldn't want to count both the husband and wife pair, we would only want to count one since they are truly one lead.
We want to be able to create a view that only has the "first" id based on the "created_at" date and then contains additional columns at the end for "connected_lead_id_1", "connected_lead_id_2", "connected_lead_id_3", etc.
We want to add in additional logic so that we take the "first" id's source, unless that is null, then take the "second" connected_lead_id's source unless that is null and so on. Finally, we want to take the earliest on_boarded_date from the connected_lead_id group.
id | created_at | connected_lead_id | on_boarded_date | source |
2 | 9/24/15 23:00 | 8 | |
4 | 9/25/15 23:00 | 7 | |event
7 | 9/26/15 23:00 | 4 | |
8 | 9/26/15 23:00 | 2 | |referral
11 | 9/26/15 23:00 | 336 | 7/1/17 |online
142 | 4/27/16 23:00 | 336 | |
336 | 7/4/16 23:00 | 11 | 9/20/18 |referral
End Goal:
id | created_at | on_boarded_date | source |
2 | 9/24/15 23:00 | | referral |
4 | 9/25/15 23:00 | | event |
11 | 9/26/15 23:00 | 7/1/17 | online |
Ideally, we would also have i number of extra columns at the end to show each connected_lead_id that is attached to the base id.
Thanks for the help!
Ok the best I can come up with at the moment is to first build maximal groups of related IDs, and then join back to your table of leads to get the rest of the data (See this SQL Fiddle for the setup, full queries and results).
To get the maximal groups you can use a recursive common table expression to first grow the groups, followed by a query to filter the CTE results down to just the maximal groups:
with recursive cte(grp) as (
select case when l.connected_lead_id is null then array[l.id]
else array[l.id, l.connected_lead_id]
end from leads l
union all
select grp || l.id
from leads l
join cte
on l.connected_lead_id = any(grp)
and not l.id = any(grp)
)
select * from cte c1
The CTE above outputs several similar groups as well as intermediary groups. The query predicate below prunes out the non maximal groups, and limits results to just one permutation of each possible group:
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| grp |
|------------|
| 2,8 |
| 4,7 |
| 14 |
| 11,336,142 |
| 12,13 |
Next join the final query above back to your leads table and use window functions to get the remaining column values, along with the distinct operator to prune it down to the final result set:
with recursive cte(grp) as (
...
)
select distinct
first_value(l.id) over (partition by grp order by l.created_at) id
, first_value(l.created_at) over (partition by grp order by l.created_at) create_at
, first_value(l.on_boarded_date) over (partition by grp order by l.created_at) on_boarded_date
, first_value(l.source) over (partition by grp
order by case when l.source is null then 2 else 1 end
, l.created_at) source
, grp CONNECTED_IDS
from cte c1
join leads l
on l.id = any(grp)
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| id | create_at | on_boarded_date | source | connected_ids |
|----|----------------------|-----------------|----------|---------------|
| 2 | 2015-09-24T23:00:00Z | (null) | referral | 2,8 |
| 4 | 2015-09-25T23:00:00Z | (null) | event | 4,7 |
| 11 | 2015-09-26T23:00:00Z | 2017-07-01 | online | 11,336,142 |
| 12 | 2015-09-26T23:00:00Z | 2017-07-01 | event | 12,13 |
| 14 | 2015-09-26T23:00:00Z | (null) | (null) | 14 |
demo:db<>fiddle
Main idea - sketch:
Looping through the ordered set. Get all ids, that haven't been seen before in any connected_lead_id (cli). These are your starting points for recursion.
The problem is your number 142 which hasn't been seen before but is in same group as 11 because of its cli. So it is would be better to get the clis of the unseen ids. With these values it's much simpler to calculate the ids of the groups later in the recursion part. Because of the loop a function/stored procedure is necessary.
The recursion part: First step is to get the ids of the starting clis. Calculating the first referring id by using the created_at timestamp. After that a simple tree recursion over the clis can be done.
1. The function:
CREATE OR REPLACE FUNCTION filter_groups() RETURNS int[] AS $$
DECLARE
_seen_values int[];
_new_values int[];
_temprow record;
BEGIN
FOR _temprow IN
-- 1:
SELECT array_agg(id ORDER BY created_at) as ids, connected_lead_id FROM groups GROUP BY connected_lead_id ORDER BY MIN(created_at)
LOOP
-- 2:
IF array_length(_seen_values, 1) IS NULL
OR (_temprow.ids || _temprow.connected_lead_id) && _seen_values = FALSE THEN
_new_values := _new_values || _temprow.connected_lead_id;
END IF;
_seen_values := _seen_values || _temprow.ids;
_seen_values := _seen_values || _temprow.connected_lead_id;
END LOOP;
RETURN _new_values;
END;
$$ LANGUAGE plpgsql;
Grouping all ids that refer to the same cli
Loop through the id arrays. If no element of the array was seen before, add the referred cli the output variable (_new_values). In both cases add the ids and the cli to the variable which stores all yet seen ids (_seen_values)
Give out the clis.
The result so far is {8, 7, 336} (which is equivalent to the ids {2,4,11,142}!)
2. The recursion:
-- 1:
WITH RECURSIVE start_points AS (
SELECT unnest(filter_groups()) as ids
),
filtered_groups AS (
-- 3:
SELECT DISTINCT
1 as depth, -- 3
first_value(id) OVER w as id, -- 4
ARRAY[(MIN(id) OVER w)] as visited, -- 5
MIN(created_at) OVER w as created_at,
connected_lead_id,
MIN(on_boarded_date) OVER w as on_boarded_date -- 6,
first_value(source) OVER w as source
FROM groups
WHERE connected_lead_id IN (SELECT ids FROM start_points)
-- 2:
WINDOW w AS (PARTITION BY connected_lead_id ORDER BY created_at)
UNION
SELECT
fg.depth + 1,
fg.id,
array_append(fg.visited, g.id), -- 8
LEAST(fg.created_at, g.created_at),
g.connected_lead_id,
LEAST(fg.on_boarded_date, g.on_boarded_date), -- 9
COALESCE(fg.source, g.source) -- 10
FROM groups g
JOIN filtered_groups fg
-- 7
ON fg.connected_lead_id = g.id AND NOT (g.id = ANY(visited))
)
SELECT DISTINCT ON (id) -- 11
id, created_at,on_boarded_date, source
FROM filtered_groups
ORDER BY id, depth DESC;
The WITH part gives out the results from the function. unnest() expands the id array into each row for each id.
Creating a window: The window function groups all values by their clis and orders the window by the created_at timestamp. In your example all values are in their own window excepting 11 and 142 which are grouped.
This is a help variable to get the latest rows later on.
first_value() gives the first value of the ordered window frame. Assuming 142 had a smaller created_at timestamp the result would have been 142. But it's 11 nevertheless.
A variable is needed to save which id has been visited yet. Without this information an infinite loop would be created: 2-8-2-8-2-8-2-8-...
The minimum date of the window is taken (same thing here: if 142 would have a smaller date than 11 this would be the result).
Now the starting query of the recursion is calculated. Following describes the recursion part:
Joining the table (the original function results) against the previous recursion result. The second condition is the stop of the infinite loop I mentioned above.
Appending the currently visited id into the visited variable.
If the current on_boarded_date is earlier it is taken.
COALESCE gives the first NOT NULL value. So the first NOT NULL source is safed throughout the whole recursion
After the recursion which gives a result of all recursion steps we want to filter out only the deepest visits of every starting id.
DISTINCT ON (id) gives out the row with the first occurence of an id. To get the last one, the whole set is descendingly ordered by the depth variable.

Is there a way to see details for groups in a query?

I need to make a report in SSRS that will output data in this format:
Person | DocumentID | Data1 | Data2 | .....
----------------------------------------------
Mr. Smith | | | |
| #123021312 | 01 | 04 | .....
| #132145681 | 07 | 00 | .....
Mr. Black | | | |
| #912205112 | 11 | 08 | .....
| #131135810 | 03 | 05 | .....
..............................................
So, there is a kind of a hierarchy to the query. There are detail records (data about documents) and group records (persons). If I would do just GROUP BY, I would be able to only see group records, and display some aggregate information, like, Max of Data1, or Count of Document ID. Instead, I want to be able to see both aggregate and detail rows.
I tried googling and couldn't find any information about wether this is possible in T-SQL (or SSRS, for that matter). Is it?
Yes it is possible....
Flat Data
Declare #T TABLE (Person VARCHAR(25), DocumentID VARCHAR(25), Data1 VARCHAR(25), Data2 VARCHAR(25))
INSERT INTO #T (Person,DocumentID,Data1,Data2) VALUES
('Mr. Smith','#12345678A','01','04'),
('Mr. Smith','#98765432A','02','05'),
('Mr. Black','#12345678B','03','06'),
('Mr. Black','#98765432B','04','07')
SELECT *
FROM #T
Tablix Setup Steps
On your tablix that contains each of the fields in SSRS highlight the data row.
Right Click on the now visible row header with the 3 lines.
Select Add Group > Parent Group
In the group by drop down select Person then OK
The report will now be grouped by the Person column.
Bonus if you don't want the Person column showing to the right of the grouping simply delete the column.

Dynamic "INSERT INTO" Query with User Defined Type

I have got a SQL-Table were eache line consists a singel Value of some kind of Virtuel-Tabel - means the real existig SQL-Table looks like this:
-----------------------------------------
|DataRecordset | DataField | DataValue |
-----------------------------------------
| 1 | Firstname | John |
| 1 | Lastname | Smith |
| 1 | Birthday | 18.12.1963 |
| 2 | Firstname | Jane |
| 2 | Lastname | Smith |
| 2 | Birthday | 14.06.1975 |
-----------------------------------------
and I need to get something that feels like this:
-------------------------------------
| Firstname | Lastname | Birthday |
-------------------------------------
| John | Smith | 18.12.1963 |
| Jane | Smith | 14.06.1975 |
-------------------------------------
the reason why the real existing SQL-Table is stored like the first one is, that there are a lot more information around the core-data... like who write the data... when was the data written... from which to which time was the data significant... so there are a lot of diffrent variabels which decides which line from the first table i use to generate the second one.
I created a User-Defined-Tabletype on the SQL-Server which looks like the second table.
Then i start writing a procedure...
DECLARE #secondTable secondTable_Typ
DECLARE firstTable_Cursor CURSOR FOR SELECT DataRecordset, ... WHERE...lot of Text
OPEN firstTable_Cursor
FETCH NEXT FROM firstTable_Cursor
INTO #DataRecordset, #...
WHILE ##FETCH_STATUS = 0
BEGIN
IF NOT EXISTS(SELECT * FROM #secondTable WHERE DataRecordset= #DataRecordset)
BEGIN
the Problem i have... now i need some kind of dynamic Query, because i want do something like this:
INSERT INTO #secondTable (DataRecordset, #DataField ) VALUES (#DataRecordset, #DataValue)
but i cant use the variable #DataField like this... so i used google and found the function sp_executesql... i wrote the following code:
SET #sqlString = 'INSERT INTO #xsecondTable (DataRecordset, ' + #DataField + ') VALUES (#xDataRecordset, #xDataValue)'
EXEC sp_executesql #sqlString, N'#xsecondTable secondTable_Typ, #xDataRecordset smallint, #xDataValue sql_variant', #secondTable , #DataRecordset, #DataValue
but when i run the procedure i got an error that means i have to add a parameter "READONLY" to "#xsecondTable"...
i think the problem is, that sp_executesql can use variables as input or as outup... but i am not shure if its possiple to get this user defined table type into this procedure...
someone any idea how to get this code to run?
thank you very much
Have you considered doing a PIVOT on the data? Something along the lines of:
SELECT
[Firstname]
, [Lastname]
, [Birthday]
FROM
(
SELECT
[DataRecordset]
, [DataField]
, [DataValue]
FROM [Table]
) DATA
PIVOT
(
MIN ([DataValue]) FOR [DataField] IN
(
[Firstname]
, [Lastname]
, [Birthday]
)
) PVT

adding missing date in a table in PostgreSQL

I have a table that contains data for every day in 2002, but it has some missing dates. Namely, 354 records for 2002 (instead of 365). For my calculations, I need to have the missing data in the table with Null values
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | 65.6 | 2002-05-09 |
| 103 | 75.9 | 2002-05-10 |
+-----+------------+------------+
you see that 2002-05-08 is missing. I want my final table to be like:
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | | 2002-05-08 |
| 103 | 65.6 | 2002-05-09 |
| 104 | 75.9 | 2002-05-10 |
+-----+------------+------------+
Is there a way to do that in PostgreSQL?
It doesn't matter if I have the result just as a query result (not necessarily an updated table)
date is a reserved word in standard SQL and the name of a data type in PostgreSQL. PostgreSQL allows it as identifier, but that doesn't make it a good idea. I use thedate as column name instead.
Don't rely on the absence of gaps in a surrogate ID. That's almost always a bad idea. Treat such an ID as unique number without meaning, even if it seems to carry certain other attributes most of the time.
In this particular case, as #Clodoaldo commented, thedate seems to be a perfect primary key and the column id is just cruft - which I removed:
CREATE TEMP TABLE tbl (thedate date PRIMARY KEY, rainfall numeric);
INSERT INTO tbl(thedate, rainfall) VALUES
('2002-05-06', 110.2)
, ('2002-05-07', 56.6)
, ('2002-05-09', 65.6)
, ('2002-05-10', 75.9);
Query
Full table by query:
SELECT x.thedate, t.rainfall -- rainfall automatically NULL for missing rows
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
LEFT JOIN tbl t USING (thedate)
ORDER BY x.thedate
Similar to what #a_horse_with_no_name posted, but simplified and ignoring the pruned id.
Fills in gaps between first and last date found in the table. If there can be leading / lagging gaps, extend accordingly. You can use date_trunc() like #Clodoaldo demonstrated - but his query suffers from syntax errors and can be simpler.
INSERT missing rows
The fastest and most readable way to do it is a NOT EXISTS anti-semi-join.
INSERT INTO tbl (thedate, rainfall)
SELECT x.thedate, NULL
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
WHERE NOT EXISTS (SELECT 1 FROM tbl t WHERE t.thedate = x.thedate)
Just do an outer join against a query that returns all dates in 2002:
with all_dates as (
select date '2002-01-01' + i as date_col
from generate_series(0, extract(doy from date '2002-12-31')::int - 1) as i
)
select row_number() over (order by ad.date_col) as id,
t.rainfall,
ad.date_col as date
from all_dates ad
left join your_table t on ad.date_col = t.date
order by ad.date_col;
This will not change your table, it will just produce the result as desired.
Note that the generated id column will not contain the same values as the ID column in your table as it is merely a counter in the result set.
You could also replace the row_number() function with extract(doy from ad.date_col)
To fill the gaps. This will not reorder the IDs:
insert into t (rainfall, "date") values
select null, "date"
from (
select d::date as "date"
from (
t
right join
generate_series(
(select date_trunc('year', min("date")) from t)::timestamp,
(select max("date") from t),
'1 day'
) s(d) on t."date" = s.d::date
where t."date" is null
) q
) s
You have to fully re-create your table as indexes haves to change.
The better way to do it is to use your prefered dbi language, make a loop ignoring ID and putting values in a new table with new serialized IDs.
for day in (whole needed calendar)
value = select rainfall from oldbrokentable where date = day
insert into newcleanedtable date=day, rainfall=value, id=serialized
(That's not real code! Just conceptual to be adapted to your prefered scripting language)