How would you create a group identifier based on one column, but sorted by another? - tsql

I am attempting to create column Group via T-SQL.
If a cluster of accounts are in a row, consider that as one group. if the account is seen again lower in the list (cluster or not), then consider it a new group. This seems straight forward, but I cannot seem to see the solution... Below there are three clusters of account 3456, each having a different group number (Group 1,4, and 6)
+-------+---------+------+
| Group | Account | Sort |
+-------+---------+------+
| 1 | 3456 | 1 |
| 1 | 3456 | 2 |
| 2 | 9878 | 3 |
| 3 | 5679 | 4 |
| 4 | 3456 | 5 |
| 4 | 3456 | 6 |
| 4 | 3456 | 7 |
| 5 | 1295 | 8 |
| 6 | 3456 | 9 |
+-------+---------+------+
UPDATE: I left this out of the original requirements, but a cluster of accounts could have more than two accounts. I updated the example data to include this scenario.

Here's how I'd do it:
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(3456,8);
--Solution
SELECT [Group] = DENSE_RANK() OVER (ORDER BY grouper.groupID), grouper.Account, grouper.Sort
FROM
(
SELECT t.*, groupID = ROW_NUMBER() OVER (ORDER BY t.sort) +
CASE t.Account WHEN LEAD(t.Account,1) OVER (ORDER BY t.sort) THEN 1 ELSE 0 END
FROM #table AS t
) AS grouper;
Results:
Group Account Sort
------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
6 3456 8
Update based on OPs comment below (20190508)
I spent a couple days banging my head on how to handle groups of three or more; it was surprisingly difficult but what I came up with handles bigger clusters and is way better than my first answer. I updated the sample data to include bigger clusters.
Note that I include a UNIQUE constraint for the sort column - this creates a unique index. You don't need the constraint for this solution to work but, having an index on that column (clustered, nonclustered unique or just nonclustered) will improve the performance dramatically.
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT UNIQUE);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(1295,8),(1295,9),(1295,10),(3456,11);
-- Better solution
WITH Groups AS
(
SELECT t.*, Grouper =
CASE t.Account WHEN LAG(t.Account,1,t.Account) OVER (ORDER BY t.Sort) THEN 0 ELSE 1 END
FROM #table AS t
)
SELECT [Group] = SUM(sg.Grouper) OVER (ORDER BY sg.Sort)+1, sg.Account, sg.Sort
FROM Groups AS sg;
Results:
Group Account Sort
----------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
5 1295 8
5 1295 9
5 1295 10
6 3456 11

Related

PostgreSQL - Setting null values to missing rows in a join statement

SQL newbie here. I'm trying to write a query that generates a scoring table, setting null to a student's grades in a module for which they haven't yet taken their exams (on PostgreSQL).
So I start with tables that look something like this:
student_evaluation:
|student_id| module_id | course_id |grade |
|----------|-----------|-----------|-------|
| 1 | 1 | 1 |3 |
| 1 | 1 | 1 |7 |
| 1 | 2 | 1 |8 |
| 2 | 4 | 2 |9 |
course_module:
| module_id | course_id |
| ---------- | --------- |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
In our use case, a course is made up of several modules. Each module has a single exam, but a student who failed his exam may have a couple of retries. The same module may also be present in different courses, but an exam attempt only counts for one instance of the module (ie. student A passed module 1's exam on course 1. If course 2 also has module 1, student A has to retake the same exam for course 2 if he also has access to that course).
So the output should look like this:
student_id
module_id
course_id
grade
1
1
1
3
1
1
1
7
1
2
1
8
1
3
1
null
2
4
2
9
I feel like this should have been a simple task, but I think I have a very flawed understanding of how outer and cross joins work. I have tried stuff like:
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
RIGHT OUTER JOIN course_module ON course_module.course_id = se.course_id
AND course_module.module_id = se.module_id
or
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
CROSS JOIN course_module WHERE course_module.course_id = se.course_id
Neither worked. These all feel wrong, but I'm lost as to what would be the proper way to go about this.
Thank you in advance.
I think you need both join types: first use a cross join to build a list of all combinations of students and courses, then use an outer join to add the grades.
SELECT sc.student_id,
sc.module_id,
sc.course_id,
se.grade
FROM student_evaluation se
RIGHT JOIN (SELECT s.student_id,
c.module_id,
c.course_id
FROM (SELECT DISTINCT student_id
FROM student_evaluation) AS s
CROSS JOIN course_module AS c) AS sc
USING (course_id));

How to get the change number?

How to increase value when source value is changed?
I have tried rank, dense_rank, row_number without success =(
id | src | how to get this?
--------
1 | 1 | 1
2 | 1 | 1
3 | 7 | 2
4 | 1 | 3
5 | 3 | 4
6 | 3 | 4
7 | 1 | 5
NOTICE: src is guaranteed to be in this order you see
is there simple way to do this?
You can achieve this by nesting two window functions - the first to get whether the src value changed from the previous row, the second to sum the number of changes. Unfortunately Postgres doesn't allow nesting window functions directly, but you can work around that with a subquery:
SELECT
id,
src,
sum(incr) OVER (ORDER BY id)
FROM (
SELECT
*,
(lag(src) OVER (ORDER BY id) IS DISTINCT FROM src)::int AS incr
FROM example
) AS _;
(online demo)

PostgresQL for each row, generate new rows and merge

I have a table called example that looks as follows:
ID | MIN | MAX |
1 | 1 | 5 |
2 | 34 | 38 |
I need to take each ID and loop from it's min to max, incrementing by 2 and thus get the following WITHOUT using INSERT statements, thus in a SELECT:
ID | INDEX | VALUE
1 | 1 | 1
1 | 2 | 3
1 | 3 | 5
2 | 1 | 34
2 | 2 | 36
2 | 3 | 38
Any ideas of how to do this?
The set-returning function generate_series does exactly that:
SELECT
id,
generate_series(1, (max-min)/2+1) AS index,
generate_series(min, max, 2) AS value
FROM
example;
(online demo)
The index can alternatively be generated with RANK() (example, see also #a_horse_­with_­no_­name's answer) if you don't want to rely on the parallel sets.
Use generate_series() to generate the numbers and a window function to calculate the index:
select e.id,
row_number() over (partition by e.id order by g.value) as index,
g.value
from example e
cross join generate_series(e.min, e.max, 2) as g(value);

PostgreSQL WITH RECURSIVE query to get ordered parent-child chain by a Partition Key

I have the issue writing a sql script on PostgreSQL 9.6.6 which orders steps in a process by using the steps' parent-child ID's, and this grouped/partitioned per process ID. I couldn't find this special case here, so I apologize if I missed it and would please you to provide me the link to the solution in the comments.
The case: I have a table which looks like this:
processID | stepID | parentID
1 1 NULL
1 3 5
1 2 4
1 4 3
1 5 1
2 1 NULL
2 3 5
2 2 4
2 4 3
2 5 1
Now I have to order the steps by starting with the step where parentID is NULL for each processID .
Note: I cannot simply order StepID or parentID as new steps I put within the whole process get a higher stepID then the last step in the process (continuous generating surrogate key).
I have to order the steps for every processID, that I will receive the following output:
processID | stepID | parentID
1 1 NULL
1 5 1
1 3 5
1 4 3
1 2 4
2 1 NULL
2 5 1
2 3 5
2 4 3
2 2 4
I tried to do this with the CTE function WITH RECURSIVE:
WITH RECURSIVE
starting (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.parentID
FROM process b
WHERE b.parentID ISNULL
),
descendants (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.stepparentID
FROM starting b
UNION ALL
SELECT b.processID,b.stepID, b.parentID
FROM process b
JOIN descendants AS c ON b.parentID = c.stepID
)
SELECT * FROM descendants
The result is not what I am searching for. As we have hundreds of processes, I receive a list where the first records are the different processIDs which have a NULL value as parentID.
I guess I have to recursive the whole script on the processID again, but have no idea how.
Thank you for your help!
You should calculate the level of each step:
with recursive starting as (
select processid, stepid, parentid, 0 as level
from process
where parentid is null
union all
select p.processid, p.stepid, p.parentid, level+ 1
from starting s
join process p on s.stepid = p.parentid and s.processid = p.processid
)
select *
from starting
order by processid, level
processid | stepid | parentid | level
-----------+--------+----------+-------
1 | 1 | | 0
1 | 5 | 1 | 1
1 | 3 | 5 | 2
1 | 4 | 3 | 3
1 | 2 | 4 | 4
2 | 1 | | 0
2 | 5 | 1 | 1
2 | 3 | 5 | 2
2 | 4 | 3 | 3
2 | 2 | 4 | 4
(10 rows)
Of course, you can skip the last column in the final select if you do not need it.

How can I uniquely map a long string identifier to a numerical value in a single query (for bandwidth reasons)?

I have a Postgresql database (technically Greenplum) with data on individuals over time. The database has three fields: user_id, monthly_date, and account_value. When I put in a query, I have to download the results from a remote server, so bandwidth is an issue. Since the user_id field is a very long string (around 50 characters), I'd like to return a numerical value that corresponds 1:1 with each value of user_id, since this will take up less space.
For example, the database might have sample data like this:
63a9364385350b13473279 Jan-2000
63a9364385350b13473279 Feb-2000
2066937e2887w206010393 Apr-2001
036686037e507d01764237 Mar-2003
036686037e507d01764237 Jun-2003
036686037e507d01764237 Jul-2003
036686037e507d01764237 Dec-2003
90829x098327549n286418 Apr-2004
90829x098327549n286418 Sep-2004
67518x834512306933u500 Nov-2000
and I'm trying to work out a query using ROW_NUMBER() and various window functions like PARTITION BY to get results like this:
1 Jan-2000
1 Feb-2000
2 Apr-2001
3 Mar-2003
3 Jun-2003
3 Jul-2003
3 Dec-2003
4 Apr-2004
4 Sep-2004
5 Nov-2000
I know these aren't actual database formats, but I'm just using them as example data. Is this possible? I don't care (although it would be nice and very neat to see) if, for example, 63a9364385350b13473279 maps to 1 in one query and 2 in the next, but in any given query, 63a9364385350b13473279 should always map to the same value regardless of date. The mapped numbers don't need to be in sequence or have any meaningful value besides being unique.
If you just need a unique number, this will do the trick:
SELECT
id,
split_part(t.d, '-', 2),
row_number() OVER all_window - row_number() OVER group_window AS a_unique_number_by_id
FROM (
VALUES
('63a9364385350b13473279','Jan-2000'),
('63a9364385350b13473279','Feb-2000'),
('2066937e2887w206010393','Apr-2001'),
('036686037e507d01764237','Mar-2003'),
('036686037e507d01764237','Jun-2003'),
('036686037e507d01764237','Jul-2003'),
('036686037e507d01764237','Dec-2003'),
('90829x098327549n286418','Apr-2004'),
('90829x098327549n286418','Sep-2004'),
('67518x834512306933u500','Nov-2000')
) as t(id, d)
WINDOW group_window AS (
PARTITION BY id
ORDER BY split_part(t.d, '-', 2)
), all_window AS (
ORDER BY split_part(t.d, '-', 2)
);
Here is the result:
id | split_part | a_unique_number_by_id
------------------------+------------+-----------------------
63a9364385350b13473279 | 2000 | 0
63a9364385350b13473279 | 2000 | 0
67518x834512306933u500 | 2000 | 2
2066937e2887w206010393 | 2001 | 3
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
036686037e507d01764237 | 2003 | 4
90829x098327549n286418 | 2004 | 8
90829x098327549n286418 | 2004 | 8
(10 rows)
You should re-order it with another column to keep the original ordering.
I think you are looking for dense_rank().
create table sample_data
(userid varchar(50) not null,
monthly_date date not null)
distributed by (userid);
insert into sample_data (userid, monthly_date) values
('63a9364385350b13473279','2000-01-01'),
('63a9364385350b13473279','2000-02-01'),
('2066937e2887w206010393','2001-04-01'),
('036686037e507d01764237','2003-03-01'),
('036686037e507d01764237','2003-06-01'),
('036686037e507d01764237','2003-07-01'),
('036686037e507d01764237','2003-12-01'),
('90829x098327549n286418','2004-04-01'),
('90829x098327549n286418','2004-09-01'),
('67518x834512306933u500','2000-11-01');
select dense_rank() over(order by userid) as new_userid, userid, monthly_date
from sample_data
order by 2;
new_userid | userid | monthly_date
------------+------------------------+--------------
1 | 036686037e507d01764237 | 2003-06-01
1 | 036686037e507d01764237 | 2003-07-01
1 | 036686037e507d01764237 | 2003-12-01
1 | 036686037e507d01764237 | 2003-03-01
2 | 2066937e2887w206010393 | 2001-04-01
3 | 63a9364385350b13473279 | 2000-02-01
3 | 63a9364385350b13473279 | 2000-01-01
4 | 67518x834512306933u500 | 2000-11-01
5 | 90829x098327549n286418 | 2004-09-01
5 | 90829x098327549n286418 | 2004-04-01
(10 rows)
Try the below script
create table test_schema.source_data (id varchar(50), dt varchar(50));
insert into test_schema.source_data
values ('63a9364385350b13473279','Jan-2000'),
('63a9364385350b13473279','Feb-2000'),
('2066937e2887w206010393','Apr-2001'),
('036686037e507d01764237','Mar-2003'),
('036686037e507d01764237','Jun-2003'),
('036686037e507d01764237','Jul-2003'),
('036686037e507d01764237','Dec-2003'),
('90829x098327549n286418','Apr-2004'),
('90829x098327549n286418','Sep-2004'),
('67518x834512306933u500','Nov-2000');
create temporary table id_mapping
as
select t1.id, row_number() over(order by t1.id) rownum
from (
SELECT distinct id
FROM test_schema.source_data
) t1;
select t1.id, t1.dt, t2.rownum
from
test_schema.source_data t1
join id_mapping t2
on t1.id = t2.id;
And here is the result
id dt rownum
------------------------+------------+-----
036686037e507d01764237 Dec-2003 1
036686037e507d01764237 Jul-2003 1
036686037e507d01764237 Jun-2003 1
036686037e507d01764237 Mar-2003 1
2066937e2887w206010393 Apr-2001 2
63a9364385350b13473279 Feb-2000 3
63a9364385350b13473279 Jan-2000 3
67518x834512306933u500 Nov-2000 4
90829x098327549n286418 Sep-2004 5
90829x098327549n286418 Apr-2004 5