SAS: How to calculate date differences between rows in grouped data

SAS: How to calculate date differences between rows in grouped data - date

My dataset have in some instances been split into multiple rows, and I need to find these instances and recombine the rows. This requires me to calculate the difference between dates in a dataset within a group and subgroup, and then effectively "merge the rows" if the differences between the Start and End dates are 1 day or below within. My dataset looks like this:
ID | Start | End | Place |
1 | 01-01-2020 | 31-03-2020 | Street 1 |
1 | 01-04-2020 | 31-07-2020 | Street 1 |
1 | 01-08-2020 | 31-12-2020 | Street 1 |
1 | 01-01-2021 | 31-03-2021 | Street 2 |
2 | 01-01-2020 | 31-04-2020 | Street 1 |
2 | 31-04-2020 | 31-08-2020 | Street 1 |
3 | 01-01-2020 | 31-03-2020 | Street 1 |
And I would really like to output this:
ID | Start | End | Place |
1 | 01-01-2020 | 31-12-2020 | Street 1 |
1 | 01-01-2021 | 31-03-2021 | Street 2 |
2 | 01-01-2020 | 31-08-2020 | Street 1 |
3 | 01-01-2020 | 31-03-2020 | Street 1 |
So essentially, within the ID group and Place subgroup, if there is a 1 or smaller difference between the previous End and new Start date, then I would like to "combine" the two rows so the start and end date reflect the entire period at that particular place (for instance entire stay at Street 1 for ID 1).
I have tried creating multiple datasteps where I use the LAG-function a lot, and that seems to deal reasonable well with the instance of ID 2, where there are only two rows that need to be considered. However, for ID 1 (Street 1), where I effectively need to join three rows, I have not been able to find a good solution as of yet. Any suggestions of functions that could be usefull will be much appreciated!

You can use a multiple steps approach.
Thanks to #whymath for spotting the dates issues in the input data.
data stage1;
set have;
by id place notsorted;
if first.place then
group_number+1;
run;
proc sort data=stage1 out=stage2;
by id place group_number start;
run;
data want;
do until (last.place);
set stage2;
by id place group_number;
if first.group_number then
_start=start;
if last.place then
do;
start=_start;
output;
end;
end;
drop _start group_number;
run;
id start end place
1 01-01-2020 31-12-2020 Street1
1 01-01-2021 31-03-2021 Street2
2 01-01-2020 31-08-2020 Street1
3 01-01-2020 31-03-2020 Street1

There some errors in your data, I correct them to get your desired output:
Line5: 31-04-2020 → 30-04-2020
Line6: 31-04-2020 → 01-05-2020
data have;
infile cards dlm='|';
input id start : ddmmyy10. end : ddmmyy10. place$;
format start end ddmmyyd10.;
cards;
1 | 01-01-2020 | 31-03-2020 | street 1 |
1 | 01-04-2020 | 31-07-2020 | street 1 |
1 | 01-08-2020 | 31-12-2020 | street 1 |
1 | 01-01-2021 | 31-03-2021 | street 2 |
2 | 01-01-2020 | 30-04-2020 | street 1 |
2 | 01-05-2020 | 31-08-2020 | street 1 |
3 | 01-01-2020 | 31-03-2020 | street 1 |
;
run;
The difficult of your question is update and output data according next row. I would suggest you the double set skill.
proc sort data=have;
by id place start;
run;
data want;
set have;
do i=1 to rec;
set have(rename=(id=tmpid place=tmpplace start=tmpstart end=tmpend))nobs=rec point=i;
if id=tmpid and place=tmpplace then do;
if start=tmpend+1 and _n_=i+1 then used=1;
else if end=tmpstart-1 and _n_<i then end=tmpend;
end;
end;
if used^=1;
drop tmp: used;
run;

Related

T-SQL Update Query to enter recurring data in groups of three

I have a table of data, it is duplicated twice in the same table to make three sets.
Its "ReferenceID" is the primary key, i want to in a way group the 3 same ReferenceID's and inject these three values "f2f" "NF2F" "Travel" into the row called "Type" in any order but ensure that each ReferenceID only has one of those values.
For Example:
ReferenceID | Type
------------|-------
1 f2f
1 nf2f
1 Travel
2 f2f
2 nf2f
2 Travel
3 f2f
3 nf2f
3 Travel
etc etc...
Is it possible ?

You can do this with a row_number that you mod by the number of groups you have (in your case 3):
declare #t table(RefType varchar(10));
insert into #t values ('f2f'),('nf2f'),('Travel'),('f2f'),('nf2f'),('Travel'),('f2f'),('nf2f'),('Travel');
select (row_number() over (order by RefType) % 3) + 1 as ReferenceID
,RefType
from #t
order by ReferenceID
,RefType;
Output
+-------------+---------+
| ReferenceID | RefType |
+-------------+---------+
| 1 | f2f |
| 1 | nf2f |
| 1 | Travel |
| 2 | f2f |
| 2 | nf2f |
| 2 | Travel |
| 3 | f2f |
| 3 | nf2f |
| 3 | Travel |
+-------------+---------+

Postgresql. How to update column with range of integers from from 0 to last row that satisfies WHERE criteria

I have a next table sample, called userz:
+----+---------------+----------+
| id | sort_position | type |
+----+---------------+----------+
| 1 | -5 | admin |
| 2 | -3 | customer |
| 3 | 1 | customer |
| 4 | 8 | employee |
| 5 | 200 | customer |
+----+---------------+----------+
With Mysql If i want to make sort_position of all customer type to start from 0 and ++ until the last row that satisfies WHERE criteria, i can do next:
SET #i=-1;
UPDATE userz
SET sort_position=#i:=#i+1
WHERE type = "customer" ORDER BY sort_position;
and i would receive expected result:
+----+---------------+----------+
| id | sort_position | type |
+----+---------------+----------+
| 1 | -5 | admin |
| 2 | 0 | customer |
| 3 | 1 | customer |
| 4 | 8 | employee |
| 5 | 2 | customer |
+----+---------------+----------+
as you see all customers are now assigned with correct sort_position of 0,1,2
But since i'm working with postgre i need to reach same with it. What i tried so far:
DO $$
DECLARE
i integer := -1;
BEGIN
UPDATE userz
SET sort_position=#i:=#i+1
WHERE type = "customer" ORDER BY sort_position;
END $$;
and i have errors around =#i:=#i+1 , tried different formatting that i googled like =i:=i+1 but still no luck.

Try below SQL;
update userz k
set sort_position =
(select ROW_NUMBER() over(order by sort_position)-1 rnum
from userz src
where src.type ='customer'
and id = k.id)

SUM of two level group by in postgresql

I have three table as given below
student
id name stand_id sub_id gender
---------------------------------------
1 | Joe | 1 | 1 | M
2 | Saun | 2 | 1 | F
3 | Paul | 1 | 2 | F
4 | Sena | 2 | 2 | M
Subject
id name
1 Math
2 English
Standard
id name
1 First
2 Second
How can I achieve this kind of multiple group by like standard, subject than total number of boys and girls.
Should I use with, union or union all ?
First
Math
boys total
girls total
second
math
boys total
girls total

It's not completely clear what you are attempting. My interpretation is that you are looking for the total of students by standard, subject and gender.
If that is correct, you need to join together the tables and count the students at the appropriate grain, like so:
SELECT
sta.name AS standard_name,
sub.name AS subject_name,
CASE stu.gender WHEN 'M' THEN 'Boys' ELSE 'Girls' END AS student_gender,
COUNT(stu.id) AS total
FROM
student stu
JOIN
subject sub
ON (stu.sub_id = sub.id)
JOIN
standard sta
ON (stu.stand_id = sta.id)
GROUP BY
standard_name,
subject_name,
student_gender;
Based on your sample data, it would return this:
standard_name | subject_name | student_gender | total
-----------------------------------------------------
First | Math | Boys | 1
First | English | Girls | 1
Second | Math | Girls | 1
Second | English | Boys | 1

Is it what you are looking for
SELECT sd.name,
sj.name,
count(st.gender) filter (
WHERE st.gender='M') AS MALE,
count(st.gender) filter (
WHERE st.gender='F') AS FEMALE
FROM Standard sd
INNER JOIN Student st ON (st.stand_id=sd.id)
INNER JOIN Subject sj ON (sj.id=st.sub_id)
GROUP BY sd.name,
sj.name;
name | name | male | female
--------+---------+------+--------
First | Math | 1 | 0
First | English | 0 | 1
Second | English | 2 | 1
Second | Math | 0 | 1
(4 rows)
I have added some more rows to second English.

Tableau: DATEDIFF( 'days', MIN([Start Date]), [End Date])

Cheers!
I'm trying to get a chart working that shows me the count of work orders that are completed each day after work on a unit (serial number) starts. I'd like to be able to "shadow" multiple serial numbers on top of each other, normalized to a start date of '0'.
Currently I have columns in my data set:
Work order number (0..999), repeats for each serial number
Serial number (0..999)
Work order start date (Datetime)
Work order end date (Datetime)
Say for instance that a new serial number starts each day, contains 5 work orders, and requires 5 days to complete (there are 5 units in WIP at any given time).
The data might look like (dates shown as ints):
| Work order number | Serial number | Work order start date | Work order end date |
| ----------------- | ------------- | --------------------- | ------------------- |
| 1 | 1 | 1 | 2 |
| 2 | 1 | 1 | 3 |
| 3 | 1 | 2 | 4 |
| 4 | 1 | 3 | 5 |
| 5 | 1 | 4 | 5 |
| 1 | 2 | 2 | 3 |
| 2 | 2 | 2 | 4 |
| 3 | 2 | 3 | 5 |
| 4 | 2 | 4 | 6 |
| 5 | 2 | 5 | 6 |
I'm assuming I'll need a calculated column that would perhaps go something like:
[Work order end days since start] =
[Work order end date] - MIN(
IF(*serial number matches current*, [Work order start date], NULL)
)
I (clearly) have no idea how to actually create such a calculated field in Tableau.
The values in the column (same order as the data above) should be:
| Work order end days since start |
| ------------------------------- |
| 1 |
| 2 |
| 3 |
| 4 |
| 4 |
| 1 |
| 2 |
| 3 |
| 4 |
| 4 |
Any guidance or help? Happy to clarify anything as well. Many thanks! Cheers!

You will have better results with this kind of data if you reshape it to have a single date column and add a type column indicating whether the current row describes the start or completion of a workorder.
| Work order number | Serial number | date | type |
Think of each row representing a state change, not a work order.
Open work orders on a particular date would be those that have a start record prior to that date, but don't have a completion record prior to that date. If you define a calculated field as +1 if type = New and -1 if type = Completion, then you can use a running total of that field to view the number of open work orders over time.

I'm a bit new to PostgreSQL and need how to construct complex query

I need to list all the cities you can get to after stopping off at exactly one other city, starting off from any city of my choice. And list with it the distance to the final city and the intermediate city.
The tables in the database consist of cities, with the attributes:
| city_id | name |
1 Edinburgh
2 Newcastle
3 Manchester
citypairs:
| citypair_id | city_id |
1 1
1 2
2 1
2 3
3 2
3 3
and distances:
| citypair_id | distance |
1 1234
2 1324
3 1324
and trains:
| train_id | departure_city_id | destination_city_id |
1 1 2
2 2 3
3 1 3
4 3 2
I haven't put any of the data in but basically if a city.name is chosen at random by me I need to find out which cities I can get to from this city if I go via another city (i.e. in two journeys) and then the distance to the final and intermediate city.
How would you, or how should I, go about forming a query to return the desired table?
Edited to include data and a missing table! As an example you can go from Edinburgh(1) to Manchester(3) via Newcastle(2) and you can go from Edinburgh to Newcastle via Manchester, however you can not go from Manchester to Edinburgh via Newcastle (since a train departs from 3, arrives at 2, but no train from 2 arrives in 1) and this route should not be returned from the query. Apologies for any confusion beforehand.

I've got a CTE that builds a tree of all the destinations.
WITH RECURSIVE trip AS (
SELECT c.city_id AS start_city,
ARRAY[c.city_id] AS route,
cast(c.name AS varchar(100)) AS route_text,
c.city_id AS leg_start_city,
c.city_id AS leg_end_city,
0 AS trip_count,
0 AS leg_length,
0 AS total_length
FROM cities c
UNION ALL
SELECT
trip.start_city,
trip.route || t.destination_city_id,
cast(trip.route_text || ',' || c.name AS varchar(100)),
t.departure_city_id,
t.destination_city_id,
trip.trip_count + 1,
d.distance,
trip.total_length + d.distance
FROM trains t
INNER JOIN trip
ON t.departure_city_id = trip.leg_end_city
INNER JOIN citypairs cps
ON t.departure_city_id = cps.city_id
INNER JOIN citypairs cpe
ON t.destination_city_id = cpe.city_id AND
cpe.citypair_id = cps.citypair_id
INNER JOIN distances d
ON cps.citypair_id = d.citypair_id
INNER JOIN cities c
ON t.destination_city_id = c.city_id
WHERE NOT (array[t.destination_city_id] <# trip.route))
SELECT *
FROM trip
WHERE trip_count = 2
AND start_city = (SELECT city_id FROM cities WHERE name = 'Edinburgh');
The CTE starts from each city (in the non-recursive part at the start), then determines all the destination cities it can go to. It keeps a track of all the cities its been to in an array (the route column), so it won't loop back to itself again. As it progresses, it keeps track of the overall trip distance, and the number of trains taken (in trip_count).
As it goes through the tree, it keeps a running total of the distance.
This gives results of
| START_CITY | ROUTE | ROUTE_TEXT | LEG_START_CITY | LEG_END_CITY | TRIP_COUNT | LEG_LENGTH | TOTAL_LENGTH |
--------------------------------------------------------------------------------------------------------------------------------
| 1 | 1,2,3 | Edinburgh,Newcastle,Manchester | 2 | 3 | 2 | 1324 | 2558 |
| 1 | 1,3,2 | Edinburgh,Manchester,Newcastle | 3 | 2 | 2 | 1324 | 2648 |
If you change remove the final WHERE clause it'll show all the possible trips in the data, likewise you can change the trip_count to find all single train destinations etc.
| START_CITY | ROUTE | ROUTE_TEXT | LEG_START_CITY | LEG_END_CITY | TRIP_COUNT | LEG_LENGTH | TOTAL_LENGTH |
--------------------------------------------------------------------------------------------------------------------------------
| 1 | 1 | Edinburgh | 1 | 1 | 0 | 0 | 0 |
| 2 | 2 | Newcastle | 2 | 2 | 0 | 0 | 0 |
| 3 | 3 | Manchester | 3 | 3 | 0 | 0 | 0 |
| 1 | 1,2 | Edinburgh,Newcastle | 1 | 2 | 1 | 1234 | 1234 |
| 1 | 1,3 | Edinburgh,Manchester | 1 | 3 | 1 | 1324 | 1324 |
| 2 | 2,3 | Newcastle,Manchester | 2 | 3 | 1 | 1324 | 1324 |
| 3 | 3,2 | Manchester,Newcastle | 3 | 2 | 1 | 1324 | 1324 |
| 1 | 1,2,3 | Edinburgh,Newcastle,Manchester | 2 | 3 | 2 | 1324 | 2558 |
| 1 | 1,3,2 | Edinburgh,Manchester,Newcastle | 3 | 2 | 2 | 1324 | 2648 |
The cast( ... as varchar(100)) is a bit hacky, and I'm not sure why it was needed, but I haven't had a chance to get around that yet.
The SQL is here for testing: http://sqlfiddle.com/#!1/93964/24

The first part is easy:
SELECT c2.name
FROM cities AS c
JOIN trains t ON c.city_id=t.departure_city_id
JOIN trains t2 ON t.destination_city_id=t2.departure_city_id
JOIN cities AS c2 ON t2.destination_city_id=c2.city_id
WHERE c2.city_id!=c.city_id
AND c.name='Edinburgh';
http://sqlfiddle.com/#!12/a656f/14
In PG 9.1+ you could even do it with a recursive CTE for any number of cities in between. The distances are a little more complicated and you probably would be better off transforming city_pairs into actual pairs.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SAS: How to calculate date differences between rows in grouped data - date

Related

T-SQL Update Query to enter recurring data in groups of three

Postgresql. How to update column with range of integers from from 0 to last row that satisfies WHERE criteria

SUM of two level group by in postgresql

Tableau: DATEDIFF( 'days', MIN([Start Date]), [End Date])

I'm a bit new to PostgreSQL and need how to construct complex query

Categories

Resources