Delete group when its observation does not contain certain values in SAS - group-by

Refer to below table, an ID is considered as complete if at least one of its group having Day 1 to Day 3 (Duplicate allowed).
I need to remove ID which has Group not having full Day 1 to Day 3.
ID Group Day
1 A 1
1 A 1
1 A 2
1 A 3
1 B 1
1 B 3
2 A 1
2 A 3
2 B 2
Expected result
ID Group Day
1 A 1
1 A 1
1 A 2
1 A 3
1 B 1
1 B 3
With this reference, Delete the group that none of its observation contain the certain value in SAS
I have tried below code but it cannot remove ID 2.
PROC SQL;
CREATE TABLE TEMP AS SELECT
* FROM HAVE
GROUP BY ID
HAVING MIN(DAY)=1 AND MAX(DAY)=3
;QUIT;
PROC SQL;
CREATE TABLE TEMP1 AS SELECT
* FROM TEMP WHERE ID IN
(SELECT ID FROM TEMP
WHERE DAY=2)
;QUIT;

So you want to find the set of ID values where the ID has at least one GROUP that has all three DAY values? Find the list of IDs as a subquery and use it to subset the original data.
The key thing in subquery is you want there to be 3 distinct values of DAY. If your data could have other values of DAY (like missing or 4) then use a WHERE clause to only keep the values you want to count.
proc sql;
create table want as
select * from have
where id in
(select id from have
where day in (1,2,3)
group by id,group
having count(distinct day)=3
)
;
quit;

You can query the dataset with a removal list. For example:
proc sql noprint;
create table want as
select *
from have
where cats(group, id) NOT IN(select cats(group, id) from removal_list)
;
quit;
Creating the Removal List
This method will prevent you from having to do a Cartesian product on all IDs, groups, and days to create your removal list.
Assume that your data is sorted by ID, group, and day.
For each ID, the first day in the group must be 1
For each ID, all days in the group after the first day must have a difference of 1 from the previous day
Code:
data removal_list;
set have;
by ID Group Day;
retain flag_remove_group;
lag_day = lag(day);
/* Reset flag_remove_group at the start of each (ID, Group).
Check if the first day is > 1. If it is, set the removal flag.
*/
if(first.group) then do;
call missing(lag_day);
if(day > 1) then flag_remove_group = 1;
else flag_remove_group = 0;
end;
/* If it's not the first (ID, Group), check if days
are skipped between observations
*/
if(NOT first.group AND (day - lag_day) NE 1) then flag_remove_group = 1;
if(flag_remove_group) then output;
keep id group;
run;

Related

how to get last known contiguous value in postgres ltree field?

I have a child table called wbs_numbers. the primary key id is a ltree
A typical example is
id
series_id
abc.xyz.00001
1
abc.xyz.00002
1
abc.xyz.00003
1
abc.xyz.00101
1
so the parent table called series. it has a field called last_contigous_max.
given the above example, i want the series of id 1 to have its last contigous max be 3
can always assume that the ltree of wbs is always 3 fragment separated by dot. and the last fragment is always a 5 digit numeric string left padded by zero. can always assume the first child is always ending with 00001 and the theoretical total children of a series will never exceed 9999.
If you think of it as gaps and islands, the wbs_numbers will never start with a gap within a series. it will always start with an island.
meaning to say this is not possible.
id
series_id
abc.xyz.00010
1
abc.xyz.00011
1
abc.xyz.00012
1
abc.xyz.00101
1
This is possible
id
series_id
abc.xyz.00001
1
abc.xyz.00004
1
abc.xyz.00005
1
abc.xyz.00051
1
abc.xyz.00052
1
abc.xyz.00100
1
abc.xyz.10001
2
abc.xyz.10002
2
abc.xyz.10003
2
abc.xyz.10051
2
abc.xyz.10052
2
abc.xyz.10100
2
abc.xyz.20001
3
abc.xyz.20002
3
abc.xyz.20003
3
abc.xyz.20004
3
abc.xyz.20052
3
abc.xyz.20100
3
so the last max contiguous in this case is
for series id 1 => 1
for series id 2 => 3
for series id 3 => 4
What's the query to calculate the last_contigous_max number for any given series_id?
I also don't mind having another table just to store "islands".
Also, you can safely assume that wbs_number records will never be deleted once created. The id in the wbs_numbers table will never be altered once filled in as well.
Meaning to say islands will only grow and never shrink.
You can carry out your problem following these steps:
extract your integer value from your "id" field
compute a ranking value sided with your id value
filter out when your ranking value does not match your id value
get tied last row for each of your matches
WITH cte AS (
SELECT *, CAST(RIGHT(id_, 4) AS INTEGER) AS idval
FROM tab
), ranked AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY series_id ORDER BY idval) AS rn
FROM cte
)
SELECT series_id, idval
FROM ranked
WHERE idval = rn
ORDER BY ROW_NUMBER() OVER(PARTITION BY series_id ORDER BY idval DESC)
FETCH FIRST ROWS WITH TIES
Check the demo here.

I need to find the number of users that were invoiced for an amount greater than 0 in the previous month and were not invoiced in the current month

I need to find the number of users that were invoiced for an amount greater than 0 in the previous month and were not invoiced in the current month. This calcualtion is to be done for 12 months in a single query. Output should be as below.
Month Count
01/07/2019 50
01/08/2019 34
01/09/2019 23
01/10/2019 98
01/11/2019 10
01/12/2019 5
01/01/2020 32
01/02/2020 65
01/03/2020 23
01/04/2020 12
01/05/2020 64
01/06/2020 54
01/07/2020 78
I am able to get the value only for one month. I want to get it for all months in a single query.
This is my current query:
SELECT COUNT(DISTINCT TWO_MONTHS_AGO.USER_ID), TWO_MONTHS_AGO.MONTH AS INVOICE_MONTH
FROM (
SELECT USER_ID, LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
WHERE invoice_amt > 0
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 2)
GROUP BY user_id
) AS TWO_MONTHS_AGO
LEFT JOIN (
SELECT user_id,LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 1)
GROUP BY USER_ID
) AS ONE_MONTH_AGO ON TWO_MONTHS_AGO.USER_ID = ONE_MONTH_AGO.USER_ID
WHERE ONE_MONTH_AGO.USER_ID IS NULL
GROUP BY INVOICE_MONTH;
Thank you in advance.
Lona
Probably lots of different approaches but the way I would do it is as follows:
Summarise data by user and month for the last 13 months (you need 12 months plus the previous month to that first month
Compare "this" month (that has data) to "next" month and select records where there is no "next" month data
Summarise this dataset by month and distinct userid
For example, assuming a table created as follows:
create table INVOICE_DATA (
USERID varchar(4),
INVOICE_DT date,
INVOICE_AMT NUMBER(10,2)
);
the following query should give you what you want - you may need to adjust it depending on whether you are including this month, or only up to the end of last month, in your calculation, etc.:
--Summarise data by user and month
WITH MONTH_SUMMARY AS
(
SELECT USERID
,TO_CHAR(INVOICE_DT,'YYYY-MM') "INVOICE_MONTH"
,TO_CHAR(ADD_MONTHS(INVOICE_DT,1),'YYYY-MM') "NEXT_MONTH"
,SUM(INVOICE_AMT) "MONTHLY_TOTAL"
FROM INVOICE_DATA
WHERE INVOICE_DT >= TRUNC(ADD_MONTHS(current_date(),-13),'MONTH') -- Last 13 months of data
GROUP BY 1,2,3
),
--Get data for users with invoices in this month but not the next month
USER_DATA AS
(
SELECT USERID, INVOICE_MONTH, MONTHLY_TOTAL
FROM MONTH_SUMMARY MS_THIS
WHERE NOT EXISTS
(
SELECT USERID
FROM MONTH_SUMMARY MS_NEXT
WHERE
MS_THIS.USERID = MS_NEXT.USERID AND
MS_THIS.NEXT_MONTH = MS_NEXT.INVOICE_MONTH
)
AND MS_THIS.INVOICE_MONTH < TO_CHAR(current_date(),'YYYY-MM') -- Don't include this month as obviously no next month to compare to
)
SELECT INVOICE_MONTH, COUNT(DISTINCT USERID) "USER_COUNT"
FROM USER_DATA
GROUP BY INVOICE_MONTH
ORDER BY INVOICE_MONTH
;

Get distinct rows based on one column with T-SQL

I have a column in the following format:
Time Value
17:27 2
17:27 3
I want to get the distinct rows based on one column: Time. So my expected result would be one result. Either 17:27 3 or 17:27 3.
Distinct
T-SQL uses distinct on multiple columns instead of one. Distinct would return two rows since the combinations of Time and Value are unique (see below).
select distinct [Time], * from SAPQMDATA
would return
Time Value
17:27 2
17:27 3
instead of
Time Value
17:27 2
Group by
Also group by does not appear to work
select * from table group by [Time]
Will result in:
Column 'Value' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Questions
How can I select all unique 'Time' columns without taking into account other columns provided in a select query?
How can I remove duplicate entries?
This is where ROW_NUMBER will be your best friend. Using this as your sample data...
time value
-------------------- -----------
17:27 2
17:27 3
11:36 9
15:14 5
15:14 6
.. below are two solutions with that you can copy/paste/run.
DECLARE #youtable TABLE ([time] VARCHAR(20), [value] INT);
INSERT #youtable VALUES ('17:27',2),('17:27',3),('11:36',9),('15:14',5),('15:14',6);
-- The most elegant way solve this
SELECT TOP (1) WITH TIES t.[time], t.[value]
FROM #youtable AS t
ORDER BY ROW_NUMBER() OVER (PARTITION BY t.[time] ORDER BY (SELECT NULL));
-- A more efficient way solve this
SELECT t.[time], t.[value]
FROM
(
SELECT t.[time], t.[value], ROW_NUMBER() OVER (PARTITION BY t.[time] ORDER BY (SELECT NULL)) AS RN
FROM #youtable AS t
) AS t
WHERE t.RN = 1;
Each returns:
time value
-------------------- -----------
11:36 9
15:14 5
17:27 2

Group events by sequence, defining the minimum period between sequences t-SQL

I have a table of events, called tbl_events that looks something like this:
PersonID Date
1 30/03/2015
1 22/04/2015
1 30/06/2015
2 18/07/2016
2 09/12/2016
2 28/04/2017
3 01/10/2014
3 28/11/2016
3 28/11/2016
3 16/01/2017
4 13/04/2017
4 09/05/2017
I want to be able to group these events up by the start date of each 'sequence', with a sequence being defined as a run of events from the first identified to the last identified for each PersonID. The last event in a sequence is defined as the event where thereafter there are no subsequent events for that PersonID for a year.
The result of this I would expect to look like is below:
PersonID FirstDate Sequence Events
1 30/03/2015 1 3
2 18/07/2016 1 3
3 01/10/2014 1 1
3 28/11/2016 2 3
4 13/04/2017 1 2
I am able to identify the sequences in Excel and pivot the data, but I need to be able to do this in SQL.
Here is the formula I have used in Excel to generate the sequence number (I am populating cell C3, with column A being PersonID and B being Date):
=+IF(A2<>A3,1,IF((B3-B2)<365,C2,C2+1))
I have joined the table back on itself using ROW_NUMBER to get the difference between the Date and the previous event date for that ID, but I'm not really sure where to go from there.
Any help is much appreciated.
My solution is based on the sample data you've provided along with your excel formula.
-- easily consumable sample data
DECLARE #tbl_events TABLE (PersonId int, [date] date)
INSERT #tbl_events VALUES
(1,'20150330'),(1,'20150422'),(1,'20150630'),(2,'20160718'),(2,'20161209'),(2,'20170428'),
(3,'20141001'),(3,'20161128'),(3,'20161128'),(3,'20170116'),(4,'20170413'),(4,'20170509');
-- Solution
WITH groupings AS
(
SELECT
PersonId,
FirstDate = MIN([date]) OVER (PARTITION BY personId ORDER BY [date]),
NextDate = LAG([date],1,[date]) OVER (PARTITION BY personId ORDER BY [date]),
[date],
grouper =
DATEDIFF(DAY, MIN([date]) OVER (PARTITION BY personId ORDER BY [date]), [date]) / 365
FROM #tbl_events
),
Prep AS
(
SELECT
PersonId,
firstDate = IIF(grouper = 0, FirstDate, IIF(FirstDate = NextDate, [date],NextDate))
FROM groupings
)
SELECT
PersonId,
FirstDate,
[Sequence] = ROW_NUMBER() OVER (PARTITION BY personId ORDER BY FirstDate),
[Events] = COUNT(*)
FROM prep
GROUP BY personId, FirstDate;
Results
PersonId FirstDate Sequence Events
----------- ---------- -------------------- -----------
1 2015-03-30 1 3
2 2016-07-18 1 3
3 2014-10-01 1 1
3 2016-11-28 2 3
4 2017-04-13 1 2
First note all years have 365 days, nonetheless, I'm using 365 to emulate your excel logic; this would need to be updated to account for leap years. Next, like your excel formula - this will only be correct when there are two sequences;
it would not work when, say personId has a date of jan 1 2015, then jan 10 2016, then feb 1 2017.Let us know if we need logic to accommodate for the aforementioned scenarios.
Lastly this solution uses LAG which requires SQL Server 2012+, if you're working with an earlier version of SQL the query will have to be updated accordingly.

T-SQL table variable data order

I have a UDF which returns table variable like
--
--
RETURNS #ElementTable TABLE
(
ElementID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
ElementValue VARCHAR(MAX)
)
AS
--
--
Is the order of data in this table variable guaranteed to be same as the order data is inserted into it. e.g. if I issue
INSERT INTO #ElementTable(ElementValue) VALUES ('1')
INSERT INTO #ElementTable(ElementValue) VALUES ('2')
INSERT INTO #ElementTable(ElementValue) VALUES ('3')
I expect data will always be returned in that order when I say
select ElementValue from #ElementTable --Here I don't use order by
EDIT:
If order by is not guaranteed then the following query
SELECT T1.ElementValue,T2.ElementValue FROM dbo.MyFunc() T1
Cross Apply dbo.MyFunc T2
order by t1.elementid
will not produce 9x9 matrix as
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
consistently.
Is there any possibility that it could be like
1 2
1 1
1 3
2 3
2 2
2 1
3 1
3 2
3 3
How to do it using my above function?
No, the order is not guaranteed to be the same.
Unless, of course you are using ORDER BY. Then it is guaranteed to be the same.
Given your update, you obtain it in the obvious way - you ask the system to give you the results in the order you want:
SELECT T1.ElementValue,T2.ElementValue FROM dbo.MyFunc() T1
Cross join dbo.MyFunc() T2
order by t1.elementid, t2.elementid
You are guaranteed that if you're using inefficient single row inserts within your UDF, that the IDENTITY values will match the order in which the individual INSERT statements were specified.
Order is not guaranteed.
But if all you want is just simply to get your records back in the same order you inserted them, then just order by your primary key. Since you already have that field setup as an auto-increment, it should suffice.
...or use a deterministic function
SELECT TOP 9
M1 = (ROW_NUMBER() OVER(ORDER BY id) + 2) / 3,
M2 = (ROW_NUMBER() OVER(ORDER BY id) + 2) % 3 + 1
FROM
sysobjects
M1 M2
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3