SAS-To identify consecutive observations - date

I attempt to identify events occurred in at latest four consecutive years. Assuming I have the following sample.
Rungroup Year
1 2003
1 2004
1 2005
1 2006
1 2008
1 2009
2 2003
2 2004
2 2005
2 2007
2 2008
2 2009
3 2003
3 2004
Based on following code, I want to remove the years that are not consecutive for at least four years. This method has two step. The first step is to give serial number to the consecutive years. The second step is based on look ahead method.
data have;
set have;
by rungroup;
lyear=lag(year);
if first.rungroup then lyear=.;
if year =1+ lyear then group1+1;
else group1=0;
run;
data have3;
set have2;
by rungroup;
set have2 ( firstobs = 2 keep = group1 rename = (group1 = next2) )
have2 ( obs = 1 );
next2 = ifn( last.rungroup, (.), next2 );
set have2 ( firstobs = 3 keep = group1 rename = (group1 = next3) )
have2 ( obs = 2 );
next3 = ifn( last.rungroup, (.), next3 );
set have2 ( firstobs = 4 keep = group1 rename = (group1 = next4) )
have2 ( obs = 3 );
next4 = ifn( last.rungroup, (.), next4);
if next4>=3 or next3>=3 or next2>=3 or group1>=3 then output;
run;
Is this an efficient want way to identify consecutive observations? Any comments would be greatly appreciated.

If your goal is to flag all the obs part of a consecutive sequence of at least 4 years within the same group, here is an approach
data have;
input Rungroup Year;
datalines;
1 2003
1 2004
1 2005
1 2006
1 2008
1 2009
2 2003
2 2004
2 2005
2 2007
2 2008
2 2009
3 2003
3 2004
;
data want(drop=y);
if _N_=1 then do;
declare hash h(dataset:'have');
h.definekey('Rungroup', 'Year');
h.definedone();
end;
set have;
array _{-3:3} _temporary_;
do y=-3 to 3;
_[y]=h.check(key:Rungroup, key:Year+y);
end;
if _[-3]=0 & _[-2]=0 & _[-1]=0
| _[-2]=0 & _[-1]=0 & _[ 1]=0
| _[-1]=0 & _[ 1]=0 & _[ 2]=0
| _[ 1]=0 & _[ 2]=0 & _[ 3]=0
then flag=1;
run;

Related

SAS--How to identify the event occurred in three consecutive years?

I am trying to identify traders who place transactions in the same month in each of three consecutive years in one company. Once a trader meets the criteria, these three transactions and all his subsequent transactions in that same month in that company should be identified.
Assume I have a sample data below.
data have;
input ID STOCK trandate $12.;
datalines;
1 1 10/15/2009
1 1 01/01/2010
1 1 01/10/2011
1 1 01/15/2012
1 1 01/01/2013
1 2 01/30/2011
1 2 01/30/2012
1 2 01/30/2012
1 2 01/30/2013
1 2 01/30/2014
1 2 01/30/2015
2 1 01/20/2010
2 1 01/15/2011
2 1 01/16/2012
2 1 02/01/2013
2 2 02/01/2010
2 2 02/10/2011
2 2 02/10/2012
2 2 02/10/2013
2 2 02/10/2014
2 2 01/10/2015
;
run;
What I need:
ID Stock trandate type
1 1 10/15/2009 0
1 1 01/01/2010 1
1 1 01/10/2011 1
1 1 01/15/2012 1
1 1 01/01/2013 1
1 2 01/30/2011 1
1 2 01/30/2012 1
1 2 01/30/2012 1
1 2 01/30/2013 1
1 2 01/30/2014 1
1 2 01/30/2015 1
2 1 01/20/2010 0
2 1 01/15/2011 0
2 1 01/16/2012 0
2 1 02/01/2013 0
2 2 02/01/2010 1
2 2 02/10/2011 1
2 2 02/10/2012 1
2 2 02/10/2013 1
2 2 02/10/2014 1
2 2 01/10/2015 0
I used following code to achieve this:
proc sort data=have;
by id stock trandate;
run;
data have;
set have;
month=month(trandate);
year=year(trandate);
run;
proc sort data=have;
by id stock month year;
run;
data have;
set have;
by personid secid month year;
rungroup + (first.month or not first.month and year - lag(year) > 1);
run;
data temp;
do index = 1 by 1 until (last.rungroup);
set have;
by rungroup;
* distinct number of years in rungroup;
years_runlength = sum (years_runlength, first.rungroup or year ne lag(year));
end;
do index = 1 to index;
set have;
if years_runlength >=4 then output;
end;
run;
The above codes are used to identify traders with transactions in the past three consecutive years. Since I also need the subsequent transactions of these traders. The following codes are further applied.
proc sort data=temp;
by personid secid rungroup;
run;
data temp;
set temp;
by rungroup;
if first.rungroup then fyear=year;
run;
data temp(drop=fyear rename=(Locf=fyear));
do until (last.personid);
set temp;
by id stock;
locf=coalesce(fyear,locf);
output;
end;
run;
data temp;
set temp;
by rungroup;
if first.rungroup then fmonth=month;
run;
data temp;
set temp;
gap=year-fyear;
run;
proc means data=temp;
var gap;
run;
data temp;
set temp;
if gap=3 then type2=1;
type1=1;
run;
The above codes are used to mark the first transaction after the three consecutive years. In this context, when the identified transactions combine with the original dataset, all transactions in that same month below the marked transaction could be identified. Thereby, I can achieve the objective that "these three transactions and all his subsequent transactions in that same month in that company should be identified". The following codes are used to achieve this.
proc sort data=have;
by id stock rungroup;
run;
proc sort data=temp;
by id stock rungroup;
run;
data combine;
merge have temp;
by id stock rungroup;
run;
data combine;
set combine;
month=month(trandate);
run;
data combine1 (drop=fmonth rename=(Locf=fmonth));
do until (last.personid );
set combine;
by id stock;
locf=coalesce(fmonth,locf);
output;
end;
run;
data combine2 (drop=type2 rename=(Locf=type2));
do until (last.personid);
set combine1;
by id stock;
locf=coalesce(type2,locf);
output;
end;
run;
data combine2;
set combine2;
if month^=fmonth then type2=.;
run;
data combine2;
set combine2;
if type1=1 or type2=1 then type=1;
else type=0;
run;
I tried these codes, the results looks right, but I cannot 100% sure. Additionally, as you can see, my codes are relative long and complex. So could anyone give me some suggestions about the code?
Here is a bit of brute force way. For this example I just limited it to the years 2009 to 2015 in your example, but you could just expand the pattern to allow more years. You could use macro logic to generate the wallpaper aspects of the code.
First generate an array you can index by YEAR and MONTH and populate the variables with 1 when the month it represents has a trade. Then check if the series of values for the same month across the years ever has three 1's in a row. You can use two DOW loops to process the data. The first one populates the array and the second tests the array and sets the new flag variable.
data want ;
do until(last.stock) ;
set have ;
by id stock;
array months [1:12,2009:2015]
m1y2009-m1y2015 m2y2009-m2y2015 m3y2009-m3y2015 m4y2009-m4y2015
m5y2009-m5y2015 m6y2009-m6y2015 m7y2009-m7y2015 m8y2009-m8y2015
m9y2009-m9y2015 m10y2009-m10y2015 m11y2009-m11y2015 m12y2009-m12y2015
;
months[month(trandate),year(trandate)]=1;
end;
do until(last.stock);
set have;
by id stock;
select (month(trandate));
when (1) flag=0 ne index(cats(of m1y:),'111');
when (2) flag=0 ne index(cats(of m2y:),'111');
when (3) flag=0 ne index(cats(of m3y:),'111');
when (4) flag=0 ne index(cats(of m4y:),'111');
when (5) flag=0 ne index(cats(of m5y:),'111');
when (6) flag=0 ne index(cats(of m6y:),'111');
when (7) flag=0 ne index(cats(of m7y:),'111');
when (8) flag=0 ne index(cats(of m8y:),'111');
when (9) flag=0 ne index(cats(of m9y:),'111');
when (10) flag=0 ne index(cats(of m10y:),'111');
when (11) flag=0 ne index(cats(of m11y:),'111');
when (12) flag=0 ne index(cats(of m12y:),'111');
otherwise ;
end;
output;
end;
drop m: ;
run;

Getting data from alternate dates of same ID column

I've a table data as below, now I need to fetch the record with in same code, where (Value2-Value1)*2 of one row >= (Value2-Value1) of consequtive date row. (all dates are uniform with in all codes)
---------------------------------------
code Date Value1 Value2
---------------------------------------
1 1-1-2018 13 14
1 2-1-2018 14 16
1 4-1-2018 15 18
2 1-1-2019 1 3
2 2-1-2018 2 3
2 4-1-2018 3 7
ex: output needs to be
1 1-1-2018 13 14
as I am begginer to SQL coding, tried my best, but cannot get through with compare only on consequtive dates.
Use a self join.
You can specify all the conditions you've listed in the ON clause:
SELECT T0.code, T0.Date, T0.Value1, T0.Value2
FROM Table As T0
JOIN Table As T1
ON T0.code = T1.code
AND T0.Date = DateAdd(Day, 1, T1.Date)
AND (T0.Value2 - T0.Value1) * 2 >= T1.Value2 - T1.Value1

Defining a custom week using days of the month or year in postgresql

I am running a simple query to get weekly revenue from our sales
SELECT date_trunc('week', payment_date) AS week, sum(payment_amount)
FROM payment
WHERE payment_date BETWEEN '2010-jan-1' AND '2016-dec-31'
GROUP BY week
Now I need my week start and end date to be static for every year. All 52 weeks of the year need to be accounted for e.g.
Week 1: Jan 1-7
Week 2: Jan8-14
Week 3: Jan15-21
Week 4: Jan22-28
Week 5: Jan29-Feb4 and so forth
I did some investigation and figured out that I need a user defined function using the payment_date as argument and returning a week value. I can then call this function in the SQL query above, in place of the date_trunc() function.
How can I use an incremental loop to assign a week value to the payment_date?
Can I also use this return value in group by clause in the SQL query?
Some explanation with detailed examples will be highly appreciated since I have basic to intermediate knowledge of SQL.
---------------Edit--------------
I'm trying to use 2 functions now to take into account the leap year, where I would still want March 4th to be included in the 9th week. Ive tried to use the function by &klin and convert it to SQL, I keep getting "syntax error at or near 'int' on line 9. My code is below.
create or replace function is_leap_year(int)
returns boolean language sql as $$
select $1 % 4 = 0 and ($1 % 100 <> 0 or $1 % 400 = 0)
$$;
create or replace function week_no(timestamp)
returns int language sql as $body$
declare
y int;
day_shift int;
begin
y = extract(year from $1);
day_shift = 1 + (is_leap_year(y) and $1 > make_date(y, 2, 28))::int;
return ((extract(doy from $1)::int)- day_shift) / 7+ 1;
end
$body$;
SELECT week_no(payment_date) as week_number, sum(payment_amount)
from payment p join payment_event pe on p.payment_event_id =
pe.payment_event_id
where payment_date between '2016-jan-1' and '2017-jan-1'
and pe.payment_event_type_id != 2
group by week_number
order by week_number
First, there are problems with your requirements.
Now I need my week start and end date to be static for every year.
They can't be. Leap years happen. February 29 will either shift start and end dates one year out of every four, or you'll need to allow one week to have eight days.
All 52 weeks of the year need to be customized for . . .
I think you mean that all 52 weeks need to be accounted for. But 52 * 7 = 364. You're missing a day.
I think the simplest expression that calculates a week number from a date is (extract(doy from payment_date)::integer / 7) as week. I don't know whether it's worth putting that into a function. Instead, I might start with creating a view that uses that expression.
But a calculation won't do anything special about February 29, or about the fact that every year has more than 52 * 7 days.
I really think your best bet here is to build a table instead of using a calculation.
create table weeks (
calendar_date date primary key,
week_num integer not null
check (week_num between 1 and 53)
);
Populate it with this dates for 2016 and 2017, and with calculated weeks, to give us a starting point. (2016 was a leap year.)
insert into weeks
select
('2016-01-01'::date + (n || ' days')::interval)::date as calendar_date
, extract(doy from ('2016-01-01'::date + (n || ' days')::interval)::date)::integer / 7 + 1 as calencar_week
from generate_series (0, 730) n;
Let's look at week 9.
select *
from weeks
where week_num = 9
order by calendar_date;
calendar_date week_num
--
2016-02-25 9
2016-02-26 9
2016-02-27 9
2016-02-28 9
2016-02-29 9
2016-03-01 9
2016-03-02 9
2017-02-25 9
2017-02-26 9
2017-02-27 9
2017-02-28 9
2017-03-01 9
2017-03-02 9
2017-03-03 9
In 2016, the calculated week 9 ran from 2016-02-25 to 2016-03-02. In 2017, it ran from 2016-02-25 to 2017-03-03. But now that all these week numbers are in a table, you can adjust them any way you like. You can even change the adjustments from year to year if it makes sense to do that.
Use doy (the day of the year) in the way like this:
create or replace function week_no(date)
returns int language sql as $$
select ((extract(doy from $1)::int)- 1) / 7+ 1
$$;
with the_table(a_date) as (
values
('2017-01-01'::date),
('2017-01-07'),
('2017-01-08'),
('2017-01-14'),
('2017-01-15'),
('2017-01-22')
)
select extract(doy from a_date)::int as doy, week_no(a_date)
from the_table;
doy | week_no
-----+---------
1 | 1
7 | 1
8 | 2
14 | 2
15 | 3
22 | 4
(6 rows)
If you want to correct the week number so that March 4th is always in 9th week (even in a leap year), use this handy function:
create or replace function is_leap_year(int)
returns boolean language sql as $$
select $1 % 4 = 0 and ($1 % 100 <> 0 or $1 % 400 = 0)
$$;
Your function may look like this (I've used the plpgsql language for better readability though this also can be coded as an sql function):
create or replace function week_no_corrected(date)
returns int language plpgsql as $$
declare
y int = extract (year from $1);
day_shift int = 1 + (is_leap_year(y) and $1 > make_date(y, 2, 28))::int;
begin
return ((extract(doy from $1)::int)- day_shift) / 7+ 1;
end;
$$;
with the_table(a_date) as (
values
('2016-03-03'::date),
('2016-03-04'),
('2016-03-05'),
('2017-03-03'),
('2017-03-04'),
('2017-03-05')
)
select a_date, week_no(a_date), week_no_corrected(a_date)
from the_table;
a_date | week_no | week_no_corrected
------------+---------+-------------------
2016-03-03 | 9 | 9
2016-03-04 | 10 | 9
2016-03-05 | 10 | 10
2017-03-03 | 9 | 9
2017-03-04 | 9 | 9
2017-03-05 | 10 | 10
(6 rows)
In an SQL function you cannot use variables, assignments may be replaced by derived tables:
create or replace function week_no_corrected(date)
returns int language sql as $$
select ((extract(doy from $1)::int)- day_shift) / 7 + 1
from (
select 1 + (is_leap_year(y) and $1 > make_date(y, 2, 28))::int as day_shift
from (
select extract (year from $1)::int as y
) s
) s
$$;
By breaking your problem down to month-day strings it will allow you to use the same logic across multiple years
mysql> SELECT "01-07" < "01-08";
+-------------------+
| "01-07" < "01-08" |
+-------------------+
| 1 |
+-------------------+
1 row in set (0.08 sec)
A simple date format of %m-%d works for comparing the payment dates to the week buckets you want to assign.
To manually assign all 52 week ranges, you can use a case statement:
SET #md_format="%m-%d";
SELECT
CASE
WHEN (date_format(`input_date`, #md_format) < "01-08") THEN 1
WHEN (date_format(`input_date`, #md_format) < "01-15") THEN 2
WHEN (date_format(`input_date`, #md_format) < "01-22") THEN 3
-- ... All other cases
ELSE 52
END;
See the docs for the syntax to define a function
Functions will allow you to do operations like:
SELECT week_bucket(payment_date) `week`, SUM(revenue) `revenue`
FROM my_table
WHERE week_bucket(payment_date) > 13
AND week_bucket(payment_date) < 15
GROUP BY `week`;

Comparing two rows in SQL

I want to compare two rows in a table. Retrieve record where line 2 value for a particular year is less than the line 1 value for the same year:
Year Line Dollar
2001 1 $50
2001 2 $50
2002 1 $100
2002 2 $100
2003 1 $150
2003 2 $100
The result is
Year Line Dollar
2003 1 $150
2003 2 $100
Thanks
select a.*, b.*
from yourtable a, yourtable b
where a.year = b.year
and a.line = 1
and b.line = 2
and a.dollar > b.dollar

Trying to set a variable inside a case statement.

I'm trying to update a date dimension table from the accounting years table of our ERP System. If I run the following Query:
SELECT fcname FYName
,min(fdstart) YearStart
,max(fdend) YearEnd
,max(fnnumber) PeriodCount
FROM M2MData01.dbo.glrule GLR
GROUP BY fcname
I get the following data:
FYName YearStart YearEnd PeriodCount
FY 2000 1/1/2000 12:00:00 AM 12/31/2000 12:00:00 AM 12
FY 2001 1/1/2001 12:00:00 AM 12/31/2001 12:00:00 AM 12
FY 2002 1/1/2002 12:00:00 AM 12/31/2002 12:00:00 AM 12
FY 2003 1/1/2003 12:00:00 AM 12/31/2003 12:00:00 AM 12
FY 2004 1/1/2004 12:00:00 AM 12/31/2004 12:00:00 AM 12
FY 2005 1/1/2005 12:00:00 AM 12/31/2005 12:00:00 AM 12
FY 2006 1/1/2006 12:00:00 AM 12/31/2006 12:00:00 AM 12
FY 2007 1/1/2007 12:00:00 AM 12/31/2007 12:00:00 AM 12
FY 2008 1/1/2008 12:00:00 AM 12/31/2008 12:00:00 AM 12
FY 2009 1/1/2009 12:00:00 AM 12/31/2009 12:00:00 AM 12
FY 2010 1/1/2010 12:00:00 AM 12/31/2010 12:00:00 AM 12
In my case my company has 12 periods per year which roughly correspond to months. Basically, I am trying to create an update statement to set Fiscal Quarters which will follow this logic:
1. If PeriodCount is divisible by 4 then the number of periods in a quarter is PeriodCount/4.
2. If PeriodNumber is in the first quarter (in this case periods 1 through 3) then FiscalQuarter =1 and so on for quarters 2 through 4.
The problem is that I cannot be guaranteed that everyone uses 12 periods, some companies I support use a different number such as 10.
I started creating the following select statement:
DECLARE #QuarterSize INT
DECLARE #SemesterSize INT
SELECT TST.Date,
CASE WHEN glr.PeriodCount % 4 = 0 THEN
-- Can Be divided into quarters. Quarter size is PeriodCount/4
set #quartersize = (GLR.PeriodCount/4)
CASE
END
ELSE 0
End
FROM m2mdata01.dbo.AllDates TST
INNER JOIN (
SELECT fcname FYName
,min(fdstart) YearStart
,MAX(fdend) YearEnd
,MAX(fnnumber) PeriodCount
FROM M2MData01.dbo.glrule GLR
GROUP BY fcname ) GLR
ON TST.DATE >= GLR.YearStart AND TST.DATE <= GLR.YearEnd
Can I set the value of a variable inside a case statement like this? What's the best way to accomplish this? Am I forced to use a cursor statement and check each date in my dimension against the range in the table above?
Not sure what you want to do here - you can assign variable outside case statement in select clause. Such as
SELECT
SomeCol,
#var = CASE
WHEN condition1 THEN some value
WHEN condition2 THEN other value
END,
OtherCol
FROM
...
Note that #var value be set to the value evaluated at the last row. As said earlier, I am not sure how you intend to use you #quartersize variable. If the value is needed on every row then u shouldn't be using variable at all.
It may not be the most elegant solution, but here is what I ended up with.
I linked a copy of the script details to a grouped by version of the same thing.
SELECT fcname FYName, fdstart PeriodStart, fdend PeriodEnd, fnnumber PeriodNo, GLRAGG.AGGFYName,
GLRAGG.QuarterSize, GLRAGG.PeriodCount, GLRAGG.Quarterific, GLRAGG.SemesterSize, GLRAGG.Semesterific
FROM M2MData01.dbo.glrule GLR
INNER JOIN
(SELECT fcname AGGFYName, min(fdstart) YearStart,
MAX(fdend) YearEnd, MAX(fnnumber) PeriodCount,
(Max(fnnumber) / 4) QuarterSize, CASE WHEN Max(fnnumber) % 4 = 0 THEN 'Yes' ELSE 'No' END AS Quarterific,
(Max(fnnumber) / 2) SemesterSize, CASE WHEN Max(fnnumber) % 2 = 0 THEN 'Yes' ELSE 'No' END AS Semesterific
FROM M2MData01.dbo.glrule
GROUP BY fcname) GLRAGG
ON GLR.FCNAME = GLRAGG.AGGFYNAME
This isn't a big deal because that table only has 12 rows for each year, in this case only 132 total rows.
That produces every fiscal period with the total number of periods in each Fiscal Year and whether it can be evenly divisible by 4 and 2. It then uses the "Quarterific" value to determine whether to do so in the update statement and I can get by wtihout using variables.
It may not be the best way, but it works and is performant given the small data set.