Merge two dataframes by date in SAS - merge

I have two tables and I want to merge them by id and by the latest date before the date in df1 for the relevant id.
data df1;
input id $ date value ;
informat date yymmdd10.;
format date yymmdd10. ;
cards;
a 19991231 1
a 20011231 2
b 20151231 4
;
data df2;
input id $ date ;
informat date yymmdd10.;
format date yymmdd10.;
cards;
a 20020101
c 20160701
;
I tried this, but there's something missing.
proc sql;
create table output
as select a.*, b.date
from df1 as a, df2 as b
where a.id = b.id
group by a.id, b.id
having (a.date) > max(b.date);
quit;
Desired output:
data output;
input id $ date value;
informat date yymmdd10.;
format date yymmdd10.;
cards;
a 20011231 2
;

I'd do it in two steps, with a PROC SQL to join and sort the two tables, then a data step to only output the latest date for each ID.
proc sql;
create table o1 as
select a.id,
a.date,
a.value
from df1 a
join df2 b
on b.id = a.id
and b.date > a.date
order by a.id, a.date
;
quit;
data output;
set o1;
by id;
if last.id then output;
run;

You can use SET to interleave the records. Use RETAIN to keep the last version of VALUE from the first dataset. You didn't indicate whether you have any missing values of VALUE, but let's test for that anyway.
data want;
set df1(in=in1) df2(in=in2);
by id date ;
retain last_value;
if first.id then last_value=.;
if in1 and not missing(value) then last_value=value;
if in2 and not missing(last_value);
run;
Result:
last_
Obs id date value value
1 a 2002-01-01 . 2
Note this method takes the value on or before the DATE in the second dataset. If you want it only take the last value BEFORE that date then reverse the order that the two datasets are referenced in the SET statement.

proc sort data=df1;
by id descending date;
proc sort data=df2;
by id;
data want;
merge df1 (in=in1) df2 (in=in2 rename=(date=date_max));
by id;
** Assume you want only values that are in both datasets **;
if in1 & in2;
retain flag;
if first.id then flag = 0;
** If no dates before max date yet and this one is before max date, we have a winner **;
if flag = 0 & date < date_max then do;
** Set flag to indicate this ID has already found the max date **;
flag = 1;
output;
end;
run;

Related

Grouping consecutive dates in PostgreSQL

I have two tables which I need to combine as sometimes some dates are found in table A and not in table B and vice versa. My desired result is that for those overlaps on consecutive days be combined.
I'm using PostgreSQL.
Table A
id startdate enddate
--------------------------
101 12/28/2013 12/31/2013
Table B
id startdate enddate
--------------------------
101 12/15/2013 12/15/2013
101 12/16/2013 12/16/2013
101 12/28/2013 12/28/2013
101 12/29/2013 12/31/2013
Desired Result
id startdate enddate
-------------------------
101 12/15/2013 12/16/2013
101 12/28/2013 12/31/2013
Right. I have a query that I think works. It certainly works on the sample records you provided. It uses a recursive CTE.
First, you need to merge the two tables. Next, use a recursive CTE to get the sequences of overlapping dates. Finally, get the start and end dates, and join back to the "merged" table to get the id.
with recursive allrecords as -- this merges the input tables. Add a unique row identifier
(
select *, row_number() over (ORDER BY startdate) as rowid from
(select * from table1
UNION
select * from table2) a
),
path as ( -- the recursive CTE. This gets the sequences
select rowid as parent,rowid,startdate,enddate from allrecords a
union
select p.parent,b.rowid,b.startdate,b.enddate from allrecords b join path p on (p.enddate + interval '1 day')>=b.startdate and p.startdate <= b.startdate
)
SELECT id,g.startdate,g.enddate FROM -- outer query to get the id
-- inner query to get the start and end of each sequence
(select parent,min(startdate) as startdate, max(enddate) as enddate from
(
select *, row_number() OVER (partition by rowid order by parent,startdate) as row_number from path
) a
where row_number = 1 -- We only want the first occurrence of each record
group by parent)g
INNER JOIN allrecords a on a.rowid = parent
The below fragment does what you intend. (but it will probably be very slow) The problem is that detecteng (non)overlapping dateranges is impossible with standard range operators, since a range could be split into two parts.
So, my code does the following:
split the dateranges from table_A into atomic records, with one date per record
[the same for table_b]
cross join these two tables (we are only interested in A_not_in_B, and B_not_in_A) , remembering which of the L/R outer join wings it came from.
re-aggregate the resulting records into date ranges.
-- EXPLAIN ANALYZE
--
WITH RECURSIVE ranges AS (
-- Chop up the a-table into atomic date units
WITH ar AS (
SELECT generate_series(a.startdate,a.enddate , '1day'::interval)::date AS thedate
, 'A'::text AS which
, a.id
FROM a
)
-- Same for the b-table
, br AS (
SELECT generate_series(b.startdate,b.enddate, '1day'::interval)::date AS thedate
, 'B'::text AS which
, b.id
FROM b
)
-- combine the two sets, retaining a_not_in_b plus b_not_in_a
, moments AS (
SELECT COALESCE(ar.id,br.id) AS id
, COALESCE(ar.which, br.which) AS which
, COALESCE(ar.thedate, br.thedate) AS thedate
FROM ar
FULL JOIN br ON br.id = ar.id AND br.thedate = ar.thedate
WHERE ar.id IS NULL OR br.id IS NULL
)
-- use a recursive CTE to re-aggregate the atomic moments into ranges
SELECT m0.id, m0.which
, m0.thedate AS startdate
, m0.thedate AS enddate
FROM moments m0
WHERE NOT EXISTS ( SELECT * FROM moments nx WHERE nx.id = m0.id AND nx.which = m0.which
AND nx.thedate = m0.thedate -1
)
UNION ALL
SELECT rr.id, rr.which
, rr.startdate AS startdate
, m1.thedate AS enddate
FROM ranges rr
JOIN moments m1 ON m1.id = rr.id AND m1.which = rr.which AND m1.thedate = rr.enddate +1
)
SELECT * FROM ranges ra
WHERE NOT EXISTS (SELECT * FROM ranges nx
-- suppress partial subassemblies
WHERE nx.id = ra.id AND nx.which = ra.which
AND nx.startdate = ra.startdate
AND nx.enddate > ra.enddate
)
;

Dedup using SQL on a huge 1 billion data set

I am having out of memory issues while trying to dedup a table consisting of huge amount of data.
Scenario :
Column A | Column B ( Date )
Value1 Date1
Value1 Date2
Value2 Date3
Value2 Date4
I need to dedup on both these columns, I need to pick the latest record using column b.
Lets say date2 and date4 are the latest dates. My output should be:
Column A | Column B ( Date )
Value1 Date2
Value2 Date4
Currently I am using the below query which works. Is there a better way of doing this using less memory.
CREATE TABLE UNIQUE_TABLENAME AS (
SELECT a.column a, a.column b, a.column c, a.column d
from tablename a,
(select column a,max(column b) from tablename group by column a)b
where a.column a = b.column a
and a.column b= b.column b)
Thanks in advance!
select distinct on (col_a)
col_a as value, col_b as "date"
from t
order by col_a, col_b desc
Check distinct on

Fill table with two datetime columns with random dates

I have table T1 with two datetime columns (StartDate, EndDate) which I must populate with random dates under one circumstance:
EndDate value must be greater than StartDate in minimal one day.
Example:
StartDate EndDate
===========================
2001-04-04 2001-04-06 (2 days)
2001-01-05 2001-01-15 (10 days)
.
.
.
Can I do that in one statement?
P.S. My first idea was to change EndDate column to NULL, and in first step populate StartDate leaving EndDate as NULL, and in second statement to write some mechanism to update EndDate with dates greater (in different number of days for every record) then StartDate.
Here's a solution that populates the table in one step:
insert into T1 (StartDate, EndDate)
select
X.StartDate,
dateadd(day, abs(checksum(newid())) % 10, X.StartDate) EndDate
from (
select top 20
dateadd(day, -abs(checksum(newid())) % 100, convert(date, getDate())) StartDate
from sys.columns c1, sys.columns c2
) X
The query above uses some tricks that I personally often use in ad-hoc SQL queries:
new_Id() generates different random values for each row, as opposed to RAND(), which would be evaluated once per query. The expression abs(checksum(newid())) % N will generate random integer values in the 0 - N-1 range.
the TOP X ... FROM sys.columns c1, sys.columns c2 trick allows you to generate X rows whose values can be composed of scalar values, like the ones in our example.
Obviously, you can modify the hardcoded values in the above query to:
generate more rows
increase the range of random start dates
increase the maximum duration of each row.
INSERT T1 (StartDate, EndDate)
select T1, T1 + add_days
from
(select DATEADD(day, (ABS(CHECKSUM(NEWID())) % 65530), 0) T1,
ROW_NUMBER() OVER(ORDER BY number) add_days
from [ master ] .. spt_values) X;
sqlfiddle example
Something simple using rand() function:
Fiddle Example
declare #records int = 100, --Number of records needed
#count int = 0, #start int, #end int
while(#records>#count)
begin
select #start = rand() * 10, #end = rand() * 100, #count+=1
insert into mytable
select dateadd(day, #start, getdate()),dateadd(day, #end, getdate())
end
select * from mytable

How can I achieve this selection in SAS

say I have a SAS table tbl which has a column col. This column col holds different values say {"a","s","d","f",...} but one is MUCH more present than the other (say "d"). How can I do a select only this value
It would be something like
data tbl;
set tbl;
where col eq "the most present element of col in this case d";
run;
One of many methods to accomplish this...
data test;
n+1;
input col $;
datalines;
a
b
c
d
d
d
d
e
f
g
d
d
a
b
d
d
;
run;
proc freq data=test order=freq; *order=freq automatically puts the most frequent on top;
tables col/out=test_count;
run;
data want;
set test;
if _n_ = 1 then set test_count(keep=col rename=col=col_keep);
if col = col_keep;
run;
To put this into a macro variable (see comments):
data _null_;
set test_count;
call symput("mvar",col); *put it to a macro variable;
stop; *only want the first row;
run;
I would use PROC SQL for this.
Here's an example that gets "d" into a macro variable and then filters the original dataset, as requested in your question.
This will work even if there is a multi-way tie for the most frequent observation.
data tbl;
input col: $1.;
datalines;
a
a
b
b
b
b
c
c
c
c
d
d
d
;run;
proc sql noprint;
create table tbl_freq as
select col, count(*) as freq
from tbl
group by col;
select quote(col) into: mode_values separated by ', '
from tbl_freq
where freq = (select max(freq) from tbl_freq);
quit;
%put mode_values = &mode_values.;
data tbl_filtered;
set tbl;
where col in (&mode_values.);
run;
Note the use of QUOTE(), which is needed to wrap the values of col in quotation marks (omit this if col is a numeric variable).

Postgresql - get closest datetime row relative to given datetime value

I have a postgres table with a unique datetime field.
I would like to use/create a function that takes as argument a datetime value and returns the row id having the closest datetime relative (but not equal) to the passed datetime value. A second argument could specify before or after the passed value.
Ideally, some combination of native datetime functions could handle this requirement. Otherwise it'll have to be a custom function.
Question: What are methods for querying relative datetime over a collection of rows?
select id, passed_ts - ts_column difference
from t
where
passed_ts > ts_column and positive_interval
or
passed_ts < ts_column and not positive_interval
order by abs(extract(epoch from passed_ts - ts_column))
limit 1
passed_ts is the timestamp parameter and positive_interval is a boolean parameter. If true only rows where the timestamp column is lower then the passed timestamp. If false the inverse.
use simply -.
Assuming you have a table with attributes Key, Attr and T (timestamp with or without timezone):
you can search with
select min(T - TimeValue) from Table where (T - TimeValue) > 0;
this will give you the main difference. You can combine this value with a join to the same table to get the tuple you are interested in:
select * from (select *, T - TimeValue as diff from Table) as T1 NATURAL JOIN
( select min(T - TimeValue) as diff from Table where (T - TimeValue) > 0) as T2;
that should do it
--dmg
You want the first row of a select statement producing all the rows below (or above) the given datetime in descending (or ascending) order.
Pseudo code for the function body:
SELECT id
FROM table
WHERE IF(#above, datecol < #param, datecol > #param)
ORDER BY IF (#above. datecol ASC, datecol DESC)
LIMIT 1
However, this does not work: one cannot condition the ordering direction.
The second idea is to do both queries, and select afterwards:
SELECT *
FROM (
(
SELECT 'below' AS dir, id
FROM table
WHERE datecol < #param
ORDER BY datecol DESC
LIMIT 1
) UNION (
SELECT 'above' AS dir, id
FROM table
WHERE datecol > #param
ORDER BY datecol ASC
LIMIT 1)
) AS t
WHERE dir = #dir
That should be pretty fast with an index on the datetime column.
-- test rig
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE lutser
( dt timestamp NOT NULL PRIMARY KEY
);
-- populate it
INSERT INTO lutser(dt)
SELECT gs
FROM generate_series('2013-04-30', '2013-05-01', '1 min'::interval) gs
;
DELETE FROM lutser WHERE random() < 0.9;
--
-- The query:
WITH xyz AS (
SELECT dt AS hh
, LAG (dt) OVER (ORDER by dt ) AS ll
FROM lutser
)
SELECT *
FROM xyz bb
WHERE '2013-04-30 12:00' BETWEEN bb.ll AND bb.hh
;
Result:
NOTICE: drop cascades to table tmp.lutser
DROP SCHEMA
CREATE SCHEMA
SET
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "lutser_pkey" for table "lutser"
CREATE TABLE
INSERT 0 1441
DELETE 1288
hh | ll
---------------------+---------------------
2013-04-30 12:02:00 | 2013-04-30 11:50:00
(1 row)
Wrapping it into a function is left as an excercise for the reader
UPDATE: here is a second one with the sandwiched-not-exists-trick (TM):
SELECT lo.dt AS ll
FROM lutser lo
JOIN lutser hi ON hi.dt > lo.dt
AND NOT EXISTS (
SELECT * FROM lutser nx
WHERE nx.dt < hi.dt
AND nx.dt > lo.dt
)
WHERE '2013-04-30 12:00' BETWEEN lo.dt AND hi.dt
;
You have to join the table to itself with the where condition looking for the smallest nonzero (negative or positive) interval between the base table row's datetime and the joined table row's datetime. It would be good to have an index on that datetime column.
P.S. You could also look for the max() of the previous or the min() of the subsequent.
Try something like:
SELECT *
FROM your_table
WHERE (dt_time > argument_time and search_above = 'true')
OR (dt_time < argument_time and search_above = 'false')
ORDER BY CASE WHEN search_above = 'true'
THEN dt_time - argument_time
ELSE argument_time - dt_time
END
LIMIT 1;