Outputting conditionally from merge - merge

I want to update a history file in SAS. I have new observations, which may overlap with existing data lines.
What is needed, is a file, which would have lines from dataset (new_data) where they exist and in case the lines do not exist, then from old set (old_data). What I've come up is a clunky merge operation, which is conditional on the order of the datasets. (==Works only if New_data is after Old_data. :?)
data new_data;
input key value;
datalines;
1 10
1 11
2 20
2 21
;
run;
data old_data;
input key value;
datalines;
2 50
2 51
3 30
3 31
;
run;
So I'd like to have the following:
key value
1 10
1 11
2 20
2 21
3 30
3 31
However the following does not work. It produces the output below it.
data updated_history;
merge New_data(in=a) old_data(in=b) ;
by key;
if a or (b and not a );
run;
....
2 50
2 51
...
But for some reason this does:
data updated_history;
merge old_data(in=b) New_data(in=a);
by key;
if a or (b and not a );
run;
Question: Is there an intelligent way to manage from which dataset the values are select from. Something like: if a then value_from_dataset a;

The order in which you list the data sets in the MERGE is the order the data is taken. So when the order is old, new values from old are read and then values from new overwrite the values from old. This is why your second version works and the first does not.

Since you have multiple observations per key value you probably do NOT want to use MERGE to combine these files. You could do it using SET by reading the data twice using two DOW loops. In that case it won't matter the order of the dataset in the SET statement since the records are interleaved instead of being joined. This first loop will calculate which of the two input datasets have any observations for this KEY value.
data want ;
anyold=0;
anynew=0;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if inold then anyold=1;
if innew then anynew=1;
end;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if not (anyold and anynew and inold) then output;
end;
drop anyold anynew;
run;
This type of combination is probably easier to code using SQL.
proc sql ;
create table want as
select key,value from new_data
union
select key,value from old_data
where key in (select key from old_data except select key from new_data)
order by 1
;
quit;

Related

Merging SAS datasets by different column names across several columns

I have 2 data sets that I want to merge by territory #...the first dataset has territory information including territory #, the second dataset has territory #'s but they are across 4 different columns titled drug_terr1, drug_terr2, drug_terr3, and drug_Terr4...I need to merge on all 4 columns because they each have different territory #'s and I want those numbers to be included in my merge with the dataset that has all the territory information...I tried a rename but that didn't work because it only changed the first column...is there a way to combine all this data, and rename it by territory # so I can do the merge?
ultimately would like it to look like this, but need to get the 4 columns from 'terrfile' to become 1 column called territory_nbr so I can merge.
%let output = E:\Horizon\Adhoc\AH\;
%let terrs =\\uslsasas1\E$\Horizon\IMS Processing\Weekly Data\20161230\Demo\;
libname terrs "&terrs.";
%let curr_process_wk = '12-30-2016';
%let curr_quarter =_q1;
**0 Grab pskw;
data pskw_data;
set PSKW.PSKWMaster ;
where week in ('12-16-2016','12-23-2016','12-30-2016','01-06-2017') and CopayType ="FBD" and FNRX=1 and pme_id in (46,42,55,38) and product in ('DUEXIS','VIMOVO','PENNSAID')
and
(COBPrimaryRejectCode1 in ('75','76') or COBPrimaryRejectCode2 in ('75', '76') or COBPrimaryRejectCode3 in ('75' , '76'));
run;
proc sort data=pskw_data;
by imsid;
run;
** 01 Grab tbl HCP;
proc sort data=ims.tblhcp (where = (week = &curr_process_wk.) keep = week imsid first_name last_name address1 address2 city state zip spec)
out = IMS_demo (drop = week);
by IMSID;
run;
** 02 Grab tbl terrs_by_imsid;
data terrfile;
set terrs.wd2_terrs_by_imsid&curr_quarter.;
run;
proc sort data = terrfile;
by imsid;
run;
** 03 Grab tbl roster;
data roster (keep = territorycode repname territoryname teamname);
set ims.tblRoster;
repname = trim(left(FirstName))||" "||trim(left(LastName));
run;
**04 link ;
data combine_dbs;
merge pskw_data (in=in1)
ims.tblhcp (in=in2);
by imsid;
if in1;
run;
data territories; ***can't merge because territory code is not in terrfile, just 4 columns as I mentioned above***;
merge terrfile (in=in1)
roster (in=in2);
by territorycode;
if in2;
run;
You need to merge the fact table with the lookup table four times. Let's say your territory identifier is called ID in your lookup table you want to take the field IMS_ID from it. Let's also assume your four fields in your fact table are named ID1-ID4.
proc sql ;
create table want as
select a.*
, b.ims_id as ims_id1
, c.ims_id as ims_id2
, d.ims_id as ims_id3
, e.ims_id as ims_id4
from FACT a
left join LU b on a.id1=b.id
left join LU c on a.id2=c.id
left join LU d on a.id3=d.id
left join LU e on a.id4=e.id
;
quit;
In your example it looks ROSTER is your FACT table and TERRFILES is your LU table. Your ID variable looks like it is name TERRITORYCODE, at least in your lookup file. Hard to tell what the four variables in ROSTER are named.

Putting keyword data into a csv file MATLAB

Given a table of the following format in MATLAB:
userid | itemid | keywords
A = [ 3 10 'book'
3 10 'briefcase'
3 10 'boat'
12 20 'windows'
12 20 'picture'
12 35 'love'
4 10 'day'
12 10 'working day'
... ... ... ];
where A is a table of size (58000*3), I want to write the data in a csv file with the following format:
csv.file
itemid keywords
10 book, briefcase, boat, day, working day, ...
20 windows, picture, ...
35 love, ...
where we the list of itemids is stored in Iids = [10,20,35,...]
I would like to avoide using loops for this as you can imagine the matrix is big-sized. Any idea is appreciated.
I wasn't able to think of a solution without loops. But you can optimize your loop by:
using logical indexing
running such loop only M times (if M is the number of unique itemid elements) instead of N times (if N is the number of elements in your table).
The solution I come up with is this.
First of all, create your table
A=table([3;3;3;12;12;12;4;12], [10;10;10;20;20;35;10;10],{'book','briefcase','boat','windows','picture','love','day','working day'}','VariableNames',{'userid','itemid','keywords'});
which looks like
Select the unique values for column itemid (your Iids):
Iids=unique(A.itemid);
which looks like
Create a new, empty, table which will contain the results:
NewTable=table();
And now the minimal loop I've come up with:
for id=Iids'
% select rows with given itemid value
RowsWithGivenId=A(A.itemid==id,:);
% create new row in NewTable with the id and the (joined together) keywords from the selected rows
NewTable=[NewTable; table(id,{strjoin(RowsWithGivenId.keywords,', ')})];
end
Also, append the new column names in NewTable
NewTable.Properties.VariableNames = {'itemid','keywords'};
And now NewTable looks like:
Please note: due to the fact that the keywords in the new table are separated by comma, a csv file is not the format I recommend. By using writetable() as writetable(NewTable,'myfile.csv');
what you'll get is
As instead, by replacing ; instead of a separating comma (in strjoin()), you'll get a nicer format:

Is it possible to merge two datasets where a variable's value in the first is used to select a variable in the second?

I would like to know how to merge two datasets in SAS using a variable's value in the first dataset to select and test a variable in the second dataset.
As an example consider two datasets. The first dataset contains four baby names and the days they were born. The second data set contains three doctors and an array of indicator variables noting if each doctor worked on a particular day. For example Dr. Smith worked on days 2 and 3 only. I would like to create a dataset that lists all the possible baby-doctor combinations where the doctor was working on the day the baby was born.
data babies;
input baby_name $ birth_day;
datalines;
Jake 1
Sonny 4
North 5
Apple 6
;
run;
data doctors;
input DrLastname $ day1 day2 day3 day4 day5 day6;
datalines;
Jones 1 0 0 1 1 1
Smith 0 1 1 0 0 0
Lewis 1 1 1 0 0 0
;
run;
The solution seems like it should be something like this
proc sql;
create table merged as
select babies.*, doctors.*
from babies, doctors
where doctors.day(babies.birth_day) = 1; *<--- incorrect;
quit;
The output should be:
baby_name birth_day DrLastName
Jake 1 Jones
Jake 1 Lewis
Sonny 4 Jones
North 5 Jones
Apple 6 Jones
I have run into this problem a few times and would love to know if this is kind of merge is possible in SAS. Thanks for any help you can provide.
While I probably would also transpose the dataset, it is possible to do so without transposing.
data babies_doctors;
set babies;
do _i = 1 to nobs_doctors;
set doctors point=_i nobs=nobs_doctors;
array days day1-day6;
if days[birth_Day] then output;
end;
run;
This will not be fast, as it checks all rows in the dataset, but it's possible.
Fastest is probably to load it into a vertical hash table (which you could do easily) or a temporary array.
data babies_doctors_array;
array drnames[32767] $80 _temporary_;
array drdays[32767,6] _temporary_;
if _n_=1 then do;
do _i = 1 to nobs_doctors;
set doctors point=_i nobs=nobs_doctors;
array days day1-day6;
drnames[_i]=DrLastname;
do _j = 1 to dim(days);
drdays[_i,_j]=days[_j];
end;
end;
end;
set babies;
do _k = 1 to nobs_doctors;
if drdays[_k,birth_day]=1 then do;
baby_drlastname = drnames[_k];
output;
end;
end;
run;
I might shift the second dataset and then merge on day.
Something like (in untested pseudo code):
data new_1-new_6;
set doctor;
array day_1-day_6 day_{6}
for i in 1 to 6:
if day_{i} = 1 then do;
day = i;
output new_{i};
end;
end;
run;
data stacked;
set day_1-day_6;
run;
Then simply merge based on the field day.

MONYY7. and DATE9. operations

I'm working on a very big data set, (more than 100 variables and 11 millions observations). In this data set, i have a variable named DTDSI (simulation date) in DATE9. format. (For example: 01APR2015 , 02MAR2015...). I have a macro-program to analyse this data set by comparing the observations in 2 different months:
%macro analysis (data_input , m , m_1);
.....
%mend;
The 2 macro-variables m and m_1 are months that i want to compare. Their format is MONYY7.(APR2015 , MAR2015...). Keep in mind that i cannot modify my data_input (its the data of my company). In the beginning of my macro program, i want to create a new data set with only the observations of the &m and &m_1 month. I can easily do that by creating a new date variable from DTDSI (real_month for ex) but in the format MONYY7. Then i just select the observations where real_month equal &m or real_month equal &m:
Data new;
Set &data_input;
mois_real = input(DTDSI,MONYY7);
RUN;
PROC SQL;
CREATE TABLE NEW AS;
SELECT *
WHERE mois_real in ("&m" , "&m_1")
FROM NEW;
....
The problem is that in my first Data Statement, i duplicated my data_input; which is bad because it took 30 minutes. So can you tell me how can i make my selection (DTDSI = m and DTDSI=m_1) right in my first Statement?
You can use formula's in your where/if condition, so apply your formula from step 1 into step 2 or vice versa.
Data new;
set &data_input;
WHERE put(DTDSI,MONYY7) in ("&m" , "&m_1");
run;

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.