Using merge in sas - merge

I am a little bit confused about merging in SAS. For example, when people are using the merge statement, sometimes (in=a) or (in=b) is followed. What does that do exactly?

To elaborate more on vknowles answer, in=a and in=b are useful in different types of merges. Let's say we have the following data step:
data inner left right outer;
merge have1(in=a) have2(in=b);
by ...;
if a and b then output inner;
else if a and not b then output left;
else if not a and b then output right;
else if not (a or b) then output miss;
run;
The data step will create 4 different datasets which are the basis of an inner join, left join, and right join.
The statement if a and b then output inner; will output only records in which the key is found in the datasets have1 and have2 which is the equivalent of a SQL inner join.
else if a and not b then output left; will output only the records that occur in the have1 dataset and not the have2. This is the equivalent of a left outer join in SQL. If you wanted a full left join you could either append the left dataset to the inner dataset or just change the statement to if (a and b) or (a and not b) then output left.
The third else if is just the opposite of the previous. Here you can perform a right join on the data.
The last else if will output to the outer dataset which is the equivalent of an outer join. This is useful for debugging purposes as the key is unique to each dataset. Thanks to Robert for this addition.

When you see a dataset referenced in SAS with () following it the items inside are called dataset options. The IN= dataset option is valid when reading datasets using statements like SET, MERGE, UPDATE. The word following IN= names a variable that SAS will create that will be true (1) when that dataset contributes to the current observation and false (0) otherwise.
A good example would be if want to use one data set to subset another. For example if you wanted to merge on data from a master lookup table to add an extra variable like an address or account type , but did not what to add in every id from the lookup table.
data want;
merge my_data(in=in1) master_lookup (in=in2);
by id;
if in1 ;
run;
Or if you are stacking or interleaving data from more than one table and wanted to take action depending on which table this record is from.
data want;
set one(in=in1) two(in=in2);
by id;
if in1 then source='ONE';
if in2 then source='TWO';
run;

Let's say you have MERGE A (in=INA) B (in=INB);
When merging two datasets with a BY statement, you can have the following situations:
Dataset A has an observation with a given by-value and dataset B does not. In this case, INA will be true and INB will be false.
Dataset A and B both have an observation with a given by-value. In this case, INA and INB will be true.
[Edited to correct error] Dataset A and B have different numbers of observations with a given by-value. Let's say A has more than B. Even after B runs out of observations, both INA and INB will be true. If you want to know whether B still has observations, you need something like the following code. As #Tom pointed out, you often want to know which dataset is contributing observations.
data want;
ina=0;
inb=0;
merge a (in=ina) b (in=inb);
by mybyvariable;
...
run;
The above code takes advantage of the fact that SAS retains the variables from the last observation contributed to the by-group by the dataset with the smaller number of observations. If you reinitialize any of the variables before the MERGE statement, and there is no new observation, they will keep their reinitialized values; but if there is a new observation, SAS will reset them. I hope that makes sense.
Dataset A does not have an observation with a given by-value and B does. In this case, INA will be false and INB will be true.
Note that this does not cover the situation of merging without a BY statement.

Related

Merge SAS datasets but keep only the common observations

Homework problem. Here are the questions:
a. Merge product_list and supplier by
Supplier_ID to create a new data set, work.prodsup.
b. Submit the program and confirm that work.prodsup was created with 556
observations.
c. Modify the DATA step to output only those
observations that are in product_list but not supplier.
Part A and B are done but part C is what I'm having trouble with.
Had to sort product_list first
proc sort data=hw2.product_list;
by Supplier_ID;
run;
data work.prodsup;
merge hw2.product_list hw2.supplier;
by Supplier_ID;
run;
What is the function to modify output so that it only includes observations that in one dataset but not the other?
You can add selection criteria by adding in=X into your merge statement:
data work.prodsup;
merge hw2.product_list(in=a) hw2.supplier(in=b);
by Supplier_ID;
if a and not b;
run;
This is what you want, but you can also do neat tricks like left joins faster than in proc sql.
if a; /*Left join*/
if a and b; /*Inner join*/
if b; /*Right join*/
See more on merge in statement here: https://onlinecourses.science.psu.edu/stat481/node/18

Merge two datasets duplicate BY variables Or I want to make following form

I am a novice in SAS program.
I have a question about merging two dataset.
The two data sets look like (please click this Image link (Excel sheet image):
Please let me know key concepts or code to make this happen!
I have searched the answer through Googling etc., but there is no site that exactly solve what I want.
(If it is possible to tackle above question without PROC SQL.)
To get the desired result you should do a cartesian product (Cross join) which returns all the rows in all tables. Each row in table1 is paired with all the rows in table2. I have used Proc SQL to do this and I am eager to see how this can be done using Data step. Here's what I know,
Proc Sql;
create table test_merge as
select a.*, b.type_rhs, b.rhs1, b.rhs2
from test a, test11 b
where a.yearmonth=b.yearmonth
;
quit;
Again, I am new to SAS as well and I think this is one of the ways to create the desired output.
When working with huge data, you will see a note in log that says "The execution of this query involves performing one or more Cartesian product joins that can not be optimized."

Can subsetting the output dataset mess with the merge data step in SAS?

While merging two datasets using the merge statement, is it fine to subset the output dataset while it is being created?
In a nutshell, which of the two approaches is better?
A)
data merge_output;
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb;
if some_other_column eq 'Y' then output merge_output;
else delete;
run;
B)
data merge_output (where = (some_other_column = 'Y'));
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb;
run;
In my experience, I have seen a situation where using approach A led to erroneous merge, whereas approach B is a sure shot success. I was trying to explain this to a wider team, but could not find any documentation.
I believe that deleting rows or sub-setting the dataset while it is being created in a merge statement somehow screws up the merge process running in the background. Can someone help me with the explanation or the correct answer?
Neither! If you must use a Data step then your WHERE data set option should be used earlier rather than later. I have assumed that some_other_column is already in merge_input_table1. If it is in the other table then move/copy as is appropriate.
The WHERE data set option and statement filters unwanted rows of data. This reduces processing as unwanted rows are excluded from the PDV.
data merge_output ;
merge
merge_input1 (in = ina where = (some_other_column = 'Y'))
merge_input2 (in = inb)
;
by some_column;
if ina and inb;
run;
Using the Data Step to perform merges as you point out risks unexpected results. Filtering can become implicitly entangled in processing leading to unintended results. SQL is far less risky as it is explicit. You define what you want and the SQL engine will determine the best way to get it.
proc sql;
create table merge_output as
select *
from merge_input1
inner join
merge_input2
on merge_input1.some_column eq merge_input2.some_column
where some_other_column eq 'Y'
;
quit;
I believe "sure shot success" you mention would be:
data merge_output;
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb and some_other_column eq 'Y';
run;
Working 14 years with SAS, I believe I never used where option on output data set.
The IF statement used above (without THEN) is called subsetting IF, but it's not really subsetting the output (like some post-action), it's simply not allowing some input records to continue through the rest of data step and finally enter output data set.
Regarding option A) using DELETE statement is maybe more "telling" what you're doing then IF statement and can be used without OUTPUT statement, so you can also be more explicit in what your doing like this:
data merge_output;
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb; /* "inner join" */
if some_other_column ne 'Y' then delete; /* subset */
run;
In my experience it's OUTPUT statement that often leads to forgetting to add it to all the IF .. THEN .. ELSE.. branches to get expected results. The rule is that once you use OUTPUT statement, there's no automatic output of records at the end of data step, so you have to take care of all the records you need.
So I try to only use OUTPUT statement when using multiple output data sets.

Merge neo4j relationships into one while returning the result if certain condition satisfies

My use case is:
I have to return whole graph in result but the condition is
If there are more than 1 relationship in between two particular nodes in the same direction then I have to just merge it into 1 relationship. For ex: Lets say there are two nodes 'm' and 'n' and there are 3 relations in between these nodes say r1, r2, r3 (in the same direction) then when I get the result after firing cypher query I should get only 1 relation in between 'n' and 'm'.
I need to perform some operations on top of it like the resultant relation that we got from merging all the relations should contain the properties and their values that I want to retain. Actually I will retain all the properties of any one of the relations that are merging depending upon the timestamp field that is one of the properties in relation.
Note : I have same properties throughout all my relations (The number of properties and name of properties are same across all relations. Values may differ for sure)
Any help would be appreciated. Thanks in advance.
You mean something like this?
Delete all except the first
MATCH (a)-[r]->(b)
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
Ordering by timestamp first
MATCH (a)-[r]->(b)
WITH a,r,b
ORDER BY r.timestamp DESC
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
If you want to do all those operations virtually just on query results you'd do them in your programming language of choice.

Finding out the hash value of a group of rows

I have a table person in my PostgresSQL database, which contains data of different users.
I need to write a test case, which ensures that some routine does modify the data of user 1, and does not modify data of user 2.
For this purpose, I need to
a) calculate a hash code of all rows of user 1 and those of user 2,
b) then perform the operation under test,
c) calculate the hash code again and
d) compare hash codes from steps a) and c).
I found a way to calculate the hash code for a single row:
SELECT md5(CAST((f.*)AS text))
FROM person f;
In order to achieve my goal (find out whether rows of user 2 have been changed), I need to perform a query like this:
SELECT user_id, SOME_AGGREGATE_FUNCTION(md5(CAST((f.*)AS text)))
FROM person f
GROUP BY user_id;
What aggregate function can I use in order to calculate the hash code of a set of rows?
Note: I just want to know whether any rows of user 2 have been changed. I do not want to know, what exactly has changed.
The simplest way - just concat all the string form md5 with string_agg. But to use this aggregate correctly you must specify ORDER BY.
Or use md5(string_agg(md5(CAST((f.*)AS text)),'')) with some ORDER BY - it will change if any field of f.* changes and it is cheap to compare.
An even simpler way to do it
SELECT user_id, md5(textin(record_out(A))) AS hash
FROM person A