Homework problem. Here are the questions:
a. Merge product_list and supplier by
Supplier_ID to create a new data set, work.prodsup.
b. Submit the program and confirm that work.prodsup was created with 556
observations.
c. Modify the DATA step to output only those
observations that are in product_list but not supplier.
Part A and B are done but part C is what I'm having trouble with.
Had to sort product_list first
proc sort data=hw2.product_list;
by Supplier_ID;
run;
data work.prodsup;
merge hw2.product_list hw2.supplier;
by Supplier_ID;
run;
What is the function to modify output so that it only includes observations that in one dataset but not the other?
You can add selection criteria by adding in=X into your merge statement:
data work.prodsup;
merge hw2.product_list(in=a) hw2.supplier(in=b);
by Supplier_ID;
if a and not b;
run;
This is what you want, but you can also do neat tricks like left joins faster than in proc sql.
if a; /*Left join*/
if a and b; /*Inner join*/
if b; /*Right join*/
See more on merge in statement here: https://onlinecourses.science.psu.edu/stat481/node/18
Related
I am asking for help on the following topic. I am trying to create an ETL process with two Excel data sources (S1 ~300 rows and S2 ~7000 rows). S1 contains project information and employee details and S2 contains the amount of hours, which each employee worked in which project at a timestamp.
I want to insert the amount of hours, which each employee worked in each project at a timestamp, into the fact table by referencing to the existing primary keys in the dimension tables. If an entry is not present in the dimension tables already, i want to add a new entry first and use the newly generated id. The destination table structure looks as follows (Data Warehouse, Star Schema):Destination Table Structure
In SSIS, i created three Data Flow tasks for filling the Dimension Tables (project, employee and time) with distinct values (using group by, as S1 and S2 contain a lot of duplicate rows)first, and a fourth data flow task (see image below) to insert the FactTable data, and this is where I'm running into problems:
Data Flow Task FactTable
I am using three LookUp functions to retrieve the foreignKeys project_id, employee_id and time_id from the Dimension tables (using project name, employee number and timestamp). If the id is found, it is passed on all the way to Merge Join 1, if not, a new Dimension Entry is created (lets say project) and the generated project_id passed on instead. Same goes for employee and time respectively.
There is two issues with this:
1) The "amount of hours" (passed by Multicast four, see image above) is not matched in the final result (No Match)
2) The amount of rows being inserted keeps increasing forever (Endless Join, I belive due to the Merge joins).
What I've tried:
I have used one UNION instead of three Merge Joins before, but this resulted in the foreign keys being in seperate rows each, instead of merged together.
I used Merge (instead of Merge Join) and combined the join as well as sort conditions in as I fell all possible ways.
I understand that this scenario might be confusing for everybody else, but thank your for taking time looking at it! Any help is greatly appreciated.
Solved it
For anybody having similar issues:
Seperate Data Flows for filling Dimension Tables with those filling Fact Tables will do the trick.
Its a clean solution and easier to debug.
Also: Dont run the LookUp Functions in parallel, but rather one after each other and pass on the attributes. Saves unnecessary Merges as well.
So as a Sum Up:
Four Data Flow Tasks, three for filling dimension tables ONLY and one for filling fact tables ONLY.
Loading Multiple Tables using SSIS keeping foreign key relationships
The answer posted by onupdatecascade is basically it.
Good luck!
I am a novice in SAS program.
I have a question about merging two dataset.
The two data sets look like (please click this Image link (Excel sheet image):
Please let me know key concepts or code to make this happen!
I have searched the answer through Googling etc., but there is no site that exactly solve what I want.
(If it is possible to tackle above question without PROC SQL.)
To get the desired result you should do a cartesian product (Cross join) which returns all the rows in all tables. Each row in table1 is paired with all the rows in table2. I have used Proc SQL to do this and I am eager to see how this can be done using Data step. Here's what I know,
Proc Sql;
create table test_merge as
select a.*, b.type_rhs, b.rhs1, b.rhs2
from test a, test11 b
where a.yearmonth=b.yearmonth
;
quit;
Again, I am new to SAS as well and I think this is one of the ways to create the desired output.
When working with huge data, you will see a note in log that says "The execution of this query involves performing one or more Cartesian product joins that can not be optimized."
I am a little bit confused about merging in SAS. For example, when people are using the merge statement, sometimes (in=a) or (in=b) is followed. What does that do exactly?
To elaborate more on vknowles answer, in=a and in=b are useful in different types of merges. Let's say we have the following data step:
data inner left right outer;
merge have1(in=a) have2(in=b);
by ...;
if a and b then output inner;
else if a and not b then output left;
else if not a and b then output right;
else if not (a or b) then output miss;
run;
The data step will create 4 different datasets which are the basis of an inner join, left join, and right join.
The statement if a and b then output inner; will output only records in which the key is found in the datasets have1 and have2 which is the equivalent of a SQL inner join.
else if a and not b then output left; will output only the records that occur in the have1 dataset and not the have2. This is the equivalent of a left outer join in SQL. If you wanted a full left join you could either append the left dataset to the inner dataset or just change the statement to if (a and b) or (a and not b) then output left.
The third else if is just the opposite of the previous. Here you can perform a right join on the data.
The last else if will output to the outer dataset which is the equivalent of an outer join. This is useful for debugging purposes as the key is unique to each dataset. Thanks to Robert for this addition.
When you see a dataset referenced in SAS with () following it the items inside are called dataset options. The IN= dataset option is valid when reading datasets using statements like SET, MERGE, UPDATE. The word following IN= names a variable that SAS will create that will be true (1) when that dataset contributes to the current observation and false (0) otherwise.
A good example would be if want to use one data set to subset another. For example if you wanted to merge on data from a master lookup table to add an extra variable like an address or account type , but did not what to add in every id from the lookup table.
data want;
merge my_data(in=in1) master_lookup (in=in2);
by id;
if in1 ;
run;
Or if you are stacking or interleaving data from more than one table and wanted to take action depending on which table this record is from.
data want;
set one(in=in1) two(in=in2);
by id;
if in1 then source='ONE';
if in2 then source='TWO';
run;
Let's say you have MERGE A (in=INA) B (in=INB);
When merging two datasets with a BY statement, you can have the following situations:
Dataset A has an observation with a given by-value and dataset B does not. In this case, INA will be true and INB will be false.
Dataset A and B both have an observation with a given by-value. In this case, INA and INB will be true.
[Edited to correct error] Dataset A and B have different numbers of observations with a given by-value. Let's say A has more than B. Even after B runs out of observations, both INA and INB will be true. If you want to know whether B still has observations, you need something like the following code. As #Tom pointed out, you often want to know which dataset is contributing observations.
data want;
ina=0;
inb=0;
merge a (in=ina) b (in=inb);
by mybyvariable;
...
run;
The above code takes advantage of the fact that SAS retains the variables from the last observation contributed to the by-group by the dataset with the smaller number of observations. If you reinitialize any of the variables before the MERGE statement, and there is no new observation, they will keep their reinitialized values; but if there is a new observation, SAS will reset them. I hope that makes sense.
Dataset A does not have an observation with a given by-value and B does. In this case, INA will be false and INB will be true.
Note that this does not cover the situation of merging without a BY statement.
While merging two datasets using the merge statement, is it fine to subset the output dataset while it is being created?
In a nutshell, which of the two approaches is better?
A)
data merge_output;
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb;
if some_other_column eq 'Y' then output merge_output;
else delete;
run;
B)
data merge_output (where = (some_other_column = 'Y'));
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb;
run;
In my experience, I have seen a situation where using approach A led to erroneous merge, whereas approach B is a sure shot success. I was trying to explain this to a wider team, but could not find any documentation.
I believe that deleting rows or sub-setting the dataset while it is being created in a merge statement somehow screws up the merge process running in the background. Can someone help me with the explanation or the correct answer?
Neither! If you must use a Data step then your WHERE data set option should be used earlier rather than later. I have assumed that some_other_column is already in merge_input_table1. If it is in the other table then move/copy as is appropriate.
The WHERE data set option and statement filters unwanted rows of data. This reduces processing as unwanted rows are excluded from the PDV.
data merge_output ;
merge
merge_input1 (in = ina where = (some_other_column = 'Y'))
merge_input2 (in = inb)
;
by some_column;
if ina and inb;
run;
Using the Data Step to perform merges as you point out risks unexpected results. Filtering can become implicitly entangled in processing leading to unintended results. SQL is far less risky as it is explicit. You define what you want and the SQL engine will determine the best way to get it.
proc sql;
create table merge_output as
select *
from merge_input1
inner join
merge_input2
on merge_input1.some_column eq merge_input2.some_column
where some_other_column eq 'Y'
;
quit;
I believe "sure shot success" you mention would be:
data merge_output;
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb and some_other_column eq 'Y';
run;
Working 14 years with SAS, I believe I never used where option on output data set.
The IF statement used above (without THEN) is called subsetting IF, but it's not really subsetting the output (like some post-action), it's simply not allowing some input records to continue through the rest of data step and finally enter output data set.
Regarding option A) using DELETE statement is maybe more "telling" what you're doing then IF statement and can be used without OUTPUT statement, so you can also be more explicit in what your doing like this:
data merge_output;
merge
merge_input1 (in = ina)
merge_input2 (in = inb)
;
by some_column;
if ina and inb; /* "inner join" */
if some_other_column ne 'Y' then delete; /* subset */
run;
In my experience it's OUTPUT statement that often leads to forgetting to add it to all the IF .. THEN .. ELSE.. branches to get expected results. The rule is that once you use OUTPUT statement, there's no automatic output of records at the end of data step, so you have to take care of all the records you need.
So I try to only use OUTPUT statement when using multiple output data sets.
I have a table person in my PostgresSQL database, which contains data of different users.
I need to write a test case, which ensures that some routine does modify the data of user 1, and does not modify data of user 2.
For this purpose, I need to
a) calculate a hash code of all rows of user 1 and those of user 2,
b) then perform the operation under test,
c) calculate the hash code again and
d) compare hash codes from steps a) and c).
I found a way to calculate the hash code for a single row:
SELECT md5(CAST((f.*)AS text))
FROM person f;
In order to achieve my goal (find out whether rows of user 2 have been changed), I need to perform a query like this:
SELECT user_id, SOME_AGGREGATE_FUNCTION(md5(CAST((f.*)AS text)))
FROM person f
GROUP BY user_id;
What aggregate function can I use in order to calculate the hash code of a set of rows?
Note: I just want to know whether any rows of user 2 have been changed. I do not want to know, what exactly has changed.
The simplest way - just concat all the string form md5 with string_agg. But to use this aggregate correctly you must specify ORDER BY.
Or use md5(string_agg(md5(CAST((f.*)AS text)),'')) with some ORDER BY - it will change if any field of f.* changes and it is cheap to compare.
An even simpler way to do it
SELECT user_id, md5(textin(record_out(A))) AS hash
FROM person A