SAS: keep format when merging - merge

I am trying to merge three data(A1, A2, A3) with different formats. The datasets are all just simple one-row two-column datasets. However, once I merge them, the format style disappears and shows the format of data A1. Does anyone know how to keep the format when merging?
DATA A1;
FORMAT _NUMERIC_ 7.1;
RUN;
DATA A2;
SET A0;
RUN;
DATA A3;
FORMAT a b PERCENT7.1;
RUN;
DATA A4; SET A1 A2 A3;
RUN;
Thank you in advance.

The (Program Data Vector) PDV is the storage assembly constructed by the implicit and automatic compile/run operation performed by the SAS executor.
Each item in the PDV corresponds to a variable and its attendant metadata (type, length, format, informat, label, etc...)
The variables in the data sets called out in a SET statement are automatically added to the PDV (along with the variables metadata). Once an item is in the PDV, later data sets in the same SET statement will NOT overwrite existing PDV items.
Your sample code is a bit flawed because A1 has no variables, A2 step has an error and A3 is ok. The A4 variables will inherit the A3 metadata because of this flawness.
Per your SET statement, suppose A1 has a variable X, then X in data set A4 would inherit the A1.X metadata.

FORMATS are attached to variables, not observations. So you cannot have both 7.1 and PERCENT7.1 attached to the variable A in the dataset A4. One variable can only have one format permanently attached to it. But note that it does not have to have any format permanently attached to it.
When you combine datasets that have variables in common SAS will define the variable (that is set the variable's type and length) based on how it is defined in the first dataset being combined that contains that variable.
For other variable attributes (FORMAT, INFORMAT, LABEL) it is a little more complex. Those will be set to the first non-empty value that is seen for the variable, even if that is not the first time the variable is seen. And those attributes of the variable can be overridden by explicit assignments in the data step. For numeric variables you can even override the LENGTH attribute for the variable since it only applies to how the variable is stored in the output dataset.

Related

How to combine two datasets into one in SAS

I have some SAS code from my editor here. I am learning to use SAS (this is my first time using it), so I'm not sure how much code is relevant.
proc import
datafile="C:\Users\barnedsm\Desktop\SAS\ToothGrowth.csv"
dbms=csv
out=tooth;
proc print data=tooth (obs=5);
run;
6. create two SAS data sets ToothGrowth_OJ and ToothGrowth_VC for the animals with the
delivery method orange juice and ascorbic acid, respectively. (5 points)
data ToothGrowth_OJ;
set tooth;
where (supp="OJ");
proc print data=ToothGrowth_OJ (obs=5);
run;
data ToothGrowth_VC;
set tooth;
where (supp="VC");
proc print data=ToothGrowth_VC (obs=5);
run;
7. save the two SAS data sets in a permanent folder on your computer. (5 points)
libname mylibr "C:\Users\barnedsm\Desktop\SAS";
data mylibr.ToothGrowth_OJ_permanent;
set ToothGrowth_OJ;
run;
libname mylibr "C:\Users\barnedsm\Desktop\SAS";
data mylibr.ToothGrowth_VC_permanent;
set ToothGrowth_VC;
run;
For the final question on my assignment, I am wanting to re-combine the last two datasets I made (ToothGrowth_OJ and ToothGrowth_VC) into one dataset (ToothGrowth_combined). How would I do this? My thoughts would be to use a subset function like I used to separate the two. The code I have in mind is below.
data ToothGrowth_combined;
set ToothGrowth_OJ(where=(supp="OJ"));
keep supp Len;
run;
This would tell SAS to keep the values from the ToothGrowth_OJ dataset that have OJ in the "supp" columns (which is all of them) and to keep the variable Len. Assuming that I have done this code correctly, I want to add in the values from my ToothGrwoth_VC dataset in a similar way, but the output is an empty dataset when I run the same code, but replace the "ToothGrowth_OJ" with "ToothGrowth_VC". Is there a way to use the subset code to take these two separate datasets and combine them into one, or an easier way?
Your starting code is doing these steps.
Using PROC IMPORT to guess how to read text file into a dataset.
Creates a subset of the data with only some of the observations.
Creates a second subset of the data.
To recombine the two subsets use the SET statement and list all of the input datasets you want. To limit the number of variables written to the output dataset use a KEEP statement.
data ToothGrowth_combined;
set ToothGrowth_OJ ToothGrowth_VC ;
keep supp Len;
run;
I am not sure why you added the WHERE= dataset option in your code attempt since by the way they were created they each only have observations with a single value of SUPP.
If you want to combine the permanent datasets instead (for example if you started a new SAS session with an empty WORK library) then use those dataset names instead in the SET. Just make sure the libref that points to them is defined in this SAS session.
libname mylibr "C:\Users\barnedsm\Desktop\SAS";
data ToothGrowth_combined;
set mylibr.ToothGrowth_OJ_permanent mylibr.ToothGrowth_VC_permanent;
keep supp Len;
run;

SAS - how can I read in date data?

I am trying to read in some data in date format and the solution is eluding me. Here are four of my tries using the simplest self-contained examples I could devise. (And the site is making me boost my text-to-code ratio in order for this to post, so please ignore this sentence).
*EDIT - my example was too simplistic. I have spaces in my variables, so I do need to specify positions (the original answer said to ignore positions entirely). The solution below works, but the date variable is not a date.
data clinical;
input
name $ 1-13
visit_date $ 14-23
group $ 25
;
datalines;
John Turner 03/12/1998 D
Mary Jones 04/15/2008 P
Joe Sims 11/30/2009 J
;
run;
No need to specify the lengths. datalines already assumes space-delimited values. A simple way to specify an informat is to use a : after each input variable.
data clinical;
input ID$ visit_date:mmddyy10. group$;
format visit_date mmddyy10.; * Make the date look human-readable;
datalines;
01 03/12/1998 D
02 04/15/2008 P
03 11/30/2009 J
;
run;
Output:
ID visit_date group
01 03/12/1998 D
02 04/15/2008 P
03 11/30/2009 J
A friend of mine suggested this, but it seems odd to have to switch syntax markedly depending on whether the variable is a date or not.
data clinical; 
input
name $ 1-12
#13 visit_date MMDDYY10.
group $ 25 ;
datalines;
John Turner 03/12/1998 D
Mary Jones  04/15/2008 P
Joe Sims    11/30/2009 J
;
run;
SAS provides a lot of different ways to input data, just depending on what you want to do.
Column input, which is what you start with, is appropriate when this is true:
To read with column input, data values must have these attributes:
appear in the same columns in all the input data records
consist of standard numeric form or character form
Your data does not meet this in the visit_date column. So, you need to use something else.
Formatted input is appropriate to use when you want these features:
With formatted input, an informat follows a variable name and defines how SAS reads the values of this variable. An informat gives the data type and the field width of an input value. Informats also read data that is stored in nonstandard form, such as packed decimal, or numbers that contain special characters such as commas.
Your visit_date column matches this requirement, as you have a specific informat (mmddyy10.) you would like to use to read in the data into date format.
List input would also work, especially in modified list format, in some cases, though in your example of course it wouldn't due to the spaces in the name. Here's when you might want to use it:
List input requires that you specify the variable names in the INPUT statement in the same order that the fields appear in the input data records. SAS scans the data line to locate the next value but ignores additional intervening blanks. List input does not require that the data is located in specific columns. However, you must separate each value from the next by at least one blank unless the delimiter between values is changed. By default, the delimiter for data values is one blank space or the end of the input record. List input does not skip over any data values to read subsequent values, but it can ignore all values after a given point in the data record. However, pointer controls enable you to change the order that the data values are read.
(For completeness, there is also Named input, though that's more rare to see, and not helpful here.)
You can mix Column and Formatted inputs, but you don't want to mix List input as it doesn't have the same concept of pointer control exactly so it can be easy to end up with something you don't want. In general, you should use the input type that's appropriate to your data - use Column input if your data is all text/regular numerics, use formatted input if you have particular formats for your data.

Using SELECT IF to recode one date from four variables in SPSS

Self-taught at SPSS here. Need to know the appropriate syntax to recode four DATE variables into one, based on which would be the latest date. I have four DATE variables in a dataset with 165 cases:
wnd_heal_date
wnd_heal_d14_date
wnd_heal_d30_date
wnd_heal_3m_date
And each variable may or may not contain a value for each case. I want to recode a new variable which scans the dates from all four and only selects the one that is the latest and puts it into a new variable (x_final_wound_heal_date).
How to use the SELECT IF function for this purpose?
select if function selects rows in the data, and so is not appropriate for this case. What you can do is this instead:
compute x_final_wound_heal_date =
max(wnd_heal_date, wnd_heal_d14_date, wnd_heal_d30_date, wnd_heal_3m_date).
VARIABLE LABELS x_final_wnd_heal_date 'Time to definitive wound healing (days)'.
VARIABLE LEVEL x_final_wnd_heal_date(SCALE).
ALTER TYPE x_final_wnd_heal_date(DATE11).
This will put the latest of available date values in the new variable.

In a data flow task, how do I restrict rows flowing using a value from another source?

I have an excel sheet with many tabs. Say one is called wsMain and the other is called wsDate.
In my data flow transformation I am able to successfully load the data from wsMain to my table.
Now I have to update this transformation where I have to fetch the maximum date from the worksheet wsDate and only load data from wsMain where the date is less than on equal to the maximum date in wsDate (that is the only column available).
So for I have figured out that I need to create a new Excel connection manager to read the data from wsDate and I have used the Aggregate transformatioin to get the maximum date.
Now the question is how do I use this date to restrict the rows coming from wsMain?
I understand from the link below that you can store the value in a variable but what do I do next?:
SSIS set result set from data flow to variable
I have tried using a merge join but not sure if I am doing it right.
Here is what it looks like now:
I could not achieve the above but would be interested to know if that is possible. As a work around I have created a separate dataflow where I have stored the valued in a variable and then used the variable in the conditional split to filter the required rows:
Here is a step by step guide I followed to write the variable:
https://www.proteanit.com/2008/12/11/ssis-writing-to-a-package-variable-in-a-dataflow/
You can obtain the maximum value of the wsDate column first, this use this as a filter to avoid introducing unnecessary records into the data flow which which would be discarded by the Conditional Split. An overview of this process is below. I'd also recommend confirming the data types for all columns involved.
Create an SSIS DateTime variable and name this something descriptive such as MaxDate.
Create a Data Flow Task before the current one with an Excel Source component. Use the SQL command option for the Data Access Mode and enter a SQL statement to return the max value of the wsDate column. In the following example ExcelSource is the name of the sheet that you're pulling from. I'd suggested confirming the query with the Preview button on the Excel Source as well.
Add a Script Component (not Task) after the Excel Source. Add the MaxDate variable in the ReadWriteVariables field on the main page of the Script Component. On the Inputs and Outputs pane add the output column from the Excel Source as an Input Column with the ReadOnly usage Type. Example C# code for this is below. Note that variables can only be written to in the PostExecute method. The Input0_ProcessInputRow method is called once for each row that passes through, however there will only be the single row in this case. On the following code MaxExcelDate is the name of the output column from the Excel Source.
On the Excel Source component in the Data Flow Task where the records are imported from Excel, change the Data Access Mode to SQL command and enter a SQL statement to return records that have a date less than or equal to the maximum wsDate value. This is the last example and the ? is a placeholder for the parameter. After entering this SQL, click the Parameters button and select Parameter0 for the Parameters field, the MaxDate variable for Variables field, and a direction of Input. The Conditional Split can then be removed since these records will now be filtered out.
Excel MAX wsDate SELECT:
SELECT MAX(wsDate) AS MaxExcelDate FROM ExcelSource
C# Script Component:
DateTime maxDate;
public override void PostExecute()
{
base.PostExecute();
Variables.MaxDate = maxDate;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
maxDate = Row.MaxExcelDate;
}
Excel Command with Date Filter:
SELECT
Column1,
Column2,
Column3
FROM ExcelSheet
WHERE DateColumn <= ?
Yes, it is possible. In the data flow, you will need to determine the max date, which you already have. Next, you will need to MERGE JOIN the two data flows on the date column. From there, you will feed it into a CONDITIONAL SPLIT and split where the date columns match [i.e., !ISNULL()] versus do not match [i.e., ISNULL()]. In your case, you only want the matches. The non-matches will be disregarded.
Note: if you use an INNER JOIN on the MERGE JOIN where there is only one date (i.e., MaxDate) to join on, then this will take care of the row filtering for you. You will not need a CONDITIONAL SPLIT.
Welcome to ETL.
Update
It is a real pain that SSIS's MERGE JOINs only perform joins on EQUAL operations as opposed to LESS THAN and GREATER THAN operations. You will need to separate the data flows.
Use a script component to scan the excel file for the MAX Date and assign that value to a package variable in SSIS. Alternatively, you can have a dates table in SQL Server and then use an Execute SQL Command in SSIS to retrieve the MAX Date from the table and assign that value to a package variable
Modify your existing data flow to remove the reading of the Excel date file completely. Then add a DERIVED COLUMN transformation and add a new column that is mapped to the package variable in SSIS that stores the MAX date. You can name the Derived Column Name 'MaxDate'
Add a conditional split transformation with the following CONDITION logic: [AsOfDt] <= [MaxDate]
Set the Output Name to Insert Records
Note: The CONDITIONAL SPLIT creates a new output data flow with restricted/filtered rows. It does not create a new column within the existing data flow. Think of this as a transposition of data flow output from column modification to row modification. Only those rows that match the condition will be sent to the output that you desire. I assume you only want to Insert these records, so I named it that. You can choose whatever naming convention you prefer
Note 2: Sorry for not making the Update my original answer - I haven't used the AGGREGATE transformation before so I was not aware that it restricts row output as opposed to reading a value in the data flow and then assigning it to a variable. That would be a terrific transformation for Microsoft to add to SSIS. It appears that the ROWCOUNT and SCRIPT COMPONENT transformations are the only ones that have the ability to set a package variable value within the data flow.

Matlab comapring values of one row

I have a row of recurring values and I want to compare them with one another. For example, the first value is compared with the whole row and if there is a match it's saved in one variable; if not, then a new variable is created. Similarly, the second value will be compared, and so on.
I cant seem to work it out. Example:
A=[5.34
1.20
4.54
3.68
5.34
4.54]
I want the same values to be stored in one variable and the next values like 1.20 to be stored in another variable, and so on