SAS Macros printing N Observations - macros

I need a macro that will print out the "proc content" and the first 'k' observations of any dataset. Then I need it to further calculate some summary statistics of those first k observations such as mean, max, std and skewness. I'm familiar with SAS but I'm absolutely new to MACROS - to the point where documentation is confusing me.
I know in order to print out the first five variables you could use OBS and FIRSTOBS which I did in my previous work.
PROC PRINT DATA = WORK.CA(firstobs= 5 obs= 9);
RUN;
The same logic I can apply to summary statistics by using PROC MEANS by calculating std mean and so on. But how do I use a macro to make it applicable to any dataset?

%macro MyMacro(data=,k=,var=);
proc contents data=&data varnum;
run;
proc print data=&data (obs=&k);
run;
proc means data=&data (obs=&k) mean max std skewness;
var &var;
run;
%mend MyMacro;
%MyMacro(data=sashelp.class, k=10, var=Age Height Weight);

Related

Plot Two Regression Lines on Same Scatter Plot By Year: X-Axis Date MM/DD

I have a scatter plot of calls / time. My x variable is the date (Day/Month) and my Y variable is a number of calls on each date. I would like to plot two regression lines using PROC SGPLOT REG, one for 2019 and one for 2020. However, when I try to do this, all I get is a regular scatter plot with no regression lines. Here is my code:
proc sgplot data=intern.bothphase1;
reg x=date y=count / group=Year;
label count="Calls Per Day" year="Year";
Title "Comparison of EMS Calls per Day 1/1 - 3/31 in 2019 vs.
2020";
run;
The scatter plot comes up without issue (2019 and 2020 values in different colors) but I want to see how the trends differed between the two time periods, so I really want to get the regression lines on there. Can anyone help?
I imagine this has to do with the fact that I concatenated my day and month with a / so it is a character variable and so SAS cannot calculate the regression. I did this so I could use year as a class variable. I still have the original date variable in my table, is there a way I could get SAS to give me the month/day from that as a numeric variable?
Thanks!
EDIT: I used a date value in SAS and changed the format to mm/dd, but this doesn't help because the regression lines are just on either end of the graph rather than overlapping (picture attached). what I want is to have the regression lines overlap for the same time period 2019 vs. 2020 This is because SAS dates correspond to numbers from 1/1/1960. What I want is the mm/dd to correspond to numbers 1-365 so I get two overlapping regression lines to show how the trends changed from one year to the next. Anyone know how I can do this?
So two steps here: first, you need to generate a "day" value that's 1-365... so let's just subtract out 01JAN from the day value.
data have;
do date = '01JAN2019'd to '31DEC2020'd;
count = 25+2*rand('uniform');
year = year(date);
if month(date) le 3 then output;
end;
format date date9.;
run;
data adjusted;
set have;
date_fixed = date - intnx('year',date,0,'b') + 1; *current date minus jan 1 plus 1 (otherwise off by 1);
format date_fixed date5.; *this does not actually affect the graph axis, oddly;
run;
proc sgplot data=adjusted;
reg x=date_fixed y=count / group=Year;
xaxis valuesformat=date5.; *this seems to be needed for some reason;
label count="Calls Per Day" year="Year";
Title "Comparison of EMS Calls per Day 1/1 - 3/31 in 2019 vs.
2020";
run;
Then we add the xaxis line because for some reason it won't obey the DATE5. format (could also use MMDDYY5. as Reeza noted in comments, but we can force it to here.
Here is what I get. You can use other axis options to further limit things, so for example 01APR doesn't show up.
)

na.approx and na.locf not behaving properly

I'm trying to calculate imputated values for a time series for different countries. This piece of code worked fine before, but now the impuated values are all wrong ... I can't figure out the problem, I've tried everything I could think of.
Our rules are:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
If values are missing in the middle of a time series, linear extrapolation is used.
# load library for imputation
library(zoo)
# expand table to show NAs
output_table_imp = expand(output_table, transport_mode, year, country_code)
output_table_imp = full_join(output_table_imp, output_table)
# add imputated values
output_table_imp <- output_table_imp %>%
group_by(transport_mode, country_code) %>%
mutate(fatalities_imp= na.approx(fatalities,na.rm=FALSE)) %>% # linear interpolation
mutate(fatalities_imp= na.locf.default(fatalities_imp,na.rm=FALSE)) %>% # missing values at the end of a time series (copy last non-NA value)
mutate(fatalities_imp= na.locf(fatalities_imp,fromLast=TRUE, na.rm=FALSE)) %>% # missing values at the start of a time series (copy first non-NA value)
My data frame consists of a couple of columns: transport_mode, country_code, year, fatalities. I'm not sure how I can share my data here? It's a large table with 3600 observations ...
These are the original numbers:
And these are the imputated values. You can see straight away that there is a problem for CY, IE and LT.
The data frame looks like this:
Your code looks somehow overly complicated. Don't know about the zoo details - but pretty sure you could get it also to work.
With the imputeTS package you could just take your whole data. frame (it assumes each column is a separate time series) and the package performs imputation for each of this series.
(unfortunately your code has no data, but I guess this would be your output_table_imp data.frame after expansion)
Just like this:
library("imputeTS")
na_interpolation(output_table_imp, option = "linear")
We also don't have to change something for NA treatment at the beginning and at the end, since your requirements are the default in the na_interpolation function.
These were your requirements:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
Here a toy example:
# Test time series with NAs at start, middle, end
test <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
# Perform linear interpolation
na_interpolation(test, option = "linear")
#Results
> 1 1 1 2 3 4 5 6 7 8 8 8
So see, this works perfectly fine.
Works also perfectly with a data.frame (as a said, a column is interpreted as a time series):
# Create three time series and combine them into 1 data.frame
ts1 <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
ts2 <- c(NA,1,1,2,3,NA,3,6,7,8,NA,NA)
ts3 <- c(NA,3,1,2,3,NA,3,6,7,8,NA,NA)
df <- data.frame(ts1,ts2,ts3)
na_interpolation(df, option = "linear")

The logic of macros

I am new to Stata.
Let's say we have a dataset with train and re78 as variables.
Why does this code work
sum train
local a= r(N)*r(mean)
regress re78 train
outreg2 using TABLE_2.xls, addstat(A, `a') excel
but not this
sum train
local a= r(N)*r(mean)
sum `a'
Both of these codes have the purpose of calling out local variable a.
In Stata the term variable is reserved for columns in the dataset. Local macros are called that, not local variables.
Why does your second code fail? After sum train the local macro a is calculated as r(N) * r(mean) and so should contain the sum or total of values from the last calculation, the application of summarize. (You could also just use r(sum).)
Let's suppose that after that your sum contains 42.
Then
sum `a'
is interpreted as
sum 42
and the problem then is nothing to do with using a local macro. The problem is that summarize has nothing legal to do there. The minimum legal syntax for summarize is to specify variable names or no variable names at all, which is interpreted as meaning all variables. But 42, or whatever your sum is, fits neither syntax and it's illegal.
I am not clear what you want this syntax to do, but it is not legal.

MATLAB Simple - Linear Predictive Coding and Energy Forecasting

I have a dataset with 274 samples (9 months) of the daily energy (Watts.hour) used on a residential household. I'm not sure if i'm applying the lpc function correctly.
My code is the following:
filename='9-months.csv';
energy = csvread(filename);
C=zeros(5,1);
counter=0;
N=3;
for n=274:-1:31
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimated=0;
for X = 1:N
energy_estimated = energy_estimated + (-a(X+1)*energy(n-X));
end
w_real=energy(n);
error2=abs(w_real-energy_estimated);
counter=counter+1;
C(counter,1)=error2;
end
mean_error=round(mean(C));
Being "n" the sample on analysis, I will use the energy array's values, from 1 to n-1, to calculate the lpc coefficientes (with N=3).
After that, it will apply the calculated coefficients on the "for" cycle presented, in order to calculate the estimated energy.
Finally, error2 outputs the error between the real energy and estimated value.
On the example presented ( http://www.mathworks.com/help/signal/ref/lpc.html ) some filters are used. Do I need to apply any filter to it? Is my methodology correct?
Thank you very much in advance!
The lpc seems to be used correctly, but there are a few other things about your code. I am adressign the part at he "for n" :
for n=31:274 %for me it would seem more logically to go forward in time
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimate=filter([0 -a(2:end)],1,w2);
energy_estimate=energy_estimate(end);
estimates(n)=energy_estimate;
end
error=energy(31:274)-estimates(31:274)';
meanerror=mean(error); %you dont really round mean errors
filter is exactly what you are trying to do with the X=1:N loop. but this will perform the calculation for the entire w2 vector. If you just want the last value take the (end) command as well.
Now there is no reason to calculate the error for every single value and then add them to a vector you can do that faster after the calculation.
Now if your trying to estimate future values with a lpc it could work like that, but you are implying that every value is only dependend on the last 3 values. Have you tried something like a polynominal approach? i would think that this would be closer to reality.

MATLAB Loop Programming

I've been stuck on a MATLAB coding problem where I needed to create market weights for many stocks from a large data file with multiple days and portfolios.
I received help from an expert the other day using 'nested loops' it worked, but I don't understand what he has done in the final line. I was wondering if anyone could shed some light and provide an explanation of the last coding line.
xp = x (where x = market value)
dates=unique(x(:,1)); (finds the unique dates in the data set Dates are column 1)
for i=1:size(dates,1) (creates an empty matrix to fill the data in)
for j=5:size(xp,2)
xp(xp(:,1)==dates(i),j)=xp(xp(:,1)==dates(i),j)./sum(xp(xp(:,1)==dates(i),j)); (help???)
end
end
Any comment are much appreciated!
To understand the code, you have to understand the colon operator, logical indexing and the difference between / and ./. If any of these is unclear, please look it up in the documentation.
The following code does the same, but is easier to read because I separated each step into a single line:
dates=unique(x(:,1));
%iterates over all dates. size(dates,1) returns the number of dates
for i=1:size(dates,1)
%iterates over the fifth to last column, which contains the data that will be normalised.
for j=5:size(xp,2)
%mdate is a logical vector, which is used to select the rows with the currently processed date.
mdate=(xp(:,1)==dates(i))
%s is the sums up all values in column j same date
s=sum(xp(mdate,j))
%divide all values in column j with the same date by s, which normalises to 1
xp(mdate,j)=xp(mdate,j)./s;
end
end
With this code, I suggest to use the debugger and step through the code.