Subset based on criteria in sas - macros

Let's hope I have worded the question correctly.
I have pulled a series of cricketing scorecards and now have 'x' scorecards (datasets) each containing 'n' rows of observations. What I want to do is create 'k' subsets from the 'x' scorecards by automatically dividing each scorecard dataset by 8. (for e.g. one of my scorecards has 168 observations therefore this scorecard will be broken into 21 subsets, while another scorecard contains 128 entries so it will be broken into 16 subsets).
I then want to transpose each of the 'k' subsets, this will then give me a dataset containing one row. Finally I want to stack the 'k' transposed datasets to create one big dataset.
small example:
NT Broom
b Henry
21
12
15
3
1
140
JD ryder
b Henry
1
3
2
0
0
50.00
(small extract from one of the scorecards) The above dataset will be divided into 2 subsets, then each of the 2 subsets will by transposed to produce the below (2) datasets:
NT Broom b henry 21 12 15 3 1 140.00
JD Ryder b Henry 1 3 2 0 0 50.00
The two datasets will be stacked on top of each:
NT Broom b henry 21 12 15 3 1 140.00
JD Ryder b Henry 1 3 2 0 0 50.00
Hope this makes sense.
Thanks in advance,
Ankit
what I've done some far:
/Batting Subset Macro/
proc sql;
select count (8) into:total from compile_bat_cleaned_&match;
quit;
%macro create_subsets(count,compile_bat_cleaned_665647);
%let cnt = %sysfunc(ceil(%sysevalf(&count/8)));
%let num = 1;
%do i = 1 %to &cnt;
%if(&i = &cnt) %then %do;
%let toread = &count;
%end;
%else %do;
%let toread = &num + 7;
%end;
data compile_bat_cleaned_665647_&i;
set compile_bat_cleaned_665647 (firstobs=&num obs=%eval(&toread));
run;
proc transpose data = compile_bat_cleaned_665647_&i out = compile_bat_cleaned_665647_&i (drop=_name_);
var bat_det_2_term details_4;
data compile_bat_cleaned_665647_&i;
set compile_bat_cleaned_665647_&i (firstobs = 2);
rename COL1 = Batsman
COL2 = Dismissal
COL3 = Runs_Scored
COL4 = Minutes
COL5 = Balls
COL6 = Fours
COL7 = Sixes
COL8 = Strike_Rate;
%let num = %eval(&num + 8);
%end;
data batters_merged_665647;
set
%do i = 1 %to &cnt;
compile_bat_cleaned_665647_&i
%end;
;
run;
%mend create_subsets;
%create_subsets(&total,compile_bat_cleaned_665647);
the above piece of code works for individual scorecards (match=665647), but not for a series of scorecard data. I change the macro to %macro create_subsets(count,compile_bat_cleaned_&match)
but it doesn't seem to work

The thing you're missing is by group processing. Most things in SAS you don't have to split up into one dataset per ID or anything like that; you assign a ID variable and then do whatever by that variable.
In this case, if you have a variable playerID that has a 1 for the first 8 rows, then a 2 for the second 8 rows (9-16), etc., then you can proc transpose by playerID;, and do just the one transpose, all from one dataset to another single dataset. No macro, no fuss.

Related

Creating a loop to manipulate a table

I have a a table with target temperature and actual temperature like this (table1):
Time Target actual diffrence
____ ______ ______ _________
1 40 40.2 0.2
2 40 41 1
3 40 40.3 0.3
I want to create a table that only contains the rows with a difference <= 0.5.
So the goal should look like this (table2):
Time Target actual
____ ______ ______
1 40 40.2
3 40 40.3
I don't know how to create a loop that solves my problem.
I tried to create an if-loop within a for loop:
for n = 1:3
if difference(n) <= 0.5
table2 = table(table1.Time(n), table1.Target(n), table1.actual(n))
end
end
But when i execute, my table3 consists of only the third row.
3 40 40.3
Can somebody please help me create the loop? (Maybe my loop always overwrites table3 and only saves the last iteration?)
Your analysis of the problem is correct. The statement in the loop just sets the value of table2 to be the latest row that meets the criterion.
You do not need to use loops for this at all. Create a boolean mask based on the difference column:
mask = table1{:, 4} > 0.5;
You can then select a subset of the entire table using the mask as an index:
table2 = table1{mask, 1:3};
You could even combine the two lines into one:
table2 = table1{table1{:, 4} > 0.5, 1:3};

Generate pairs of points using a nested for loop

As an example, I have a matrix [1,2,3,4,5]'. This matrix contains one column and 5 rows, and I have to generate a pair of points like (1,2),(1,3)(1,4)(1,5),(2,3)(2,4)(2,5),(3,4)(3,5)(4,5).
I have to store these values in 2 columns in a matrix. I have the following code, but it isn't quite giving me the right answer.
for s = 1:5;
for tb = (s+1):5;
if tb>s
in = sub2ind(size(pairpoints),(tb-1),1);
pairpoints(in) = s;
in = sub2ind(size(pairpoints),(tb-1),2);
pairpoints(in) = tb;
end
end
end
With this code, I got (1,2),(2,3),(3,4),(4,5). What should I do, and what is the general formula for the number of pairs?
One way, though is limited depending upon how many different elements there are to choose from, is to use nchoosek as follows
pairpoints = nchoosek([1:5],2)
pairpoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
See the limitations of this function in the provided link.
An alternative is to just iterate over each element and combine it with the remaining elements in the list (assumes that all are distinct)
pairpoints = [];
data = [1:5]';
len = length(data);
for k=1:len
pairpoints = [pairpoints ; [repmat(data(k),len-k,1) data(k+1:end)]];
end
This method just concatenates each element in data with the remaining elements in the list to get the desired pairs.
Try either of the above and see what happens!
Another suggestion I can add to the mix if you don't want to rely on nchoosek is to generate an upper triangular matrix full of ones, disregarding the diagonal, and use find to generate the rows and columns of where the matrix is equal to 1. You can then concatenate both of these into a single matrix. By generating an upper triangular matrix this way, the locations of the matrix where they're equal to 1 exactly correspond to the row and column pairs that you are seeking. As such:
%// Highest value in your data
N = 5;
[rows,cols] = find(triu(ones(N),1));
pairpoints = [rows,cols]
pairPoints =
1 2
1 3
2 3
1 4
2 4
3 4
1 5
2 5
3 5
4 5
Bear in mind that this will be unsorted (i.e. not in the order that you specified in your question). If order matters to you, then use the sortrows command in MATLAB so that we can get this into the proper order that you're expecting:
pairPoints = sortrows(pairPoints)
pairPoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
Take note that I specified an additional parameter to triu which denotes how much of an offset you want away from the diagonal. The default offset is 0, which includes the diagonal when you extract the upper triangular matrix. I specified 1 as the second parameter because I want to move away from the diagonal towards the right by 1 unit so I don't want to include the diagonal as part of the upper triangular decomposition.
for loop approach
If you truly desire the for loop approach, going with your model, you'll need two for loops and you need to keep track of the previous row we are at so that we can just skip over to the next column until the end using this. You can also use #GeoffHayes approach in using just a single for loop to generate your indices, but when you're new to a language, one key advice I will always give is to code for readability and not for efficiency. Once you get it working, if you have some way of measuring performance, you can then try and make the code faster and more efficient. This kind of programming is also endorsed by Jon Skeet, the resident StackOverflow ninja, and I got that from this post here.
As such, you can try this:
pairPoints = []; %// Initialize
N = 5; %// Highest value in your data
for row = 1 : N
for col = row + 1 : N
pairPoints = [pairPoints; [row col]]; %// Add row-column pair to matrix
end
end
We get the equivalent output:
pairPoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
Small caveat
This method will only work if your data is enumerated from 1 to N.
Edit - August 20th, 2014
You wish to generalize this to any array of values. You also want to stick with the for loop approach. You can still keep the original for loop code there. You would simply have to add a couple more lines to index your new array. As such, supposing your data array was:
dat = [12, 45, 56, 44, 62];
You would use the pairPoints matrix and use each column to subset the data array to access your values. Also, you need to make sure your data is a column vector, or this won't work. If we didn't, we would be creating a 1D array and concatenating rows and that's not obviously what we're looking for. In other words:
dat = [12, 45, 56, 44, 62];
dat = dat(:); %// Make column vector - Important!
N = numel(dat); %// Total number of elements in your data array
pairPoints = []; %// Initialize
%// Skip if the array is empty
if (N ~= 0)
for row = 1 : N
for col = row + 1 : N
pairPoints = [pairPoints; [row col]]; %// Add row-column pair to matrix
end
end
vals = [dat(pairPoints(:,1)) dat(pairPoints(:,2))];
else
vals = [];
Take note that I have made a provision where if the array is empty, don't even bother doing any calculations. Just output an empty matrix.
We thus get:
vals =
12 45
12 56
12 44
12 62
45 56
45 44
45 62
56 44
56 62
44 62

To get sum values of other columns based on first column values

An excel file contains 5 columns; first column contains year (1987 to 2080), second column contains month, third column contains days, fourth and fifth column contain values. I would like to get the sum values of columns four and five according to year in column one. For example, I would like to get the sum values of column four and five for year 1987, then 1988, then 1989...so on.!
Example of data file is attached
I have tried the following code considering that each year contains 365 days.
n=1;
for i=1:365:size(data,1)
Total(n,:) = sum(data(i:i+365-1,:));
n=n+1;
end
But the problem is that not all the years contain 365 days. Some of them (e.g. 1988, 1992) contain 366 days in a year as they are leap year. In those cases, the sum results become incorrect.
Looking for your help to get the sum values of columns 4 and 5 according to the year in column 1.
It would be greatly appreciated.
UPDATE: much faster solution at the end!
It can be done as follows with one line for each column:
% some example data
years = ceil(1987:0.3:2080)';
months = randi(12,numel(years),1);
days = randi(30,numel(years),1);
values = randi(42,numel(years),2);
% data similar to yours;
data = [ years months days values ];
That would be the easy readable long way:
% years
y = data(:,1)
% unique years
uy = unique(y);
% for column 4
C4 = arrayfun(#(x) sum( data(y == x, 4) ), uy )
% for column 5
C5 = arrayfun(#(x) sum( data(y == x, 5) ), uy )
or just short in one line per column:
C4 = arrayfun(#(x) sum( data( (data(:,1) == x), 4) ), unique(data(:,1)) )
returning a 94x1 double array with all sums for all 94 unique years of the example data.
If you want to arrange it somehow you could do it as follows:
summary = [uy, C4, C5]
returning something like:
summary = %//sum of sum of
column 4 column 5
1987 3 3
1988 40 40
1989 56 56
1990 96 96
1991 54 54
1992 15 15
1993 73 73
1994 42 42
1995 66 66
1996 56 56
...
You could also do all columns at once. Already for just 2 column it should be 50% faster.
cols = 4:5;
C = cell2mat( arrayfun(#(x) sum( data(y == x, cols),1 ), uy,'uni',0 ) )
The problem with that solution is, that you have a matrix of about 30000x5 size, and for every unique years it will apply the indexing on the whole matrix to "search" for the current year which is summed up. But actually there is an in-built function doing exactly that:
A simpler and much faster solution you can achieve using accumarray:
[~,~, i_uy] = unique(data(:,1));
C4 = accumarray(i_uy,data(:,4));
C5 = accumarray(i_uy,data(:,5));

How I get only few specified columns from a text file in Matlab?

I have a text file with 20 columns(columns are seperated by |) and many rows. How can I read only the columns 5,9,17 ?
If you want to read a file like this (called text.txt in my example)
1 | 2 | 3 | 4
2 | 3 | 4 | 5
3 | 4 | 5 | 6
just do
matrix = dlmread('text.txt');
which gives you
1 2 3 4
2 3 4 5
3 4 5 6
You can then use standard matlab matrix notation to extract for example columns 1 and 4
col1 = matrix(:, 1) % the colon is used to tell matlab to take all rows
col4 = matrix(:, 4)
You will have to form another variable choosing specific columns from the variable array formed by the import of the text file
With the right input parameters, textscan can pull this off:
Ncols = 20;
colExtract = [5 9 17];
fspec = cell(1,Ncols);
fspec(:)={'%*f '}; % the asterisk tells textscan to ignore the column
fspec(colExtract)={'%f '};
fspec{end}=fspec{end}(1:end-1); % removes the space from the last parameter
fspecstr = horzcat(fspec{:});
fid = fopen(filename);
indata = textscan(fid,fspecstr,'HeaderLines',1,'delimiter','\t');
fclose(fid);
col5 = indata{1};
col9 = indata{2};
col17= indata{3};
As you can see, I assumed there was a single headerline and the data is tab delimited. If your application does not have this, change it of course.
I guess it pays of if you're working with huge files from which you only want a small portion or can't have all the content in memory.

Merge tables Without sorting on Keys

I have two tables (or actually I have several) that should be merged using innerjoin. It works as expected; except that it sort on the "Keys", which destroys my data since it actually must be arranged in the original row order.
From the help section:
C is sorted by the values in the key variables
t1 = Case Val
3 1
1 1
2 1
t2 = Val2 Case Subset
2 2 2
1 2 2
2 3 1
tnew = innerjoin(t1,t2)
tnew = Case Val Val2 Subset
2 ...
% will start with "2" since its a lower value than "3", but in t1 "3" was in lower row than "2", it is rearranged
How should I avoid the sorting? Hopeless to use innerjoin?
In addition to the resulting table, innerjoin returns two extra outputs: The row indices of the first table and the row indices of the second table that correspond to the rows in the output.
You can simply use the second output to determine the rows in t1 that were used and you could sort these. Then use the sort order to change the ordering of the rows in the result of the join.
%// Setup the data
t1 = table([3;1;2], [1;1;1], 'VariableNames', {'Case', 'Val'});
t2 = table([2;1;2],[2;2;3],[2;2;1], 'VariableNames', {'Val2', 'Case', 'Subset'});
%// Perform the inner join and keep track of where the rows were in t1
[tnew, rows_in_t1] = innerjoin(t1, t2);
%// Sort them in order to maintain the order in t1
[~, sortinds] = sort(rows_in_t1);
%// Apply this sort order to the new table
tnew = tnew(sortinds,:);
%// Case Val Val2 Subset
%// ____ ___ ____ ______
%// 3 1 2 1
%// 2 1 2 2
%// 2 1 1 2