SAS how groupformat works

SAS how groupformat works - merge

I found this on SAS official website.
Use the GROUPFORMAT option in the BY statement to ensure that
1. formatted values are used to group observations when a FORMAT statement and a BY statement are used together in a DATA step
2. the FIRST.variable and LAST.variable are assigned by the formatted values of the variable
And the example it uses to illustrate the usage of groupformat is
proc format;
value range
low -55 = 'Under 55'
55-60 = '55 to 60'
60-65 = '60 to 65'
65-70 = '65 to 70'
other = 'Over 70';
run;
proc sort data=class out=sorted_class;
by height;
run;
data _null_;
format height range.;
set sorted_class;
by height groupformat;
if first.height then
put 'Shortest in ' height 'measures ' height:best12.;
run;
But I don't understand how this example shows groupformat "ensures"
formatted values are used to group observations when a FORMAT statement and a BY statement are used together in a DATA step.

Look at the results with and without the groupformat statement:
4805
4806 data _null_;
4807 format height range.;
4808 set sorted_class;
4809 by height groupformat;
4810 if first.height then
4811 put 'Shortest in ' height 'measures ' height:best12.;
4812 run;
Shortest in Under 55 measures 51.3
Shortest in 55 to 60 measures 56.3
Shortest in 60 to 65 measures 62.5
Shortest in 65 to 70 measures 65.3
Shortest in Over 70 measures 72
NOTE: There were 19 observations read from the data set WORK.SORTED_CLASS.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.01 seconds
4813
4814 data _null_;
4815 format height range.;
4816 set sorted_class;
4817 by height ;
4818 if first.height then
4819 put 'Shortest in ' height 'measures ' height:best12.;
4820 run;
Shortest in Under 55 measures 51.3
Shortest in 55 to 60 measures 56.3
Shortest in 55 to 60 measures 56.5
Shortest in 55 to 60 measures 57.3
Shortest in 55 to 60 measures 57.5
Shortest in 55 to 60 measures 59
Shortest in 55 to 60 measures 59.8
Shortest in 60 to 65 measures 62.5
Shortest in 60 to 65 measures 62.8
Shortest in 60 to 65 measures 63.5
Shortest in 60 to 65 measures 64.3
Shortest in 60 to 65 measures 64.8
Shortest in 65 to 70 measures 65.3
Shortest in 65 to 70 measures 66.5
Shortest in 65 to 70 measures 67
Shortest in 65 to 70 measures 69
Shortest in Over 70 measures 72
NOTE: There were 19 observations read from the data set WORK.SORTED_CLASS.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
From there is it obvious that the GROUPFORMAT makes the by groups based on the FORMATTED value. Without it, you are using the RAW value in HEIGHT.

Related

How can I efficiently convert the output of one KDB function into three table columns?

I have a function that takes as input some of the values in a table and returns a tuple if you will - three separate return values, which I want to transpose into the output of a query. Here's a simplified example of what I want to achieve:
multiplier:{(x*2;x*3;x*3)};
select twoX:multiplier[price][0]; threeX:multiplier[price][1]; fourX:multiplier[price][2] from data;
The above basically works (I think I've got the syntax right for the simplified example - if not then hopefully my intention is clear), but is inefficient because I'm calling the function three times and throwing away most of the output each time. I want to rewrite the query to only call the function once, and I'm struggling.
Update
I think I missed a crucial piece of information in my explanation of the problem which affects the outcome - I need to get other data in the query alongside the output of my function. Here's a hopefully more realistic example:
multiplier:{(x*2;x*3;x*4)};
select average:avg price, total:sum price, twoX:multiplier[sum price][0]; threeX:multiplier[sum price][1]; fourX:multiplier[sum price][2] by category from data;
I'll have a go at adapting your answers to fit this requirement anyway, and apologies for missing this bit of information. The real function if a proprietary and fairly complex algorithm and the real query has about 30 output columns, hence the attempt at simplifying the example :)

If you're just looking for the results themselves you can extract (exec) as lists, create dictionary and then flip the dictionary into a table:
q)exec flip`twoX`threeX`fourX!multiplier[price] from ([]price:til 10)
twoX threeX fourX
-----------------
0 0 0
2 3 4
4 6 8
6 9 12
8 12 16
10 15 20
12 18 24
14 21 28
16 24 32
18 27 36
If you need other columns from the original table too then its trickier but you could join the tables sideways using ,'
q)t:([]price:til 10)
q)t,'exec flip`twoX`threeX`fourX!multiplier[price] from t

An apply # can also achieve what you want. Here data is just a table with 10 random prices. # is then used to apply the multiplier function to the price column while also assigning a column name to each of the three resulting lists:
q)data:([] price:10?100)
q)multiplier:{(x*2;x*3;x*3)}
q)#[data;`twoX`threeX`fourX;:;multiplier data`price]
price twoX threeX fourX
-----------------------
80 160 240 240
24 48 72 72
41 82 123 123
0 0 0 0
81 162 243 243
10 20 30 30
36 72 108 108
36 72 108 108
16 32 48 48
17 34 51 51

Need to calculate until a specific date in tableau?

There are three columns, date, x, y
I need to calculate the running sum/total of y for a specific date (today's date more specifically). The data is in two datasources and looks like this in first data source.
DATE X Z
5-Sep
6-Sep 26 101
7-Sep 27 100
8-Sep 28 99
9-Sep 29 98
10-Sep 30 98
11-Sep 30 98
12-Sep 30 97
13-Sep 31 96
14-Sep 32 95
15-Sep 33 94
16-Sep 34 93
17-Sep 35 92
18-Sep 35 92
and like this is second data source
DATE Y
5-Sep 166
6-Sep 182
7-Sep 130
8-Sep 93
9-Sep 107
10-Sep 95
11-Sep 128
12-Sep 173
13-Sep 154
14-Sep 136
15-Sep 79
16-Sep 61
17-Sep 156
18-Sep 66
Lets say that today's date is 17th Sep, then I need to calculate the running sum of 'Z' until today and display it next to the 'X' column. Something like this
17-Sep 35 1499.
How do I do that?
(I tried using sets with date by limiting the date to today but then the running sum doesn't work, also there are some errors in calculated field which is because the data is in two different sources)
Please ask if need more clarification

Using the Super store data, I created a date parameter. Then created a calculated field as follows:
if [date param] >= [Order Date] then [Sales] end
Now this will display sales prior to your selected date parameter. I also created a filter calc to only see data prior to the selected date in the param.
[date param]>=[Order Date]
Place this in the filter shelf and select True.
Now place date field on Rows and your sales calculated field on Text pill. Right click on it and select Quick Table Calculation > Running Total.
See sample workbook here: https://www.dropbox.com/s/p42tx86v4qidlvn/170327%20stack%20question.twbx?dl=0
EDIT:
If you just want to see the total and the date selected, create a calc field for "last" as last() then filter that for zero.

Calculate Variance of a Group data

I have a table contain height and frequency.I want to calculate the variance of it.
Height 140 150 160 170 180 190
Frequency 3 5 57 63 30 2
I have tried the below code:
height=[140 150 160 170 180 190;3 5 57 63 30 2]
height=height(:)
V = var(height) %Calculate Variance
**This give an answer of 5.7316e+03**
while with formula it give an answer of 81.8594. Now please tell me how can i do this?

Use weighted variance:
h=height;
var(h(1,:),h(2,:))

Matlab code crashing unexpectedly

Does anyone of you have a clue of why the following code is crashing with Index exceeds matrix dimensions. error for N_SUBJ = 17 or N_SUBJ = 14, but not for example for the values 13,15,16?
N_PICS = 7
COLR = hsv;
N_COLR = size(COLR,1);
COLR = COLR(1+[0:(N_PICS-1)]*round(N_COLR/N_PICS),:);
SUBJ_COLR = hsv;
N_SUBJ_COLR = size(SUBJ_COLR,1);
SUBJ_COLR = SUBJ_COLR(1+[0:(N_SUBJ-1)]*round(N_SUBJ_COLR/N_SUBJ),:);
And also, could somebody please explain me what it's doing exactly and how it's working?

When you say crashing, I assume you mean you are seeing the error, Index exceeds matrix dimensions.? If you are seeing this error then the matrix returned by hsv does not have enough rows for the sub-sample operation you are doing.
SUBJ_COLR = SUBJ_COLR(1+[0:(N_SUBJ-1)]*round(N_SUBJ_COLR/N_SUBJ),:);
selects a subset of the original matrix. 1+[0:(N_SUBJ-1)]*round(N_SUBJ_COLR/N_SUBJ) calculates which row to select, and : means all columns.

The matrix SUBJ_COLR is 64-by-3, thus N_SUBJ_COLR is equal to 64. You're indexing into the 64 rows of SUBJ_COLR and in some cases the particular index is greater than the number of row, resulting in a Index exceeds matrix dimensions. error. So the question is really why does this snippet
1+[0:(N_SUBJ-1)]*round(N_SUBJ_COLR/N_SUBJ)
evaluate to numbers greater than 64 for some values of N_SUBJ? This expression can be rewritten as:
1+(0:round(64/N_SUBJ):round(64/N_SUBJ)*(N_SUBJ-1))
or
1:round(64/N_SUBJ):round(64/N_SUBJ)*(N_SUBJ-1)+1
where I've replaced N_SUBJ_COLR by 64 for clarity. This latter expression more clearly shows what the largest index in the vector will be and how it depends on the value of N_SUBJ. You can print out this largest index as a function of N_SUBJ:
N_SUBJ = 1:30;
round(64./N_SUBJ).*(N_SUBJ-1)+1
which returns
ans =
Columns 1 through 13
1 33 43 49 53 56 55 57 57 55 61 56 61
Columns 14 through 26
66 57 61 65 69 55 58 61 64 67 70 73 51
Columns 27 through 30
53 55 57 59
As you can see, there are several values that exceed 64. This nonlinear behavior comes down to the use of round. The integers created by the round part don't appear to get small enough fast enough as they multiply (N_SUBJ-1) which is growing in order to keep the total term less than 64. One option might be to replace round with floor, but there are probably other ways.

How to compare the pairs of coordinates most efficiently without using nested loops in Matlab?

If I have 20 pairs of coordinates, whose x and y values are say :
x y
27 182
180 81
154 52
183 24
124 168
146 11
16 90
184 153
138 133
122 79
192 183
39 25
194 63
129 107
115 161
33 14
47 65
65 2
1 124
93 79
Now if I randomly generate 15 pairs of coordinates (x,y) and want to compare with these 20 pairs of coordinates given above, how can I do that most efficiently without nested loops?

If you're trying to see if any of your 15 randomly generated coordinate pairs are equal to any of your 20 original coordinate pairs, an easy solution is to use the function ISMEMBER like so:
oldPts = [...]; %# A 20-by-2 matrix with x values in column 1
%# and y values in column 2
newPts = randi(200,[15 2]); %# Create a 15-by-2 matrix of random
%# values from 1 to 200
isRepeated = ismember(newPts,oldPts,'rows');
And isRepeated will be a 15-by-1 logical array with ones where a row of newPts exists in oldPts and zeroes otherwise.

If your coordinates are 1) actually integers and 2) their span is reasonable (otherwise use sparse matrix), I'll utilize a simple truth table. Like
x_0= [27 180 ...
y_0= [182 81 ...
s= [200 200]; %# span of coordinates
T= false(s);
T(sub2ind(s, x_0, y_0))= true;
%# now obtain some other coordinates
x_1= [...
y_1= [...
%# and common coordinates of (x_0, y_0) and (x_1, y_1) are just
T(sub2ind(s, x_1, y_1))

If your original twenty points aren't going to change, you'd get better efficiency if you sorted them O(n log n); then you could see if each random point was in the list with a O(log n) search.
If your "original" points list changes (insertions / deletions), you could get equivalent performance with a binary tree.
BUT: If the number of points you're working with is really as low as in your question, your double loop might just be the fastest method! Algorithms with low Big-O curves will be faster as the amount of data gets really big, but it's often at the cost of a one-time slowdown (in your case, the sort) - and with only 15x20 data points... There won't be a human-perceptible difference; you might see one if you're timing it on your system clock. Or you might not.
Hope this helps!