I am trying to generate a variable in Stata that is the mean of two other column variables. How can I do this? So far, I have
generate var = mean(var1 var2)
but I know that this is not correct, since mean isn't a command.
Thanks!
The problem is that mean() is not a Stata function. No Stata command has that kind of syntax.
To get the mean of two variables, you can just divide their sum by 2:
gen var = (var1 + var2)/2
If either variable is missing, the result will be missing. If you want to use the non-missing value, you could go
gen var = cond(missing(var1, var2), max(var1, var2), (var1 + var2) / 2)
or use the egen function rowmean().
Related
I am trying to split a string variable into multiple dummy coded variables. I used these sources to get an idea of how one would achieve this task in SPSS:
https://www.ibm.com/support/pages/making-multiple-string-variables-single-multiply-coded-field
https://www.spss-tutorials.com/spss-split-string-variable-into-separate-variables/
But when I try to adapt the first one to my needs or when I try to convert the second one to a macro, I fail.
In my dataset I have (multiple) variables that contain a comma seperated string that represents different combinations of selected items (as well as missing values). For each item of a specific variable I want to create a dummy variable. If the item was selected, it should be represented with a 1 in the new dummy variable. If it was not selected, that case should be represented with a 0.
Different input variables can contain different numbers of items.
For example:
ID
VAR1
VAR2
DMMY1_1
DMMY1_2
DMMY1_3
1
1, 2
8
1
1
0
2
1
1, 3
1
0
0
3
3, 1
2, 3, 1
1
0
1
4
2, 8
0
0
0
Here is what I came up with so far ...
* DEFINE DATA.
DATA LIST /ID 1 (F) VAR1 2-5 (A) VAR2 6-12 (A).
BEGIN DATA
11, 28
21 1, 3
33, 12, 3, 1
4 2, 8
END DATA.
* MACRO SYNTAX.
* DEFINE VARIABLES (in the long run these should/will be inside the macro function, but for now I will leave them outside).
NUMERIC v1 TO v3 (F1).
VECTOR v = v1 TO v3.
STRING #char (A1).
DEFINE split_var(vr = TOKENS(1)).
!DO !#pos=1 !TO char.length(!vr).
COMPUTE #char = char.substr(!vr, !#pos, 1).
!IF (!#char !NE "," !AND !#char !NE " ") !THEN
COMPUTE v(NUMBER(!#char, F1)) = 1.
!IFEND.
!DOEND.
!ENDDEFINE.
split_var vr=VAR1.
EXECUTE.
As I got more errors than I can count, it's hard to narrow down my problem. But I think the problem has something to do with the way I use the char.length() function (and I am a bit confused when to use the bang operator).
If anyone has some insights, I would really appreciate some help :)
There is a fundamental issue to understand about SPSS macro - the macro does not read or interact in any way with the data. All the macro does is manipulate text to write syntax. The syntax created will later work on the actual data when you run it.
So, for example, Your first error is using char.length(!vr) within the syntax. You are trying to get the macro to read the data, calculate the length and use, but that simply can't be done - the macro can only work with what you gave it.
Another example in your code: you calculate #char and then try to use it in the macro as !#char. So that obviously won't work. ! precedes only macro functions or arguments. #char, in your code, is neither, and it can't become one - can't read the data into the macro...
To give you a litte push forward: I understand you want the macro loop to run a different number of times for each variable, but you can't use char.length(!vr). I suggest instead have the macro loop as many times as necessary to be sure you can deal with the longest variable you'll need to work with.
And another general strategy hint - first, create syntax to deal with one specific variable and one specific delimiter. Once this works, start working on a macro, keeping in mind that the only purpose of the macro is to recreate the same working syntax, only changing the parameters of variable name and delimiter.
With my new understanding of the SPSS macro logic (thanks to #eli-k) the problem was quite easy to solve. Here is the working solution.
* DEFINE DATA.
DATA LIST /ID 1 (F) VAR1 2-5 (A) VAR2 6-12 (A).
BEGIN DATA
11, 28
21 1, 3
33, 12, 3, 1
4 2, 8
END DATA.
* DEFINE MACRO.
DEFINE #split_var(src_var = !TOKENS(1)
/dmmy_var_label = !DEFAULT(dmmy) !TOKENS(1)
/dmmy_var_lvls = !TOKENS(1))
NUMERIC !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls) (F1).
VECTOR #dmmy_vec = !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls).
STRING #char (A1).
LOOP #pos=1 TO char.length(!src_var).
COMPUTE #char = char.substr(!src_var, #pos, 1).
DO IF (#char NE "," AND #char NE " ").
COMPUTE #index = NUMBER(#char, F1).
COMPUTE #dmmy_vec(#index) = 1.
END IF.
END LOOP.
RECODE !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls) (SYSMIS=0) (ELSE=COPY).
EXECUTE.
!ENDDEFINE.
* CALL MACRO.
#split_var src_var=VAR2 dmmy_var_lvls=8.
I am looking for some style/best practice advice. I often find myself writing scripts which need many (several tens of) parameters to be defined at the beginning. Then these parameters are used by many functions within the script. A minimum, simplified example might look something like the following:
params.var1 = 1;
params.var2 = 10;
params.var3 = 100;
params.var4 = 1e3;
result1 = my_func1(params);
result2 = my_func2(params);
Now, I don't want to pass many inputs into every function, so am reluctant to do something like result1 = my_func1(var1,var2,var3,var4,...). Therefore, I always find myself making each variable a field of a structure (e.g. params), and then passing this structure alone into each function, as above. The structure is not modified by the functions, only the parameters are used for further calculations.
One of the functions might look like this then:
function result = my_func1(params)
var1 = params.var1;
var2 = params.var2;
var3 = params.var3;
var4 = params.var4;
result = var1.^2 + var2.^2 -var3.^3 + var4;
end
Now, because I don't want to refer to each variable within the function as params.var1, etc. (in the interest of keeping the expression for result as clear as possible), I first do all this unpacking at the beginning using var1 = params.var1.
I suppose the best thing to be doing in situations like this might be to use classes (because I have some data and also want to perform functions on that data). Are there any better ways for me to be doing this kind of thing without moving fully to object-oriented code?
I would simply leave the unpacking out. Call the struct params something shorter inside the function, to keep clutter to a minimum:
function result = my_func1(p)
result = p.var1.^2 + p.var2.^2 - p.var3.^3 + p.var4;
end
I would keep calling it params elsewhere, so you don’t have to deal with cryptic names.
You can define constant functions:
function out = var1
out = 1;
end
function out = var2
out = 10;
end
function result = my_func1
result = var1.^2 + var2.^2;
end
Based on your actual application you may pass array of numbers:
var = [var1 var2 var3 var4];
my_func1(var);
my_func1(var1,var2,var3,var4,...) in my opinion is preferred over passing struct.
I am trying to define the following function in MATLAB:
file = #(var1,var2,var3,var4) ['var1=' num2str(var1) 'var2=' num2str(var2) 'var3=' num2str(var3) 'var4=' num2str(var4)'];
However, I want the function to expand as I add more parameters; if I wanted to add the variable vark, I want the function to be:
file = #(var1,var2,var3,var4,vark) ['var1=' num2str(var1) 'var2=' num2str(var2) 'var3=' num2str(var3) 'var4=' num2str(var4) 'vark=' num2str(vark)'];
Is there a systematic way to do this?
Use fprintf with varargin for this:
f = #(varargin) fprintf('var%i= %i\n', [(1:numel(varargin));[varargin{:}]])
f(5,6,7,88)
var1= 5
var2= 6
var3= 7
var4= 88
The format I've used is: 'var%i= %i\n'. This means it will first write var then %i says it should input an integer. Thereafter it should write = followed by a new number: %i and a newline \n.
It will choose the integer in odd positions for var%i and integers in the even positions for the actual number. Since the linear index in MATLAB goes column for column we place the vector [1 2 3 4 5 ...] on top, and the content of the variable in the second row.
By the way: If you actually want it on the format you specified in the question, skip the \n:
f = #(varargin) fprintf('var%i= %i', [(1:numel(varargin));[varargin{:}]])
f(6,12,3,15,5553)
var1= 6var2= 12var3= 3var4= 15var5= 5553
Also, you can change the second %i to floats (%f), doubles (%d) etc.
If you want to use actual variable names var1, var2, var3, ... in your input then I can only say one thing: Don't! It's a horrible idea. Use cells, structs, or anything else than numbered variable names.
Just to be crytsal clear: Don't use the output from this in MATLAB in combination with eval! eval is evil. The Mathworks actually warns you about this in the official documentation!
How about calling the function as many times as the number of parameters? I wrote this considering the specific form of the character string returned by your function where k is assumed to be the index of the 'kth' variable to be entered. Array var can be the list of your numeric parameters.
file=#(var,i)[strcat('var',num2str(i),'=') num2str(var) ];
var=[2,3,4,5];
str='';
for i=1:length(var);
str=strcat(str,file(var(i),i));
end
If you want a function to accept a flexible number of input arguments, you need varargin.
In case you want the final string to be composed of the names of your variables as in your workspace, I found no way, since you need varargin and then it looks impossible. But if you are fine with having var1, var2 in your string, you can define this function and then use it:
function str = strgen(varargin)
str = '';
for ii = 1:numel(varargin);
str = sprintf('%s var%d = %s', str, ii, num2str(varargin{ii}));
end
str = str(2:end); % to remove the initial blank space
It is also compatible with strings. Testing it:
% A = pi;
% B = 'Hello!';
strgen(A, B)
ans =
var1 = 3.1416 var2 = Hello!
I would like to use a dataset filename "AUDUSD" in several functions. It would be easier for me, just to change the filename "AUDUSD" to a more general name like "FX" and then using the abbreviation "FX" in other_matlab functions, e.g. double(). But matlab does not know the name "FX" (that should be assigned to the dataset "AUDUSD") in the code below... Any suggestions?
CODE:
FX = 'AUDUSD';
load(FX); %OKAY !!! FX works as input to open file AUDUSD!
Svars = {'S_bid','S_offer'};
Fvars = {'F_bid','F_offer'};
vS = double(FX,Svars); % FX does NOT work as input for the file AUDUSD
There is no double() function that accepts multiple cell arrays as arguments (this is what happens when you call double(FX,Svars)).
If you call double(FX), then each character in FX is interpreted for its ASCII value and then cast to double. So you get [ 65.0 85.0 68.0 85.0 83.0 68.0 ]. This is the behavior for the double() function if you provide a vector: each individual value in the vector is cast to double.
You'd have to provide more details on what you're trying to accomplish to give any more suggestions.
I have a different example, maybe you will better understand my point. The key work I would like to process is as follows:
I have got a folder with "dataset" files. I would like to loop through this folder, entering in any datasetfile, extracting the 2nd and 3rd column of each dataset file, and constructing only ONE new datasetfile with all 2nd and 3rd columns of the datasetfiles.
One problem is that the size of the datasetfiles are not the same, so I tried to translate a datasetfile into a double-matrix and then consolidate all double matrices into ONE double matrx.
Here my code:
folder_string = 'Diss_Data/Raw';
FolderContent = dir(folder_string);
No_ds = numel(FolderContent);
for i = 1:No_ds
if isdir(FolderContent(i).name)==0
file_string = FolderContent(i).name;
file_path = strcat(folder_string,'/',file_string)
dataset_filename = file_string(1:6);
load(file_path); %loads the suggested datasetfile; OKAY
M = double(dataset_filename);% returns an ASCII code number; WRONG; should transfer the datasetfile into a matrix M
vS = M(:,2:3);
%... to be continued
end
end
I have the following example which expresses the type of problem that I'm trying to solve:
clear all
textdata = {'DateTime','St','uSt','Ln','W'};
data = rand(365,4);
Final = struct('data',data,'textdata',{textdata})
clear textdata data
From this, Final.data contains values which correspond to the headings in Final.textdata excluding the first ('DateTime') thus Final.data(:,1) corresponds to the heading 'St'... and so on. What I'm trying to do is to create a variable in the workspace for each of these vectors. So, I would have a variable for St, uSt, Ln, and W in the workspace with the corresponding values given in Final.data.
How could this be done?
Will this solve your problem:
for ii=2:length( textdata )
assignin('base',Final.textdata{ii},Final.data(:,ii-1));
end
Let me know if I misunderstood.
The direct answer to your question is to use the assignin function, like so (edit: just like macduff suggested 10 minutes ago):
%Starting with a Final structure containing the data, like this
Final.textdata = {'DateTime','St','uSt','Ln','W'};
Final.data = rand(365,4);
for ix = 1:4
assignin('base',Final.textdata{ix+1}, Final.data(:,ix));
end
However, I strongly discourage using dynamic variable names to encode data like this. Code that starts this way usually ends up as spaghetti code full of long string concatenations and eval statements. Better is to use a structure, like this
for ix = 1:4
dataValues(Final.textdata{ix+1}) = Final.data(:,ix);
end
Or, to get the same result in a single line:
dataValues = cell2struct(num2cell(Final.data,1), Final.textdata(2:end),2)