SPSS/macro: split string into multiple variables - macros

I am trying to split a string variable into multiple dummy coded variables. I used these sources to get an idea of how one would achieve this task in SPSS:
https://www.ibm.com/support/pages/making-multiple-string-variables-single-multiply-coded-field
https://www.spss-tutorials.com/spss-split-string-variable-into-separate-variables/
But when I try to adapt the first one to my needs or when I try to convert the second one to a macro, I fail.
In my dataset I have (multiple) variables that contain a comma seperated string that represents different combinations of selected items (as well as missing values). For each item of a specific variable I want to create a dummy variable. If the item was selected, it should be represented with a 1 in the new dummy variable. If it was not selected, that case should be represented with a 0.
Different input variables can contain different numbers of items.
For example:
ID
VAR1
VAR2
DMMY1_1
DMMY1_2
DMMY1_3
1
1, 2
8
1
1
0
2
1
1, 3
1
0
0
3
3, 1
2, 3, 1
1
0
1
4
2, 8
0
0
0
Here is what I came up with so far ...
* DEFINE DATA.
DATA LIST /ID 1 (F) VAR1 2-5 (A) VAR2 6-12 (A).
BEGIN DATA
11, 28
21 1, 3
33, 12, 3, 1
4 2, 8
END DATA.
* MACRO SYNTAX.
* DEFINE VARIABLES (in the long run these should/will be inside the macro function, but for now I will leave them outside).
NUMERIC v1 TO v3 (F1).
VECTOR v = v1 TO v3.
STRING #char (A1).
DEFINE split_var(vr = TOKENS(1)).
!DO !#pos=1 !TO char.length(!vr).
COMPUTE #char = char.substr(!vr, !#pos, 1).
!IF (!#char !NE "," !AND !#char !NE " ") !THEN
COMPUTE v(NUMBER(!#char, F1)) = 1.
!IFEND.
!DOEND.
!ENDDEFINE.
split_var vr=VAR1.
EXECUTE.
As I got more errors than I can count, it's hard to narrow down my problem. But I think the problem has something to do with the way I use the char.length() function (and I am a bit confused when to use the bang operator).
If anyone has some insights, I would really appreciate some help :)

There is a fundamental issue to understand about SPSS macro - the macro does not read or interact in any way with the data. All the macro does is manipulate text to write syntax. The syntax created will later work on the actual data when you run it.
So, for example, Your first error is using char.length(!vr) within the syntax. You are trying to get the macro to read the data, calculate the length and use, but that simply can't be done - the macro can only work with what you gave it.
Another example in your code: you calculate #char and then try to use it in the macro as !#char. So that obviously won't work. ! precedes only macro functions or arguments. #char, in your code, is neither, and it can't become one - can't read the data into the macro...
To give you a litte push forward: I understand you want the macro loop to run a different number of times for each variable, but you can't use char.length(!vr). I suggest instead have the macro loop as many times as necessary to be sure you can deal with the longest variable you'll need to work with.
And another general strategy hint - first, create syntax to deal with one specific variable and one specific delimiter. Once this works, start working on a macro, keeping in mind that the only purpose of the macro is to recreate the same working syntax, only changing the parameters of variable name and delimiter.

With my new understanding of the SPSS macro logic (thanks to #eli-k) the problem was quite easy to solve. Here is the working solution.
* DEFINE DATA.
DATA LIST /ID 1 (F) VAR1 2-5 (A) VAR2 6-12 (A).
BEGIN DATA
11, 28
21 1, 3
33, 12, 3, 1
4 2, 8
END DATA.
* DEFINE MACRO.
DEFINE #split_var(src_var = !TOKENS(1)
/dmmy_var_label = !DEFAULT(dmmy) !TOKENS(1)
/dmmy_var_lvls = !TOKENS(1))
NUMERIC !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls) (F1).
VECTOR #dmmy_vec = !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls).
STRING #char (A1).
LOOP #pos=1 TO char.length(!src_var).
COMPUTE #char = char.substr(!src_var, #pos, 1).
DO IF (#char NE "," AND #char NE " ").
COMPUTE #index = NUMBER(#char, F1).
COMPUTE #dmmy_vec(#index) = 1.
END IF.
END LOOP.
RECODE !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls) (SYSMIS=0) (ELSE=COPY).
EXECUTE.
!ENDDEFINE.
* CALL MACRO.
#split_var src_var=VAR2 dmmy_var_lvls=8.

Related

Minizinc: declare explicit set in decision variable

I'm trying to implement the 'Sport Scheduling Problem' (with a Round-Robin approach to break symmetries). The actual problem is of no importance. I simply want to declare the value at x[1,1] to be the set {1,2} and base the sets in the same column upon the first set. This is modelled as in the code below. The output is included in a screenshot below it. The problem is that the first set is not printed as a set but rather some sort of range while the values at x[2,1] and x[3,1] are indeed printed as sets and x[4,1] again as a range. Why is this? I assume that in the declaration of x that set of 1..n is treated as an integer but if it is not, how to declare it as integers?
EDIT: ONLY the first column of the output is of importance.
int: n = 8;
int: nw = n-1;
int: np = n div 2;
array[1..np, 1..nw] of var set of 1..n: x;
% BEGIN FIX FIRST WEEK $
constraint(
x[1,1] = {1, 2}
);
constraint(
forall(t in 2..np) (x[t,1] = {t+1, n+2-t} )
);
solve satisfy;
output[
"\(x[p,w])" ++ if w == nw then "\n" else "\t" endif | p in 1..np, w in 1..nw
]
Backend solver: Gecode
(Here's a summarize of my comments above.)
The range syntax is simply a shorthand for contiguous values in a set: 1..8 is a shorthand of the set {1,2,3,4,5,6,7,8}, and 5..6 is a shorthand for the set {5,6}.
The reason for this shorthand is probably since it's often - and arguably - easier to read the shorthand version than the full list, especially if it's a long list of integers, e.g. 1..1024. It also save space in the output of solutions.
For the two set versions, e.g. {1,2}, this explicit enumeration might be clearer to read than 1..2, though I tend to prefer the shorthand version in all cases.

Iterate (/) a multivalent function

How do you iterate a function of multivalent rank (>1), e.g. f:{[x;y] ...} where the function inputs in the next iteration step depend on the last iteration step? Examples in the reference manual only iterate unary functions.
I was able to achieve this indirectly (and verbosely) by passing a dictionary of arguments (state) into unary function:
f:{[arg] key[arg]!(min arg;arg[`y]-2)}
f/[{0<x`x};`x`y!6 3]
Note that projection, e.g. f[x;]/[whilecond;y] would only work in the scenario where the x in the next iteration step does not depend on the result of the last iteration (i.e. when x is path-independent).
In relation to Rahul's answer, you could use one of the following (slightly less verbose) methods to achieve the same result:
q)g:{(min x,y;y-2)}
q)(g .)/[{0<x 0};6 3]
-1 -3
q).[g]/[{0<x 0};6 3]
-1 -3
Alternatively, you could use the .z.s self function, which recursively calls the function g and takes the output of the last iteration as its arguments. For example,
q)g:{[x;y] x: min x,y; y:y-2; $[x<0; (x;y); .z.s[x;y]]}
q)g[6;3]
-1 -3
Function that is used with '/' and '\' can only accept result from last iteration as a single item which means only 1 function parameter is reserved for the result. It is unary in that sense.
For function whose multiple input parameters depends on last iteration result, one solution is to wrap that function inside a unary function and use apply operator to execute that function on the last iteration result.
Ex:
q) g:{(min x,y;y-2)} / function with rank 2
q) f:{x . y}[g;] / function g wrapped inside unary function to iterate
q) f/[{0<x 0};6 3]
Over time I stumbled upon even shorter way which does not require parentheses or brackets:
q)g:{(min x,y;y-2)}
q){0<x 0} g//6 3
-1 -3
Why does double over (//) work ? The / adverb can sometimes be used in place of the . (apply) operator:
q)(*) . 2 3
6
q)(*/) 2 3
6

Creating a function with variable number of inputs?

I am trying to define the following function in MATLAB:
file = #(var1,var2,var3,var4) ['var1=' num2str(var1) 'var2=' num2str(var2) 'var3=' num2str(var3) 'var4=' num2str(var4)'];
However, I want the function to expand as I add more parameters; if I wanted to add the variable vark, I want the function to be:
file = #(var1,var2,var3,var4,vark) ['var1=' num2str(var1) 'var2=' num2str(var2) 'var3=' num2str(var3) 'var4=' num2str(var4) 'vark=' num2str(vark)'];
Is there a systematic way to do this?
Use fprintf with varargin for this:
f = #(varargin) fprintf('var%i= %i\n', [(1:numel(varargin));[varargin{:}]])
f(5,6,7,88)
var1= 5
var2= 6
var3= 7
var4= 88
The format I've used is: 'var%i= %i\n'. This means it will first write var then %i says it should input an integer. Thereafter it should write = followed by a new number: %i and a newline \n.
It will choose the integer in odd positions for var%i and integers in the even positions for the actual number. Since the linear index in MATLAB goes column for column we place the vector [1 2 3 4 5 ...] on top, and the content of the variable in the second row.
By the way: If you actually want it on the format you specified in the question, skip the \n:
f = #(varargin) fprintf('var%i= %i', [(1:numel(varargin));[varargin{:}]])
f(6,12,3,15,5553)
var1= 6var2= 12var3= 3var4= 15var5= 5553
Also, you can change the second %i to floats (%f), doubles (%d) etc.
If you want to use actual variable names var1, var2, var3, ... in your input then I can only say one thing: Don't! It's a horrible idea. Use cells, structs, or anything else than numbered variable names.
Just to be crytsal clear: Don't use the output from this in MATLAB in combination with eval! eval is evil. The Mathworks actually warns you about this in the official documentation!
How about calling the function as many times as the number of parameters? I wrote this considering the specific form of the character string returned by your function where k is assumed to be the index of the 'kth' variable to be entered. Array var can be the list of your numeric parameters.
file=#(var,i)[strcat('var',num2str(i),'=') num2str(var) ];
var=[2,3,4,5];
str='';
for i=1:length(var);
str=strcat(str,file(var(i),i));
end
If you want a function to accept a flexible number of input arguments, you need varargin.
In case you want the final string to be composed of the names of your variables as in your workspace, I found no way, since you need varargin and then it looks impossible. But if you are fine with having var1, var2 in your string, you can define this function and then use it:
function str = strgen(varargin)
str = '';
for ii = 1:numel(varargin);
str = sprintf('%s var%d = %s', str, ii, num2str(varargin{ii}));
end
str = str(2:end); % to remove the initial blank space
It is also compatible with strings. Testing it:
% A = pi;
% B = 'Hello!';
strgen(A, B)
ans =
var1 = 3.1416 var2 = Hello!

Variable labels in SPSS Macro

I'm new to the SPSS macro syntax and had a hard time trying to label variables based on a simple loop counter. Here's what I tried to do:
define !make_indicatorvars()
!do !i = 1 !to 10.
!let !indicvar = !concat('indexvar_value_', !i, '_ind')
compute !indicvar = 0.
if(indexvar = !i) !indicvar = 1.
variable labels !indicvar 'Indexvar has value ' + !quote(!i).
value labels !indicvar 0 "No" 1 "Yes".
!doend
!enddefine.
However, when I run this, I get the following warnings:
Warning # 207 on line ... in column ... Text: ...
A '+' was found following a text string, indicating continuation, but the next non-blank character was not a quotation mark or an apostrophe.
Warning # 4465 in column ... Text: ...
An invalid symbol appears on the VAR LABELS command where a slash was
expected. All text through the next slash will be be ignored.
Indeed the label is then only 'Indexvar has value '.
Upon using "set mprint on printback on", the following code was printed:
variable labels indexvar_value_1_ind 'Indexvar has value ' '1'
So it appears that SPSS seems to somehow remove the "+" which is supposed to concatenate the two strings, but why?
The rest of the macro worked fine, it's only the variable labels command that's causing problems.
Try:
variable labels !indicvar !quote(!concat('Indexvar has value ',!i)).
Also note:
compute !indicvar = 0.
if(indexvar = !i) !indicvar = 1.
Can be simplified as:
compute !indicvar = (indexvar = !i).
Where the right hand side of the COMPUTE equation evaluates to equal TRUE a 1 (one) is assigned else if FALSE a 0 (zero) is assigned. Using just a single compute in this way not only reduce the lines of code, it will also make the transformations more efficient/quicker to run.
You might consider the SPSSINC CREATE DUMMIES extension command. It will automatically construct a set of dummies for a variable and label them with the values or value labels. It also creates a macro that lists all the variables. There is no need to enumerate the values. It creates dummies for all the values in the data.
It appears on the Transform menu as long as the Python Essentials are installed. Here is an example using the employee data.sav file shipped with Statistics.
SPSSINC CREATE DUMMIES VARIABLE=jobcat
ROOTNAME1=job
/OPTIONS ORDER=A USEVALUELABELS=YES USEML=YES OMITFIRST=NO
MACRONAME1="!jobcat".

SAS IML use of Mattrib with Macro (symget) in a loop

In an IML proc I have several martices and several vectors with the names of columns:
proc IML;
mydata1 = {1 2 3, 2 3 4};
mydata2 = {1 2, 2 3};
names1 = {'red' 'green' 'blue'};
names2 = {'black' 'white'};
To assign column names to columns in matrices one can copypaste the mattrib statement enough times:
/* mattrib mydata1 colname=names1;*/
/* mattrib mydata2 colname=names2;*/
However, in my case the number of matrices is defined at execution, thus a do loop is needed. The following code
varNumb=2;
do idx=1 to varNumb;
call symputx ('mydataX', cat('mydata',idx));
call symputx ('namesX', cat('names',idx));
mattrib (symget('mydataX')) colname=(symget('namesX'));
end;
print (mydata1[,'red']) (mydata2[,'white']);
quit;
however produces the "Expecting a name" error on the first symget.
Similar question Loop over names in SAS-IML? offers the macro workaround with symget, what produces an error here.
What is the correct way of using mattrib with symget? Is there other way of making a variable from a string than macro?
Any help would be appreciated.
Thanks,
Alex
EDIT1
The problem is in the symget function. The &-sign resolves the name of the matrix contained in the macro variable, the symget only returns the name of the macro.
proc IML;
mydata1 = {1 2 3};
call symputx ('mydataX', 'mydata1');
mydataNew = (symget('mydataX'));
print (&mydataX);
print (symget("mydataX"));
print mydataNew;
quit;
results in
mydata1 :
1 2 3
mydata1
mydataNew :
mydata1
Any ideas?
EDIT2
Function value solves the symget problem in EDIT1
mydataNew = value(symget('mydataX'));
print (&mydataX);
print (value(symget("mydataX")));
print mydataNew;
The mattrib issue but persists.
SOLVED
Thanks Rick, you have opened my eyes to CALL EXECUTE() statement.
When you use CALL SYMPUTX, you should not use quotes for the second argument. Your statement
call symputx ('mydataX', 'mydata1');
assigns the string 'mydata1' to the macro variable.
In general, trying to use macro variables in SAS/IML loops often results in complicated code. See the article Macros and loops in the SAS/IML language for an indication of the issues caused by trying to combine a macro preprocessor with an interactive language. Because the MATTRIB statement expects a literal value for the matrix name, I recomend that you use CALL EXECUTE rather than macro substitution to execute the MATTRIB statement.
You are also having problems because a macro variable is always a scalar string, whereas the column name is a vector of strings. Use the ROWCAT function to concatenate the vector of names into a single string.
The following statements accomplish your objective without using macro variables:
/* Use CALL EXECUTE to set matrix attributes dynamically.
Requires that matrixName and varNames be defined at main scope */
start SetMattrib;
cmd = "mattrib " + matrixName + " colname={" + varNames + "};";
*print cmd; /* for debugging */
call execute(cmd);
finish;
varNumb=2;
do idx=1 to varNumb;
matrixName = cat('mydata',idx);
varNames = rowcat( value(cat('names',idx)) + " " );
run SetMattrib;
end;