SAS to PySpark Conversion - pyspark

I have the following SAS code:
data part1;
set current.part;
by DEVICE_ID part_flag_d
if first.DEVICE_ID or first.part_flag_d;
ITEM_NO = 0;
end;
else do;
ITEM_NO + 1;
end;
run;
I am converting this to PySpark and getting stuck. I have the 'part' DataFrame. Where I am getting stuck is trying to convert the following line:
if first.DEVICE_ID or first.part_flag_d;
I know it's getting the first entry of each column, but is it also checking for null? What is the OR condition saying?
Would appreciate any direction on how to script that line.

Please refer to the documentation for the by statement:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/a000202968.htm#a002651313
Processing BY Groups
SAS assigns the following values to
FIRST.variable and LAST.variable:
FIRST.variable has a value of 1 under the following conditions:
when the current observation is the first observation that is read
from the data set.
when you do not use the GROUPFORMAT option and the internal value of
the variable in the current observation differs from the internal
value in the previous observation.
If you use the GROUPFORMAT option, FIRST.variable has a value of 1
when the formatted value of the variable in the current observation
differs from the formatted value in the previous observation.
FIRST.variable has a value of 1 for any preceding variable in the BY
statement.
In all other cases, FIRST.variable has a value of 0.
LAST.variable has a value of 1 under the following conditions:
when the current observation is the last observation that is read from
the data set.
when you use the GROUPFORMAT option and the internal value of the
variable in the current observation differs from the internal value in
the next observation.
If you use the GROUPFORMAT option, LAST.variable has a value of 1 when
the formatted value of the variable in the current observation differs
from the formatted value in the next observation.
LAST.variable has a value of 1 for any preceding variable in the BY
statement.
In all other cases, LAST.variable has a value of 0.

Please see the code below. It is generated by an automated conversion tool called SPROCKET you can find the info about it here
It should take care of the first.variable
# DATASTEP
part1 = (
CURRENT_part_df # change to your set df name
.withColumn('_first_DEVICE_ID', when(
row_number().over(Window.partitionBy(['part_flag_d']).orderBy('DEVICE_ID')) == 1, 1)
.otherwise(lit(0)))
# WARNING: NO ORDER DETECTED, MANUALLY SPECIFY ORDER VARIABLE
.withColumn('_first_part_flag_d', when(
row_number().over(Window.partitionBy(['DEVICE_ID','part_flag_d']).orderBy("")) == 1, 1)
.otherwise(lit(0)))
.withColumn('ITEM_NO', expr("""case
when (_first_DEVICE_ID > 0 or _first_part_flag_d > 0) then 0
when not ((_first_DEVICE_ID > 0 or _first_part_flag_d > 0)) then ITEM_NO + 1
else ITEM_NO end"""))
.drop('_first_DEVICE_ID')
.drop('_first_part_flag_d')
)

Related

How to save and pass a variable in Matlab

I can find / identify all digits in a number using a recursive function.
My issue is trying to sum the number through each recursion.
I had to initialize sum = 0 at the top of my function statement, and when I return through recursion, I'm always resetting sum back to zero. I do not know how to save a variable without first initalizing it.
Code is below;
function output= digit_sum(input)
sum=0
if input < 10
output = input
else
y=rem(input,10);
sum=sum+y
z=floor(input/10);
digit_sum(z)
end
output=sum
end

SPSS/macro: split string into multiple variables

I am trying to split a string variable into multiple dummy coded variables. I used these sources to get an idea of how one would achieve this task in SPSS:
https://www.ibm.com/support/pages/making-multiple-string-variables-single-multiply-coded-field
https://www.spss-tutorials.com/spss-split-string-variable-into-separate-variables/
But when I try to adapt the first one to my needs or when I try to convert the second one to a macro, I fail.
In my dataset I have (multiple) variables that contain a comma seperated string that represents different combinations of selected items (as well as missing values). For each item of a specific variable I want to create a dummy variable. If the item was selected, it should be represented with a 1 in the new dummy variable. If it was not selected, that case should be represented with a 0.
Different input variables can contain different numbers of items.
For example:
ID
VAR1
VAR2
DMMY1_1
DMMY1_2
DMMY1_3
1
1, 2
8
1
1
0
2
1
1, 3
1
0
0
3
3, 1
2, 3, 1
1
0
1
4
2, 8
0
0
0
Here is what I came up with so far ...
* DEFINE DATA.
DATA LIST /ID 1 (F) VAR1 2-5 (A) VAR2 6-12 (A).
BEGIN DATA
11, 28
21 1, 3
33, 12, 3, 1
4 2, 8
END DATA.
* MACRO SYNTAX.
* DEFINE VARIABLES (in the long run these should/will be inside the macro function, but for now I will leave them outside).
NUMERIC v1 TO v3 (F1).
VECTOR v = v1 TO v3.
STRING #char (A1).
DEFINE split_var(vr = TOKENS(1)).
!DO !#pos=1 !TO char.length(!vr).
COMPUTE #char = char.substr(!vr, !#pos, 1).
!IF (!#char !NE "," !AND !#char !NE " ") !THEN
COMPUTE v(NUMBER(!#char, F1)) = 1.
!IFEND.
!DOEND.
!ENDDEFINE.
split_var vr=VAR1.
EXECUTE.
As I got more errors than I can count, it's hard to narrow down my problem. But I think the problem has something to do with the way I use the char.length() function (and I am a bit confused when to use the bang operator).
If anyone has some insights, I would really appreciate some help :)
There is a fundamental issue to understand about SPSS macro - the macro does not read or interact in any way with the data. All the macro does is manipulate text to write syntax. The syntax created will later work on the actual data when you run it.
So, for example, Your first error is using char.length(!vr) within the syntax. You are trying to get the macro to read the data, calculate the length and use, but that simply can't be done - the macro can only work with what you gave it.
Another example in your code: you calculate #char and then try to use it in the macro as !#char. So that obviously won't work. ! precedes only macro functions or arguments. #char, in your code, is neither, and it can't become one - can't read the data into the macro...
To give you a litte push forward: I understand you want the macro loop to run a different number of times for each variable, but you can't use char.length(!vr). I suggest instead have the macro loop as many times as necessary to be sure you can deal with the longest variable you'll need to work with.
And another general strategy hint - first, create syntax to deal with one specific variable and one specific delimiter. Once this works, start working on a macro, keeping in mind that the only purpose of the macro is to recreate the same working syntax, only changing the parameters of variable name and delimiter.
With my new understanding of the SPSS macro logic (thanks to #eli-k) the problem was quite easy to solve. Here is the working solution.
* DEFINE DATA.
DATA LIST /ID 1 (F) VAR1 2-5 (A) VAR2 6-12 (A).
BEGIN DATA
11, 28
21 1, 3
33, 12, 3, 1
4 2, 8
END DATA.
* DEFINE MACRO.
DEFINE #split_var(src_var = !TOKENS(1)
/dmmy_var_label = !DEFAULT(dmmy) !TOKENS(1)
/dmmy_var_lvls = !TOKENS(1))
NUMERIC !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls) (F1).
VECTOR #dmmy_vec = !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls).
STRING #char (A1).
LOOP #pos=1 TO char.length(!src_var).
COMPUTE #char = char.substr(!src_var, #pos, 1).
DO IF (#char NE "," AND #char NE " ").
COMPUTE #index = NUMBER(#char, F1).
COMPUTE #dmmy_vec(#index) = 1.
END IF.
END LOOP.
RECODE !CONCAT(!dmmy_var_label,1) TO !CONCAT(!dmmy_var_label, !dmmy_var_lvls) (SYSMIS=0) (ELSE=COPY).
EXECUTE.
!ENDDEFINE.
* CALL MACRO.
#split_var src_var=VAR2 dmmy_var_lvls=8.

q - internal state in while loop

In q, I am trying to call a function f on an incrementing argument id while some condition is not met.
The function f creates a table of random length (between 1 and 5) with a column identifier which is dependent on the input id:
f:{[id] len:(1?1 2 3 4 5)0; ([] identifier:id+til len; c2:len?`a`b`c)}
Starting with id=0, f should be called while (count f[id])>1, i.e. so long until a table of length 1 is produced. The id should be incremented each step.
With the "repeat" adverb I can do the while condition and the starting value:
{(count x)>1} f/0
but how can I keep incrementing the id?
Not entirely sure if this will fix your issue but I was able to get it to work by incrementing id inside the function and returning it with each iteration:
q)g:{[id] len:(1?1 2 3 4 5)0; id[0]:([] identifier:id[1]+til len; c2:len?`a`b`c);#[id;1;1+]}
In this case id is a 2 element list where the first element is the table you are returning (initially ()) and the second item is the id. By amending the exit condition I was able to make it stop whenever the output table had a count of 1:
q)g/[{not 1=count x 0};(();0)]
+`identifier`c2!(,9;,`b)
10
If you just need the table you can run first on the output of the above expression:
q)first g/[{not 1=count x 0};(();0)]
identifier c2
-------------
3 a
The issue with the function f is that when using over and scan the output if each iteration becomes the input for the next. In your case your function is working on a numeric value put on the second pass it would get passed a table.

Variable labels in SPSS Macro

I'm new to the SPSS macro syntax and had a hard time trying to label variables based on a simple loop counter. Here's what I tried to do:
define !make_indicatorvars()
!do !i = 1 !to 10.
!let !indicvar = !concat('indexvar_value_', !i, '_ind')
compute !indicvar = 0.
if(indexvar = !i) !indicvar = 1.
variable labels !indicvar 'Indexvar has value ' + !quote(!i).
value labels !indicvar 0 "No" 1 "Yes".
!doend
!enddefine.
However, when I run this, I get the following warnings:
Warning # 207 on line ... in column ... Text: ...
A '+' was found following a text string, indicating continuation, but the next non-blank character was not a quotation mark or an apostrophe.
Warning # 4465 in column ... Text: ...
An invalid symbol appears on the VAR LABELS command where a slash was
expected. All text through the next slash will be be ignored.
Indeed the label is then only 'Indexvar has value '.
Upon using "set mprint on printback on", the following code was printed:
variable labels indexvar_value_1_ind 'Indexvar has value ' '1'
So it appears that SPSS seems to somehow remove the "+" which is supposed to concatenate the two strings, but why?
The rest of the macro worked fine, it's only the variable labels command that's causing problems.
Try:
variable labels !indicvar !quote(!concat('Indexvar has value ',!i)).
Also note:
compute !indicvar = 0.
if(indexvar = !i) !indicvar = 1.
Can be simplified as:
compute !indicvar = (indexvar = !i).
Where the right hand side of the COMPUTE equation evaluates to equal TRUE a 1 (one) is assigned else if FALSE a 0 (zero) is assigned. Using just a single compute in this way not only reduce the lines of code, it will also make the transformations more efficient/quicker to run.
You might consider the SPSSINC CREATE DUMMIES extension command. It will automatically construct a set of dummies for a variable and label them with the values or value labels. It also creates a macro that lists all the variables. There is no need to enumerate the values. It creates dummies for all the values in the data.
It appears on the Transform menu as long as the Python Essentials are installed. Here is an example using the employee data.sav file shipped with Statistics.
SPSSINC CREATE DUMMIES VARIABLE=jobcat
ROOTNAME1=job
/OPTIONS ORDER=A USEVALUELABELS=YES USEML=YES OMITFIRST=NO
MACRONAME1="!jobcat".

For Matlab, Import data from xlsx, how to get 1st row as variable name, and rest of the column as data for variable name

In my excel file. The very first row in each column is a string. The rest of the column is data for that string, i.e
'time'
1
2
3
4
I want to take the first row in excel, and make that the variable name in Matlab, and the rest of the column data is numerical data for that variable. So in Matlab, time would be a column vector of numbers 1, 2 ,3 ,4.
I can't get this to work.
how about
[val nms] = xlsread( xlsFileName );
assert( size(val,2) == size(nms,2), 'mismatch number of columns and number of names');
for ci=1:size(val,2)
eval( [ nms{ci}, ' = val(:,ci);' ] ); % name the column
end
What makes this work:
This code calls xlsread with two output variables. This way xlsread puts the numeric data into the first variable, and text data into the second. See xlsread doc for more info.
Using eval to assign values to variable (time) which name is stored in another variable (nms{1}). The argument of the eval commands is a string time = val(:,1);, which the Matlab command that assigns the values of the first column of data (val(:,1)) to a new variable named time.