charToDate(x) error when using "seqformat" in TraMineR - traminer

I'm using TraMineR to inspect work trajectories.
When using the seqformat function (from SPELL data) with process = TRUE, and an external data frame for pdata, as follows :
situations <- seqformat(data[,1:4], id = 1, from = "SPELL", to = "STS",
begin = 3, end = 4, status = 2, right = NA,
process = TRUE, limit = 7644, pdata = pdata,
pvar = c("id","birth"))
I get an error message :
Error in charToDate(x) :
character string is not in a standard unambiguous format
I read many threads about that issue, but could not find any helpful solution.
Here are the structures of my data frames data and pdata :
str(data)
'data.frame': 2428 obs. of 9 variables:
$ ID_SQ : Factor w/ 798 levels "1","2","3","5",..: 1 1 1 1 1 2 2
...
$ SITUATION : chr "En poste" "En poste" "En poste" "En poste" ...
$ DATE_DE : Date, format: "1997-09-01" "1999-05-03" "2003-01-01"
...
$ DATE_A : Date, format: "1999-04-26" "2002-12-31" "2006-04-28"
...
$ SEXE : Factor w/ 2 levels "FĂ©minin","Masculin": 1 1 1 1 1 1 1
...
$ PROMO : Factor w/ 6 levels "1997","1998",..: 1 1 1 1 1 2 2 ...
$ DEPARTEMENT : Factor w/ 10 levels "BC","GCU","GE",..: 1 1 1 1 1 4 4 4
4 4 ...
$ NIVEAU_ADMISSION: Factor w/ 2 levels "En Premier Cycle",..: NA NA NA NA
NA 1 1 1 1 1 ...
$ FILIERE_SECTION : Factor w/ 4 levels "Cursus Classique",..: NA NA NA NA
NA 4 4 4 4 4 ...
str(pdata)
'data.frame': 798 obs. of 2 variables:
$ id : Factor w/ 798 levels "1","2","3","5",..: 1 2 3 4 5 6 7 8 9 10 ...
$ birth: Date, format: "1997-01-01" "1998-01-01" "1998-01-01" "2000-01-01" ...
It seems to me that all date formats are OK.
But, clearly, something's wrong.
What am I doing wrong?
Thank you in advance for your help,
Best,
Arnaud.

The seqformat function expects integer values for the begin and end dates of the spells. Actually, these integers should be the (time-)position in the state sequence and will correspond in your example to column numbers in the resulting STS format.
So you need to transform your dates into integer values.
=============
The error
Error in charToDate(x) : character string is not in a standard unambiguous format
occurs while the function tests whether pdata is the string "auto" with if(pdata == "auto"). This is because, when pdata contains dates, the test attempts to coerce "auto" into a date for the sake of comparison. The workaround is to input the dates as integers.

Related

Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer?

I have a date-level promotion data frame that looks something like this:
ID
Date
Promotions
Converted to customer
1
2-Jan
2
0
1
10-Jan
3
1
1
14-Jan
3
0
2
10-Jan
19
1
2
10-Jan
8
0
2
10-Jan
12
0
Now I want to see what were the number of promotions it took to convert someone into a customer
For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID
Date
1
5
2
19
I am unable to think of an idea to solve it. Can you please help me?
#Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).
You can use:
prom = (df.groupby('ID')['Promotions'].cumsum()
.where(df['Converted to customer'].eq(1))
.dropna().astype(int))
out = df.loc[prom.index, ['ID', 'Date']].assign(Promotion=prom)
print(out)
# Output
ID Date Promotion
1 1 10-Jan 5
3 2 10-Jan 19
Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer
you want to compute something by ID, so a groupby ID seems appropriate, e.g.
data.groupby("ID").apply(fct)
Now write a separate function agg_fct which computes the result for a
dataframe consisting of only one ID
Assuming data are ordered by Date, I guess that
def agg_fct(df):
index_of_conv = df["Converted to customer"].argmax()
return df.iloc[0:index_of_conv,df.columns.get_loc("Promotions")].sum()
would be fine. You might want to make some adjustments in case of a customer who has never been converted.

Convolution of two symbolic arrays on Matlab

I have two arrays:
p1=[sym(1) sym(2)]
p2=[sym(3) sym(4)]
I want to do the convolution of those two lists using conv function.
Matlab outputs the following:
Error using conv2
Invalid data type. First and second arguments must be numeric or logical.
Error in conv (line 43)
c = conv2(a(:),b(:),shape);
Can anyone help me how to deal with that?
Edit 1: i have not symbolic math toolbox so i demonstrated a matrix-wise operation on numeric values to calculate conv, i think this should do the trick on symbolic values either.
the conv function just accept numeric values.
there is a way to calculate conv matrix-wise
i demonstrate it with an example:
assume u and v are as follows :
u =
1 2 1 3
v =
2 7 1
>> conv(u,v)
ans =
2 11 17 15 22 3
instead we could first calculate u'*v, then do some rearranging and summing to calculate conv:
so first :
>> c=u'*v
c=
2 7 1
4 14 2
2 7 1
6 21 3
then we do some rearranging:
>> d=[c;zeros(3,3)]
d =
2 7 1
4 14 2
2 7 1
6 21 3
0 0 0
0 0 0
0 0 0
>>e= reshape(d(1:end-3),[6,3])
e=
2 0 0
4 7 0
2 14 1
6 7 2
0 21 1
0 0 3
and finally adding values together :
>> sum(e,2)
ans =
2
11
17
15
22
3
you can write your own code to use "v size" to do it(add (numel(v)*numel(v)) zeros to end of u'*v and so on.)

Precision Function Numeric Variable Error in R

I am currently working on Neural Networks and for some odd reason I am getting a random error saying that the results need to be numeric when passing them to the precision function as it can be seen below.
Code Snippet:
originalValues <- max.col(dataTest[,11:15])
originalValues
predictedValues <- max.col(predictedResult)
predictedValues
CrossTable(originalValues, predictedValues, prop.r = FALSE, prop.c = FALSE, prop.t = FALSE,
prop.chisq = FALSE)
originalValues
predictedValues
Accuracy(originalValues, predictedValues)
Precision(originalValues, predictedValues)
Recall(originalValues, predictedValues)
When running this code for some reason the Accuracy function is working and the Precision and Recall functions are working strange as seen below.
Result:
> originalValues
[1] 1 1 1 1 1 2 2 3 3 3 4 4 4 4 4 5 5 5
> predictedValues
[1] 2 2 2 2 2 2 2 3 3 3 5 5 5 5 5 5 5 5
> Accuracy(originalValues, predictedValues)
[1] 0.4444444
> Precision(originalValues, predictedValues)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
> Recall(originalValues, predictedValues)
[1] NA

Calculated field in Tableau

I have a very simple problem but i am totally new in Tableau. So needs some help in solving this problem.
My Data Set contain
Year_Track_4,Year_Track_5,Year_Track_6,Year_Track_7,.... N
Each Year_Track contain 1 /0 values. 1 means graduated and 0 means didnot graduated or failed
enter image description here
y4 y5 N
1 8
0 5
1 6
0 1
1 2
1 5
1 7
1 8
1 5
0 7
1 5
1 8
1 6
1 1
So , I want to create a placeholder in Tableau or Calculated Field or parameter to select one YEAR and count number of graduated or didn't graduated.
I need to create the same for OverAll_0 and OverAll_1 as one Calculated field and it contains the value of 1 and 0 . So, that i can use the SUM(N) and and calculate it.
I used IFF statement to solve this problem
IIF(Year_Track_4 = 0) then 'graduated in 4 year '
.......
......

Matlab (textscan), read characters from specified column and row

I have a number of text files with data, and want to read a specific part of each file (time information), which is always located at the end of the first row of each file. Here's an example:
%termo2, 30-Jan-2016 12:27:20
I.e. I would like to get "12:27:20".
I've tried using textscan, which I have used before for similar problems. I figured there are 3 columns of this row, with single white space as delimiter.
I first tried to specify these as strings (%s):
fid = fopen(fname);
time = textscan(fid,'%s %s %s');
I also tried to specify the date and time using datetime format:
time = textscan(fid,'%s %{dd-MMM-yyyy}D %{HH:mm:ss}D')
Both of these just produce a blank cell. (I've also tried a number of variations, such as defining the delimiter as ' ', with the same result)
Thanks for any help!
Here's the entire file (not sure pasting here is the right way to do this - i'm new to both matlab and stackoverflow..):
%termo2, 30-Jan-2016 12:27:20
%
%102
%
%stimkod stimtyp
% 1 Next:Pain
% 2 Next:Brush
% vaskod text
% 1 Obeh -> Beh
% 2 Inte alls intensiv -> Mycket intensiv
% stimnr starttid stimkod vaskod VASstart VASmark VAS
1 78.470 2 1 96.470 100.708 6.912
1 78.470 2 2 96.470 104.739 2.763
2 138.822 1 2 156.821 162.619 7.615
2 138.822 1 1 156.821 166.659 2.496
3 199.117 2 2 217.116 222.978 2.897
3 199.117 2 1 217.116 224.795 5.773
4 258.612 2 1 276.612 280.419 5.395
4 258.612 2 2 276.612 284.145 4.622
5 320.068 1 1 338.068 340.689 4.396
5 320.068 1 2 338.068 346.090 2.722
6 377.348 1 2 395.347 398.809 6.336
6 377.348 1 1 395.347 404.465 3.391
7 443.707 2 1 461.707 464.840 6.604
7 443.707 2 2 461.707 473.703 3.652
8 503.122 1 2 521.122 526.009 4.285
8 503.122 1 1 521.122 529.808 3.646
9 568.546 2 2 586.546 586.546 5.000
9 568.546 2 1 586.546 595.496 6.412
10 629.953 2 1 647.953 650.304 7.034
10 629.953 2 2 647.953 655.600 6.615
11 694.305 1 1 712.305 714.416 4.669
11 694.305 1 2 712.305 721.079 2.478
12 751.537 2 2 769.537 773.511 7.307
12 751.537 2 1 769.537 777.423 8.225
13 813.944 1 2 831.944 834.958 7.731
13 813.944 1 1 831.944 839.255 1.363
14 872.448 2 1 890.448 893.829 6.813
14 872.448 2 2 890.448 899.439 2.600
15 939.880 1 2 957.880 963.811 4.332
15 939.880 1 1 957.880 966.603 2.786
16 998.328 2 1 1016.327 1020.707 5.837
16 998.328 2 2 1016.327 1025.275 2.664
17 1062.911 1 2 1080.910 1082.967 2.792
17 1062.911 1 1 1080.910 1088.674 4.094
18 1125.182 1 1 1143.182 1144.379 0.619
18 1125.182 1 2 1143.182 1151.786 8.992
If you're not reading in the entire file, you could just read the first line using fgetl, split on the strings (using regexp) and then grab the last element.
parts = regexp(fgetl(fid), '\s+', 'split');
last = parts{end};
That being said, there doesn't seem to be anything wrong with the way you're using textscan if your file is actually how you say. You could alternately do something like:
parts = textscan(fid, '%s', 3);
last = parts{end}
Update
Also, be sure to rewind the file pointer using frewind before trying to parse the file to ensure that it starts at the top of the file.
frewind(fid)