Comparing results of factor analysis and PCA in R - cluster-analysis

I did a PCA and a factor analysis on the same dataset to be able to compare the results on both. I was unsure which would the best for me
My data with 60 observations and 15 variables
Cartoon_RTF = c(247, 226, 251, 202, 252,
GFA = c(2.71, 2.06, 2.16,
Motion_RTF = c(1283, 442, 420, 282, 640,
VA = c(0.63, 1, 1, 0.8, 0.5,
Contrast = c(2.5, 1.5, 1.25, 1.25, 1.5,
Myopia = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
Hyperopia = c(0, 0, 0, 1,0, 1, 1, 0,
eso = c(1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0
exo = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
VF_con = c(0, 0, 0, 0, 0,
VF_hor = c(0, 0, 0, 0,
`VF_con = c(0,0, 0, 0, 0,
Op_small = c(0, 0, 0, 0, 0, 0,
Op_pale = c(1, 1, 0, 1, 0, 1, 1,
Op_asym = c(0, 0, 0, 0, 1, 0, 0
I used the prcomp function for PCA
and
I used this ml<-fa(r=Pca_for_R,nfactors=6, rotate="varimax",fm="ml",residuals=TRUE, mixed.cor)
for factor analysis
I got 15 PCS and chose the first 6 PCS over eigen value 1.
This is loadings of my 6 PCS
PC1 PC2
Cartoon_RTF 0.32995961 -0.23679023
GFA 0.26021772 -0.39216973
Motion_RTF 0.17615437 -0.43962825
VA -0.27921081 0.35660990
Contrast 0.08390296 -0.35382829
Myopia 0.30481801 0.20996028
Hyperopia -0.30051841 -0.02882153
eso 0.12474663 -0.05688751
exo -0.11932742 -0.16026206
VF_ver 0.02236080 -0.06079847
VF_hor 0.29169117 0.35039327
VF_con -0.26406120 0.02269619
Op_small 0.41420388 0.19405489
Op_pale 0.40489249 0.29254657
Op_asym 0.05952916 0.14294505
PC3 PC4
Cartoon_RTF 0.09714211 -0.14249623
GFA -0.29859533 -0.08483508
Motion_RTF 0.03128701 -0.07902869
VA 0.02185057 -0.03979934
Contrast 0.24149068 -0.06051401
Myopia 0.05114086 0.07703303
Hyperopia -0.54720537 0.01999867
eso -0.28609792 0.54639892
exo -0.15166024 -0.57499917
VF_ver -0.19005456 0.32629761
VF_hor 0.05005377 -0.23540141
VF_con -0.30517810 -0.35231236
Op_small -0.36959672 -0.10208947
Op_pale -0.23278445 -0.15377193
Op_asym 0.33492997 -0.07362392
PC5 PC6
Cartoon_RTF 0.128112561 -0.39507864
GFA -0.257002491 0.01493252
Motion_RTF -0.368439500 -0.09148335
VA -0.295195633 -0.26417882
Contrast 0.355325330 0.09587548
Myopia -0.121456162 0.44897396
Hyperopia 0.188038057 -0.15846731
eso -0.330644483 0.23368702
exo -0.008976848 0.20917168
VF_ver 0.402828877 -0.35772386
VF_hor 0.260904035 0.20401225
VF_con -0.171637725 0.04884035
Op_small 0.011126186 -0.18014461
Op_pale -0.002690577 -0.16102471
Op_asym -0.385932437 -0.44825488
This is loadings of 6 MLs:
ML4 ML3 ML1 ML2 ML5 ML6
Cartoon_RTF 0.14 0.05 0.95 0.00 -0.15 0.22
GFA 0.14 0.01 0.04 -0.01 -0.10 0.69
Motion_RTF -0.10 -0.10 0.11 0.00 -0.11 0.50
VA -0.12 -0.18 0.00 -0.03 0.82 -0.22
Contrast -0.07 0.00 0.06 -0.14 -0.33 0.10
Myopia 0.38 -0.31 -0.15 0.12 -0.13 0.02
Hyperopia -0.06 0.89 -0.29 -0.16 0.28 0.03
eso 0.02 0.10 -0.14 0.94 0.04 0.27
exo -0.07 0.03 -0.07 -0.27 -0.02 0.14
VF_ver 0.02 0.23 0.07 0.00 -0.05 -0.05
VF_hor 0.40 -0.12 0.08 -0.06 -0.11 -0.22
VF_con -0.13 0.11 -0.14 -0.17 0.30 0.12
Op_small 0.83 0.14 0.11 0.05 0.08 0.17
Op_pale 0.64 -0.06 0.05 0.09 0.00 0.00
Op_asym 0.09 -0.26 0.01 -0.11 0.11 -0.02
I would like to know why there is a clear difference in both when I compare and which one would be better for me also considering I have a mixed data.
My aim is to used these factors or components to perform a cluster analysis.
I am pretty new to R and any help/advice would be great, thanks

Related

Array in pyspark don't return the groupby

I have the df spark that I need build a array field.
I tried:
df_array = (df_csv.select(df_csv.depth,df_csv.height,df_csv.weight,df_csv.width,df_csv.seller_id,df_csv.sku,df_csv.navigation_id,df_csv.category,
df_csv.subcategory,df_csv.max_dimension_p,df_csv.max_side_p,
sf.struct(df_csv.id_distribution_center,df_csv.id_modality,df_csv.id_modality,
df_csv.zipcode_initial,df_csv.zipcode_final,df_csv.cost,df_csv.city,df_csv.state).alias("infos_gerais_product")))
But return the follow df:
depth height weight width seller_id sku navigation_id category subcategory max_dimension_p max_side_p infos_gerais_product
0.21 0.06 1.417 0.38 aabb 111122 333333333 QU GGGG 0.65 0.38 {2500, 2, 2, 8680...
0.21 0.06 1.417 0.38 aabb 111122 333333333 QU GGGG 0.65 0.38 {490, 1, 1, 85685...
0.21 0.06 1.417 0.38 aabb 111122 333333333 QU GGGG 0.65 0.38 {2500, 1, 1, 68556...
I need the return only one row, because all columns have the same informations how observation in df and only agregation the informations 'infos_gerais_product', can anyone help me?

Fitting a curve through scipy.optimize.curve_fit with endpoints fixed

The following script fits a curve bowing-like via curve_fit (from scipy.optimize), see below:
ydata = numpy.array[ 1.6504 1.63928044 1.62855028 1.6181874 1.60817119 1.59848249 1.58910347 1.58001759 1.57120948 1.56266487 1.55437054 1.54631424 1.5384846 1.53087109 1.52346397 1.5162542 1.5092334 1.50239383 1.4957283 1.48923013 1.48289315 1.47671162 1.4706802 1.46479393 1.45904821 1.45343874 1.44796151 1.44261281 1.43738913 1.43228723 1.42730406 1.42243677 1.4176827 1.41303936 1.40850439 1.40407561 1.39975096 1.39552851 1.39140647 1.38738314 1.38345695 1.37962642 1.37589018 1.37224696 1.36869555 1.36523487 1.36186389 1.35858169 1.35538741 1.35228028 1.34925958 1.34632469 1.34347504 1.34071015 1.33802957 1.33543295 1.33291998 1.33049042 1.32814407 1.32588081 1.32370057 1.32160331 1.31958908 1.31765795 1.31581005 1.31404556 1.31236472 1.3107678 1.30925513 1.30782709 1.30648411 1.30522666 1.3040553 1.30297062 1.30197327 1.30106398 1.30024355 1.29951286 1.29887287 1.29832464 1.29786933 1.29750821 1.29724268 1.29707426 1.29700463 1.29703564 1.29716927 1.29740773 1.2977534 1.29820885 1.29877688 1.29946049 1.3002629 1.30118751 1.30223793 1.30341792 1.30473139 1.30618232 1.30777475 1.30951267 1.3114 ]
xdata = numpy.array[ 0. 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1. ]
sigma = np.ones(len(xdata))
sigma[[0, -1]] = 0.01
def function_cte(x, b):
return 1.31*x + 1.57*(1-x) - b*x*(1-x)
def function_linear(x, c1, c2):
return 1.31*x + 1.57*(1-x) - (c1+c2*x)*x*(1-x)
popt_cte, pcov_cte = curve_fit(function_cte, xdata, ydata, sigma=sigma)
popt_lin, pcov_lin = curve_fit(function_linear, xdata, ydata, sigma=sigma)
But I'm getting the plot in figure ,
i.e., the initial points from both functions disagree with the data to fit (xdata, ydata).
I would like a fit constrained on the endpoints (0.0, 1.57) and (1.0, 1.31) at the same point that minimize the error. Any idea based on this code or it is better taking another way?
thanks!

Calculate the mean values if the numbers are same

I want to calculate the mean value if the first column numbers are same. e.g:
M = [ 2 0.99 0.15 0.60 0.12 0.76 0.16 0.81 0.02 0.75 0.32
2 0.17 0.38 0.34 0.02 0.74 0.67 0.75 0.92 0.23 0.81
2 0.26 0.16 0.30 0.29 0.74 0.89 0.12 0.65 0.06 0.79
3 0.40 0.76 0.45 0.32 0.11 0.52 0.53 0.93 0.77 0.85
3 0.07 0.87 0.42 0.65 0.68 0.70 0.33 0.16 0.67 0.51
3 0.68 0.35 0.36 0.96 0.46 0.15 0.55 0.92 0.72 0.64
3 0.40 0.69 0.56 0.94 0.21 0.95 0.40 0.79 0.64 0.95
4 0.98 0.29 0.74 0.46 0.10 0.54 0.42 0.58 0.42 0.44
4 0.40 0.53 0.42 0.24 0.82 0.68 0.18 0.44 0.39 0.06
4 0.62 0.83 0.43 0.76 0.18 0.04 0.26 0.26 0.82 0.87 ]
Out=[ 2 0.47 0.23 0.41 0.15 0.75 0.57 0.56 0.53 0.35 0.64
3 0.39 0.67 0.45 0.72 0.37 0.58 0.45 0.70 0.70 0.74
4 0.67 0.55 0.53 0.49 0.37 0.42 0.28 0.43 0.54 0.46 ]
G = findgroups(M(:,1 ));
Out = [unique(M(:,1)) splitapply(#mean, M(:,2:end), G)]
G = findgroups(A) returns G, a vector of group numbers created from the grouping variable A. Here G = findgroups(M(:,1 )) means pick up the first column out the matrics.
Y = splitapply(func,X,G) splits X into groups specified by G and applies the function func to each group.

matrix multiplication result value range

Here is the initial question:
About the output value range of LeGall 5/3 wavelet
Today I found actually the transform can be seen as a matrix multiplication. It is easy to calculate the wavelet coefficients as a matrix (in order to estimate the value, all the Rounded down action is ignored which will not affect the estimation of the max value).
The 1st level of DWT2 has two steps which is to perform the LeGall 5/3 filter on two directions. If we see the I as the input 8*8 matrix and A as the wavelet coefficients matrix.
For the horizontal direction:
output1 = I.A
Then the vertical direction is calculated:
In fact, it can be represented as output2 = output1'.A (use the transpose of output1 to multiply A again) which will get the transpose of the result we want.
Transpose the output2.
output_lvl1 = output2' = (output1'.A)' = ((I.A)'.A)' = (A'.I'.A)'=A'.I.A (I put details here to make it clear without the math symbol...)
And the 2nd level of the wavelet is only performed at the LL area which is the output_lvl1(1:4,1:4). And basically the process is the same (let the coefficients matrix represented as B).
Here is the coefficients of the matrix A and B based on my calculation (hope it is correct...)
A = [0.75 -0.125 0 0 -0.5 0 0 0;
0.5 0.25 0 0 1 0 0 0;
-0.25 0.75 -0.125 0 -0.5 -0.5 0 0;
0 0.25 0.25 0 0 1 0 0;
0 -0.125 0.75 -0.125 0 -0.5 -0.5 0
0 0 0.25 0.25 0 0 1 0;
0 0 -0.125 0.625 0 0 -0.5 -1;
0 0 0 0.25 0 0 0 1];
B = [0.75 -0.125 -0.5 0;
0.5 0.25 1 0;
-0.25 0.75 -0.5 -1;
0 0.125 0 1];
And now the question became:
1. if we know A and the range of Input(matrix I) which is -128 to +127, what is the value range of output_lvl1 = A'.I.A?
if we use the output_lvl1(1:4,1:4) as input I2, what is value range of B'.I2.B?
I really need some math help here. Thank you in advance.
Well finally I found a way to solve this.
SymPy lib is what I really need.
As the max value could only be possible in results of B'.I2.B. So a program will do this.
from sympy import *
def calcu_max(strin):
x=0
strin1 = str(strin).replace('*',' ').replace('+',' ').replace('-',' ')
strin1 = strin1.split(' ')
for ele in strin1:
if '[' in ele or ']' in ele or ele =='':
continue
x = x + float(ele)
return x
DWT1 = Matrix(8, 8, [0.75, -0.125, 0, 0,-0.5, 0, 0, 0, 0.5, 0.25, 0, 0, 1, 0, 0, 0, -0.25, 0.75, -0.125, 0, -0.5, -0.5, 0, 0, 0, 0.25, 0.25, 0, 0, 1, 0, 0, 0,-0.125, 0.75, -0.125, 0, -0.5, -0.5, 0, 0, 0, 0.25, 0.25, 0, 0, 1, 0, 0, 0, -0.125, 0.625, 0, 0, -0.5, -1, 0, 0, 0, 0.25, 0, 0, 0, 1])
Input1 = MatrixSymbol('A',8,8)
DWT1_t = Transpose(DWT1)
output_lvl1_1d = DWT1_t*Input1
output_lvl1_2d = output_lvl1_1d* DWT1
#print 'output_lvl1_2d[0,0]: '
#print simplify(output_lvl1_2d[0,0])
#bulit 2nd lvl input from the lvl1 output (1:4,1:4)
input_lvl2 = output_lvl1_2d[0:4,0:4]
DWT2 = Matrix(4, 4, [0.75, -0.125, -0.5, 0, 0.5, 0.25, 1, 0, -0.25, 0.75, -0.5, -1, 0, 0.125, 0, 1])
DWT2_t = Transpose(DWT2)
output_lvl2_1d = DWT2_t*input_lvl2
output_lvl2_2d = output_lvl2_1d * DWT2
#Lvl 2 calculate max
max_lvl2 = zeros(4,4)
for i in range(4):
for j in range(4):
max_lvl2[i,j]=128.0*calcu_max(simplify(output_lvl2_2d[i,j]))
print str(i)+' '+str(j)+' '+str(max_lvl2[i,j])
#print max_lvl2[i,j]
print max_lvl2
Well, here is the result (putting all possible max values in one matrix, and min values are correspondingly negative):
[338.000000000000, 266.500000000000, 468.000000000000, 468.000000000000],
[266.500000000000, 210.125000000000, 369.000000000000, 369.000000000000],
[468.000000000000, 369.000000000000, 648.000000000000, 648.000000000000],
[468.000000000000, 369.000000000000, 648.000000000000, 648.000000000000]
Then 648 is what I am looking for.

finding out the scaling factors to match two curves with fmincon in matlab

This is a follow up question related to how to find out the scaling factors to match two curves in matlab?
I use the following code to figure out the scaling factors to match two curves
function err = sqrError(coeffs, x1, y1, x2, y2)
y2sampledInx1 = interp1(coeffs(1)*x2,y2,x1);
err = sum((coeffs(2)*y2sampledInx1-y1).^2);
end
and I used fmincon to optimize the result.
options = optimset('Algorithm','active-set','MaxFunEvals',10000,'TolCon',1e-7)
A0(1)=1; A0(2)=1; LBA1=0.1; UBA1=5; LBA2=0.1; UBA2=5;
LB=[LBA1 LBA2]; UB=[UBA1 UBA2];
coeffs = fmincon(#(c) sqrError(c,x1, y1, x2, y2),A0,[],[],[],[],LB,UB,[],options);
when I test with my data with the function,
x1=[-0.3
-0.24
-0.18
-0.12
-0.06 0
0.06
0.12
0.18
0.24
0.3
0.36
0.42
0.48
0.54
0.6
0.66
0.72
0.78
0.84
0.9
0.96
1.02
1.08
1.14
1.2
1.26
1.32
1.38
1.44
1.5
1.56
1.62
1.68
1.74
1.8
1.86
1.92
1.98
2.04 ] y1=[0.00
0.00
0.00
0.01
0.03
0.09
0.13
0.14
0.14
0.16
0.20
0.22
0.26
0.34
0.41
0.52
0.62
0.72
0.81
0.91
0.95
0.99
0.98
0.96
0.90
0.82
0.74
0.66
0.58
0.52
0.47
0.40
0.36
0.32
0.27
0.22
0.19
0.15
0.12
0.10 ];
x2=[-0.3
-0.24
-0.18
-0.12
-0.06 0
0.06
0.12
0.18
0.24
0.3
0.36
0.42
0.48
0.54
0.6
0.66
0.72
0.78
0.84
0.9
0.96
1.02
1.08
1.14
1.2
1.26
1.32
1.38
1.44
1.5
1.56
1.62
1.68
1.74
1.8
1.86
1.92
1.98
2.04 ]; y2=[0.00
0.00
0.00
0.00
0.05
0.15
0.15
0.13
0.11
0.11
0.13
0.18
0.24
0.33
0.43
0.54
0.66
0.76
0.84
0.90
0.93
0.94
0.94
0.91
0.87
0.81
0.75
0.69
0.63
0.55
0.49
0.43
0.37
0.32
0.27
0.23
0.19
0.16
0.13
0.10 ];
The error message shows up as follows:
??? Error using ==> interp1 at 172 NaN is not an appropriate value for
X.
Error in ==> sqrError at 2 y2sampledInx1 =
interp1(coeffs(1)*x2,y2,x1);
Error in ==> #(c)sqrError(c,x1,y1,x2,y2)
Error in ==> nlconst at 805
f =
feval(funfcn{3},x,varargin{:});
Error in ==> fmincon at 758
[X,FVAL,LAMBDA,EXITFLAG,OUTPUT,GRAD,HESSIAN]=...
Error in ==>coeffs = fmincon(#(c) sqrError(c,x1, y1, x2,
y2),A0,[],[],[],[],LB,UB,[],options);
What is wrong in the code and how should I get around with it.
Thanks for the help.
Your scaling is likely pushing the interpolated axis out of range of the x-axis of the data. i.e.
x1 < min(x2*coeffs(1)) or x1 > max(x2*coeffs(1)) for at least one x1 and the value of coeffs(1) chosen by the fitting algorithm
You can fix this by giving an extrapolation value for data outside the range. Alternately, you can use extrapolation to guess at these values. So try one of these
y2sampledInx1 = interp1(coeffs(1)*x2,y2,x1,'Linear', 'Extrap');
y2sampledInx1 = interp1(coeffs(1)*x2,y2,x1,'Linear', Inf);
y2sampledInx1 = interp1(coeffs(1)*x2,y2,x1,'Linear', 1E18); %if Inf messes with the algorithm