MultiCollinearity when VIF is 0 - linear-regression

What does it mean when VIF(Variance Inflation Factor) is 0? Does it indicate No MultiCollinearity?

Related

coefficients of correlation on AMPL

I have a specific question, and I will deeply appreciate any help please.
I am working on a specific project on AMPL (A Mathematical Programming language):
I need to implement an objective function that minimizes the risk on the cost for a variable, that has a cost as a parameter and correlation coefficients as well.
The risk is estimated using the variance on cost, and I have my correlation matrix data.
My correlation matrix looks like this:
correlation coefficients (%)
2015
Coal steam turbine Gas combustion turbine Wind Central PV Hydro non pumped Nuclear GenIII Nuclear GenIV Coal steam turbine CCS
1 0.47 0 0 0 0.12 0.12 1 Coal steam turbine
0.47 1 0 0 0 0.06 0.06 0.47 Gas combustion turbine
0 0 1 0 0 0 0 0 Wind
0 0 0 1 0 0 0 0 Central PV
0 0 0 0 1 0 0 0 Hydro non pumped
0.12 0.06 0 0 0 1 1 0.12 Nuclear GenIII
0.12 0.06 0 0 0 1 1 0.12 Nuclear GenIV
1 0.47 0 0 0 0.12 0.12 1 Coal steam turbine CCS
In my case, the risk on cost that I want to minimize is on fuel prices ( Fuels types are correlated and coefficients of correlation vary yearly, fuel prices depend on the type of the technology, the province, and the year).
I need to find a way to find an efficient way to enter the correlation matrix in a table ( database on psgAdmin (psql) ) and then use appropriate arguments to read them, and implement them on my objective function.
The table that I have so far looks like this:
table fuel_prices "inputs/fuel_prices.tab" IN:
[province, fuel, year], fuel_price, cv_fuel_price;
read table fuel_prices;
I need to modify it to add correlation coefficients.
# Table for the correlation coefficients
# table fuel_prices_corr "inputs/fuel_prices_corr.tab" and IN:
# [province, year], fuel, correl_coeff1, correl_coeff2;
# read table fuel_prices_corr;
The technologies I am using are extracted from tables as the following:
table generator_info "inputs/generator_info.tab" IN:
TECHNOLOGIES <- [technology], technology_id, fuel;
read table generator_info;
table gen_cap_cost "inputs/gen_cap_cost.tab" IN:
[technology, year], overnight_cost_yearly ~ overnight_cost, fixed_o_m_yearly ~ fixed_o_m, variable_o_m_yearly ~ variable_o_m;
read table gen_cap_cost;
table existing_plants "inputs/existing_plants.tab" IN:
EXISTING_PLANTS <- [project_id, province, technology],
ep_plant_name ~ plant_name, ep_carma_plant_id ~ carma_plant_id,
ep_capacity_mw ~ capacity_mw, ep_heat_rate ~ heat_rate, ep_cogen_thermal_demand ~ cogen_thermal_demand_mmbtus_per_mwh,
ep_vintage ~ start_year,
ep_overnight_cost ~ overnight_cost, ep_connect_cost_per_mw ~ connect_cost_per_mw, ep_fixed_o_m ~ fixed_o_m, ep_variable_o_m ~ variable_o_m,
ep_location_id;
read table existing_plants;
table new_projects "inputs/new_projects.tab" IN:
PROJECTS <- [project_id, province, technology], location_id, ep_project_replacement_id,
capacity_limit, capacity_limit_conversion, heat_rate, cogen_thermal_demand, connect_cost_per_mw;
read table new_projects;
My objective function looks like this:pid = project specific id , a = province, t = technology , p = PERIODS, the start of an investment period as well as the date when a power plant starts running, h = study hour - unique timepoint considered, and p = investment period.
sum{(pid, a, t, p)in PROJECT} Gen[pid, a,t, p, h] * fuel_cost[pid,a,t,p]))
Does anyone have a hint on that please, or a project that uses MPT, and correlated variables?
Here's an example of a table declaration for reading a two-dimensional parameter amt taken from here:
table dietAmts IN "ODBC" (ConnectionStr) "Amounts":
[NUTR, FOOD], amt;
In your case, you'll have the same set twice in the key section, something like [ENERGY_SOURCE, ENERGY_SOURCE], where ENERGY_SOURCE is a set of energy sources such as Coal steam turbine, etc. Since the matrix is symmetric you only need to store half of it.

Is it possible to use pre-calculated factorization to accelerate backslash\mldivide with sparse matrix

I perform many iterations of solving a linear system of equations: Mx=b with large and sparse M.
M doesn't change between iterations but b does. I've tried several methods and so far found the backslash\mldivide to be the most efficient and accurate.
The following code is very similar to what I'm doing:
for ii=1:num_iter
x = M\x;
x = x+dx;
end
Now I want to accelerate the computation even more by utilizing the fact that M is fixed.
Setting the flag spparms('spumoni',2) allows detailed information of the solver algorithm.
I ran the following code:
spparms('spumoni',2);
x = M\B;
The output (monitoring):
sp\: bandwidth = 2452+1+2452.
sp\: is A diagonal? no.
sp\: is band density (0.01) > bandden (0.50) to try banded solver? no.
sp\: is A triangular? no.
sp\: is A morally triangular? no.
sp\: is A a candidate for Cholesky (symmetric, real positive diagonal)? no.
sp\: use Unsymmetric MultiFrontal PACKage with Control parameters:
UMFPACK V5.4.0 (May 20, 2009), Control:
Matrix entry defined as: double
Int (generic integer) defined as: UF_long
0: print level: 2
1: dense row parameter: 0.2
"dense" rows have > max (16, (0.2)*16*sqrt(n_col) entries)
2: dense column parameter: 0.2
"dense" columns have > max (16, (0.2)*16*sqrt(n_row) entries)
3: pivot tolerance: 0.1
4: block size for dense matrix kernels: 32
5: strategy: 0 (auto)
6: initial allocation ratio: 0.7
7: max iterative refinement steps: 2
12: 2-by-2 pivot tolerance: 0.01
13: Q fixed during numerical factorization: 0 (auto)
14: AMD dense row/col parameter: 10
"dense" rows/columns have > max (16, (10)*sqrt(n)) entries
Only used if the AMD ordering is used.
15: diagonal pivot tolerance: 0.001
Only used if diagonal pivoting is attempted.
16: scaling: 1 (divide each row by sum of abs. values in each row)
17: frontal matrix allocation ratio: 0.5
18: drop tolerance: 0
19: AMD and COLAMD aggressive absorption: 1 (yes)
The following options can only be changed at compile-time:
8: BLAS library used: Fortran BLAS. size of BLAS integer: 8
9: compiled for MATLAB
10: CPU timer is ANSI C clock (may wrap around).
11: compiled for normal operation (debugging disabled)
computer/operating system: Microsoft Windows
size of int: 4 UF_long: 8 Int: 8 pointer: 8 double: 8 Entry: 8 (in bytes)
sp\: is UMFPACK's symbolic LU factorization (with automatic reordering) successful? yes.
sp\: is UMFPACK's numeric LU factorization successful? yes.
sp\: is UMFPACK's triangular solve successful? yes.
sp\: UMFPACK Statistics:
UMFPACK V5.4.0 (May 20, 2009), Info:
matrix entry defined as: double
Int (generic integer) defined as: UF_long
BLAS library used: Fortran BLAS. size of BLAS integer: 8
MATLAB: yes.
CPU timer: ANSI clock ( ) routine.
number of rows in matrix A: 3468
number of columns in matrix A: 3468
entries in matrix A: 60252
memory usage reported in: 16-byte Units
size of int: 4 bytes
size of UF_long: 8 bytes
size of pointer: 8 bytes
size of numerical entry: 8 bytes
strategy used: symmetric
ordering used: amd on A+A'
modify Q during factorization: no
prefer diagonal pivoting: yes
pivots with zero Markowitz cost: 1284
submatrix S after removing zero-cost pivots:
number of "dense" rows: 0
number of "dense" columns: 0
number of empty rows: 0
number of empty columns 0
submatrix S square and diagonal preserved
pattern of square submatrix S:
number rows and columns 2184
symmetry of nonzero pattern: 0.904903
nz in S+S' (excl. diagonal): 62184
nz on diagonal of matrix S: 2184
fraction of nz on diagonal: 1.000000
AMD statistics, for strict diagonal pivoting:
est. flops for LU factorization: 2.76434e+007
est. nz in L+U (incl. diagonal): 306216
est. largest front (# entries): 31329
est. max nz in any column of L: 177
number of "dense" rows/columns in S+S': 0
symbolic factorization defragmentations: 0
symbolic memory usage (Units): 174698
symbolic memory usage (MBytes): 2.7
Symbolic size (Units): 9196
Symbolic size (MBytes): 0
symbolic factorization CPU time (sec): 0.00
symbolic factorization wallclock time(sec): 0.00
matrix scaled: yes (divided each row by sum of abs values in each row)
minimum sum (abs (rows of A)): 1.00000e+000
maximum sum (abs (rows of A)): 9.75375e+003
symbolic/numeric factorization: upper bound actual %
variable-sized part of Numeric object:
initial size (Units) 149803 146332 98%
peak size (Units) 1037500 202715 20%
final size (Units) 787803 154127 20%
Numeric final size (Units) 806913 171503 21%
Numeric final size (MBytes) 12.3 2.6 21%
peak memory usage (Units) 1083860 249075 23%
peak memory usage (MBytes) 16.5 3.8 23%
numeric factorization flops 5.22115e+008 2.59546e+007 5%
nz in L (incl diagonal) 593172 145107 24%
nz in U (incl diagonal) 835128 154044 18%
nz in L+U (incl diagonal) 1424832 295683 21%
largest front (# entries) 348768 30798 9%
largest # rows in front 519 175 34%
largest # columns in front 672 177 26%
initial allocation ratio used: 0.309
# of forced updates due to frontal growth: 1
number of off-diagonal pivots: 0
nz in L (incl diagonal), if none dropped 145107
nz in U (incl diagonal), if none dropped 154044
number of small entries dropped 0
nonzeros on diagonal of U: 3468
min abs. value on diagonal of U: 4.80e-002
max abs. value on diagonal of U: 1.00e+000
estimate of reciprocal of condition number: 4.80e-002
indices in compressed pattern: 13651
numerical values stored in Numeric object: 295806
numeric factorization defragmentations: 0
numeric factorization reallocations: 0
costly numeric factorization reallocations: 0
numeric factorization CPU time (sec): 0.05
numeric factorization wallclock time (sec): 0.00
numeric factorization mflops (CPU time): 552.22
solve flops: 1.78396e+006
iterative refinement steps taken: 1
iterative refinement steps attempted: 1
sparse backward error omega1: 1.80e-016
sparse backward error omega2: 0.00e+000
solve CPU time (sec): 0.00
solve wall clock time (sec): 0.00
total symbolic + numeric + solve flops: 2.77385e+007
Observe the lines:
numeric factorization flops 5.22115e+008 2.59546e+007 5%
solve flops: 1.78396e+006
total symbolic + numeric + solve flops: 2.77385e+007
It indicates that the factorization of M took 2.59546e+007/2.77385e+007 = 93.6% of the total time required to solve the equations.
I would like to calculate the factorization in advance outside of my iterations and then run only the last stage which takes about 6.5% CPU time.
I know how to calculate the factorization ([L,U,P,Q,R] = lu(M);) but I don't know how to utilize its output as input to a solver.
I would like to run something in the spirit of:
[L,U,P,Q,R] = lu(M);
for ii=1:num_iter
dx = solve_pre_factored(M,P,Q,R,x);
x = x+dx;
end
Is there a way to do that in Matlab?
You have to ask yourself what all these matrices from the LU factorization do.
As the documentation states :
[L,U,P,Q,R] = lu(A) returns unit lower triangular matrix L, upper triangular matrix U, permutation matrices P and Q, and a diagonal scaling matrix R so that P*(R\A)Q = LU for sparse non-empty A. Typically, but not always, the row-scaling leads to a sparser and more stable factorization. The statement lu(A,'matrix') returns identical output values.
Thus in more mathematical terms we have PR-1AQ = LU, thus A = RP-1LUQ-1
Then x = M\x can be rewritten in the following steps :
y = R-1x
z = P y
u = L-1z
v = U-1u
w = Q v
x = w
To invert U, L and R you can use \ which will recognize they are triangular (and diagonal for R) matrices - as monitoring should confirm, and use the appropriate trivial solvers for them.
Thus in a denser and matlab-written way : x = Q*(U\(L\(P*(R\x))));
Doing this will be exactly what happens inside the solver \, with only a single factorization, as you asked.
However, as stated in the comments, it will become faster for big numbers of inversions to compute N = M-1 once, and then only do a simple matrix-vector multiplication, which is much simpler than the process explained above. The initial computation, inv(M), is longer and has some limitations, so this trade-off also depends on properties if your matrix.

ttest results in Matlab?

I am computing those three matrices
A = [1 2 3 4 5 6]';
B = [50987548463 45764568 606978 7318 1674 4]';
C = [50 45 60 78 1 4]';
Why on earth does
ttest(A,B) returns 0 (no rejection of null hypothesis, which means the means are the same with 95% confidence level) while
ttest(A,C) returns 1 (rejection of null hypothesis, which means the means should be different with 95% confidence level)
I would expect rejection of null hypothesis for both ttest, but even more for ttest(A,B)!!
Mean and standard deviation of (A-B) set are high, but t-statistics is -1.0011, which is sufficient to reject H0. Mean and std of (A-C) are smaller, but t-statistics is -2.7612, which is not sufficient to reject H0 with only 5 degrees of freedom. You can check it using
[h1,p1,ci1,stats1] = ttest(A,B)
[h2,p2,ci2,stats2] = ttest(A,C)

Meaning of result in SVM using libsvm

I have recently started using libsvm package in matlab. I am getting the accuracy as a vector. I don't understand it. can someone explain this output.
thanks in advance.
predict_label =
1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
1
-1
-1
-1
accuracy =
86.6667
0.5333
0.5455
prob_values =
0.6648 0.3352
0.0275 0.9725
0.5591 0.4409
0.3320 0.6680
0.2842 0.7158
0.1899 0.8101
0.4817 0.5183
0.1820 0.8180
0.7234 0.2766
0.2326 0.7674
0.0189 0.9811
0.7356 0.2644
0.2289 0.7711
0.0743 0.9257
0.0285 0.9715
this is my output of from this command:
[predict_label, accuracy, prob_values] = svmpredict(testLabel, [(1:N2)' testData*trainData'], model, '-b 1')
where N2 is a fixed value. The problem is the accuracy term.
From this reference:
The function 'svmpredict' has three outputs. The first one,
predictd_label, is a vector of predicted labels. The second output,
accuracy, is a vector including accuracy (for classification), mean
squared error, and squared correlation coefficient (for regression).
The third is a matrix containing decision values or probability
estimates (if '-b 1' is specified).

Training a Decision Tree in MATLAB over binary train data

I want to train a decision tree in MATLAB for binary data. Here is a sample of data I use.
traindata <87*239> [array of data with 239 features]
1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 ... [till 239]
1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 ... [till 239]
....
The thing is that this data corresponds to a form which has only options for yes/no. The outcome of the form is also binary and has the meaning that a patinet has some medical disorder or not! we have used classification tree and the classifier shows us double numbers. for example it branches the first node based on x137 value being bigger than 0.75 or not! Since we don't have 0.75 in our data and it has no yes/no meaning we wanted to use a decision tree which is best for our work. The best decision tree for us is the one that is trained based on boolean variables not double ones. Also it understands that the data is not continuous and for example instead of above representation shows x137 is yes o no (1 or 0). Can someone help me with this? I would also appreciate a solution to map our data to double variables and features if the boolean decision tree is not appliable. I am currently using classregtree in matlab with <87*237> as train and <87*1> as results.
classregtree has an optional input parameter categorical. Using this option, you can pass in a vector indicating which of your input variables are categorical (in your case, this vector would be 1x239, all ones). The decision tree should then contain yes/no decisions rather than numerical thresholds.
From the help of classregtree:
t = classregtree(X,y) creates a decision tree t for predicting the response y as a function of the predictors in the columns of X. X is an n-by-m matrix of predictor values. If y is a vector of n response values, classregtree performs regression. If y is a categorical variable, character array, or cell array of strings, classregtree performs classification.
What's the type of y in your case? It seems that classregtree is doing regression in your case but you want classification. So, y should be a categorical variable.
EDIT: To make your y categorical, you can try "nominal(y)".