Spearman's rank correlation significance - matlab

I'm calculating Spearman's rank correlation in matlab with the following code:
[RHO,PVAL] = corr(x,y,'Type','Spearman');
RHO =
0.7211
PVAL =
4.9473e-04
and then with different variables
[RHO,PVAL] = corr(x2,y2,'Type','Spearman');
RHO =
0.3277
PVAL =
0.0060
How do you categorize these as p < 0.05, p < 0.01, p < 0.001 etc. Commonly in scientific journals these pvalues are represented as the examples I've shown and not as one number. Would both of these be p < 0.01? When defining whether a correlation is significant to a specific value do you always look for the smallest error i.e if its PVAL = 0.0005, both p > 0.05 and p > 0.001 would be correct here, do we simply write the lowest i.e. p > 0.001?

As Martin Dinov wrote, this is at least partially a matter of journal policy. But, as long as there is no explicit journal convention against it, I would recommend to always report the actual p-value, in this case in the form p = 4.9·10-4 and p = 0.006, respectively. You can then proceed to say that the effect you found is statistically significant, usually based on comparison with a previously chosen significance level, typically 0.05, unless you need to correct for multiple comparisons.
The reason is that the commonly used significance levels are purely a matter of convention. By only saying that p is below one conventional threshold means to withhold valuable information from the reader, which she might use to make up her own mind about the result – and this truncation is not even justified by relevant saving of print space.
You should also, of course, report the value of the correlation coefficient itself (which in this case doubles as a test statistic and an effect size) as well as the sample size.
At least for the field of psychology, these are official recommendations:
Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval.
…
Effect sizes. Always present effect sizes for primary outcomes. If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d).
L. Wilkinson and the Task Force on Statistical Inference, "Statistical Methods in Psychology Journals. Guidelines and Explanations"

You mean pval is < 0.05 and also < 0.001 and not >. In general, you do want to show that it is smaller than the smallest significance level (alpha) threshold that you can. So yes, it is best to say for the second example that the p-value is < 0.001. Depending on the journal convention, it may be preferable to put the actual p-value in (so, for the first example, 4.9473e-04) or just that it's < some good alpha (0.0001 for the first case).

Related

P value in backward elimination > 0.05 are eliminated from the features, why do we eliminate? (multi variable linear regression)

In general if the p value is less than 0.05 significance level we reject the null,
In backward elimination we delete the features whose p value is greater than 0.05, why not delete the delete the terms whose p value is less.
and on what condition do the regression model calculate the p value ?
can anyone explain in simple and clear terms.
Thanks for ur time and help.
when the p value is >.05 the significance level is <0.05,
significance level + p_value =1
so it means if high p_value less significance of the variable hence we delete that variable.
if p_value is <0.05 then significance level is > 0.95 hence we take columns with high significance i.e low p_value and del columns with high p_value
Your answer and the assumption are not right, because in Backward Elimination the very first Step is to set SL (significance level) for your model to a constant value (usually SL = 0.05).
The answer is simple. The goal is to keep the significant occurrences. so if for example you set the SL = 0.1, the P_values greater 0.1 are insignificant and the P_values smaller or equal to 0.1 are considered significant.
The higher the P_value, more insignificant the predictor (occurrence) and the coefficient closer to zero (H_0). Such an occurrence is not suitable for the model to train with.

Are the conditional probabilities of MATLAB's mnrval correct?

An MWE (stats toolbox required, tested on MATLAB R2014b):
x = (1:3)';
b = mnrfit(x,x,'model','hierarchical');
pihat = mnrval(b,x,'model','hierarchical','type','conditional')
Output:
pihat =
1 1
2.2204e-16 1
2.2204e-16 2.2204e-16
(Ignore the issued warning, it's because of the trivial example, which is linearly separable (I'm predicting x using itself). It doesn't matter: I've tried this as well with a non-trivial (and not-so minimal) example without the warning and the results are similar.)
My problem is the result. I've specified I want the conditional probabilities. According to MATLAB's documentation on mnrval:
Specify ['conditional'] to return predictions [...] in terms of the first k – 1 conditional category probabilities [...], i.e., the probability [...] for category j, given an outcome in category j or higher.
In my example this means rows of pihat contain the probability of
x=1 given x>=1
x=2 given x>=2
(A third column for x=3 is not necessary, because if the first two probabilities are known, the third is too. It follows logically from P(x=1) + P(x=2) + P(x=3) = 1.)
Am I interpreting this correctly? Thus, if x=1 is predicted, then the first column value should be large (close to one), because P(x=1) given x>=1 is large. The second column should be close to zero, because P(x=2) given x>=2 can't be large if x=1.
However, as you can see in the first row, the second column value is large as well as the first! I believe this is incorrect according to what the documentation specifies, am I right? The current (incorrect?) result implies the predicted probabilities in the rows are not of x=j given x>=j, but what are they then? Or how should I be interpreting them?
They are not equal to the cumulative probabilities, i.e. the probability of x<=j, which increases with j. I've checked this by calculating pihat2 = mnrval(b,x,'model','hierarchical','type','cumulative'); pihat2-pihat.

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.
Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

Match Two Sets of Measurement Data With Different Logging Start Times and End Times

Problem
I have two arrays (Xa and Xb) that contain measurements of the same physical signal, but they are taken at different sample rates. Lastly, physical logging of Xa data starts at a different time, than that of Xb. The logging of data also stops at different time.
i.e.
(The following is just a summary of important statements, not code.)
sampleRatea > sampleRateb % Resolution of Xa is greater than that of Xb
t0a ~= t0b % Start times are not equal
t1a ~= t1b % End times are not equal
Objective
Find the necessary shift in indices that will best line up these sets of data.
Approach
Use fmincon to find the index that minimizes the mean squared error (MSE) between versions Xa and Xb that are edited to have the same sample rate (perhaps using the interpolation function).
I have tried to do this but it always seems that I have too many degrees of freedom. Is there anyone who can shed some light on a process that might facilitate this process?
Assuming you have two samples with constant frequencies, the problem reduces to something quite simple:
Find scale, location such that:
Xa , at timestamps corresponding to its index, makes the best match with Xb at timstamps corresponding to location + scale * its index.
If you agree with this you can see that only two degrees of freedom are left, if you know the ratio of sample rates it even reduces to just 1 degree of freedom.
I believe that now the hard part is done, but some work still remains:
Judge how good two samples with timestamps and values match
Find the optimal combination of your location and scale parameter
Note that, assuming you complete these 2 steps properly, the solution should be optimal for finding the optimal timestamps. As you are looking for a shift in (integer) indices, translating these timestamps back to indices may not be result in the real optimum but it should be pretty close.
Here is a quick-and-dirty solution that should be enough to get you started. Given your input signals Xa and Xb sampled at sampleRatea and sampleRateb respectively:
g = gcd(sampleRatea,sampleRateb);
Ya = interp(Xa,sampleRateb/g);
Yb = interp(Xb,sampleRatea/g);
Yfs = sampleRatea*sampleRateb/g;
[acor,lag] = xcorr(Ya,Yb);
time_shift = lag(acor == max(acor))/Yfs;
The variable time_shift will tell you the time elapsed between the start of A and the start of B. If B starts first, the result will be negative.
If your sampling rates are relatively prime, this will be horribly inefficient. If one is an integer multiple of the other, or they have a relatively large GCD, it will be much better.

Dijkstra's algorithm with negative weights

Can we use Dijkstra's algorithm with negative weights?
STOP! Before you think "lol nub you can just endlessly hop between two points and get an infinitely cheap path", I'm more thinking of one-way paths.
An application for this would be a mountainous terrain with points on it. Obviously going from high to low doesn't take energy, in fact, it generates energy (thus a negative path weight)! But going back again just wouldn't work that way, unless you are Chuck Norris.
I was thinking of incrementing the weight of all points until they are non-negative, but I'm not sure whether that will work.
As long as the graph does not contain a negative cycle (a directed cycle whose edge weights have a negative sum), it will have a shortest path between any two points, but Dijkstra's algorithm is not designed to find them. The best-known algorithm for finding single-source shortest paths in a directed graph with negative edge weights is the Bellman-Ford algorithm. This comes at a cost, however: Bellman-Ford requires O(|V|·|E|) time, while Dijkstra's requires O(|E| + |V|log|V|) time, which is asymptotically faster for both sparse graphs (where E is O(|V|)) and dense graphs (where E is O(|V|^2)).
In your example of a mountainous terrain (necessarily a directed graph, since going up and down an incline have different weights) there is no possibility of a negative cycle, since this would imply leaving a point and then returning to it with a net energy gain - which could be used to create a perpetual motion machine.
Increasing all the weights by a constant value so that they are non-negative will not work. To see this, consider the graph where there are two paths from A to B, one traversing a single edge of length 2, and one traversing edges of length 1, 1, and -2. The second path is shorter, but if you increase all edge weights by 2, the first path now has length 4, and the second path has length 6, reversing the shortest paths. This tactic will only work if all possible paths between the two points use the same number of edges.
If you read the proof of optimality, one of the assumptions made is that all the weights are non-negative. So, no. As Bart recommends, use Bellman-Ford if there are no negative cycles in your graph.
You have to understand that a negative edge isn't just a negative number --- it implies a reduction in the cost of the path. If you add a negative edge to your path, you have reduced the cost of the path --- if you increment the weights so that this edge is now non-negative, it does not have that reducing property anymore and thus this is a different graph.
I encourage you to read the proof of optimality --- there you will see that the assumption that adding an edge to an existing path can only increase (or not affect) the cost of the path is critical.
You can use Dijkstra's on a negative weighted graph but you first have to find the proper offset for each Vertex. That is essentially what Johnson's algorithm does. But that would be overkill since Johnson's uses Bellman-Ford to find the weight offset(s). Johnson's is designed to all shortest paths between pairs of Vertices.
http://en.wikipedia.org/wiki/Johnson%27s_algorithm
There is actually an algorithm which uses Dijkstra's algorithm in a negative path environment; it does so by removing all the negative edges and rebalancing the graph first. This algorithm is called 'Johnson's Algorithm'.
The way it works is by adding a new node (lets say Q) which has 0 cost to traverse to every other node in the graph. It then runs Bellman-Ford on the graph from point Q, getting a cost for each node with respect to Q which we will call q[x], which will either be 0 or a negative number (as it used one of the negative paths).
E.g. a -> -3 -> b, therefore if we add a node Q which has 0 cost to all of these nodes, then q[a] = 0, q[b] = -3.
We then rebalance out the edges using the formula: weight + q[source] - q[destination], so the new weight of a->b is -3 + 0 - (-3) = 0. We do this for all other edges in the graph, then remove Q and its outgoing edges and voila! We now have a rebalanced graph with no negative edges to which we can run dijkstra's on!
The running time is O(nm) [bellman-ford] + n x O(m log n) [n Dijkstra's] + O(n^2) [weight computation] = O (nm log n) time
More info: http://joonki-jeong.blogspot.co.uk/2013/01/johnsons-algorithm.html
Actually I think it'll work to modify the edge weights. Not with an offset but with a factor. Assume instead of measuring the distance you are measuring the time required from point A to B.
weight = time = distance / velocity
You could even adapt velocity depending on the slope to use the physical one if your task is for real mountains and car/bike.
Yes, you could do that with adding one step at the end i.e.
If v ∈ Q, Then Decrease-Key(Q, v, v.d)
Else Insert(Q, v) and S = S \ {v}.
An expression tree is a binary tree in which all leaves are operands (constants or variables), and the non-leaf nodes are binary operators (+, -, /, *, ^). Implement this tree to model polynomials with the basic methods of the tree including the following:
A function that calculates the first derivative of a polynomial.
Evaluate a polynomial for a given value of x.
[20] Use the following rules for the derivative: Derivative(constant) = 0 Derivative(x) = 1 Derivative(P(x) + Q(y)) = Derivative(P(x)) + Derivative(Q(y)) Derivative(P(x) - Q(y)) = Derivative(P(x)) - Derivative(Q(y)) Derivative(P(x) * Q(y)) = P(x)*Derivative(Q(y)) + Q(x)*Derivative(P(x)) Derivative(P(x) / Q(y)) = P(x)*Derivative(Q(y)) - Q(x)*Derivative(P(x)) Derivative(P(x) ^ Q(y)) = Q(y) * (P(x) ^(Q(y) - 1)) * Derivative(Q(y))