P value in backward elimination > 0.05 are eliminated from the features, why do we eliminate? (multi variable linear regression) - linear-regression

In general if the p value is less than 0.05 significance level we reject the null,
In backward elimination we delete the features whose p value is greater than 0.05, why not delete the delete the terms whose p value is less.
and on what condition do the regression model calculate the p value ?
can anyone explain in simple and clear terms.
Thanks for ur time and help.

when the p value is >.05 the significance level is <0.05,
significance level + p_value =1
so it means if high p_value less significance of the variable hence we delete that variable.
if p_value is <0.05 then significance level is > 0.95 hence we take columns with high significance i.e low p_value and del columns with high p_value

Your answer and the assumption are not right, because in Backward Elimination the very first Step is to set SL (significance level) for your model to a constant value (usually SL = 0.05).
The answer is simple. The goal is to keep the significant occurrences. so if for example you set the SL = 0.1, the P_values greater 0.1 are insignificant and the P_values smaller or equal to 0.1 are considered significant.
The higher the P_value, more insignificant the predictor (occurrence) and the coefficient closer to zero (H_0). Such an occurrence is not suitable for the model to train with.

Related

Assessing performance of a zero inflated negative binomial model

I am modelling the diffusion of movies through a contact network (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below.)
The next step is to evaluate the performance of the model.
My attempt has been to do multiple out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I received a prediction for the mean length of a diffusion chain for each datapoint, and using
predict(m1, newdata = testData, type = "prob")
I received a matrix containing the probability of each datapoint being a certain length.
Problem with the evaluation: Since I have a 0 (and 1) inflated dataset, the model would be correct most of the time if it predicted 0 for all the values. The predictions I receive are good for chains of length zero (according to the MSE), but the deviation between the predicted and the true value for chains of length 1 or larger is substantial.
My question is:
How can we assess how well our model predicts chains of non-zero length?
Is this approach the correct way to make predictions from a zero inflated negative binomial model?
If yes: how do I interpret these results?
If no: what alternative can I use?
My variables are:
Dependent variable:
length of the diffusion chain (count [0,36])
Independent variables:
movie characteristics (both dummies and continuous variables).
Thanks!
It is straightforward to evaluate RMSPE (root mean square predictive error), but is probably best to transform your counts beforehand, to ensure that the really big counts do not dominate this sum.
You may find false negative and false positive error rates (FNR and FPR) to be useful here. FNR is the chance that a chain of actual non-zero length is predicted to have zero length (i.e. absence, also known as negative). FPR is the chance that a chain of actual zero length is falsely predicted to have non-zero (i.e. positive) length. I suggest doing a Google on these terms to find a paper in your favourite quantitative journals or a chapter in a book that helps explain these simply. For ecologists I tend to go back to Fielding & Bell (1997, Environmental Conservation).
First, let's define a repeatable example, that anyone can use (not sure where your trainData comes from). This is from help on zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that calculate these. But here's the by hand approach. First calculate observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then get the confusion matrix (basis of FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration we can do it visually or via calibration
# let's look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Transforming the counts to "see" what is going on:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case you could also evaluate the correlations, between the predicted counts and the actual counts, either including or excluding the zeros.
So you could fit a regression as a kind of calibration to evaluate this!
However, since the predictions are not necessarily counts, we can't use a poisson
regression, so instead we can use a lognormal, by regressing the log
prediction against the log observed, assuming a Normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more fancy ways of assessing calibration I suppose, as in any modelling exercise ... but this is a start.
For a more advanced assessment of zero-inflated models, check out the ways in which the log likelihood can be used, in the references provided for the zeroinfl function. This requires a bit of finesse.

Spearman's rank correlation significance

I'm calculating Spearman's rank correlation in matlab with the following code:
[RHO,PVAL] = corr(x,y,'Type','Spearman');
RHO =
0.7211
PVAL =
4.9473e-04
and then with different variables
[RHO,PVAL] = corr(x2,y2,'Type','Spearman');
RHO =
0.3277
PVAL =
0.0060
How do you categorize these as p < 0.05, p < 0.01, p < 0.001 etc. Commonly in scientific journals these pvalues are represented as the examples I've shown and not as one number. Would both of these be p < 0.01? When defining whether a correlation is significant to a specific value do you always look for the smallest error i.e if its PVAL = 0.0005, both p > 0.05 and p > 0.001 would be correct here, do we simply write the lowest i.e. p > 0.001?
As Martin Dinov wrote, this is at least partially a matter of journal policy. But, as long as there is no explicit journal convention against it, I would recommend to always report the actual p-value, in this case in the form p = 4.9·10-4 and p = 0.006, respectively. You can then proceed to say that the effect you found is statistically significant, usually based on comparison with a previously chosen significance level, typically 0.05, unless you need to correct for multiple comparisons.
The reason is that the commonly used significance levels are purely a matter of convention. By only saying that p is below one conventional threshold means to withhold valuable information from the reader, which she might use to make up her own mind about the result – and this truncation is not even justified by relevant saving of print space.
You should also, of course, report the value of the correlation coefficient itself (which in this case doubles as a test statistic and an effect size) as well as the sample size.
At least for the field of psychology, these are official recommendations:
Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval.
…
Effect sizes. Always present effect sizes for primary outcomes. If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d).
L. Wilkinson and the Task Force on Statistical Inference, "Statistical Methods in Psychology Journals. Guidelines and Explanations"
You mean pval is < 0.05 and also < 0.001 and not >. In general, you do want to show that it is smaller than the smallest significance level (alpha) threshold that you can. So yes, it is best to say for the second example that the p-value is < 0.001. Depending on the journal convention, it may be preferable to put the actual p-value in (so, for the first example, 4.9473e-04) or just that it's < some good alpha (0.0001 for the first case).

Matlab linprog constraints: how to stop charge and discharge storage at same time?

I'm having issues with matlab linprog code. The optimisation function is the overall cost for the 24 period, just considering fuel costs of the boiler.
Purpose of simulation:
Optimisation of the charge/discharge behaviour of a Thermal Energy Storage (TES) for a 24h operation of a system consisting of a boiler, heat demand, and the TES. The price of the gas are time-varying.
Problem:
If the TES is ideal (efficiency=100%), I have no constraint that stops the system from charging and discharging at the same time. I CANNOT use one variable to describe charge and discharge. I do need them separated
At the moment I have the following constraints to describe the min/max charge/discharge rates (and of course some others):
maxChargeThermalTES>=ChargeThermalTES<=0
maxDischargeThermalTES >= DischargeThermalTES <=0
is it possible to realise the following logical rule within the constraints of linprog?
if ChargeThermalTES<0,
DischargeThermalTES=0
end
all approaches, e.g. with a binary variable (to describe if the system is charging or discharging) do not work, as the binary variable always depends on the output of the optimisation.
You cannot enforce such logical rule in linear programming.
However, what you can do is the following :
1\ solve your linear program, without this constraint. Get the optimal cost of your objective function (lets name it OldCost).
2\ Then change your linear program this way :
add a constraint : old objective function should be between OldCost * (1-Epsilon) and OldCost * (1+Epsilon)
the new objective function to minimize is ChargeThermalTES + DischargeThermalTES.
Cheers
Yes, it is possible. You can add your If-Then condition to linprog using one 0-1 binary variable and Big-M.
To realize the Logical rule:
if ChargeThermalTES<0,
DischargeThermalTES=0
end
Condition: If ChargeThermalTES<0, then DischargeThermalTES=0
Let's introduce a binary variable y
So we can rewrite the condition as
ChargeThermalTES - M y < 0
which implies that
if y = 0, then DischargeThermalTES must be = 0
if y=1, DischargeThermalTES can be anything
Let's split the equal to constraint into two inequalities
DischargeThermalTES < M y
DischargeThermalTES > -M y
If y=1, the above two are essentially non-binding.
If y= 0, it will force DischargeThermalTES to become 0.
So with the following constraints combined, you can enforce your logical constraint to the Linear Program.
ChargeThermalTES - M y < 0
DischargeThermalTES < M y
DischargeThermalTES > -M y
y = {0,1} binary, M is a large number.
Hope that helps.

explanation of roulette wheel selection example

I have a lisp program on roulette wheel selection,I am trying to understand the theory behind it but I cannot understand anything.
How to calculate the fitness of the selected strng?
For example,if I have a string 01101,how did they get the fitness value as 169?
Is it that the binary coding of 01101 evaluates to 13,so i square the value and get the answer as 169?
That sounds lame but somehow I am getting the right answers by doing that.
The fitness function you have is therefore F=X^2.
The roulette wheel calculates the proportion (according to its fitness) of the whole that that individual (string) takes, this is then used to randomly select a set of strings for the next generation.
Suggest you read this a few times.
The "fitness function" for a given problem is chosen (often) arbitrarily keeping in mind that as the "fitness" metric rises, the solution should approach optimality. For example for a problem in which the objective is to minimize a positive value, the natural choice for F(x) would be 1/x.
For the problem at hand, it seems that the fitness function has been given as F(x) = val(x)*val(x) though one cannot be certain from just a single value pair of (x,F(x)).
Roulette-wheel selection is just a commonly employed method of fitness-based pseudo-random selection. This is easy to understand if you've ever played roulette or watched 'Wheel of Fortune'.
Let us consider the simplest case, where F(x) = val(x),
Suppose we have four values, 1,2,3 and 4.
This implies that these "individuals" have fitnesses 1,2,3 and 4 respectively. Now the probability of selection of an individual 'x1' is calculated as F(x1)/(sum of all F(x)). That is to say here, since the sum of the fitnesses would be 10, the probabilities of selection would be, respectively, 0.1,0.2,0.3 and 0.4.
Now if we consider these probabilities from a cumulative perspective the values of x would be mapped to the following ranges of "probability:
1 ---> (0.0, 0.1]
2 ---> (0.1, (0.1 + 0.2)] ---> (0.1, 0.3]
3 ---> (0.3, (0.1 + 0.2 + 0.3)] ---> (0.3, 0.6]
4 ---> (0.6, (0.1 + 0.2 + 0.3 + 0.4)] ---> (0.6, 1.0]
That is, an instance of a uniformly distributed random variable generated, say R lying in the normalised interval, (0, 1], is four times as likely to be in the interval corresponding to 4 as to that corresponding to 1.
To put it another way, suppose you were to spin a roulette-wheel-type structure with each x assigned a sector with the areas of the sectors being in proportion to their respective values of F(x), then the probability that the indicator will stop in any given sector is directly propotional to the value of F(x) for that x.

Dijkstra's algorithm with negative weights

Can we use Dijkstra's algorithm with negative weights?
STOP! Before you think "lol nub you can just endlessly hop between two points and get an infinitely cheap path", I'm more thinking of one-way paths.
An application for this would be a mountainous terrain with points on it. Obviously going from high to low doesn't take energy, in fact, it generates energy (thus a negative path weight)! But going back again just wouldn't work that way, unless you are Chuck Norris.
I was thinking of incrementing the weight of all points until they are non-negative, but I'm not sure whether that will work.
As long as the graph does not contain a negative cycle (a directed cycle whose edge weights have a negative sum), it will have a shortest path between any two points, but Dijkstra's algorithm is not designed to find them. The best-known algorithm for finding single-source shortest paths in a directed graph with negative edge weights is the Bellman-Ford algorithm. This comes at a cost, however: Bellman-Ford requires O(|V|·|E|) time, while Dijkstra's requires O(|E| + |V|log|V|) time, which is asymptotically faster for both sparse graphs (where E is O(|V|)) and dense graphs (where E is O(|V|^2)).
In your example of a mountainous terrain (necessarily a directed graph, since going up and down an incline have different weights) there is no possibility of a negative cycle, since this would imply leaving a point and then returning to it with a net energy gain - which could be used to create a perpetual motion machine.
Increasing all the weights by a constant value so that they are non-negative will not work. To see this, consider the graph where there are two paths from A to B, one traversing a single edge of length 2, and one traversing edges of length 1, 1, and -2. The second path is shorter, but if you increase all edge weights by 2, the first path now has length 4, and the second path has length 6, reversing the shortest paths. This tactic will only work if all possible paths between the two points use the same number of edges.
If you read the proof of optimality, one of the assumptions made is that all the weights are non-negative. So, no. As Bart recommends, use Bellman-Ford if there are no negative cycles in your graph.
You have to understand that a negative edge isn't just a negative number --- it implies a reduction in the cost of the path. If you add a negative edge to your path, you have reduced the cost of the path --- if you increment the weights so that this edge is now non-negative, it does not have that reducing property anymore and thus this is a different graph.
I encourage you to read the proof of optimality --- there you will see that the assumption that adding an edge to an existing path can only increase (or not affect) the cost of the path is critical.
You can use Dijkstra's on a negative weighted graph but you first have to find the proper offset for each Vertex. That is essentially what Johnson's algorithm does. But that would be overkill since Johnson's uses Bellman-Ford to find the weight offset(s). Johnson's is designed to all shortest paths between pairs of Vertices.
http://en.wikipedia.org/wiki/Johnson%27s_algorithm
There is actually an algorithm which uses Dijkstra's algorithm in a negative path environment; it does so by removing all the negative edges and rebalancing the graph first. This algorithm is called 'Johnson's Algorithm'.
The way it works is by adding a new node (lets say Q) which has 0 cost to traverse to every other node in the graph. It then runs Bellman-Ford on the graph from point Q, getting a cost for each node with respect to Q which we will call q[x], which will either be 0 or a negative number (as it used one of the negative paths).
E.g. a -> -3 -> b, therefore if we add a node Q which has 0 cost to all of these nodes, then q[a] = 0, q[b] = -3.
We then rebalance out the edges using the formula: weight + q[source] - q[destination], so the new weight of a->b is -3 + 0 - (-3) = 0. We do this for all other edges in the graph, then remove Q and its outgoing edges and voila! We now have a rebalanced graph with no negative edges to which we can run dijkstra's on!
The running time is O(nm) [bellman-ford] + n x O(m log n) [n Dijkstra's] + O(n^2) [weight computation] = O (nm log n) time
More info: http://joonki-jeong.blogspot.co.uk/2013/01/johnsons-algorithm.html
Actually I think it'll work to modify the edge weights. Not with an offset but with a factor. Assume instead of measuring the distance you are measuring the time required from point A to B.
weight = time = distance / velocity
You could even adapt velocity depending on the slope to use the physical one if your task is for real mountains and car/bike.
Yes, you could do that with adding one step at the end i.e.
If v ∈ Q, Then Decrease-Key(Q, v, v.d)
Else Insert(Q, v) and S = S \ {v}.
An expression tree is a binary tree in which all leaves are operands (constants or variables), and the non-leaf nodes are binary operators (+, -, /, *, ^). Implement this tree to model polynomials with the basic methods of the tree including the following:
A function that calculates the first derivative of a polynomial.
Evaluate a polynomial for a given value of x.
[20] Use the following rules for the derivative: Derivative(constant) = 0 Derivative(x) = 1 Derivative(P(x) + Q(y)) = Derivative(P(x)) + Derivative(Q(y)) Derivative(P(x) - Q(y)) = Derivative(P(x)) - Derivative(Q(y)) Derivative(P(x) * Q(y)) = P(x)*Derivative(Q(y)) + Q(x)*Derivative(P(x)) Derivative(P(x) / Q(y)) = P(x)*Derivative(Q(y)) - Q(x)*Derivative(P(x)) Derivative(P(x) ^ Q(y)) = Q(y) * (P(x) ^(Q(y) - 1)) * Derivative(Q(y))