shouldn't norm.ppf() method only give positive output? - scipy

I am learning online Udemy course for finance.
the author use .ppf() in one of his monte carlo simulation to simulate stock daily return.
my understand is .ppf (percentage point function )shows the distance to the mean given a certain percentage. and it is always positive. Normal distribution is always symmetric, in reality there are two values +- (output), but ppf() just show the absolute distance.
to my surprise the code give positive and negative outputs, can somebody help me is there something wrong with my understanding of statistic, or is there some specification of ppf() method I do not know? I anyway didn't find documentation with explanation of how ppf() method works. thank you !
the code is simple
norm.ppf(np.random.rand(10, 2))

The ppf method is the inverse of the CDF. It is also known as the quantile function.
You said "and it is always positive", but that is not correct. It will return values from the support of the distribution, which for the normal distribution is the entire real line.
The expression norm.ppf(np.random.rand(10, 2)) generates random samples from the standard normal distribution, using the inverse transform method. Instead of using that expression, you could simply call the rvs method: norm.rvs(size=(10, 2)).

Related

Don't understand the need of "grad" in lrCostFunction.m

Coding the lrCostFunction.m in Octave for the course Machine Learning in Coursera (Neural Networks) "ex3". I don't get why we need to obtain "grad". Anybody has a clue?
Thx in advance
Grad refers to the 'gradient' of the cost function.
Your objective is to minimize the cost function. In order to do that, most optimisation algorithms also need to know the equation that gives its gradient at each point, so that they can use it to move the next search in a direction that makes it more likely that the cost function will be at a lower value.
Specifically, since the gradient at a point is defined as the direction of maximal rate of 'increase' in the underlying function, typically optimisation algorithms use the current point and take a small step in the reverse direction to that indicated by the gradient.
In any case, since you're asking an abstract optimisation algorithm to optimise parameters such that a cost function is minimized by making use of its gradient at each step, you need to provide all of those inputs to the algorithm. Hence why you need to calculate 'grad' value as well as the value of the cost function itself at each point.

MSE Cost Function for Training Neural Network

In an online textbook on neural networks and deep learning, the author illustrates neural net basics in terms of minimizing a quadratic cost function which he says is synonymous with mean squared error. Two things have me confused about his function, though (pseudocode below).
MSE≡(1/2n)*∑‖y_true-y_pred‖^2
Instead of dividing the sum of squared errors by the number of training examples n why is it instead divided by 2n? How is this the mean of anything?
Why is double bar notation used instead of parentheses? This had me thinking there was some other calculation going on, such as of an L2-norm, that is not shown explicitly. I suspect this is not the case and that term is meant to express plain old sum of squared errors. Super confusing though.
Any insight you can offer is greatly appreciated!
The 0.5 factor by which the cost function is multiplied is not important. In fact you could multiply it by any real constant you want, and the learning would be the same. It's only used so that the derivative of the cost function with respect to the output will simply be $$y - y_{t}$$. Which is convenient in some applications, like backpropagation.
The notation ∥v∥ just denotes the usual length function for a vector v. From the online textbook you referenced.
Find more info on the double bars here. But from what I understand, you can basically view it as an absolute term.
I'm not sure why it says 2n, but it's not always 2n. Wikipedia for example writes the function as follows:
Googling Mean Squared Error also has a lot of sources using the Wikipedia one, instead of theo ne from the online textbook.
The double bar is the distance measure, and the bracket is incorrect if y is multi-dimenssional.
For mean squared error, there is no 2 with n, but it is unimportant. It will be absorbed by the learning rate.
However it is often there to cancel the square number 2 when evaluating the derivative.

Tolerances in Numerical quadrature - MATLAB

What is the difference between abtol and reltol in MATLAB when performing numerical quadrature?
I have an triple integral that is supposed to generate a number between 0 and 1 and I am wondering what would be the best tolerance for my application?
Any other ideas on decreasing the time of integral3 execution.
Also does anyone know whether integral3 or quadgk is faster?
When performing the integration, MATLAB (or most any other integration software) computes a low-order solution qLow and a high-order solution qHigh.
There are a number of different methods of computing the true error (i.e., how far either qLow or qHigh is from the actual solution qTrue), but MATLAB simply computes an absolute error as the difference between the high and low order integral solutions:
errAbs = abs(qLow - qHigh).
If the integral is truly a large value, that difference may be large in an absolute sense but not a relative sense. For example, errAbs might be 1E3, but qTrue is 1E12; in that case, the method could be said to converge relatively since at least 8 digits of accuracy has been reached.
So MATLAB also considers the relative error :
errRel = abs(qLow - qHigh)/abs(qHigh).
You'll notice I'm treating qHigh as qTrue since it is our best estimate.
Over a given sub-region, if the error estimate falls below either the absolute limit or the relative limit times the current integral estimate, the integral is considered converged. If not, the region is divided, and the calculation repeated.
For the integral function and integral2/integral3 functions with the iterated method, the low-high solutions are a Gauss-Kronrod 7-15 pair (the same 7-th order/15-th order set used by quadgk.
For the integral2/integral3 functions with the tiled method, the low-high solutions are a Gauss-Kronrod 3-7 pair (I've never used this option, so I'm not sure how it compares to others).
Since all of these methods come down to a Gauss-Kronrod quadrature rule, I'd say sticking with integral3 and letting it do the adaptive refinement as needed is the best course.

Implementing Bootstrap Confidence Intervals into Matlab

I apologise if this is quite obvious to some however I have been trying to get my head around bootstraps for a few hours, and for something so simple I am really struggling.
I have a large data set, however it is not normally distributed and am trying to find the confidence levels, hence why I have turned to bootstraps. I want to apply the bootstrap to the fourth column of a data set, which I can do.
However I am having trouble with the bootci function itself
ci=bootci(10000, ..... , array;
I am having trouble implementing the function, as I don't fully understand what the 2nd part of the bootci function, denoted ....., does.
I have seen #mean implemented on other examples, I'm assuming this calculates the mean for each column and applies it to the function.
If anyone could confirm my thinking or explain the function to me it would be much appreciated!
I am also unsure about how to change the sample size, could someone point me in the right direction?
From what I understand of the question:
ci = bootci(10000, #mean, X);
Will determine a 95% confidence interval of the mean of the dataset X using 10000 subsamples generated using random sampling with replacement from dataset X.
The second argument of the function #mean indicates that the function to apply to the subsamples is mean, and hence to calculate the confidence interval of the mean. You could equally pass in #std to calculate a confidence interval on the standard deviation if you wanted, or pass in any other suitable function for that matter.
From what I have read in the documentation, it does not seem to be possible to directly control the size of the subsamples used by the bootci function.

Naive bayes classifier calculation

I'm trying to use naive Bayes classifier to classify my dataset.My questions are:
1- Usually when we try to calculate the likehood we use the formula:
P(c|x)= P(c|x1) * P(c|x2)*...P(c|xn)*P(c) . But in some examples it says in order to avoid getting very small results we use P(c|x)= exp(log(c|x1) + log(c|x2)+...log(c|xn) + logP(c)). can anyone explain more to me the difference between these two formula and are they both used to calculate the "likehood" or the sec one is used to calculate something called "information gain".
2- In some cases when we try to classify our datasets some joints are null. Some ppl use "LAPLACE smoothing" technique in order to avoid null joints. Doesnt this technique influence on the accurancy of our classification?.
Thanks in advance for all your time. I'm just new to this algorithm and trying to learn more about it. So is there any recommended papers i should read? Thanks alot.
I'll take a stab at your first question, assuming you lost most of the P's in your second equation. I think the equation you are ultimately driving towards is:
log P(c|x) = log P(c|x1) + log P(c|x2) + ... + log P(c)
If so, the examples are pointing out that in many statistical calculations, it's often easier to work with the logarithm of a distribution function, as opposed to the distribution function itself.
Practically speaking, it's related to the fact that many statistical distributions involve an exponential function. For example, you can find where the maximum of a Gaussian distribution K*exp^(-s_0*(x-x_0)^2) occurs by solving the mathematically less complex problem (if we're going through the whole formal process of taking derivatives and finding equation roots) of finding where the maximum of its logarithm K-s_0*(x-x_0)^2 occurs.
This leads to many places where "take the logarithm of both sides" is a standard step in an optimization calculation.
Also, computationally, when you are optimizing likelihood functions that may involve many multiplicative terms, adding logarithms of small floating-point numbers is less likely to cause numerical problems than multiplying small floating point numbers together is.