How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)? - perl

I have written a Perl script that performs many one-sample t-tests. I get thousands of t-statistics with their degrees of freedom (df). I need to upgrade the script to also return their p-values (there are too many to look them up manually in a table). Is there some kind of formula I can use for this with the t-statistic and d.f as input? I hope someone can help me with this, many thanks in advance!
A.A.

Using Statistics::Distributions seems pretty straightforward:
print Statistics::Distributions::tprob($dof,$tstat);

A search of MetaCPAN reveals the following:
https://metacpan.org/pod/Statistics::Distributions

If you are doing a two-tailed test, then your p-value = 2*P(T > t), where t is your calculated test statistic. So essentially, you need a way to model the T-dist in order to integrate(T-dist from t to INFINITY). Here is a demo: http://www.stat.tamu.edu/~west/applets/tdemo.html
I'm not familiar with Perl and its libraries, but hopefully this gets you started. You can write a rudimentary integrator and check some values to make sure that it is accurate enough.

Related

Using Gurobi to run a MIQP: how can I improve time performance?

I am using Gurobi to run a MIQP (Mixed Integer Quadratic Programming) with linear constraints in Matlab. The solver is very slow and I would like your help to understand whether I can do something about it.
These are the lines which I use to launch the problem
clear model;
clear params;
model.A=[Aineq; Aeq];
model.rhs=[bineq; beq];
model.sense=[repmat('<', size(Aineq,1),1); repmat('=', size(Aeq,1),1)];
model.Q=Q;
model.obj=c;
model.vtype=type;
model.lb=total_lb;
model.ub=total_ub;
params.MIPGap=10^(-1);
result=gurobi(model,params);
This is a screenshot of the output in the Matlab window.
Question 1: It is the first time I am trying to run a MIQP and I would like to have your advice to understand what I can do to improve performance. Let me tell what I have tried so far:
I cheated by imposing params.MIPGap=10^(-1). In this way the phase of node exploration is made shorter. What are the cons of doing this?
I have big-M coefficients and I have tied them to the smallest possible values.
I have tried setting params.ScaleFlag=2; params.ObjScale=2 but it makes things slower
I have changed params.method but it does not seem to help (unless you have some specific recommendation)
I have increase params.Threads but it does not seem to help
Question 2 (minor): Why do I get a negative objective in the root simplex log? How can the objective function be negative?
Without having the full model here, there is not much on advise to give. Tight Big-M formulations are important, but you said, you checked them already. Sometimes splitting them up might help, but this is a complex field.
What might give great benefits for some problems is using the Gurobi parameter tuning tool. So try to export your model and feed the tuning tool with it. It automatically tries different of the hundreds of tuning parameters and might give some nice results.
Regarding the question about negative objectives in the simplex logs, I can think of a couple of possible explanations. First, note that the negative objective values occur in the presence of dual infeasibilities in the dual simplex run. In such a case, I'm not sure exactly what the primal objective values correspond to. Second, if you have a MIQP with products of binaries in the objective, Gurobi may convexify the objective in a way that makes it possible for a negative objective to appear in the reformulated model even when the original model must have a nonnegative objective in any feasible solution.

random value from lognormal distribution in perl

I'd want to randomly pick numbers from a lognormal distribution that I can define myself. I want to use perl for this. There's probably a really simple solution for this, using only a couple lines of code, but I can't find anything just now, and when I try to think about it, my thoughts get stuck somewhere... So, I'd appreciate any kind of help.
Thanks in advance.
Start at CPAN to get a normal generator, then apply the formula on the Wikipedia page you linked to by replacing Z with a call to Normal.
Or you could just grab a lognormal generator from CPAN...

max likelihood fminsearch

I used Matlab-fminsearch for a negativ max likelihood model for a binomial distributed function. I don't get any error notice, but the parameter which I want to estimate, take always the start value. Apparently, there is a mistake. I know that I ask a totally general question. But is it possible that anybody had the same mistake and know how to deal with it?
Thanks a lot,
#woodchips, thank you a lot. Step by step, I've tried to do what you advised me. First of all, I actually maximized (-log(likelihood)) and this is not the problem. I think I found out the problem but I still have some questions, if I don't bother you. I have a model(param) to maximize in paramstart=p1. This model is built for (-log(likelihood(F))) and my F is a vectorized function like F(t,Z,X,T,param,m2,m3,k,l). I have a data like (tdata,kdata,ldata),X,T are grids and Z is a function on this grid and (m1,m2,m3) are given parameters.When I want to see the value of F(tdata,Z,X,T,m1,m2,m3,kdata,ldata), I get a good output. But I think fminsearch accept that F(tdata,Z,X,T,p,m2,m3,kdata,ldata) like a constant and thatswhy I always have as estimated parameter the start value. I will be happy, if you have any advise to tweak that.
You have some options you can try to tweak. I'd start with algorithm.
When the function value practically doesn't change around your startpoint it's also problematic. Maybe switching to log-likelyhood helps.
I always use fminunc or fmincon. They allow also providing the Hessian (typically better than "estimated") or 'typical values' so the algorithm doesn't spend time in unfeasible regions.
It is virtually always true that you should NEVER maximize a likelihood function, but ALWAYS maximize the log of that function. Floating point issues will almost always corrupt the problem otherwise. That your optimization starts and stops at the same point is a good indicator this is the problem.
You may well need to dig a little deeper than the above, but even so, this next test is the test I recommend that all users of optimization tools do for every one of their problems, BEFORE they throw a function into an optimizer. Evaluate your objective for several points in the vicinity. Does it yield significantly different values? If not, then look to see why not. Are you creating a non-smooth objective to optimize, or a zero objective? I.e., zero to within the supplied tolerances?
If it does yield different values but still not converge, then make sure you know how to call the optimizer correctly. Yeah, right, like nobody has ever made this mistake before. This is actually a very common cause of failure of optimizers.
If it does yield good values that vary, and you ARE calling the optimizer correctly, then think if there are regions into which the optimizer is trying to diverge that yield garbage results. Is the objective generating complex or imaginary results?

Is there a CPAN module for creating the CDF for a given list?

It's not too difficult to implement, but it seems like Cumulative distribution function is a very basic Statistics::Descriptive function, doesn't it?
It seems Statistics::Descriptive::Weighted has it, but using Weighted (with equal points to all data...) instead of the simpler Statistics::Descriptive seems to have a large overhead.
There is Statistics::KernelEstimation. I have not used it but it looks OK.
Using Statistics::Descriptive::Weighted, you can omit the weights and each point will be assigned a unit weight by default.
A CPAN search came up with this: PDL::GSL::CDF. Its part of the PDL, the Perl Data Language (though this CDF module makes use of the GNU Scientific Library).

Using MATLAB's plotting features as an interactive part of a Fortran program

Although many of you will have a decent idea of what I'm aiming at, just from reading the title -- allow me a simple introduction still.
I have a Fortran program - it consists of a program, some internal subroutines, 7 modules with its own procedures, and ... uhmm, that's it.
Without going into much detail, for I don't think it's necessary at this point, what would be the easiest way to use MATLAB's plotting features (mainly plot(x,y) with some customizations) as an interactive part of my program ? For now I'm using some of my own custom plotting routines (based on HPGL and Calcomp's routines), but just as part of an exercise on my part, I'd like to see where this could go and how would it work (is it even possible what I'm suggesting?). Also, how much effort would it take on my part ?
I know this subject has been rather extensively described in many "tutorials" on the net, but for some reason I have trouble finding the really simple yet illustrative introductory ones. So if anyone can post an example or two, simple ones, I'd be really grateful. Or just take me by the hand and guide me through one working example.
platform: IVF 11.something :) on Win XP SP2, Matlab 2008b
The easiest way would be to have your Fortran program write to file, and have your Matlab program read those files for the information you want to plot. I do most of my number-crunching on Linux, so I'm not entirely sure how Windows handles one process writing a file and another reading it at the same time.
That's a bit of a kludge though, so you might want to think about using Matlab to call the Fortran program (or parts of it) and get data directly for plotting. In this case you'll want to investigate Creating Fortran MEX Files in the Matlab documentation. This is relatively straightforward to do and would serve your needs if you were happy to use Matlab to drive the process and Fortran to act as a compute service. I'd look in the examples distributed with Matlab for simple Fortran MEX files.
Finally, you could call Matlab from your Fortran program, search the documentation for Calling the Matlab Engine. It's a little more difficult for me to see how this might fit your needs, and it's not something I'm terribly familiar with.
If you post again with more detail I may be able to provide more specific tips, but you should probably start rolling your sleeves up and diving in to MEX files.
Continuing the discussion of DISLIN as a solution, with an answer that won't fit into a comment...
#M. S. B. - hello. I apologize for writing in your answer, but these comments are much too short, and answering a question in the form of an answer with an answer is ... anyway ...
There is the Quick Plot feature of DISLIN -- routine QPLOT needs only three arguments to plot a curve: X array, Y array and number N. See Chapter 16 of the manual. Plus only several additional calls to select output device and label the axes. I haven't used this, so I don't know how good the auto-scaling is.
Yes, I know of Quickplot, and it's related routines, but it is too fixed for my needs (cannot change anything), and yes, it's autoscaling is somewhat quircky. Also, too big margins inside the graf.
Or if you want to use the power of GRAF to setup your graph box, there is subroutine GAXPAR to automatically generate recommended values. -2 as the first argument to LABDIG automatically determines the number of digits in tick-mark labels.
Have you tried the routines?
Sorry, I cannot find the GAXPAR routine you're reffering to in dislin's index. Are you sure it is called exactly like that ?
Reply by M.S.B.: Yes, I am sure about the spelling of GAXPAR. It is the last routine in Chapter 4 of the DISLIN 9.5 PDF manual. Perhaps it is a new routine? Also there is another path to automatic scaling: SETSCL -- see Chapter 6.
So far, what I've been doing (apart from some "duck tape" solutions) is
use dislin; implicit none
real, dimension(5) :: &
x = [.5, 2., 3., 4., 5.], &
y = [10., 22., 34., 43., 15.]
real :: xa, xe, xor, xstp, &
ya, ye, yor, ystp
call setpag('da4p'); call metafl('xwin');
call disini(); call winkey('return');
call setscl(x,size(x),'x');
call setscl(y,size(y),'y')
call axslen(1680,2376) !(8/10)*2100 and 2970, respectively
call setgrf('name','name','line','line')
call incmrk(1); call hsymbl(3);
call graf(xa, xe, xor, xstp, ya, ye, yor, ystp); call curve(x,y,size(x))
call disfin()
end
which will put the extreme values right on the axis. Do you know perhaps how could I go to have one "major tick margin" on the outside, as to put some area between the curve and the axis (while still keeping setscl's effects) ?
Even if you don't like the built-in auto-scaling, if you are already using DISLIN, rolling your own auto-scaling will be easier than calling Fortran from MATLAB. You can use the Fortran intrinsic functions minval and maxval to find the smallest and largest values in the data, than write a subroutine to round outwards to "nice" round values. Similarly, a subroutine to decide on the tick-mark spacing.
This is actually not so easy to accomplish (and ideas to prove me wrong will be gladly appreciated). Or should I say, it is easy if you know the rough range in which your values will lie. But if you don't, and you don't know
whether your values will lie in the range of 13-34 or in the 1330-3440, then ...
... if I'm on the wrong track completely here, please, explain if you ment something different. My english is somewhat lacking, so I can only hope the above is understandable.
Inside a subroutine to determine round graph start/end values, you could scale the actual min/max values to always be between 1 and 10, then have a table to pick nice round values, then unscale back to the correct range.
--
Dump Matlab because its proprietary, expensive, bloated/slow and codes are not easy to parallelize.
What you should do is use something on the lines of DISLIN, PLplot, GINO, gnuplotfortran etc.