How to judge the results of the Wilcoxon test? - scipy

all! I want to conduct the Wilcoxon rank-sum test on two datasets, i.e., x1 and x2. My code is as follows,
from scipy import stats
x1 = [9,5,8,7,10,6,7] # len(x1) = 7
x2 = [7,4,5,6,3,6,4,4] # len(x2) = 8
stats.mannwhitneyu(x1, x2)
Then I get the result like this,
> MannwhitneyuResult(statistic=6.5, pvalue=0.006966479792405637)
I do know how the statistic variable is calculated.
But whether should I reject the null hypothesis or not according to the result generated by function mannwhitneyu()?
I also noticed that some other python functions return the statistic and pvalue variables. I am quite confused about the meaning of p-value generated by these functions, such as ranksums(), wilcoxon(), etc. Can I simply compare it with 0.05 to judge whether I should reject it or not?

Related

predicting ODE parameters with DiffEqFlux

I'm trying to build a neural network that will take in the solutions to a system of ODE's and predict the parameters of the system. I'm using Julia and in particular, the DiffEqFlux package. The structure of a network is a few simple Dense layers chained together that predict some intermediate parameters (in this case, some chemical reaction free energies), which then feed into some deterministic (non-trained) layers that convert those parameters into the ones that go into the system of equations (in this case, reaction rate constants). I've tried two different approaches from here:
Chain the ODE solve directly on as the last layer of the network. In this case, the loss function is just comparing the inputs to the outputs.
Have the ODE solve in the loss function, so the network output is just the parameters.
However, in neither case can I get Flux.train! to actually run.
A silly little example for the first option that gives the same error I'm getting (I've tried to keep as many things parallel to my actual case as possible, i.e. the solver, etc., although I did omit the intermediate deterministic layers since they don't seem to make a difference) is shown below.
using Flux, DiffEqFlux, DifferentialEquations
# let's use Chris' favorite example, Lotka-Volterra
function lotka_volterra(du,u,p,t)
x, y = u
α, β, δ, γ = p
du[1] = dx = α*x - β*x*y
du[2] = dy = -δ*y + γ*x*y
end
u0 = [1.0,1.0]
tspan = (0.0,10.0)
# generate a couple sets of solutions to train on
training_params = [[1.5,1.0,3.0,1.0], [1.4,1.1,3.1,0.9]]
training_sols = [solve(ODEProblem(lotka_volterra, u0, tspan, tp)).u[end] for tp in training_params]
model = Chain(Dense(2,3), Dense(3,4), p -> diffeq_adjoint(p, ODEProblem(lotka_volterra, u0, tspan, p), Rodas4())[:,end])
# in this case we just want outputs to match inputs
# (actual parameters we're after are outputs of next-to-last layer)
training_data = zip(training_sols, training_sols)
# mean squared error loss
loss(x,y) = Flux.mse(model(x), y)
p = Flux.params(model[1:2])
Flux.train!(loss, p, training_data, ADAM(0.001))
# gives TypeError: in typeassert, expected Float64, got ForwardDiff.Dual{Nothing, Float64, 8}
I've tried all three solver layers, diffeq_adjoint, diffeq_rd, and diffeq_fd, none of which work, but all of which give different errors that I'm having trouble parsing.
For the other option (which I'd actually prefer, but either way would work), just replace the model and loss function definitions as:
model = Chain(Dense(2,3), Dense(3,4))
function loss(x,y)
p = model(x)
sol = diffeq_adjoint(p, ODEProblem(lotka_volterra, u0, tspan, p), Rodas4())[:,end]
Flux.mse(sol, y)
end
The same error is thrown as above.
I've been hacking at this for over a week now and am completely stumped; any ideas?
You're running into https://github.com/JuliaDiffEq/DiffEqFlux.jl/issues/31, i.e. forward-mode AD for the Jacobian doesn't play nice with Flux.jl right now. To get around this, use Rodas4(autodiff=false) instead.

Indefinite integration with Matlab's Symbolic Toolbox - complex solution

I'm using Matlab 2014b. I've tried:
clear all
syms x real
assumeAlso(x>=5)
This returned:
ans =
[ 5 <= x, in(x, 'real')]
Then I tried:
int(sqrt(x^2-25)/x,x)
But this still returned a complex answer:
(x^2 - 25)^(1/2) - log(((x^2 - 25)^(1/2) + 5*i)/x)*5*i
I tried the simplify command, but still a complex answer. Now, this might be fixed in the latest version of Matlab. If so, can people let me know or offer a suggestion for getting the real answer?
The hand-calculated answer is sqrt(x^2-25)-5*asec(x/5)+C.
This behavior is present in R2017b, though when converted to floating point the imaginary components are different.
Why does this occur?
This occurs because Matlab's int function returns the full general solution when you ask for the indefinite integral. This solution is valid over the entire domain of of real values, including your restricted domain of x>=5.
With a bit of math you can show that the solution is always real for x>=5 (see complex logarithm). Or you can use more symbolic math via the isAlways function to show this:
syms x real
assume(x>=5)
y = int(sqrt(x^2-25)/x, x)
isAlways(imag(y)==0)
This returns true (logical 1). Unfortunately, Matlab's simplification routines appear to not be able to reduce this expression when assumptions are included. You might also submit this case to The MathWorks as a service request in case they'd consider improving the simplification for this and similar equations.
How can this be "fixed"?
If you want to get rid of the zero-valued imaginary part of the solution you can use sym/real:
real(y)
which returns 5*atan2(5, (x^2-25)^(1/2)) + (x^2-25)^(1/2).
Also, as #SardarUsama points out, when the full solution is converted to floating point (or variable precision) there will sometimes numeric imprecision when converting from exact symbolic form. Using the symbolic real form above should avoid this.
The answer is not really complex.
Take a look at this:
clear all; %To clear the conditions of x as real and >=5 (simple clear doesn't clear that)
syms x;
y = int(sqrt(x^2-25)/x, x)
which, as we know, gives:
y =
(x^2 - 25)^(1/2) - log(((x^2 - 25)^(1/2) + 5i)/x)*5i
Now put some real values of x≥5 to check what result it gives:
n = 1004; %We'll be putting 1000 values of x in y from 5 to 1004
yk = zeros(1000,1); %Preallocation
for k=5:n
yk(k-4) = subs(y,x,k); %Putting the value of x
end
Now let's check the imaginary part of the result we have:
>> imag(yk)
ans =
1.0e-70 *
0
0
0
0
0.028298997121333
0.028298997121333
0.028298997121333
%and so on...
Notice the multiplier 1e-70.
Let's check the maximum value of imaginary part in yk.
>> max(imag(yk))
ans =
1.131959884853339e-71
This implies that the imaginary part is extremely small and it is not a considerable amount to be worried about. Ideally it may be zero and it's coming due to imprecise calculations. Hence, it is safe to call your result real.

Understanding the chi2gof function in matlab

I'm trying to understand how to use the chi2gof function in matlab with a very simple test. Let's assume that I toss a coin 190 times and get 94 heads and 96 tails. The null hypothesis should be that i get 95h, 95t. As far as I understand the documentation, I should be able to test the hypothesis by running
[h,p,stats] = chi2gof([94,96], 'expected', [95,95])
However, this returns h = 1, which supposedly means that null hypothesis is rejected, which makes no sense. Another pecular thing is that the O parameter in stats returns as O: [0 2] - but shouldn't this be my input ([94,96])? What am I doing wrong?
What am I doing wrong?
The problem is that you are passing the cumulative outcome of your coin tosses to chi2gof. The goodness-of-fit test must be performed on the full sample. From the official documentation (reference here):
x = sample data for the hypothesis test, specified as a vector (the wrong part of your code)
Expected = expected counts for each bin (the correct part of your code)
Let's make an example using the correct variables:
ct = randsample([0 1],190,true,[0.49 0.51]);
[h,p,stats] = chi2gof(ct,'Expected',[95 95]);
The returned value of h is 0, which is absolutely correct.
Now, let's make an example that is supposed to fail:
ct = randsample([0 1],190,true,[0.05 0.95]);
[h,p,stats] = chi2gof(ct,'Expected',[95 95]);
As you can see, h returned from this second test will be equal to 1.
On a final note, don't forget to take a look at the second output argument, which is the p-value of the test and is an important element to evaluate the significance of the result.

matlab global stream: Any correlation between generated sets of numbers?

I'm just looking for some clarification in creating sets of random numbers in matlab and how this relates to the 'global stream.'
I know that I can set the global stream for reproducibility of my results should I run the code again:
s = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(s);
A = rand(1,10);
Every time I run this, A is the same. For example,
s = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(s);
B = rand(1,10);
I should find that isequal(A,B) is true.
Now my question pertains to the following,
s = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(s);
A = rand(1,10);
B = rand(1,10);
If I run this then A and B are different sets of numbers. Can I take them to be independent sets, or is there some correlation between them? If I wanted to ensure stronger independence between A and B should I create a new and different globabl stream after creating A, but before creating B? For example,
sA = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(sA);
A = rand(1,10);
sB = RandStream('mt19937ar','Seed',3);
RandStream.setGlobalStream(sB);
B = rand(1,10);
Matlab generate random number from a "KNOWN" but complex function,
All pseudorandom number generators are based on deterministic algorithms, and all will fail a sufficiently specific statistical test for randomness
when you change seed number (which you could do it with rng(your_desired_seed_number) too) it just use another part of the function which is not irrelevant to previous random number sequence(at least i think that way) , (it is a mathematical question)
but i suggest to use different generators to have maximum independent random number,
rng(5,'twister'); % you could also use randstream instead of rng
A=rand(1,10);
rng(3,'combRecursive');
B=rand(1,10);

Tensorflow max-margin loss training?

I want to train a neural network in tensorflow with a max-margin loss function using one negative sample per positive sample:
max(0,1 -pos_score +neg_score)
What I'm currently doing is this:
The network takes three inputs: input1, and then one positive example input2_pos and one negative example input2_neg. (These are indices to a word embeddings layer.) The network is supposed to calculate a score that expresses how related two examples are.
Here's a simplified version of my code:
input1 = tf.placeholder(dtype=tf.int32, shape=[batch_size])
input2_pos = tf.placeholder(dtype=tf.int32, shape=[batch_size])
input2_neg = tf.placeholder(dtype=tf.int32, shape=[batch_size])
# f is a neural network outputting a score
pos_score = f(input1,input2_pos)
neg_score = f(input1,input2_neg)
cost = tf.maximum(0., 1. -pos_score +neg_score)
optimizer= tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
What I see when I run this, is that like this the network just learns which input holds the positive example - it always predicts a similar score along the lines of:
pos_score = 0.9965983
neg_score = 0.00341663
How can I structure the variables/training so that the network learns the task instead?
I want just one network that takes two inputs and calculates a score expressing the correlation between them, and train it with max-margin loss.
Calculating scores for positive and negative separately does not seem like an option to me, since then it won't backpropagate properly. Another option seems to be randomizing inputs - but then for the loss function I need to know which example is the positive one - inputting that as another parameter would give away the solution again?
Any ideas?
Given your results (1 for every positive, 0 for every negative) it seems you have two different networks learning:
to predict 1 for the first one
to predict 0 for the second one
When using max-margin loss, you need to use the same network for computing both pos_score and neg_score. The way to do that is to share the variables. I will give you a small example using tf.get_variable():
with tf.variable_scope("network"):
w = tf.get_variable("weights", shape=..., initializer=...)
def f(x, y):
with tf.variable_scope("network", reuse=True):
w = tf.get_variable("weights")
res = w * (x - y) # some computation
return res
With this function f as model, the training will optimize the shared variable with name "network/weights".