Is a multinomial logistic regression the appropriate "test" for this situation? - statistical-test

I have two columns in my dataset. y is the dependent variable and is categorical with three levels (unordered levels A, B and C) and x is the numeric independent variable. The example below illustrates the situation, but my actual dataset is larger, with over 1000 rows.
+------+---+
| x | y |
+------+---+
| 5.93 | A |
| 4.46 | A |
| 4.63 | A |
| 5.07 | A |
| 5.71 | A |
| 6.81 | B |
| 6.45 | B |
| 6.07 | B |
| 7.26 | C |
| 8.24 | C |
| 6.25 | C |
| 7.34 | C |
| 7.17 | C |
+------+---+
My null hypothesis is that the proportions of A, B and C in column y are independent of the x values. That is, the proportions of A, B and C associated with any given x value are independent of x. The alternative hypothesis is that these proportions are dependent on x.
I am looking for a statistical test for this.
I am wondering if performing a multinomial logistic regression and assessing the significance of the coefficients is a reasonable way to go, or if there is a better test.

If you need to perform hypothesis testing, then most likely multinomial regression is the way. The other option is to discretize your continuous variable and then show that there is an association between different bins and your categories.
You can check this post's accepted answer for testing each coefficient, under each term separately. The downside about that is you need to set one term as a reference.
Since your null is that "proportions of A, B and C in column y are independent of the x values", you can test your model against a null model. Normally a likelihood ratio test is used. Below is how to do it in R:
df = structure(list(x = c(5.93, 4.46, 4.63, 5.07, 5.71, 6.81, 6.45,
6.07, 7.26, 8.24, 6.25, 7.34, 7.17), y = structure(c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor")), class = "data.frame", row.names = 0:12)
library(car)
library(nnet)
fit = multinom(y ~ x,data=df)
Anova(fit)
# weights: 6 (2 variable)
initial value 14.281960
final value 13.954126
converged
Analysis of Deviance Table (Type II tests)
Response: y
LR Chisq Df Pr(>Chisq)
x 18.717 2 8.624e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Related

PySpark - Creating a single column from multiple columns with some basic math

Consider the following PySpark dataframe
Col1
Col2
Col3
A, B
D, G
A, G
C, F
C, D
A, G
C, F
C, D
A, G
I'd like to create a new dataframe with 2 columns, the first with all the different combinations, and the second column is the ratio: Frequency of Combination / Total Number of Combinations. For example,
Combination
Ratio
A, B
0.111 (1/9)
C, F
0.222 (2/9)
D, G
0.111 (1/9)
C, D
0.222 (2/9)
A, G
0.333 (3/9)
You can unpivot, then group by and count:
from pyspark.sql import functions as F, Window
df2 = df.selectExpr(
'stack(' + str(len(df.columns)) + ', ' + ', '.join(df.columns) + ') as combination'
).groupBy('combination').count().withColumn(
'ratio',
F.col('count') / F.sum('count').over(Window.orderBy())
).drop('count')
df2.show()
+-----------+------------------+
|combination| ratio|
+-----------+------------------+
| A, B|0.1111111111111111|
| C, F|0.2222222222222222|
| C, D|0.2222222222222222|
| D, G|0.1111111111111111|
| A, G|0.3333333333333333|
+-----------+------------------+

Optimization under constraints

I have a question regarding optimization.
I have a matrix x with 3 columns and a certain number of rows (max 200). Each row represents a candidate. The column one contains a score (between 0 and 1) , the column 2 contains the kind of candidate (there are 10 kinds in total labeled from 1 to 10) and the column 3 contains the amount of each candidate. There is one thing to take into consideration: the amount can be NEGATIVE
What I would like to do is to select max 35 elements among these candidates which would maximize the function which sum over their respective score (column 1) under the constraints that there can be a maximum of 10% of each kind computed in the following way: percenteage of kind 1: sum amount of kind 1 divided by sum all amount.
At the end, I would like to have a set of max 35 candidates which satisfy the constraints and optimize the sum of their scores.
Here is a the code I have come up with so far but I am struggling on the 10% constraint as it seems not to be taken into account:
rng('default');
clc;
clear;
n = 100;
maxSize = 35;
%%%TOP BASKET
nbCandidates = 100;
score = rand(100,1)/10+0.9;
quantity = rand(100,1)*100000;
type = ceil(rand(100,1)*10)
typeMask = zeros(n,10);
for i=1:10
typeMask(:,i) = type(:,1) == i;
end
fTop = -score;
intconTop = [1:1:n];
%Write the linear INEQUALITY constraints:
A = [ones(1,n);bsxfun(#times,typeMask,quantity)'/sum(type.*quantity)];
b = [maxSize;0.1*ones(10,1)];
%Write the linear EQUALITY constraints:
Aeq = [];
beq = [];
%Write the BOUND constraints:
lb = zeros(n,1);
ub = ones(n,1); % Enforces i1,i2,...in binary
x = intlinprog(fTop,intconTop,A,b,Aeq,beq,lb,ub);
I would be grateful to some advice where I m doing it wrong!
A linear program for your model might look something like this:
n is the number of candidates.
S[x] is candidate x's score.
A[i][x] is the amount of candidate x for kind i (A[i][x] can be positive or negative, like you said).
T[i] is the total amount of all candidates for kind i.
I[x] is 1 if element x is to be included, and 0 if element x is to be excluded.
The function f which you want to optimize is a function of S[x] and I[x]. You can think of S and I as n-dimensional vectors, so the function you want to optimize is their dot-product.
f() = DotProduct(I, S)
This is equivalent to the linear function I1 * S1 + I2 * S2 + ... + In * Sn.
We can formulate all of the constraints in this way to get a set of linear functions whose coeffecients are the components in an n dimensional vector that we can dot with I, the parameters to optimize.
For the constraint that we can only take 35 elements at most, let C1() be a function which computes the total number of elements.
Then the first constraint can be formalized as C1() <= 35 and C1() is a linear function which can be computed thusly:
Let j be an n dimensional vector with each component equal to 1: j = <1,1,...,1>.
C1() = DotProduct(I, j)
So C1() <= 35 is a linear inequality equivalent to:
I1 * 1 + I2 * 1 + ... + In * 1 <= 35
I1 + I2 + ... + In <= 35
We need to add a slack variable x1 here to turn this into and equivalence relation:
I1 + I2 + ... + In + x1 = 35
For the constraint that we can only take 10% of each kind, we will have a function C2[i]() for each kind i (you said there are 10 in all). C2[i]() Computes the amount of students taken for kind i given the students we have selected:
C21() <= .1 * T1
C22() <= .1 * T2
...
C210() <= .1 * T10
We compute C2[i]() like this:
Let k be an n dimensional vector equal to <A[i]1, A[i]2, ..., A[i]n>, each component is the amount of each candidate for kind i.
Then DotProduct(I, k) = I1 * A[i]1 + I2 * A[i]2 + ... + In * A[i]n, is the total amount we are taking of kind i given I, the vector which captures what elements we are including.
So C2[i]() = DotProduct(I, k)
Now that we know how to compute C2[i](), we need to add a slack variable to turn this into an equality relation:
C2[i]() + x[i + 1] = .1 * T[i]
Here x's subscript is [i + 1] because x1 is already used as a slack variable for the previous constraint.
In summary, the linear program would look like this (adding 11 slack variables x1, x2, ..., x11 for each constraint that is an inequality):
Let:
V = <I1, I2, ..., In, x1, x2, ..., x11> (variables)
|S1|
|S2|
|. |
|. |
|. |
P = |Sn| (parameters of objective function)
|0 |
|0 |
|. |
|. |
|. |
|0 |
|35 |
|.1*T1 |
C = |.1*T2 | (right-hand sides of constraining equality relations)
|... |
|.1*T10|
|1 |1 |...|1 |1|0|...|0|
|A1,1 |A1,2 |...|A1,n |0|1|...|0|
CP = |A2,1 |A2,2 |...|A2,n |0|0|...|0| (parameters of constraint functions)
|... |... |...|... |0|0|...|0|
|A10,1|A10,2|...|A10,n|0|0|...|1|
Maximize:
V x P
Subject to:
CP x Transpose(V) = C
Hopefully this is clear, sorry for terrible formatting.
I believe the MIP model can look like:
Here i are the data points and j indicates the type. For simplicity I assumed here every type has the same number of data points (i.e. Amount(i,j), Score(i,j) are matrices). It is easy to handle the more irregular case by restricting the summations.
The 10% rule is simply applied on the sum of the amounts. I hope that is the correct interpretation. Not sure if this is true if we have negative sums.

Maple: specify variable over which to maximize

This is a very simple question, but found surprisingly very little about it online...
I want to find the minimizer of a function in maple, I am not sure how to indicate which is the variable of interest? Let us take a very simple case, I want the symbolic minimizer of a quadratic expression in x, with parameters a, b and c.
Without specifying something, it does minimize over all variables, a, b, c and x.
f4 := a+b*x+c*x^2
minimize(f4, location)
I tried to specify the variable in the function, did not work either:
f5 :=(x) ->a+b*x+c*x^2
minimize(f5, location)
How should I do this? And, how would I do if I wanted over two variables, x and y?
fxy := a+b*x+c*x^2 + d*y^2 +e*y
f4 := a+b*x+c*x^2:
extrema(f4, {}, x);
/ 2\
|4 a c - b |
< ---------- >
| 4 c |
\ /
fxy := a+b*x+c*x^2 + d*y^2 +e*y:
extrema(fxy, {}, {x,y});
/ 2 2\
|4 a c d - b d - c e |
< --------------------- >
| 4 c d |
\ /
The nature of the extrema will depend upon the values of the parameters. For your first example above (quadratic in x) it will depend on the signum of c.
The command extrema accepts an optional fourth argument, such as an unassigned name (or an uneval-quoted name) to which is assigns the candidate solution points (as a side-effect of its calculation). Eg,
restart;
f4 := a+b*x+c*x^2:
extrema(f4, {}, x, 'cand');
2
4 a c - b
{----------}
4 c
cand;
b
{{x = - ---}}
2 c
fxy := a+b*x+c*x^2 + d*y^2 +e*y:
extrema(fxy, {}, {x,y}, 'cand');
2 2
4 a c d - b d - c e
{---------------------}
4 c d
cand;
b e
{{x = - ---, y = - ---}}
2 c 2 d
Alternatively, you may set up the partial derivatives and solve them manually. Note that for these two examples there is just a one result (for each) returned by solve.
restart:
f4 := a+b*x+c*x^2:
solve({diff(f4,x)},{x});
b
{x = - ---}
2 c
normal(eval(f4,%));
2
4 a c - b
----------
4 c
fxy := a+b*x+c*x^2 + d*y^2 +e*y:
solve({diff(fxy,x),diff(fxy,y)},{x,y});
b e
{x = - ---, y = - ---}
2 c 2 d
normal(eval(fxy,%));
2 2
4 a c d - b d - c e
---------------------
4 c d
The code for the extrema command can be viewed, by issuing the command showstat(extrema). You can see how it accounts for the case of solve returning multiple results.

Compute the change of basis matrix in Matlab

I've an assignment where I basically need to create a function which, given two basis (which I'm representing as a matrix of vectors), it should return the change of basis matrix from one basis to the other.
So far this is the function I came up with, based on the algorithm that I will explain next:
function C = cob(A, B)
% Returns C, which is the change of basis matrix from A to B,
% that is, given basis A and B, we represent B in terms of A.
% Assumes that A and B are square matrices
n = size(A, 1);
% Creates a square matrix full of zeros
% of the same size as the number of rows of A.
C = zeros(n);
for i=1:n
C(i, :) = (A\B(:, i))';
end
end
And here are my tests:
clc
clear out
S = eye(3);
B = [1 0 0; 0 1 0; 2 1 1];
D = B;
disp(cob(S, B)); % Returns cob matrix from S to B.
disp(cob(B, D));
disp(cob(S, D));
Here's the algorithm that I used based on some notes. Basically, if I have two basis B = {b1, ... , bn} and D = {d1, ... , dn} for a certain vector space, and I want to represent basis D in terms of basis B, I need to find a change of basis matrix S. The vectors of these bases are related in the following form:
(d1 ... dn)^T = S * (b1, ... , bn)^T
Or, by splitting up all the rows:
d1 = s11 * b1 + s12 * b2 + ... + s1n * bn
d2 = s21 * b1 + s22 * b2 + ... + s2n * bn
...
dn = sn1 * b1 + sn2 * b2 + ... + snn * bn
Note that d1, b1, d2, b2, etc, are all column vectors. This can be further represented as
d1 = [b1 b2 ... bn] * [s11; s12; ... s1n];
d2 = [b1 b2 ... bn] * [s21; s22; ... s2n];
...
dn = [b1 b2 ... bn] * [sn1; sn2; ... s1n];
Lets call the matrix [b1 b2 ... bn], whose columns are the columns vectors of B, A, so we have:
d1 = A * [s11; s12; ... s1n];
d2 = A * [s21; s22; ... s2n];
...
dn = A * [sn1; sn2; ... s1n];
Note that what we need now to find are all the entries sij for i=1...n and j=1...n. We can do that by left-multiplying both sides by the inverse of A, i.e. by A^(-1).
So, S might look something like this
S = [s11 s12 ... s1n;
s21 s22 ... s2n;
...
sn1 sn2 ... snn;]
If this idea is correct, to find the change of basis matrix S from B to D is really what I'm doing in the code.
Is my idea correct? If not, what's wrong? If yes, can I improve it?
Things become much easier when one has an intuitive understanding of the algorithm.
There are two key points to understand here:
C(B,B) is the identity matrix (i.e., do nothing to change from B to B)
C(E,D)C(B,E) = C(B,D) , think of this as B -> E -> D = B -> D
A direct corollary of 1 and 2 is
C(E,D)C(D,E) = C(D,D), the identity matrix
in other words
C(E,D) = C(D,E)-1
Summarizing.
Algorithm to calculate the matrix C(B,D) to change from B to D:
Define C(B,E) = [b1, ..., bn] (column vectors)
Define C(D,E) = [d1, ..., dn] (column vectors)
Compute C(E,D) as the inverse of C(D,E).
Compute C(B,D) as the product C(E,D)C(B,E).
Example
B = {(1,2), (3,4)}
D = {(1,1), (1,-1)}
C(B,E) = | 1 3 |
| 2 4 |
C(D,E) = | 1 1 |
| 1 -1 |
C(E,D) = | .5 .5 |
| .5 -.5 |
C(B,D) = | .5 .5 | | 1 3 | = | 1.5 3.5 |
| .5 -.5 | | 2 4 | | -.5 -.5 |
Verification
1.5 d1 + -.5 d2 = 1.5(1,1) + -.5(1,-1) = (1,2) = b1
3.5 d1 + -.5 d2 = 3.5(1,1) + -.5(1,-1) = (3,4) = b2
which shows that the columns of C(B,D) are in fact the coordinates of b1 and b2 in the base D.

How to obtain jaccard similarity in matlab

I have a table:
x y z
A 2 0 3
B 0 3 0
C 0 0 4
D 1 4 0
I want to calculate the Jaccard similarity in Matlab, between the vectors A, B, C and D.
The formula is :
In this formula |x| and |y| indicates the number of items which are not zero. For example |A| number of items that is not zero is 2, for |B| and |C| it is 1, and for |D| it is 2.
|x intersect y| indicates the number of common items which are not zero. |A intersect B| is 0. |A intersect D| is 1, because the value of x in both is not zero.
e.g.: jaccard(A,D)= 1/3=0.33
How can I implement this in Matlab?
Matlab has a built-in function that computes the Jaccard distance: pdist.
Here is some code
X = rand(2,100);
X(X>0.5) = 1;
X(X<=0.5) = 0;
JD = pdist(X,'jaccard') % jaccard distance
JI = 1 - JD; % jaccard index
EDIT
A calculation that does not require the statistic toolbox
a = X(1,:);
b = X(2,:);
JD = 1 - sum(a & b)/sum(a | b)