ANCOVA in Python with Scipy/Numpy stats - scipy

I would like to know a way of performing ANCOVA(analysis of covariance) using Python with scipy. It is basically a statistical comparison of regression lines. I know Python can do ANOVA and it can also do regression line fitting with Scipy.stats. I'm not sure how to put those together to get an effective ANCOVA though, if it is possible.

ANCOVA can be done with regression an using dummy variables in the design matrix for the effects that depend on the categorical variable.
A simple example is at
http://groups.google.com/group/pystatsmodels/browse_thread/thread/aaa31b08f3df1a69?hl=en
using the OLS class from scikits.statsmodels
Relevant part of construction of design matrix
xg includes group numbers/labels,
x1 is continuous explanatory variable
>>> dummy = (xg[:,None] == np.unique(xg)).astype(float)
>>> X = np.c_[x1, dummy[:,1:], np.ones(nsample)]
Estimate the model
>>> res2 = sm.OLS(y, X).fit()
>>> print res2.params
[ 1.00901524 3.08466166 -2.84716135 9.94655423]
>>> print res2.bse
[ 0.07499873 0.71217506 1.16037215 0.38826843]
>>> prstd, iv_l, iv_u = wls_prediction_std(res2)
"Test hypothesis that all groups have same intercept"
>>> R = [[0, 1, 0, 0],
... [0, 0, 1, 0]]
>>> print res2.f_test(R)
<F test: F=array([[ 91.69986847]]), p=[[ 8.90826383e-17]],
df_denom=46, df_num=2>
strongly rejected because differences in intercept are very large
Update (two and a half years later):
scikits.statsmodels has been renamed to statsmodels
and to the question:
With the latest release of statsmodels, it is more convenient to use formulas for specifying categorical effects and interaction effects. statsmodels uses patsy to handle the formulas and creates the design matrices.
More information is available at the links to the statsmodels documentation in https://stackoverflow.com/a/19495920/333700

Related

How embedding_bag exactly works in PyTorch

in PyTorch, torch.nn.functional.embedding_bag seems to be the main function responsible for doing the real job of embedding lookup. On PyTorch's documentation, it has been mentioned that embedding_bag does its job > without instantiating the intermediate embeddings. What does that exactly mean? Does this mean for example when the mode is "sum" it does in-place summation? or it just means that no additional Tensors will be produced when calling embedding_bag but still from the system's point of view all the intermediate row-vectors are already fetched into the processor to be used for calculating the final Tensor?
In the simplest case, torch.nn.functional.embedding_bag is conceptually a two step process. The first step is to create an embedding and the second step is to reduce (sum/mean/max, according to the "mode" argument) the embedding output across dimension 0. So you can get the same result that embedding_bag gives by calling torch.nn.functional.embedding, followed by torch.sum/mean/max. In the following example, embedding_bag_res and embedding_mean_res are equal.
>>> weight = torch.randn(3, 4)
>>> weight
tensor([[ 0.3987, 1.6173, 0.4912, 1.5001],
[ 0.2418, 1.5810, -1.3191, 0.0081],
[ 0.0931, 0.4102, 0.3003, 0.2288]])
>>> indices = torch.tensor([2, 1])
>>> embedding_res = torch.nn.functional.embedding(indices, weight)
>>> embedding_res
tensor([[ 0.0931, 0.4102, 0.3003, 0.2288],
[ 0.2418, 1.5810, -1.3191, 0.0081]])
>>> embedding_mean_res = embedding_res.mean(dim=0, keepdim=True)
>>> embedding_mean_res
tensor([[ 0.1674, 0.9956, -0.5094, 0.1185]])
>>> embedding_bag_res = torch.nn.functional.embedding_bag(indices, weight, torch.tensor([0]), mode='mean')
>>> embedding_bag_res
tensor([[ 0.1674, 0.9956, -0.5094, 0.1185]])
However, the conceptual two step process does not reflect how it's actually implemented. Since embedding_bag does not need to return the intermediate result, it doesn't actually generate a Tensor object for the embedding. It just goes straight to computing the reduction, pulling in the appropriate data from the weight argument according to the indices in the input argument. Avoiding the creation of the embedding Tensor allows for better performance.
So the answer to your question (if I understand it correctly)
it just means that no additional Tensors will be produced when calling embedding_bag but still from the system's point of view all the intermediate row-vectors are already fetched into the processor to be used for calculating the final Tensor?
is yes.

Calculating transpose of a tensor in Paraview

I am required to calculate the following in Paraview:
How can I calculate the transpose used in the above formula ? Basically I would like to know how to calculate the transpose of a matrix in Paraview.
As suggested by #Nico Vuaille, you should make use of Numpy support in ParaView. Simply apply a Programmable Filter to the dataset of interest, and supply a script comparable to the following.
import numpy as np
u = inputs[0].PointData['Velocity']
# Calculate gradient here, say uGrad
output.PointData.append(uGrad, 'Gradient')
EDIT: I have actually tried to generate your calculation with one of my datasets and realised that my answer and comments are not so helpful. Therefore, this is what I would suggest now, which should work:
Load your dataset in ParaView
Apply a Gradient / Gradient Of Unstructured Dataset filter on your dataset and select the velocity field as the input field (I used Gradient Of Unstructured Dataset, from which you have the possibility to also directly work out both divergence and vorticity fields).
Apply a Programmable Filter filter to the resulting dataset you obtained from the previous step and supply the code below.
Script
import numpy as np
grad = inputs[0].PointData['Gradients']
omega = (grad - np.transpose(grad, axes=(0, 2, 1))) / 2
output.PointData.append(omega, 'Omega')
You should end up with another item in your ParaView pipeline that only contains the expected Omega.
EDIT 2: The input file is using the XMDF format. When loaded into ParaView, it is interpreted as a Multi-Block Dataset of Blocks. As a result, the code snippet provided to the Script argument of Programmable Filter has to be updated to:
import paraview.vtk.numpy_interface.dataset_adapter as dsa
for i in range(inputs[0].GetNumberOfBlocks()):
data = dsa.WrapDataObject(inputs[0].GetBlock(i))
grad = data.PointData['Gradients']
omega = (grad - np.transpose(grad, axes=(0, 2, 1))) / 2
data.PointData.append(omega, 'Omega')
output.SetBlock(i, data.VTKObject)
I think this can be easily computed using Python calculator (no need for programmable filter):
To compute the gradient, type:
gradient(u)
To compute the symmetric part of the tensor gradient(u):
strain(u)
To compute the non-symmetric part, Omega, of the gradient tensor:
gradient(u) - strain(u)
Note that that the gradient(u) tensor can be written as follows:

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm
I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components
Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.

Using SparseTensor as a trainable variable?

I'm trying to use SparseTensor to represent weight variables in a fully-connected layer.
However, it seems that TensorFlow 0.8 doesn't allow to use SparseTensor as tf.Variable.
Is there any way to go around this?
I've tried
import tensorflow as tf
a = tf.constant(1)
b = tf.SparseTensor([[0,0]],[1],[1,1])
print a.__class__ # shows <class 'tensorflow.python.framework.ops.Tensor'>
print b.__class__ # shows <class 'tensorflow.python.framework.ops.SparseTensor'>
tf.Variable(a) # Variable is declared correctly
tf.Variable(b) # Fail
By the way, my ultimate goal of using SparseTensor is to permanently mask some of connections in dense form. Thus, these pruned connections are ignored while calculating and applying gradients.
In my current implementation of MLP, SparseTensor and its sparse form of matmul ops successfully reports inference outputs. However, the weights declared using SparseTensor aren't trained as training steps go.
As a workaround to your problem, you can provide a tf.Variable (until Tensorflow v0.8) for the values of a sparse tensor. The sparsity structure has to be pre-defined in that case, the weights however remain trainable.
weights = tf.Variable(<initial-value>)
sparse_var = tf.SparseTensor(<indices>, weights, <shape>) # v0.8
sparse_var = tf.SparseTensor(<indices>, tf.identity(weights), <shape>) # v0.9
TensorFlow doesn't currently support sparse tensor variables. However, it does support sparse lookups (tf.embedding_lookup) and sparse gradient updates (tf.sparse_add) of dense variables. I suspect these two will suffice your use case.
TensorFlow doesn't support training on sparse tensors yet. You can initialize a sparse tensor as you wish, then convert it into a dense tensor and create a variable from it like that:
# You need to correctly initialize the sparse tensor with indices, values and a shape
b = tf.SparseTensor(indices, values, shape)
b_dense = tf.sparse_tensor_to_dense(b)
b_variable = tf.Variable(b_dense)
Now you have initialized a sparse tensor as a variable. Now you need to take care of the gradient update (in other words, make sure the entries in the variable stay 0, since there is a non-vanishing gradient calculated in the backpropagation algorithm for them when using this naively).
In order to do this, TensorFlow optimizers have a method called tf.train.Optimizer.compute_gradients(loss, [list_of_variables]). This calculates all the gradients in the graph necessary to minimize the loss function, but doesn't apply them yet. This method returns a list of tuples in a form of (gradients, variable). You can modify these gradients freely, but in your case it makes sense to mask the gradients not needed to 0 (i.e. by creating another sparse tensor with default values 0.0 and values 1.0 where the weights in your network are present).
After having modified them, you call the optimizer method tf.train.Optimizer.apply_gradients(grads_and_vars) to actually apply the gradients. An example code would look like this:
# Create optimizer instance
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
# Get the gradients for your weights
grads_and_vars = optimizer.compute_gradients(loss, [b_variable])
# Modify the gradients at will
# In your case it would look similar to this
modified_grads_and_vars = [(tf.multiply(gv[0], mask_tensor), gv[1] for gv in grads_and_vars]
# Apply modified gradients to your model
optimizer.apply_gradients(modified_grads_and_vars)
This makes sure your entries stay 0 in your weight matrix and no unwanted connections are created. You need to take care of all the other gradients for all other variables later.
The above code works with some minor correction like this.
def optimize(loss, mask_tensor):
optimizer = tf.train.AdamOptimizer(0.001)
grads_and_vars = optimizer.compute_gradients(loss)
modified_grads_and_vars = [
(tf.multiply(gv[0], mask_tensor[gv[1]]), gv[1]) for gv in grads_and_vars
]
return optimizer.apply_gradients(modified_grads_and_vars)

What are some packages that implement semi-supervised (constrained) clustering?

I want to run some experiments on semi-supervised (constrained) clustering, in particular with background knowledge provided as instance level pairwise constraints (Must-Link or Cannot-Link constraints). I would like to know if there are any good open-source packages that implement semi-supervised clustering? I tried to look at PyBrain, mlpy, scikit and orange, and I couldn't find any constrained clustering algorithms. In particular, I'm interested in constrained K-Means or constrained density based clustering algorithms (like C-DBSCAN).
Packages in Matlab, Python, Java or C++ would be preferred, but need not be limited to these languages.
The python package scikit-learn has now algorithms for Ward hierarchical clustering (since 0.15) and agglomerative clustering (since 0.14) that support connectivity constraints.
Besides, I do have a real world application, namely the identification of tracks from cell positions, where each track can only contain one position from each time point.
The R package conclust implements a number of algorithms:
There are 4 main functions in this package: ckmeans(), lcvqe(), mpckm() and ccls(). They take an unlabeled dataset and two lists of must-link and cannot-link constraints as input and produce a clustering as output.
There's also an implementation of COP-KMeans in python.
Maybe its a bit late but have a look at the following.
An extension of Weka (in java) that implements PKM, MKM and PKMKM
http://www.cs.ucdavis.edu/~davidson/constrained-clustering/
Gaussian mixture model using EM and constraints in Matlab
http://www.scharp.org/thertz/code.html
I hope that this helps.
Full disclosure. I am the author of k-means-constrained.
Here is a Python implementation of K-Means clustering where you can specify the minimum and maximum cluster sizes. It uses the same API as scikit-learn and so fairly easy to use. It is also based on a fast C++ package and so has good performance.
You can pip install it:
pip install k-means-constrained
Example use:
>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
>>> [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
>>> n_clusters=2,
>>> size_min=2,
>>> size_max=5,
>>> random_state=0
>>> )
>>> clf.fit(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
>>> clf.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
Github Semisupervised has the similar usage like Sklearn API.
pip install semisupervised
Step 1. The unlabeled samples should be labeled as -1.
Step2. model.fit(X,y)
Step3. model.predict(X_test)
Examples:
from semisupervised.TSVM import S3VM
model = S3VM()
model.fit(np.vstack((label_X_train, unlabel_X_train)), np.append(label_y_train, unlabel_y))
# predict
predict = model.predict(X_test)
acc = metrics.accuracy_score(y_test, predict)
# metric
print("accuracy", acc)
Check out this python package active-semi-supervised-clustering
Github https://github.com/datamole-ai/active-semi-supervised-clustering