What are some packages that implement semi-supervised (constrained) clustering? - cluster-analysis

I want to run some experiments on semi-supervised (constrained) clustering, in particular with background knowledge provided as instance level pairwise constraints (Must-Link or Cannot-Link constraints). I would like to know if there are any good open-source packages that implement semi-supervised clustering? I tried to look at PyBrain, mlpy, scikit and orange, and I couldn't find any constrained clustering algorithms. In particular, I'm interested in constrained K-Means or constrained density based clustering algorithms (like C-DBSCAN).
Packages in Matlab, Python, Java or C++ would be preferred, but need not be limited to these languages.

The python package scikit-learn has now algorithms for Ward hierarchical clustering (since 0.15) and agglomerative clustering (since 0.14) that support connectivity constraints.
Besides, I do have a real world application, namely the identification of tracks from cell positions, where each track can only contain one position from each time point.

The R package conclust implements a number of algorithms:
There are 4 main functions in this package: ckmeans(), lcvqe(), mpckm() and ccls(). They take an unlabeled dataset and two lists of must-link and cannot-link constraints as input and produce a clustering as output.
There's also an implementation of COP-KMeans in python.

Maybe its a bit late but have a look at the following.
An extension of Weka (in java) that implements PKM, MKM and PKMKM
http://www.cs.ucdavis.edu/~davidson/constrained-clustering/
Gaussian mixture model using EM and constraints in Matlab
http://www.scharp.org/thertz/code.html
I hope that this helps.

Full disclosure. I am the author of k-means-constrained.
Here is a Python implementation of K-Means clustering where you can specify the minimum and maximum cluster sizes. It uses the same API as scikit-learn and so fairly easy to use. It is also based on a fast C++ package and so has good performance.
You can pip install it:
pip install k-means-constrained
Example use:
>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
>>> [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
>>> n_clusters=2,
>>> size_min=2,
>>> size_max=5,
>>> random_state=0
>>> )
>>> clf.fit(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
>>> clf.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)

Github Semisupervised has the similar usage like Sklearn API.
pip install semisupervised
Step 1. The unlabeled samples should be labeled as -1.
Step2. model.fit(X,y)
Step3. model.predict(X_test)
Examples:
from semisupervised.TSVM import S3VM
model = S3VM()
model.fit(np.vstack((label_X_train, unlabel_X_train)), np.append(label_y_train, unlabel_y))
# predict
predict = model.predict(X_test)
acc = metrics.accuracy_score(y_test, predict)
# metric
print("accuracy", acc)

Check out this python package active-semi-supervised-clustering
Github https://github.com/datamole-ai/active-semi-supervised-clustering

Related

Unable to figure out nInputPlane in SpatialConvolution in torch?

Documentaion for Spatial Convolution define it as
module = nn.SpatialConvolution(nInputPlane, nOutputPlane, kW, kH, [dW], [dH], [padW], [padH])
nInputPlane: The number of expected input planes in the image given into forward().
nOutputPlane: The number of output planes the convolution layer will produce.
I don't have any experience with torch but i guess i have used a similar function in keras
Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 256, 256))
which takes as input the shape of the image that is 256*256 in rgb.
I have read usage of Spatial Convolution in torch as below but unable to figure out what does the nInputPlane and nOutputPlane paramter correspond to?
local convLayer = nn.SpatialConvolutionMM(384, 384, 1, 1, 1, 1, 0, 0)
In the code above what does these 384,384 represent ?
In case I'm not speaking common language you can refer to this.
nIputPlane is the number of layers coming in to the convolution, nOutputPlane is number of layers coming out of the convolution. If you have an rgb image nInputPlane = 3 (assuming your tensor is setup correctly). nOutputPlane can be any number of layers that you want to come out of the spatial convolution but of course make sure the next layer input is equal to nOutputPlane.
If that isn't clear I'd recommend the 60-minute blitz.

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm
I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components
Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.

automatically determine number of clusters k-means

I want to build a cluster model in rapid miner that can define the number of clusters automatically and then continue to the k-means algorithm. Is there any way for determine k of clustering automatically in rapid miner?
In k-means, the value of k is supplied by the user. The clusters that are produced can be assessed using a cluster validity measure (such as Davies-Bouldin) to give a score. By varying k, different cluster validity scores can be produced and the optimum score (for Davies-Bouldin a minimum) would be a candidate for the most interesting value of k. Follow the link for details on how this might be done in RapidMiner Examples.
There are many caveats associated with this. The most important point is that a domain expert must be involved to check that the value of k and the clustering that is produced has meaning.
One trick to determining K is to run a DBSCAN on your dataset first. Determine the number of clusters from DBSCAN, and then get the cluster centers using K-means
Here is some python code:
from sklearn.cluster import DBSCAN #python -m pip install scikit-learn
import cv2 as cv #python -m pip install opencv-python
import numpy as np #python -m pip install numpy
Z=np.array([0.0,1.0,0.25,0.11,0.12,0.27,0.99,1.1,0.05,0.06])
Z=np.unique(Z) #speed up the DBSCAN by considering only unique points
Z=Z.reshape((-1,1)).astype(np.float32)
K=int(np.max(DBSCAN(eps=0.05,min_samples=2).fit(Z).labels_))+1
criteria = (cv.TERM_CRITERIA_EPS + cv.TERM_CRITERIA_MAX_ITER, 10, 1.0)
_,label,center=cv.kmeans(Z,K,None,criteria,10,cv.KMEANS_RANDOM_CENTERS)
print(f"\nK={K}")
print("\nlabel=")
print(label)
print("\ncenter=")
print(center)
Output from the code:
K=4
label=
[[0]
[0]
[0]
[3]
[3]
[1]
[1]
[2]
[2]
[2]]
center=
[[0.03666667]
[0.26 ]
[1.0300001 ]
[0.11499999]]

Plotting points in Maple

I want to plot a single [2,-2,0] in Maple.
I am trying to use command:
pointplot3d([2, -2, 0], axes=normal, symbol=cross)
it does not work(maybe because pointplot3d is for a list of points). Help please.
It worked for me, provided that I either invoked it as plots:-pointplot3d(...) , or plots[pointplot3d](...) , or as pointplot3d(...) only after having loaded the plots package by with(plots).
The default color and size may not be to your liking. Here's a screenshot (Maple 15.01, Windows 7),
plots[pointplot3d]([[2, -2, 0]], axes=normal,
symbolsize=20, symbol=cross, color=red);
You mentioned about its being a single point. Wel, in all of Maple 13.02, 14.01, 15.01, and 16.00 it also works as plots[pointplot3d]([[2, -2, 0]],...) which is a list of lists.

ANCOVA in Python with Scipy/Numpy stats

I would like to know a way of performing ANCOVA(analysis of covariance) using Python with scipy. It is basically a statistical comparison of regression lines. I know Python can do ANOVA and it can also do regression line fitting with Scipy.stats. I'm not sure how to put those together to get an effective ANCOVA though, if it is possible.
ANCOVA can be done with regression an using dummy variables in the design matrix for the effects that depend on the categorical variable.
A simple example is at
http://groups.google.com/group/pystatsmodels/browse_thread/thread/aaa31b08f3df1a69?hl=en
using the OLS class from scikits.statsmodels
Relevant part of construction of design matrix
xg includes group numbers/labels,
x1 is continuous explanatory variable
>>> dummy = (xg[:,None] == np.unique(xg)).astype(float)
>>> X = np.c_[x1, dummy[:,1:], np.ones(nsample)]
Estimate the model
>>> res2 = sm.OLS(y, X).fit()
>>> print res2.params
[ 1.00901524 3.08466166 -2.84716135 9.94655423]
>>> print res2.bse
[ 0.07499873 0.71217506 1.16037215 0.38826843]
>>> prstd, iv_l, iv_u = wls_prediction_std(res2)
"Test hypothesis that all groups have same intercept"
>>> R = [[0, 1, 0, 0],
... [0, 0, 1, 0]]
>>> print res2.f_test(R)
<F test: F=array([[ 91.69986847]]), p=[[ 8.90826383e-17]],
df_denom=46, df_num=2>
strongly rejected because differences in intercept are very large
Update (two and a half years later):
scikits.statsmodels has been renamed to statsmodels
and to the question:
With the latest release of statsmodels, it is more convenient to use formulas for specifying categorical effects and interaction effects. statsmodels uses patsy to handle the formulas and creates the design matrices.
More information is available at the links to the statsmodels documentation in https://stackoverflow.com/a/19495920/333700