Software solutions for visualizing similarity/dissimilarity between pairs of people - visualization

I have calculations of the similarity/dissimilarity between any two pairs of ~1200 people on a scale of 0-1.
I would like to visualize these relationships on an X-Y plane. Are there are any software tools that can take these relationships and put people close to each other who have high similarity and far away from each who have high dissimilarity?

You need to use Multidimensional Scaling. The objective is to identify a transformation of your data that will express the relative differences in similarity with linear distances in 2 dimensions. You will want to use classical or metric based scaling for this.
Here is an example with R:
The code is fairly straightforward. The magick is all in cmdscale and the use of vegdist to create the distance matrix for cmdscale. Then you can use R or export the data somewhere else for visualization.
## load libraries
library(ggplot2) # for charting
library(vegan) # for jaccard
## simulate some data - 1200 rows, 5 features/cols/fields
features <- matrix(abs(rnorm(1200*5)),ncol=5)
rownames(features) <- paste("P", seq(1:1200), sep="")
## calculate jaccard distances
d <- vegdist(features, method = "jaccard")
## Multidimensional Scaling
fit <- cmdscale(d,eig=TRUE, k=2)
# prepare the data for plotting
mds <- data.frame(
x = fit$points[,1],
y = fit$points[,2],
name = rownames(features))
# plot
ggplot(mds, aes(x=x,y=y,label=color=name)) + geom_text(size=1)
## bonus visualization! - a dendrogram
fit <- hclust(d, method="ward")
plot(fit)

Related

Morton encoding of nearest neighbours?

I am using morton encoding on a 3D grid so that a set of points (x,y,z) gives me a 1D array of morton encodings M(x,y,z), where x,y,z are integers. For every M(x,y,z), my calculations also require the 26 nearest neighbours on the grid, ie. M(x-1,y-1,z-1), M(x-1,y-1,z+0), M(x-1,y-1,z+1), M(x-1,y+0,z-1)...
My question is, how do I directly compute these neighbour encodings from M(x,y,z)? I know wikipedia has a solution for 8-bit integers in 2D:
M(x,y-1) = ((M(x,y) & 0b10101010) - 1 & 0b10101010) | (M(x,y) & 0b01010101)
What do the equivalent algorithms look like for a 3 dimensional grid?
Is it a strict requirement that you have to compute the neighbours in a similar manner to the formula you have written? If not, you can use the (x, y, z)-coordinates you already have from which you can obtain all the neighbour Morton order indices simply by performing regular Morton order encoding on these. Here is a simple function in Python-syntax that shows what I mean:
def get_neighbour_indices_3d(point):
x, y, z = point # The point you are currently looking at and know the coordinates of
neighbours_indices = []
for x_new in range(x-1, x+2):
for y_new in range(y-1, y+2):
for z_new in range(z-1, z+2):
# Maybe do some check that you're not beyond the edge or at the original point
neighbours_indices.append(morton_encode(x_new, y_new, z_new))
return neighbours_indices

Assessing performance of a zero inflated negative binomial model

I am modelling the diffusion of movies through a contact network (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below.)
The next step is to evaluate the performance of the model.
My attempt has been to do multiple out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I received a prediction for the mean length of a diffusion chain for each datapoint, and using
predict(m1, newdata = testData, type = "prob")
I received a matrix containing the probability of each datapoint being a certain length.
Problem with the evaluation: Since I have a 0 (and 1) inflated dataset, the model would be correct most of the time if it predicted 0 for all the values. The predictions I receive are good for chains of length zero (according to the MSE), but the deviation between the predicted and the true value for chains of length 1 or larger is substantial.
My question is:
How can we assess how well our model predicts chains of non-zero length?
Is this approach the correct way to make predictions from a zero inflated negative binomial model?
If yes: how do I interpret these results?
If no: what alternative can I use?
My variables are:
Dependent variable:
length of the diffusion chain (count [0,36])
Independent variables:
movie characteristics (both dummies and continuous variables).
Thanks!
It is straightforward to evaluate RMSPE (root mean square predictive error), but is probably best to transform your counts beforehand, to ensure that the really big counts do not dominate this sum.
You may find false negative and false positive error rates (FNR and FPR) to be useful here. FNR is the chance that a chain of actual non-zero length is predicted to have zero length (i.e. absence, also known as negative). FPR is the chance that a chain of actual zero length is falsely predicted to have non-zero (i.e. positive) length. I suggest doing a Google on these terms to find a paper in your favourite quantitative journals or a chapter in a book that helps explain these simply. For ecologists I tend to go back to Fielding & Bell (1997, Environmental Conservation).
First, let's define a repeatable example, that anyone can use (not sure where your trainData comes from). This is from help on zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that calculate these. But here's the by hand approach. First calculate observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then get the confusion matrix (basis of FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration we can do it visually or via calibration
# let's look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Transforming the counts to "see" what is going on:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case you could also evaluate the correlations, between the predicted counts and the actual counts, either including or excluding the zeros.
So you could fit a regression as a kind of calibration to evaluate this!
However, since the predictions are not necessarily counts, we can't use a poisson
regression, so instead we can use a lognormal, by regressing the log
prediction against the log observed, assuming a Normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more fancy ways of assessing calibration I suppose, as in any modelling exercise ... but this is a start.
For a more advanced assessment of zero-inflated models, check out the ways in which the log likelihood can be used, in the references provided for the zeroinfl function. This requires a bit of finesse.

Using SURF algorithm to match objects on MATLAB

The objective is to see if two images, which have one object captured in each image, matches.
The object or image I have stored. This will be used as a baseline:
item1 (This is being matched in the code)
The object/image that needs to matched with-this is stored:
input (Need to see if this matches with what is stored
My method:
Covert images to gray-scale.
Extract SURF interest points.
Obtain features.
Match features.
Get 50 strongest features.
Match the number of strongest features with each image.
Take the ratio of- number of features matched/ number of strongest
features (which is 50).
If I have two images of the same object (two images taken separately on a camera), ideally the ratio should be near 1 or near 100%.
However this is not the case, the best ratio I am getting is near 0.5 or even worse, 0.3.
I am aware the SURF detectors and features can be used in neural networks, or using a statistics based approach. I believe I have approached the statistics based approach to some extent by using 50 of the strongest features.
Is there something I am missing? What do I add onto this or how do I improve it? Please provide me a point to start from.
%Clearing the workspace and all variables
clc;
clear;
%ITEM 1
item1 = imread('Loreal.jpg');%Retrieve order 1 and digitize it.
item1Grey = rgb2gray(item1);%convert to grayscale, 2 dimensional matrix
item1KP = detectSURFFeatures(item1Grey,'MetricThreshold',600);%get SURF dectectors or interest points
strong1 = item1KP.selectStrongest(50);
[item1Features, item1Points] = extractFeatures(item1Grey, strong1,'SURFSize',128); % using SURFSize of 128
%INPUT : Aquire Image
input= imread('MakeUp1.jpg');%Retrieve input and digitize it.
inputGrey = rgb2gray(input);%convert to grayscale, 2 dimensional matrix
inputKP = detectSURFFeatures(inputGrey,'MetricThreshold',600);%get SURF dectectors or interest
strongInput = inputKP.selectStrongest(50);
[inputFeatures, inputPoints] = extractFeatures(inputGrey, strongInput,'SURFSize',128); % using SURFSize of 128
pairs = matchFeatures(item1Features, inputFeatures, 'MaxRatio',1); %matching SURF Features
totalFeatures = length(item1Features); %baseline number of features
numPairs = length(pairs); %the number of pairs
percentage = numPairs/50;
if percentage >= 0.49
disp('We have this');
else
disp('We do not have this');
disp(percentage);
end
The baseline image
The input image
I would try not doing selectStrongest and not setting MaxRatio. Just call matchFeatures with the default options and compare the number of resulting matches.
The default behavior of matchFeatures is to use the ratio test to exclude ambiguous matches. So the number of matches it returns may be a good indicator of the presence or absence of the object in the scene.
If you want to try something more sophisticated, take a look at this example.

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

How can I find low regions in a graph using Perl/R?

I'm examining some biological data which is basically a long list (a few million values) of integers, each saying how well this position in the genome is covered. Here is a graphical example for a data set:
I would like to look for "valleys" in this data, that is, regions which are significantly lower than their surrounding environment.
Note that the size of the valleys I'm looking for is not really known - it may range from 50 bases to a few thousands. Defining what is a valley is of course one of the questions I'm struggling with, but the previous examples are relatively easy for me:
What kind of paradigms would you recommend using to find those valleys? I mainly program using Perl and R.
Thanks!
We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.
In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.
The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.
The application of the weighted median :
center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)
Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.
Running Sample code :
require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
type <- match.arg(type)
x <- seq(-1,1,length.out=n)
out <-switch(type,
gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
epanechnikov=0.75*(1-x^2),
biweight=15/16*(1-x^2)^2,
triweight=35/32*(1-x^2)^3,
cosinus=pi/4*cos(x*pi/2),
)
out <- out/sum(out)*n
return(out)
}
# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
n <- length(x)
if(missing(w)) w<-rep(1,n)
w <- w[!is.na(x)]
x <- x[!is.na(x)]
if(sum(w)!=n) w <- w/sum(w)*n
phat <- sum(x*w)/n
z <- (phat-p0)/sqrt(p0*(1-p0)/n)
p <- 1-pnorm(z)
return(p)
}
# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
extra <- rep(NA,n)
x <- c(extra,x,extra)
out <- embed(x,2*n+1)
return(out)
}
# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n
running.score <- function(sel,n=20,w,p0){
if(missing(w)) w<- rep(1,2*n+1)
if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)]) # null hypothesis
out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
return(out)
}
# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
wdw <- 2*n+1
if(missing(w)) w <- rep(1,wdw)
center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
mad <- median(abs(x-center))*cte
return(list(med=center,mad=mad))
}
##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)
peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))
pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)
plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")
The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.
Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :
#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)
op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)
You can define a valley by different criterion :
depth
width
volume (depth*width)
You might also have valley in a big mountain, do you want these too ?
For example there is a valley here : 1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3
Try to explain with more details how YOU (or any expert in your field) would choose the valleys given the data
You might want to look at watershed
You might want to try the peak detection function to identify the regions of interest. The desired minimum width of the valleys can be specified with the span parameter.
It might be a good idea to smooth the data first, to get rid of the noise peaks like the one in the right "valley" of the blue graph. A simple stats::filter should be enough.
The final step would be to check the depth of the found "valleys". This really depends on your requirements. As a first approximation, you can simply compare the peak value with the median level of the data.