To check similarity between text data - cluster-analysis

Please guide me how to measure similarity of text data for clustering, for numeric data we can measure with euclidean distance measure or any other distance measure. The data is keywords used for searching collected from website and the second data set is the collection of snippets returned of some searching. the similarity should be similar in meaning as well.

Read about tf–idf and cosine similarity.

Related

Pyspark columnSimilarities() usage for calculation of cosine similarities between products

I have a big dataset and need to calculate cosine similarities between products in the context of item-item collaborative filtering for product recommendations. As the data contains more than 50000 items and 25000 rows, I opted for using Spark and found the function columnSimilarities() which can be used on DistributedMatrix, specifically on a RowMatrix or IndexedRowMatrix.
But, there is 2 issues I'm wondering about.
1) In the documentation, it's mentioned that:
A RowMatrix is backed by an RDD of its rows, where each row is a local
vector. Since each row is represented by a local vector, the number of
columns is limited by the integer range but it should be much smaller
in practice.
As I have many products it seems that RowMatrix is not the best choice for building the similarity Matrix from my input which is a Spark Dataframe. That's why I decided to start by converting the dataframe to a CoordinateMatrix and then use toRowMatrix() because columnSimilarities() requires input parameter as RowMatrix. Meanwhile, I'm not sure of its performance..
2) I found out that:
the columnSimilarities method only returns the off diagonal entries of
the upper triangular portion of the similarity matrix.
reference
Does this mean I cannot get the similarity vectors of all the products?
So your current strategy is to compute the similarity between each item, i, and each other item. This means at best you have to compute the upper triangular of the distance matrix, I think that's (i^2 / 2) - i calculations. Then you have to sort for each of those i items.
If you are willing to trade off a little accuracy for runtime you can use approximate nearest neighbors (ANN). You might not find exactly the top NNS for an item but you will find very similar items and it will be orders of magnitude faster. No one dealing with moderately sized datasets calculates (or has the time to wait to calculate) the full set of distances.
Each ANN search method creates an index that will only generate a small set of candidates and compute distances within that subset (this is the fast part). The way the index is constructed provides different guarantees about the accuracy of the NN retrieval (this is the approximate part).
There are various ANN search libraries out there, annoy, nmslib, LSH. An accessible introduction is here: https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html
HTH. Tim

Clustering techniques for similarity matrix

I have a binary data of 128 respondants based on the features of digital camera that they have selected. where '1' represents the selection of feature and '0' represents that feature not selected. i have 92 product features in columns and respondants in rows. Each respondant has exactly selected 20 features out of set of 92 features. I want to create the clusters of different user groups based on the features they selected. I have tried some clustering algorithms like fuzzy clustering and hierarichal on these binaray data but it didnt gave me any good results and the clusters created were really bad. So now i have applied the dice coefficient similarity matrix on the data w.r.t the respondants, that basically gives me the similarity score for each respondant with all the other respondants. Is it possible to apply clustering technique on this similarity matrix to get good clusters? also what clustering techniques are available that i could apply on this user similarity matrix so that i could identify the clusters of users based on their similairty score. Any suggestion and comment would be really appreciated
Since your data set is tiny, go with hierarchical clustering.
It can be implemented with distance or with similarity.

Hierarchical agglomerative clustering

Can we use Hierarchical agglomerative clustering for clustering data in this format ?
"beirut,proff,email1"
"beirut,proff,email2"
"swiss,aproff,email1"
"france,instrc,email2"
"swiss,instrc,email2"
"beirut,proff,email1"
"swiss,instrc,email2"
"france,aproff,email2"
If not, what is the compatible clustering algorithm to cluster data with string values ?
Thank you for your help!
Any type of clustering requires a distance metric. If all you're willing to do with your strings is treat them as equal to each other or not equal to each other, the best you can really do is the field-wise Hamming distance... that is, the distance between "abc,def,ghi" and "uvw,xyz,ghi" is 2, and the distance between "abw,dez,ghi" is also 2. If you want to cluster similar strings within a particular field -- say clustering "Slovakia" and "Slovenia" because of the name similarity, or "Poland" and "Ukraine" because they border each other, you'll use more complex metrics. Given a distance metric, hierarchical agglomerative clustering should work fine.
All this assumes, however, that clustering is what you actually want to do. Your dataset seems like sort of an odd use-case for clustering.
Hierarchical clustering is a rather flexible clustering algorithm. Except for some linkages (Ward?) it does not have any requirement on the "distance" - it could be a similarity as well, usually negative values will work just as well, you don't need triangle inequality etc.
Other algorithms - such as k-means - are much more limited. K-means minimizes variance; so it can only handle (squared) Euclidean distance; and it needs to be able to compute means, thus the data needs to be in a continuous, fixed dimensionality vector space; and sparsity may be an issue.
One algorithm that probably is even more flexible is Generalized DBSCAN. Essentially, it needs a binary decision "x is a neighbor of y" (e.g. distance less than epsilon), and a predicate to measure "core point" (e.g. density). You can come up with arbitary complex such predicates, that may no longer be a single "distance" anymore.
Either way: If you can measure similarity of these records, hiearchical clustering should work. The question is, if you can get enough similarity out of that data, and not just 3 bit: "has the same email", "has the same name", "has the same location" -- 3 bit will not provide a very interesting hierarchy.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

A Matlab histogram application

In my application, I have a number of data points and each are associated with a number and strength. I am trying to figure out how to sort these data points so that I can find the most frequent data point with the highest strength -- the answer will be sort of like an average between these two.
I can use hist() to generate the histogram of the data points and find which number occurs most often. However, I'm having trouble thinking of a way to sort the data point strengths by number easily. (I figure I can just multiply the hist of numbers with hist of strengths to find the best bin.) I don't think hist() can do this. Is there another way? Or am I limited to just binning the data point strengths manually by going through each number of bin?
I may be severely misinterpreting your problem, but why don't you use a 2D histogram routine (there are many in the FEX, such as this) and find the bin - corresponding to a range of numbers and a range of strengths - with the highest incidence of data points?