Clustering strings according to multiple string similarity ratings - cluster-analysis

I have a list of ~75000 strings that represent suburb names. These strings are often misspelled or are shortened variations. The list was captured manually over decades so do not underestimate how dirty this dataset is.
My goal is to find which of these strings belong to a suburb that I'm interested in i.e which of these strings were meant to be "HUMEWOOD" but were instead written as "HWOOD"/"HUMEWOOD STRAND"/"HUMWOOD" etc. Other suburbs have similar names such as "HUMERAIL" or "SHERWOOD" which score quite high for substring similarity, but are not the same suburb. I've used various string similarity algorithms and the results are OK, except that some algorithms are better suited to spelling mistakes (Levenstein)and other are better suited to finding the shortened variations (longest common substring).
I thought of plotting two normalised similarity ratings for each string and then using a clustering algorithm to find the strings that are describing my suburb. So I've got a similarity rating according to two different algorithms for each string like this:
String similarity ratings
Now I'd like to use a clustering algorithm to group the strings that might represent my suburb, plotting these ratings results in the following:
Plotting two string similarity ratings
Obviously there are many combinations of string similarity algorithms I can use, and many clustering algorithms that can be used on those combinations. So before I take a deep dive into which combinations work best I'd like to know:
Is this even a viable approach? I can't find any implementations similar to this and I'm sure there must be a good reason. Maybe I'm over-complicating this entirely,in which case I'd appreciate a nudge in the right direction.

Related

Reverse TF-IDF vector (vec2text)

Given a generated doc2vec vector on some document. is it possible to reverse the vector back to the original document?
If so, does there exist any hash algorithm that would make the vector irreversible but still comparable to other vectors of the same type (using cosine/Euclidean distance)?
It's unclear why you've mentioned "TF-IDF vector" in your question title, but then asked about a Doc2Vec vector – which is very different from a TF-IDF approach. I'll assume your main interest is Doc2Vec vectors.
In general, a Doc2Vec vector has far too little information to actually reconstruct the document for which the vector was calculated. It's essentially a compressed summary, based on evolving (in training or inference) a vector that's good (within the limits of the model) at predicting the document's words.
For example, one commonly-used dimensionality for Doc2Vec vectors is 300. Those 300 dimensions are each represented by a 4-byte floating-point value. So the vector is 1200 bytes in total - but could be the summary vector for a document of many hundreds or thousands of words, far far larger than 1200 bytes.
It's theoretically plausible that with a Doc2Vec vector, and the associated model from which it was trained or inferred, you could generate a ranked list of words most-likely to be in the document. There's a pending feature-request to offer this in Gensim (#2459), but not yet implementing code. But such a list-of-words wouldn't be grammatical, and the top 10 words in such a list might not be in the document at all. (It might be entirely made up of other similar words.)
With a large set of calculated vectors, as you get when training of a model has finished, you could take a vector (from that set, or from inferring a new text), and look through the set-of-vectors for whichever one has a vector closest to your query vector. That would point you at one of your known documents - but that's more of a lookup (when you already know many example documents) than reversing a vector into a document directly.
You'd have to say more about your need for a 'irreversible' vector that is still good for document-to-document comparisons for me to make further suggestions to meet that need.
To an extent, Doc2Vec vectors already meet that need, as they can't regenerate an exat document. But given that they could generate a list of likely words (per above), if your needs are more rigorous, you might need extra steps. For example, if you used a model to calcualte all needed vectors, but then threw away the model, even that theoretical capability to list most-likely words would go away.
But to the extent you still have the vectors, and potentially their mappings to full documents, a vector still implies one, or a few, closest-documents from the known set. And even if you somehow had a novel vector, without its text, simply looking among your known documents that are closest would be highly suggestive (but not dispositive) about what words are in the source document.
(If your needs are very demanding, there might be something in the genre of 'Fully Homomorphic Encryption' and/or 'Private Information Retrieval' that would help. Those use advanced cryptography to allow queries on encrypted data that only reveal final results, hiding the details of what you're doing even from the system answering your query. But those techniques are far more new & complicated, with few if any sources of ready-to-use code, and adapting them specifically for vector-similarity style calculations might require significant custom advanced-cryptography work.)

How do I evaluate clustering for mixed data accuracy, like K-Prototype>

There are a lot of validity index for clustering, but just for numeric data. what about clustering for mixed data (numeric and categorical) ?
The same way, mostly.
You obviously can't use inertia, but anything that is distance based (and doesn't use the cluster mean) will work with the distance you used for clustering. E.g., Silhouette.
Unfortunately, the distance functions for such data are not very trustworthy in my opinion. So good luck, and triple check all results before using them, as you may have non-meaningful results that only look good when condensed to this single score number.

What mechanism can be used to quantify similarity between non-numeric lists?

I have a database of recipes which is essentially structured as a list of ingredients and their associated quantities. If you are given a recipe how would you identify similar recipes allowing for variations and omissions? For example using milk instead of water, or honey instead of sugar or entirely omitting something for flavour.
The current strategy is to do multiple inner joins for combinations of the main ingredients but this is can be exceedingly slow with a large database. Is there another way to do this? Something to the equivalent of perceptual hashing would be ideal!
How about cosine similarity?
This technique is commonly used in Machine Learning for text recognition as a similarity measure. With it, you can calculate the distance between two texts (actually, between any two vectors) which can be interpreted as how much are those texts alike (the closer, the more alike).
Take a look at this great question that explains cosine similarity in a simple way. In general, you could use any similarity measure to obtain a distance to compare your recipe. This article talks about different similarity measures, you can check it out if you wish to know more.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

Combining different similarities to build one final similarity

Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters:
city
education
interest
To calculate similarity between them im gonna apply cosine similarity and discrete similarity.
For example:
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:
1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n
where 1 means the presence of the interest and n is the total number of all interests.
My question is:
How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.
There are not hard-and-fast answers, since the answers here depend greatly on your input and problem domain. A lot of the work of machine learning is the art (not science) of preparing your input, for this reason. I could give you some general ideas to think about. You have two issues: making meaningful similarities out of each of these items, and then combining them.
The city similarity sounds reasonable but really depends on your domain. Is it really the case that being in the same city means everything, and being in neighboring cities means nothing? For example does being in similarly-sized cities count for anything? In the same state? If they do your similarity should reflect that.
Education: I understand why you might use cosine similarity but that is not going to address the real problem here, which is handling different tokens that mean the same thing. You need "eng" and "engineering" to match, and "ba" and "bachelors", things like that. Once you prepare the tokens that way it might give good results.
Interest: I don't think cosine will be the best choice here, try a simple tanimoto coefficient similarity (just size of intersection over size of union).
You can't just sum them, as I assume you still want a value in the range [0,1]. You could average them. That makes the assumption that the output of each of these are directly comparable, that they're the same "units" if you will. They aren't here; for example it's not as if they are probabilities.
It might still work OK in practice to average them, perhaps with weights. For example, being in the same city here is as important as having exactly the same interests. Is that true or should it be less important?
You can try and test different variations and weights as hopefully you have some scheme for testing against historical data. I would point you at our project, Mahout, as it has a complete framework for recommenders and evaluation.
However all these sorts of solutions are hacky and heuristic. I think you might want to take a more formal approach to feature encoding and similarities. If you're willing to buy a book and like Mahout, Mahout in Action has good coverage in the clustering chapters on how to select and encode features and then how to make one similarity out of them.
Here's the usual trick in machine learning.
city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
I take this to mean you use a one-of-K coding. That's good.
education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data.
If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval.
Experiment with L1 and L2 to decide which is best.