What mechanism can be used to quantify similarity between non-numeric lists? - cluster-analysis

I have a database of recipes which is essentially structured as a list of ingredients and their associated quantities. If you are given a recipe how would you identify similar recipes allowing for variations and omissions? For example using milk instead of water, or honey instead of sugar or entirely omitting something for flavour.
The current strategy is to do multiple inner joins for combinations of the main ingredients but this is can be exceedingly slow with a large database. Is there another way to do this? Something to the equivalent of perceptual hashing would be ideal!

How about cosine similarity?
This technique is commonly used in Machine Learning for text recognition as a similarity measure. With it, you can calculate the distance between two texts (actually, between any two vectors) which can be interpreted as how much are those texts alike (the closer, the more alike).
Take a look at this great question that explains cosine similarity in a simple way. In general, you could use any similarity measure to obtain a distance to compare your recipe. This article talks about different similarity measures, you can check it out if you wish to know more.

Related

Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE)

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

Clustering strings according to multiple string similarity ratings

I have a list of ~75000 strings that represent suburb names. These strings are often misspelled or are shortened variations. The list was captured manually over decades so do not underestimate how dirty this dataset is.
My goal is to find which of these strings belong to a suburb that I'm interested in i.e which of these strings were meant to be "HUMEWOOD" but were instead written as "HWOOD"/"HUMEWOOD STRAND"/"HUMWOOD" etc. Other suburbs have similar names such as "HUMERAIL" or "SHERWOOD" which score quite high for substring similarity, but are not the same suburb. I've used various string similarity algorithms and the results are OK, except that some algorithms are better suited to spelling mistakes (Levenstein)and other are better suited to finding the shortened variations (longest common substring).
I thought of plotting two normalised similarity ratings for each string and then using a clustering algorithm to find the strings that are describing my suburb. So I've got a similarity rating according to two different algorithms for each string like this:
String similarity ratings
Now I'd like to use a clustering algorithm to group the strings that might represent my suburb, plotting these ratings results in the following:
Plotting two string similarity ratings
Obviously there are many combinations of string similarity algorithms I can use, and many clustering algorithms that can be used on those combinations. So before I take a deep dive into which combinations work best I'd like to know:
Is this even a viable approach? I can't find any implementations similar to this and I'm sure there must be a good reason. Maybe I'm over-complicating this entirely,in which case I'd appreciate a nudge in the right direction.

Text patterns from classification

Say I have some kind of multi-class text/conversation classificator (naive bayes or so) and I wanted to find text patterns that were significant for the classification. How would I best go about finding these text patterns? The motivation behind this is you could use these patterns to better understand the process behind the classification.
A pattern is defined as a (multi)set of words s={w1, ... , wn}, this pattern has a class probability for each class c - P(c|s) - inferred by the classificator. A pattern is then significant, if the inferred probability is high (local maximum, top n, something like that).
Now it wouldn't be such a problem to run the classificator on parts of text in the dataset you are looking at. However, these patterns do not have to be natural sentences or something like that, but any (multi)subset of the vocabulary. You are then looking at running the classification on all the (multi)subsets of the vocabulary, which is computationally unrealistic.
I think what could work is to search the text space using a heuristic search algorithm such as hill climbing to maximize the likelyhood of a certain class. You could run the hillclimber a bunch of times from different initial conditions and then just take the top 10 or so unique results as patterns.
Is this a good approach, or are there better ones? Thanks for any suggestions.

How many and which parents should we select for crossover in genetic algorithm

I have read many tutorials, papers and I understood the concept of Genetic Algorithm, but I have some problems to implement the problem in Matlab.
In summary, I have:
A chromosome containing three genes [ a b c ] with each gene constrained by some different limits.
Objective function to be evaluated to find the best solution
What I did:
Generated random values of a, b and c, say 20 populations. i.e
[a1 b1 c1] [a2 b2 c2]…..[a20 b20 c20]
At each solution, I evaluated the objective function and ranked the solutions from best to worst.
Difficulties I faced:
Now, why should we go for crossover and mutation? Is the best solution I found not enough?
I know the concept of doing crossover (generating random number, probability…etc) but which parents and how many of them will be selected to do crossover or mutation?
Should I do the crossover for the entire 20 solutions (parents) or only two of them?
Generally a Genetic Algorithm is used to find a good solution to a problem with a huge search space, where finding an absolute solution is either very difficult or impossible. Obviously, I don't know the range of your values but since you have only three genes it's likely that a good solution will be found by a Genetic Algorithm (or a simpler search strategy at that) without any additional operators. Selection and Crossover is usually carried out on all chromosome in the population (although it's not uncommon to carry some of the best from each generation forward as is). The general idea is that the fitter chromosomes are more likely to be selected and undergo crossover with each other.
Mutation is usually used to stop the Genetic Algorithm prematurely converging on a non-optimal solution. You should analyse the results without mutation to see if it's needed. Mutation is usually run on the entire population, at every generation, but with a very small probability. Giving every gene 0.05% chance that it will mutate isn't uncommon. You usually want to give a small chance of mutation, without it completely overriding the results of selection and crossover.
As has been suggested I'd do a lit bit more general background reading on Genetic Algorithms to give a better understanding of its concepts.
Sharing a bit of advice from 'Practical Neural Network Recipies in C++' book... It is a good idea to have a significantly larger population for your first epoc, then your likely to include features which will contribute to an acceptable solution. Later epocs which can have smaller populations will then tune and combine or obsolete these favourable features.
And Handbook-Multiparent-Eiben seems to indicate four parents are better than two. However bed manufactures have not caught on to this yet and seem to only produce single and double-beds.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.