How to find similar wiki pages with n-gram? - distance

Let's suppose there's a wiki, and for every wiki page I'd like to show a widget - with the list of similar pages.
It could be done in two steps:
Step 1 - convert each page into feature vector with NGram (there are better ways to do so, for example using word2vec, but I'd like to keep the approach simple to better understand the core concepts).
And here's the first problem - the dimension of the feature vector will be huge. Let's say we'll be using NGram = 3 and with only a-z symbols, there are 26 letters, so the dimensionality of the feature vector will be 26 * 26 * 26 = 17576
What are the simple way to handle huge dimensionality of the feature vector? (there are complex ways to solve that, for example word2vec uses very smart approach, but I really would like to use something simpler to implement it and get the feeling how it works).
Step 2 - we have a bunch of vectors - each for every page in a N-dimensional vector space. Now we need to find similar vectors, again there are multiple ways to define similarity, I'd like to use Euclidean distance.
What are the efficient way to get k closest vectors for the given vector X? We can use brute-force and compare vector X with every other vector, but it's inefficient. Is there a better way?

Related

Dot product with huge vectors

I am facing the following problem: I have a system of 160000 linear equations with 160000 variables. I am going to write two programs on conjugate gradient method and steepest descent method to solve it. The matrix is block tridiagonal with only 5 non zero diagonals, thus it's not necessary to create and store the matrix. But I am having the following problem: when I go to the iterarion stepe, there must be dot product of vectors involved. I have tried the following commands: dot(u,v), u'*v, which are commonly used. But when I run the program, MATLAB told me the data size is too large for the memory.
To resolve this problem, I tried to decompose the huge vector into sparse vectors with small support, then calculate the dot products of small vectors and finally glue them together. But it seems that this method is more complicated and not very efficient, and it is easy (especially for beginners like me) to make mistakes. I wonder if there're any more efficient ways to deal with this problem. Thanks in advance.

better building of kd-trees

Has anyone ever tried improving kd-trees using the following method?
Dividing each numeric dimension via some 1-d clustering method (e.g. Jenks Natural Breaks Optimization, or FayyadIranni or xyz...)
Sorting the dimensions on the expected value of the variance reduction within each division of that dimension
Building the KD-tree top-down selecting attributes from the order found in (2)
Breaking dimensions at each level of the KD-tree using the divisions found in (1)
And just to say the obvious. If (3) terminates when #rows is (say) less than 30 then nearest neighbor would require 30 distance measures, not N.
You want the tree to be balanced, so there is not much leeway in terms of where to split.
Also, you want the construction to be fast.
If you put in an O(n^2) method during construction, construction will likely be the new bottleneck.
In many cases, the very simple (original) k-d-tree is just as fast as any of the "optimized" techniques that try to determine the "best" splitting axis.

Find the nearest positions

On each day, I have 10000 position pairs in the form of (x,y); Up to now, I have collected 30 days. I want to select a position pair from each day so that all the positions pairs have similar coordinates (x,y) value. By similar I mean the euclidean distance is minimized between any two pairs. How to do it in matlab with efficiency. Because with brute force, it is almost impossible.
In brute force case, we have 10000^30 possibilities, each operation say needs 10^-9 second,
It will run forever.
One idea would be to use the k-means algorithm or one of its variations. it is relatively easy to implement (it is also part of the Statistics Toolbox) and has a runtime about O(nkl).
Analysing all the possibilities will give you for sure the best result you are looking for.
If you want an approximate result you can consider the first two days and analyse all the possibilities for these 2 days and pick the best result. Then when analyse the next day keep the result obtained previously and find the point of the third column closest to the previous two.
In this way you will obtain an approximate solution but with a less computational complexity.

MANOVA - huge matrices

First, sorry by the tag as "ANOVA", it is about MANOVA (yet to become a tag...)
From the tutorials I found, all the examples use small matrices, following them would not be feasible for the case of big ones as it is the case of many studies.
I got 2 matrices for my 14 sampling points, 1 for the organisms IDs (4493 IDs) and other to chemical profile (190 variables).
The 2 matrices were correlated by spearman and based on the correlation, split in 4 clusters (k-means regarding the square euclidian clustering values), the IDs on the row and chemical profile on line.
The differences among them are somewhat clear, but to have it in a more robust way I want to perform MANOVA to show the differences between and within the clusters - that is a key factor for the conclusion, of course.
Problem is that, after 8h trying, could not even input the data in a format acceptable to the analysis.
The tutorials I found are designed to very few variables and even when I think I overcame that, the program says that my matrices can't be compared by their difference in length.
Each cluster has its own set of IDs sharing all same set of variables.
What should I do?
Thanks in advance.
Diogo Ogawa
If you have missing values in your data (which practically all data sets seem to contain) you can either remove those observations or you can create a model using those observations. Use the first approach if something about your methodology gives you conviction that there is something different about those observations. Most of the time, it is better to run the model using the missing values. In this case, use the general linear model instead of a balanced ANOVA model. The balanced model will struggle with those missing data.

Generating multivariate normal random numbers with zero covariances in matlab

Suppose to generate a n-dim normal random number with distribution N(u, diag(sigma_1^2, ..., sigma_n^2) in Matlab, where u is a vertical vector.
There are two ways.
randn(n,1).*[sigma_1, ..., sigma_n]' + u;
mvnrnd(u', diag(sigma_1^2, ..., sigma_n^2))';
I think they are both correct. But I wonder if there is some preference of one over the other based on some reasons? I ask this question, because I saw another person always choose the first way, while I choose the second without having thought about it yet.
Thanks and regards!
They are equivalent methods. Personally, I would prefer the second option because it's one function that can be used to generate this sort of data for arbitrarily-shaped arrays. If all of a sudden you wanted a whole matrix of Gaussian values, you can get that more easily from the second function call, without doing any calls to reshape(). I also think the second example is easier to read because it relies on a built-in of Matlab's that has been ubiquitous for a long time.
I suppose that if n is large, one could argue that it's inefficient to actually form diag(sigma_1^2, ..., sigma_n^2). But if you're needing to make random draws from a matrix that large, then Matlab is already the wrong tool for the job and you should use Boost::Probability in C++, or perhaps SciPy / scikits.statsmodels in Python.
If there are correlations between the random variables then the covariance matrix is not anymore diagonal. In such case you may use mvnrnd or use randn with Cholesky decompistion as following.
U = chol(SIGMA);
x = U'*randn(n,1);
Whenever possible, use basic functions instead of using toolbox functions. Basic function are faster and portable.