Clustering on a large dataset - matlab

I'm trying to cluster a large (Gigabyte) dataset. In order to cluster, you need distance of every point to every other point, so you end up with a N^2 sized distance matrix, which in case of my dataset would be on the order of exabytes. Pdist in Matlab blows up instantly of course ;)
Is there a way to cluster subsets of the large data first, and then maybe do some merging of similar clusters?
I don't know if this helps any, but the data are fixed length binary strings, so I'm calculating their distances using Hamming distance (Distance=string1 XOR string2).

A simplified version of the nice method from
Tabei et al., Single versus Multiple Sorting in All Pairs Similarity Search,
say for pairs with Hammingdist 1:
sort all the bit strings on the first 32 bits
look at blocks of strings where the first 32 bits are all the same;
these blocks will be relatively small
pdist each of these blocks for Hammingdist( left 32 ) 0 + Hammingdist( the rest ) <= 1.
This misses the fraction of e.g. 32/128 of the nearby pairs which have
Hammingdist( left 32 ) 1 + Hammingdist( the rest ) 0.
If you really want these, repeat the above with "first 32" -> "last 32".
The method can be extended.
Take for example Hammingdist <= 2 on 4 32-bit words; the mismatches must split like one of
2000 0200 0020 0002 1100 1010 1001 0110 0101 0011,
so 2 of the words must be 0, sort the same.
(Btw, sketchsort-0.0.7.tar is 99 % src/boost/, build/, .svn/ .)

How about sorting them first? Maybe something like a modified merge sort? You could start with chunks of the dataset which will fit in memory to perform a normal sort.
Once you have the sorted data, clustering could be done iteratively. Maybe keep a rolling centroid of N-1 points and compare against the Nth point being read in. Then depending on your cluster distance threshold, you could pool it into the current cluster or start a new one.

The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.
Additionally, these algorithms can cluster bit strings directly where all cluster representatives and data points are bit strings, and the similarity measure that is used is Hamming distance. This minimizes the Hamming distance within each cluster found.

Related

Number of buckets in LSH

In LSH, you hash slices of the documents into buckets. The idea is that these documents that fell into the same buckets will be potentially similar, thus a nearest neighbor, possibly.
For 40.000 documents, what is a good value (pretty much) for the number of buckets?
I have it as: number_of_buckets = 40.000/4 now, but I feel it can be reduced more.
Any ideas, please?
Relative: How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?
A common starting point is to use sqrt(n) buckets for n documents. You can try doubling and halving that and run some analysis to see what kind of document distributions you got. Naturally any other exponent can be tried as well, and even K * log(n) if you expect that the number of distinct clusters grows "slowly".
I don't think this is an exact science yet, belongs on the similar topic as choosing the optimal k for k-means clustering.
I think it should be at least n. If it is less than that, let's say n/2, you ensure that for all bands, each document will have at least 1 possible similar document on average, due to collisions. So, your complexity when calculating the similarities will be at least O(n).
On the other hand, you will have to pass through the buckets at least K times, so that is O(K*B), being B your buckets. I believe the latter is faster, because it is just iterating over your data structure (namely a Dictionary of some kind) and counting the number of documents that hashed to each bucket.

Which data structure to store binary strings and query with hamming distane

I'm looking for a data structure to handle bilions of binary strings that contains 512 binary values.
My goal is to send querys to the structure and get a resultset which contains all data that are lower a distance.
My first idea was to use a kd tree. But those trees are very slow for a high dimension.
My second idea is to use a lsh approach (minHash/ superbit) lsh. But for that I must also have a structure to perform efficient search
Any ideas how to handle these big data?
**Update **
Some detail notes :
for the hamming distance exists only a upper limit that is maybe 128. But in time I doesn't know the upper limit
a insertion or a deletion would be nice but I also can rebuild the graph (the data base only updated onces a week)
the result set must contain all relevant nodes (I'm not looking for knn)
Without knowing your intended search parameters, it's hard to be too optimized. That said, I think a good approach would be to build a B- or T- tree and then optimize that structure for the binary nature of the data.
Specifically, you have 64 bytes of data as a 512 element bit-string. Your estimate is that you will have "bilions" of records. That's on the order of 232 values, so 1/16th of the space will be full? (Does this agree with your expectations?)
Anyway, try breaking the data into bytes, let each byte be a key level. You can probably compress the level records, if the probability of set bits is uniform. (If not, if say set bits are more likely at the front of the key, then you might want to just allocate 256 next-level pointers and have some be null. It's not always worth it.)
All of your levels will be the same- they will represent 8 more bits of the string. So compute a table that maps, for a byte, all the byte values that are within distance S from that byte, 0 <= S <= 8. Also, compute a table that maps two bytes to the distance E between them, hamming(a,b).
To traverse the tree, let your search distance be SD. Set D = SD. Read the top level block. Find all 8-bits values in the block less than distance min(8, D) from your query. For each value, compute the exact distance hamming(query, value) and recurse to the lower block with D = D - hamming(query, value) for that sub-tree.
The biggest design problem I see here is the closure requirement: we need to return all items within distance N of a given vector, for arbitrary N. The data space is sparse: "billions" is on the order of 2^33, but we have 512 bits of information, so there is only 1 entry per 2^(512-33) possibilities. For randomly distributed keys, the expected distance between any two nodes is 256; the expected nearest-neighbour distance is somewhere around 180.
This leads me to expect that your search will hinge on non-random clusters of data, and that your search will be facilitated by recognition of that clustering. This will be a somewhat painful pre-processing step on the initial data, but should be well worthwhile.
My general approach to this is to first identify those clusters in some generally fast way. Start with a hashing function that returns a very general distance metric. For instance, for any vector, compute the distances to each of a set of orthogonal reference vectors. For 16 bits, you might take the following set (listed in hex): 0000, 00FF, 0F0F, 3333, 5555, a successive "grain" of alternating bits". Return this hash as a simple tuple the 4-bit distances, a total of 20 bits (there are actual savings for long vectors, as one of the sizes is 2^(2^N)).
Now, this hash tuple allows you a rough estimate of the hamming distance, such that you can cluster the vectors more easily: vectors that are similar must have similar hash values.
Within each cluster, find a central element, and then characterize each element of the cluster by its distance from that center. For more speed, give each node a list of its closest neighbors with distances, all of them within the cluster. This gives you a graph for each cluster.
Similarly connect all the cluster centers, giving direct edges to the nearer cluster centers. If your data are reasonably amenable to search, then we'll be able to guarantee that, for any two nodes A, B with cluster centers Ac and Bc, we will have d(A, Ac) + d(B, Bc) < d(A, B). Each cluster is a topological neighbourhood.
The query process is now somewhat faster. For a target vector V, find the hash value. Find cluster centers that are close enough tot hat value that something in their neighbourhoods might match ([actual distance] - [query range] - [cluster radius]). This will allow you to eliminate whole clusters right away, and may give you an entire cluster of "hits". For each marginal cluster (some, but not all nodes qualify), you'll need to find a node that works by something close to brute force (start in the middle of the range of viable distances from the cluster center), and then do a breadth-first search of each node's neighbors.
I expect that this will give you something comparable to optimal performance. It also adapts decently to additions and deletions, so long as those are not frequent enough to change cluster membership for other nodes.
The set of vectors is straightforward. Write out the bit patterns for the 16-bit case:
0000 0000 0000 0000 16 0s
0000 0000 1111 1111 8 0s, 8 1s
0000 1111 0000 1111 4 0s, 4 1s, repeat
0011 0011 0011 0011 2 0s, 2 1s, repeat
0101 0101 0101 0101 1 0s, 1 1s, repeat

Clustering in Matlab

Hi I am trying to cluster using linkage(). Here is the code I am trying..
Y = pdist(data);
Z = linkage(Y);
T = cluster(Z,'maxclust',4096);
I am getting error as follows
The number of elements exceeds the maximum allowed size in
MATLAB.
Error in ==> linkage at 135
Z = linkagemex(Y,method);
data size is 56710*128. How can I apply the code on small chunks of data and then merge those clusters optimally?? Or any other solution to the problem.
Matlab probably cannot cluster this many objects with this algorithm.
Most likely they use distance matrixes in their implementation. A pairwise distance matrix for 56710 objects needs 56710*56709/2=1,607,983,695 entries, or some 12 GB of RAM; most likely also a working copy of this is needed. Chances are that the default Matlab data structures are not prepared to handle this amount of data (and you won't want to wait for the algorithm to finish either; probably that is why they "allow" only a certain amount).
Try using a subset, and see how well it scales. If you use 1000 instances, does it work? How long does the computation take? If you increase to 2000, how much longer does it take?

Why are vectors so shallow?

What is the rationale behind Scala's vectors having a branching factor of 32, and not some other number? Wouldn't smaller branching factors enable more structural sharing? Clojure seems to use the same branching factor. Is there anything magic about the branching factor 32 that I am missing?
It would help if you explained what a branching factor is:
The branching factor of a tree or a graph is the number of children at each node.
So, the answer appears to be largely here:
http://www.scala-lang.org/docu/files/collections-api/collections_15.html
Vectors are represented as trees with a high branching factor. Every
tree node contains up to 32 elements of the vector or contains up to
32 other tree nodes. Vectors with up to 32 elements can be represented
in a single node. Vectors with up to 32 * 32 = 1024 elements can be
represented with a single indirection. Two hops from the root of the
tree to the final element node are sufficient for vectors with up to
215 elements, three hops for vectors with 220, four hops for vectors
with 225 elements and five hops for vectors with up to 230 elements.
So for all vectors of reasonable size, an element selection involves
up to 5 primitive array selections. This is what we meant when we
wrote that element access is "effectively constant time".
So, basically, they had to make a design decision as to how many children to have at each node. As they explained, 32 seemed reasonable, but, if you find that it is too restrictive for you, then you could always write your own class.
For more information on why it may have been 32, you can look at this paper, as in the introduction they make the same statement as above, about it being nearly constant time, but this paper deals with Clojure it seems, more than Scala.
http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf
James Black's answer is correct. Another argument for choosing 32 items might have been that the cache line size in many modern processors is 64 bytes, so two lines can hold 32 ints with 4 bytes each or 32 pointers on a 32bit machine or a 64bit JVM with a heap size up to 32GB due to pointer compression.
It's the "effectively constant time" for updates. With that large of a branching factor, you never have to go beyond 5 levels, even for terabyte-scale vectors. Here's a video with Rich talking about that and other aspects of Clojure on Channel 9. http://channel9.msdn.com/Shows/Going+Deep/Expert-to-Expert-Rich-Hickey-and-Brian-Beckman-Inside-Clojure
Just adding a bit to James's answer.
From an algorithm analysis standpoint, because the growth of the two functions is logarithmic, so they scale the same way.
But, in practical applications, having
hops is a much smaller number of hops than, say, base 2, sufficiently so that it keeps it closer to constant time, even for fairly large values of N.
I'm sure they picked 32 exactly (as opposed to a higher number) because of some memory block size, but the main reason is the fewer number of hops, compared to smaller sizes.
I also recommend you watch this presentation on InfoQ, where Daniel Spiewak discusses Vectors starting about 30 minutes in: http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala

An instance of online data clustering

I need to derive clusters of integers from an input array of integers in such a way that the variation within the clusters is minimized. (The integers or data values in the array are corresponding to the gas usage of 16 cars running between cities. At the end I will derive 4 clusters from the 16 cars into based on the clusters of the data values.)
Constraints: always the number of elements is 16, no. of clusters is 4 and the size of
the cluster is 4.
One simple way I am planning to do is that I will sort the input array and then divide them into 4 groups as shown below. I think that I can also use k-means clustering.
However, the place where I stuck was as follows: The data in the array change over time. Basically I need to monitor the array for every 1 second and regroup/recluster them so that the variation within the cluster is minimized. Moreover, I need to satisfy the above constraint. For this, one idea I am getting is to select two groups based on their means and variations and move data values between the groups to minimize variation within the group. However, I am not getting any idea of how to select the data values to move between the groups and also how to select those groups. I cannot apply sorting on the array in every second because I cannot afford NlogN for every second. It would be great if you guide me to produce a simple solution.
sorted `input array: (12 14 16 16 18 19 20 21 24 26 27 29 29 30 31 32)`
cluster-1: (12 14 16 16)
cluster-2: (18 19 20 21)
cluster-3: (24 26 27 29)
cluster-4: (29 30 31 32)
Let me first point out that sorting a small number of objects is very fast. In particular when they have been sorted before, an "evil" bubble sort or insertion sort usually is linear. Consider in how many places the order may have changed! All of the classic complexity discussion doesn't really apply when the data fits into the CPUs first level caches.
Did you know that most QuickSort implementations fall back to insertion sort for small arrays? Because it does a fairly good job for small arrays and has little overhead.
All the complexity discussions are only for really large data sets. They are in fact proven only for inifinitely sized data. Before you reach infinity, a simple algorithm of higher complexity order may still perform better. And for n < 10, quadratic insertion sort often outperforms O(n log n) sorting.
k-means however won't help you much.
Your data is one-dimensional. Do not bother to even look at multidimensional methods, they will perform worse than proper one-dimensional methods (which can exploit that the data can be ordered)
If you want guaranteed runtime, k-means with possibly many iterations is quite uncontrolled.
You can't easily add constraints such as the 4-cars rule into k-means
I believe the solution to your task (because of the data being 1 dimensional and the constraints you added) is:
Sort the integers
Divide the sorted list into k even-sized groups