An instance of online data clustering - cluster-analysis

I need to derive clusters of integers from an input array of integers in such a way that the variation within the clusters is minimized. (The integers or data values in the array are corresponding to the gas usage of 16 cars running between cities. At the end I will derive 4 clusters from the 16 cars into based on the clusters of the data values.)
Constraints: always the number of elements is 16, no. of clusters is 4 and the size of
the cluster is 4.
One simple way I am planning to do is that I will sort the input array and then divide them into 4 groups as shown below. I think that I can also use k-means clustering.
However, the place where I stuck was as follows: The data in the array change over time. Basically I need to monitor the array for every 1 second and regroup/recluster them so that the variation within the cluster is minimized. Moreover, I need to satisfy the above constraint. For this, one idea I am getting is to select two groups based on their means and variations and move data values between the groups to minimize variation within the group. However, I am not getting any idea of how to select the data values to move between the groups and also how to select those groups. I cannot apply sorting on the array in every second because I cannot afford NlogN for every second. It would be great if you guide me to produce a simple solution.
sorted `input array: (12 14 16 16 18 19 20 21 24 26 27 29 29 30 31 32)`
cluster-1: (12 14 16 16)
cluster-2: (18 19 20 21)
cluster-3: (24 26 27 29)
cluster-4: (29 30 31 32)

Let me first point out that sorting a small number of objects is very fast. In particular when they have been sorted before, an "evil" bubble sort or insertion sort usually is linear. Consider in how many places the order may have changed! All of the classic complexity discussion doesn't really apply when the data fits into the CPUs first level caches.
Did you know that most QuickSort implementations fall back to insertion sort for small arrays? Because it does a fairly good job for small arrays and has little overhead.
All the complexity discussions are only for really large data sets. They are in fact proven only for inifinitely sized data. Before you reach infinity, a simple algorithm of higher complexity order may still perform better. And for n < 10, quadratic insertion sort often outperforms O(n log n) sorting.
k-means however won't help you much.
Your data is one-dimensional. Do not bother to even look at multidimensional methods, they will perform worse than proper one-dimensional methods (which can exploit that the data can be ordered)
If you want guaranteed runtime, k-means with possibly many iterations is quite uncontrolled.
You can't easily add constraints such as the 4-cars rule into k-means
I believe the solution to your task (because of the data being 1 dimensional and the constraints you added) is:
Sort the integers
Divide the sorted list into k even-sized groups

Related

Is there an efficient way to calculate the smallest N numbers from a set of numbers in hardware (HDL)?

I am trying to calculate the smallest N numbers from a set and I've found software algorithms to do this. I'm wondering if there is an efficient way to do this in hardware (i.e. HDL - in System Verilog or Verilog)? I am specifically trying to calculate the smallest 2 numbers from a set.
I am trying to do this combinationally optimizing with respect to area and speed (for a large set of signals) but I can only think of comparator trees to do this? Is there a more efficient way of doing this?
Thank you, any help is appreciated~
I don't think you can work around using comparator trees if you want to find the two smallest elements combinationally. However, if your goal isn't low latency than a (possibly pipelined) sequential circuit could also be an option.
One approach that I can come up with on the spot would be to break down the operation doing kind of an incomplete bubble sort in hardware using small sorting networks. Depending on the amount of area you are willing to spend you can use a smaller or larger p-sorting network that combinationaly sorts p elements at a time where p >= 3. You can then apply this network on your input set of size N, sorting p elements at a time. The two smallest elements in each iteration are stored in some sort of memory (e.g. an SRAM memory, if you want to process larger amounts of elements).
Here is an example for p=3 (the brackets indicate the grouping of elements the p-sorter is applied to):
(4 0 9) (8 6 7) (4 2 1) --> (0 4 9) (6 7 8) (1 2 4) --> 0 4 6 7 1 2
Now you start the next round:
You apply the p-sorter on the results of the first round.
Again you store the two smallest outputs of your p-sorter into the same memory overwriting values from the previous round.
Here the continuation of the example:
(0 4 6) (7 1 2) --> (0 4 6) (1 2 7) --> 0 4 1 2
In each round you can reduce the number of elements to look at by a factor of 2/p. E.g. with p==4 you discard half the elements in each round until the smallest two elements are stored at the first two memory locations. So the algorithm has time/cycle complexity of O(n log(n)). For an actual hardware implementation, you probably want to stick to powers of two for the size p of the sorting network.
Although the control logic of such a circuit is not trivial to implement the area should be mainly dominated by the size of your sorting network and the memory you need to hold the first 2/p*N intermediate results (assuming your input signals are not already stored in a memory that you can reuse for that purpose). If you want to tune your circuit towards throughput you can increase p and pipeline the sorting network at the expense of additional area. Additional speedup could be gained by replacing the single memory using up to p two-port memories (1 read and 1 write port each) which would allow you to fetch and write back the data for the sorting network in a single cycle thus increasing the utilization ratio of the comparators in the sorting network.

What is the shortest human-readable hash without collision?

I have a total number of W workers with long worker IDs. They work in groups, with a maximum of M members in each group.
To generate a unique group name for each worker combination, concantating the IDs is not feasible. I am think of doing a MD5() on the flattened sorted worker id list. I am not sure how many digits should I keep for it to be memorable to humans while safe from collision.
Will log( (26+10), W^M ) be enough ? How many redundent chars should I keep ? I there any other specialized hash function that works better for this scenario ?
The total number of combinations of 500 objects taken by up to 10 would be approximately 2.5091E+20, which would fit in 68 bits (about 13 characters in base36), but I don't see an easy algorithm to assign each combination a number. An easier algorithm would be like this: if you assign each person a 9-bit number (0 to 511) and concatenate up to 10 numbers, you would get 90 bits. To encode those in base36, you would need 18 characters.
If you want to use a hash that with just 6 characters in base36 (about 31 bits), the probability of a collision depends on the total number of groups used during the lifetime of the application. If we assume that each day there are 10 new groups (that were not encountered before) and that the application will be used for 10 years, we would get 36500 groups. Using the calculator provided by Nick Barnes shows that there is a 27% chance of a collision in this case. You can adjust the assumptions to your particular situation and then change the hash length to fit your desired maximum chance of a collision.

Hash operator in Matlab for linear indices of vectors

I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.

What element of the array would be the median if the the size of the array was even and not odd?

I read that it's possible to make quicksort run at O(nlogn)
the algorithm says on each step choose the median as a pivot
but, suppose we have this array:
10 8 39 2 9 20
which value will be the median?
In math if I remember correct the median is (39+2)/2 = 41/2 = 20.5
I don't have a 20.5 in my array though
thanks in advance
You can choose either of them; if you consider the input as a limit, it does not matter as it scales up.
We're talking about the exact wording of the description of an algorithm here, and I don't have the text you're referring to. But I think in context by "median" they probably meant, not the mathematical median of the values in the list, but rather the middle point in the list, i.e. the median INDEX, which in this cade would be 3 or 4. As coffNjava says, you can take either one.
The median is actually found by sorting the array first, so in your example, the median is found by arranging the numbers as 2 8 9 10 20 39 and the median would be the mean of the two middle elements, (9+10)/2 = 9.5, which doesn't help you at all. Using the median is sort of an ideal situation, but would work if the array were at least already partially sorted, I think.
With an even numbered array, you can't find an exact pivot point, so I believe you can use either of the middle numbers. It'll throw off the efficiency a bit, but not substantially unless you always ended up sorting even arrays.
Finding the median of an unsorted set of numbers can be done in O(N) time, but it's not really necessary to find the true median for the purposes of quicksort's pivot. You just need to find a pivot that's reasonable.
As the Wikipedia entry for quicksort says:
In very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by R. Sedgewick).
Finding the median of three values is much easier than finding it for the whole collection of values, and for collections that have an even number of elements, it doesn't really matter which of the two 'middle' elements you choose as the potential pivot.

Clustering on a large dataset

I'm trying to cluster a large (Gigabyte) dataset. In order to cluster, you need distance of every point to every other point, so you end up with a N^2 sized distance matrix, which in case of my dataset would be on the order of exabytes. Pdist in Matlab blows up instantly of course ;)
Is there a way to cluster subsets of the large data first, and then maybe do some merging of similar clusters?
I don't know if this helps any, but the data are fixed length binary strings, so I'm calculating their distances using Hamming distance (Distance=string1 XOR string2).
A simplified version of the nice method from
Tabei et al., Single versus Multiple Sorting in All Pairs Similarity Search,
say for pairs with Hammingdist 1:
sort all the bit strings on the first 32 bits
look at blocks of strings where the first 32 bits are all the same;
these blocks will be relatively small
pdist each of these blocks for Hammingdist( left 32 ) 0 + Hammingdist( the rest ) <= 1.
This misses the fraction of e.g. 32/128 of the nearby pairs which have
Hammingdist( left 32 ) 1 + Hammingdist( the rest ) 0.
If you really want these, repeat the above with "first 32" -> "last 32".
The method can be extended.
Take for example Hammingdist <= 2 on 4 32-bit words; the mismatches must split like one of
2000 0200 0020 0002 1100 1010 1001 0110 0101 0011,
so 2 of the words must be 0, sort the same.
(Btw, sketchsort-0.0.7.tar is 99 % src/boost/, build/, .svn/ .)
How about sorting them first? Maybe something like a modified merge sort? You could start with chunks of the dataset which will fit in memory to perform a normal sort.
Once you have the sorted data, clustering could be done iteratively. Maybe keep a rolling centroid of N-1 points and compare against the Nth point being read in. Then depending on your cluster distance threshold, you could pool it into the current cluster or start a new one.
The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.
Additionally, these algorithms can cluster bit strings directly where all cluster representatives and data points are bit strings, and the similarity measure that is used is Hamming distance. This minimizes the Hamming distance within each cluster found.