cluster external validation - cluster-analysis

I am using ELKI in order to perform location clustering with DBSCAN and OPTICS. My data set include 30 participants but it is not labeled but I do have pair of coordinates (e.g. home, work, etc) as each participant's frequent places.
I want to know that these pair of coordinates belong to which cluster (for each person). One of the way is to check each pair against each of the cluster manually using some minimum distance threshold.
What could be the better way to achieve this?

Can you format your input data as this:
123 456 work1
124 457 work1
789 123 home2
123 123 unknown
The labels should be non numeric, that's why I chose "work1", "work2" etc. for this example.
Then ELKI can automatically evaluate the result.

Related

YOLOv3 convolutional layers count

I am really confused about the count of the convolutional layers exist in YOLOv3!
According to the paper they are using darknet-53 and they don't mention any further details or addition to that structure!
However, according to the build of AlexeyAB it is composed of 106 layers!
moreover, the towardsdatascience website claims that the additional 53 layers are added for the detection process, but what does that really mean are the first 53 layers only for feature extraction then?
So my question is, what is the matter of these extra unmentioned-in-the-paper 53 layers? where did they come from? and why?
According to AlexeyAB (creator of very popular forked Darknet version) https://groups.google.com/forum/?nomobile=true#!topic/darknet/9WppEzRouMU (This link is deprecated somehow)
Yolo has
75 cnn-layers (convolutional layers) + 31 other layers (shortcut, route, upsample, yolo) = 106 layers in total.
You can count the total of CNN layer in cfg file, there are 75. Also remember that Yolo V3 does detection at 3 different scales, which are at layer 82,94,106.
Darknet-53 is the name of the extractor developed by Joseph Redmon et al., and it does indeed constitute the first 53 layers of YOLOv3. The next 53 layers are dedicated to resizing, concatenation and upsampling the input to prepare them for detection at three different scales at layer 82, 94 and 106 respectively. The first layer detects the largest objects, the second the ones in the middle, and the last layer all that remains (in theory at least).
I think the idea of this hierarchical structure is the further one moves into YOLOv3, the more high-level information it is able to extract.

How to revert One-Hot Enoding in Spark (Scala)

After running k-means (mllib spark scala) I want to make sense of the cluster centers I obtained from data which I pre-processed using (among other transformers) mllib's OneHotEncoder.
A center looks like this:
Cluster Center 0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
Which is obviously not very human friendly... Any ideas on how to revert the one-hot encoding and retrieve the original categorical features?
What if I look for the data point which is closest (using the same distance metric that is used by k-means, which I assume is Euclidean distance) to the centroid and then revert the encoding of that particular data point?
For the cluster centroids it is not possible (strongly disrecommended) to reverse the encoding. Imagine you have the original feature "3" out of 6 and it is encoded as [0.0,0.0,1.0,0.0,0.0,0.0]. In this case it's easy to extract 3 as the correct feature from the encoding.
But after kmeans application you may get a cluster centroid that looks for this feature like this [0.0,0.13,0.0,0.77,0.1,0.0]. If you want to decode this back to the representation that you had before, like "4" out of 6, because the feature 4 has the largest value, then you will lose information and the model may get corrupted.
Edit: Add a possible way to revert encoding on datapoints from the comments to the answer
If you have IDs on the datapoints you can perform a select / join operation on the ID after you assigned a datapoints to a cluster to get the old state, before the encoding.

Which data structure to store binary strings and query with hamming distane

I'm looking for a data structure to handle bilions of binary strings that contains 512 binary values.
My goal is to send querys to the structure and get a resultset which contains all data that are lower a distance.
My first idea was to use a kd tree. But those trees are very slow for a high dimension.
My second idea is to use a lsh approach (minHash/ superbit) lsh. But for that I must also have a structure to perform efficient search
Any ideas how to handle these big data?
**Update **
Some detail notes :
for the hamming distance exists only a upper limit that is maybe 128. But in time I doesn't know the upper limit
a insertion or a deletion would be nice but I also can rebuild the graph (the data base only updated onces a week)
the result set must contain all relevant nodes (I'm not looking for knn)
Without knowing your intended search parameters, it's hard to be too optimized. That said, I think a good approach would be to build a B- or T- tree and then optimize that structure for the binary nature of the data.
Specifically, you have 64 bytes of data as a 512 element bit-string. Your estimate is that you will have "bilions" of records. That's on the order of 232 values, so 1/16th of the space will be full? (Does this agree with your expectations?)
Anyway, try breaking the data into bytes, let each byte be a key level. You can probably compress the level records, if the probability of set bits is uniform. (If not, if say set bits are more likely at the front of the key, then you might want to just allocate 256 next-level pointers and have some be null. It's not always worth it.)
All of your levels will be the same- they will represent 8 more bits of the string. So compute a table that maps, for a byte, all the byte values that are within distance S from that byte, 0 <= S <= 8. Also, compute a table that maps two bytes to the distance E between them, hamming(a,b).
To traverse the tree, let your search distance be SD. Set D = SD. Read the top level block. Find all 8-bits values in the block less than distance min(8, D) from your query. For each value, compute the exact distance hamming(query, value) and recurse to the lower block with D = D - hamming(query, value) for that sub-tree.
The biggest design problem I see here is the closure requirement: we need to return all items within distance N of a given vector, for arbitrary N. The data space is sparse: "billions" is on the order of 2^33, but we have 512 bits of information, so there is only 1 entry per 2^(512-33) possibilities. For randomly distributed keys, the expected distance between any two nodes is 256; the expected nearest-neighbour distance is somewhere around 180.
This leads me to expect that your search will hinge on non-random clusters of data, and that your search will be facilitated by recognition of that clustering. This will be a somewhat painful pre-processing step on the initial data, but should be well worthwhile.
My general approach to this is to first identify those clusters in some generally fast way. Start with a hashing function that returns a very general distance metric. For instance, for any vector, compute the distances to each of a set of orthogonal reference vectors. For 16 bits, you might take the following set (listed in hex): 0000, 00FF, 0F0F, 3333, 5555, a successive "grain" of alternating bits". Return this hash as a simple tuple the 4-bit distances, a total of 20 bits (there are actual savings for long vectors, as one of the sizes is 2^(2^N)).
Now, this hash tuple allows you a rough estimate of the hamming distance, such that you can cluster the vectors more easily: vectors that are similar must have similar hash values.
Within each cluster, find a central element, and then characterize each element of the cluster by its distance from that center. For more speed, give each node a list of its closest neighbors with distances, all of them within the cluster. This gives you a graph for each cluster.
Similarly connect all the cluster centers, giving direct edges to the nearer cluster centers. If your data are reasonably amenable to search, then we'll be able to guarantee that, for any two nodes A, B with cluster centers Ac and Bc, we will have d(A, Ac) + d(B, Bc) < d(A, B). Each cluster is a topological neighbourhood.
The query process is now somewhat faster. For a target vector V, find the hash value. Find cluster centers that are close enough tot hat value that something in their neighbourhoods might match ([actual distance] - [query range] - [cluster radius]). This will allow you to eliminate whole clusters right away, and may give you an entire cluster of "hits". For each marginal cluster (some, but not all nodes qualify), you'll need to find a node that works by something close to brute force (start in the middle of the range of viable distances from the cluster center), and then do a breadth-first search of each node's neighbors.
I expect that this will give you something comparable to optimal performance. It also adapts decently to additions and deletions, so long as those are not frequent enough to change cluster membership for other nodes.
The set of vectors is straightforward. Write out the bit patterns for the 16-bit case:
0000 0000 0000 0000 16 0s
0000 0000 1111 1111 8 0s, 8 1s
0000 1111 0000 1111 4 0s, 4 1s, repeat
0011 0011 0011 0011 2 0s, 2 1s, repeat
0101 0101 0101 0101 1 0s, 1 1s, repeat

Clustering/Nearest Neighbor

I have thousands to ten-thousands of data points (x,y)coming from 5 to 6 different source. I need to uniquely group them based on certain distance criteria in such a way that the formed group should exactly contain only one input from each source and each of them in the group should be within certain distance d. The groups formed should be the best possible match.
Is this a combination of clustering and nearest neighbor?
What are the recommendation for the algorithms?
Are there any open source available for it?
I see many references saying KD tree implementation and k-clustering etc. I am not sure how can I tailor to this specific need.

Clustering on a large dataset

I'm trying to cluster a large (Gigabyte) dataset. In order to cluster, you need distance of every point to every other point, so you end up with a N^2 sized distance matrix, which in case of my dataset would be on the order of exabytes. Pdist in Matlab blows up instantly of course ;)
Is there a way to cluster subsets of the large data first, and then maybe do some merging of similar clusters?
I don't know if this helps any, but the data are fixed length binary strings, so I'm calculating their distances using Hamming distance (Distance=string1 XOR string2).
A simplified version of the nice method from
Tabei et al., Single versus Multiple Sorting in All Pairs Similarity Search,
say for pairs with Hammingdist 1:
sort all the bit strings on the first 32 bits
look at blocks of strings where the first 32 bits are all the same;
these blocks will be relatively small
pdist each of these blocks for Hammingdist( left 32 ) 0 + Hammingdist( the rest ) <= 1.
This misses the fraction of e.g. 32/128 of the nearby pairs which have
Hammingdist( left 32 ) 1 + Hammingdist( the rest ) 0.
If you really want these, repeat the above with "first 32" -> "last 32".
The method can be extended.
Take for example Hammingdist <= 2 on 4 32-bit words; the mismatches must split like one of
2000 0200 0020 0002 1100 1010 1001 0110 0101 0011,
so 2 of the words must be 0, sort the same.
(Btw, sketchsort-0.0.7.tar is 99 % src/boost/, build/, .svn/ .)
How about sorting them first? Maybe something like a modified merge sort? You could start with chunks of the dataset which will fit in memory to perform a normal sort.
Once you have the sorted data, clustering could be done iteratively. Maybe keep a rolling centroid of N-1 points and compare against the Nth point being read in. Then depending on your cluster distance threshold, you could pool it into the current cluster or start a new one.
The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.
Additionally, these algorithms can cluster bit strings directly where all cluster representatives and data points are bit strings, and the similarity measure that is used is Hamming distance. This minimizes the Hamming distance within each cluster found.