Best Method to Intersect Huge HyperLogLogs in Redis - hash

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use elsewhere.
Two Strategies
There are two strategies, one of which seems vastly simpler. I've looked at the actual Redis source and I'm having a bit of trouble (not big in C, myself) figuring out whether it's better from a precision and efficiency perspective to use their built-in structures/routines or develop my own. For what it's worth, I'm willing to sacrifice space and to some degree errors (stdev +-2%) in the pursuit of efficiency with extremely large sets.
1. Inclusion Principle
By far the simplest of the two--essentially I would just use the lossless union (PFMERGE) in combination with this principle to calculate an estimate of the overlap. Tests seem to show this running reliably in many cases, although I'm having trouble getting an accurate handle on in-the-wild efficiency and accuracy (some cases can produce errors of 20-40% which is unacceptable in this use case).
Basically:
aCardinality + bCardinality - intersectionCardinality
or, in the case of multiple sets...
aCardinality + (bCardinality x cCardinality) - intersectionCardinality
seems to work in many cases with good accuracy, but I don't know if I trust it. While Redis has many built-in low-cardinality modifiers designed to circumvent known HLL issues, I don't know if the issue of wild inaccuracy (using inclusion/exclusion) is still present with sets of high disparity in size...
2. Jaccard Index Intersection/MinHash
This way seems more interesting, but a part of me feels like it may computationally overlap with some of Redis' existing optimizations (ie, I'm not implementing my own HLL algorithm from scratch).
With this approach I'd use a random sampling of bins with a MinHash algorithm (I don't think an LSH implementation is worth the trouble). This would be a separate structure, but by using minhash to get the Jaccard index of the sets, you can then effectively multiply the union cardinality by that index for a more accurate count.
Problem is, I'm not very well versed in HLL's and while I'd love to dig into the Google paper I need a viable implementation in short order. Chances are I'm overlooking some basic considerations either of Redis' existing optimizations, or else in the algorithm itself that allows for computationally-cheap intersection estimates with pretty lax confidence bounds.
thus, my question:
How do I most effectively get a computationally-cheap intersection estimate of N huge (billions) sets, using redis, if I'm willing to sacrifice space (and to a small degree, accuracy)?

Read this paper some time back. Will probably answer most of your questions. Inclusion Principle inevitably compounds error margins a large number of sets. Min-Hash approach would be the way to go.
http://tech.adroll.com/media/hllminhash.pdf

There is a third strategy to estimate the intersection size of any two sets given as HyperLogLog sketches: Maximum likelihood estimation.
For more details see the paper available at
http://oertl.github.io/hyperloglog-sketch-estimation-paper/.

Related

Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE)

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

kmean clustering: variable selection

I'm applying a kmean algorithm for clustering my customer base. I'm struggling conceptually on the selection process of the dimensions (variables) to include in the model. I was wondering if there are methods established to compare among models with different variables. In particular, I was thinking to use the common SSwithin / SSbetween ratio, but I'm not sure if that can be applied to compare models with a different number of dimensions...
Any suggestions>?
Thanks a lot.
Classic approaches are sequential selection algorithms like "sequential floating forward selection" (SFFS) or "sequential floating backward elimination (SFBS). Those are heuristic methods where you eliminate (or add) one feature at the time based on your performance metric, e.g,. mean squared error (MSE). Also, you could use a genetic algorithm for that if you like.
Here is an easy-going paper that summarizes the ideas:
Feature Selection from Huge Feature Sets
And a more advanced one that could be useful: Unsupervised Feature Selection for the k-means Clustering Problem
EDIT:
When I think about it again, I initially had the question in mind "how do I select the k (a fixed number) best features (where k < d)," e.g., for computational efficiency or visualization purposes. Now, I think what you where asking is more like "What is the feature subset that performs best overall?" The silhouette index (similarity of points within a cluster) could be useful, but I really don't think you can really improve the performance via feature selection unless you have the ground truth labels.
I have to admit that I have more experience with supervised rather than unsupervised methods. Thus, I typically prefer regularization over feature selection/dimensionality reduction when it comes to tackling the "curse of dimensionality." I use dimensionality reduction frequently for data compression though.

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.

How many and which parents should we select for crossover in genetic algorithm

I have read many tutorials, papers and I understood the concept of Genetic Algorithm, but I have some problems to implement the problem in Matlab.
In summary, I have:
A chromosome containing three genes [ a b c ] with each gene constrained by some different limits.
Objective function to be evaluated to find the best solution
What I did:
Generated random values of a, b and c, say 20 populations. i.e
[a1 b1 c1] [a2 b2 c2]…..[a20 b20 c20]
At each solution, I evaluated the objective function and ranked the solutions from best to worst.
Difficulties I faced:
Now, why should we go for crossover and mutation? Is the best solution I found not enough?
I know the concept of doing crossover (generating random number, probability…etc) but which parents and how many of them will be selected to do crossover or mutation?
Should I do the crossover for the entire 20 solutions (parents) or only two of them?
Generally a Genetic Algorithm is used to find a good solution to a problem with a huge search space, where finding an absolute solution is either very difficult or impossible. Obviously, I don't know the range of your values but since you have only three genes it's likely that a good solution will be found by a Genetic Algorithm (or a simpler search strategy at that) without any additional operators. Selection and Crossover is usually carried out on all chromosome in the population (although it's not uncommon to carry some of the best from each generation forward as is). The general idea is that the fitter chromosomes are more likely to be selected and undergo crossover with each other.
Mutation is usually used to stop the Genetic Algorithm prematurely converging on a non-optimal solution. You should analyse the results without mutation to see if it's needed. Mutation is usually run on the entire population, at every generation, but with a very small probability. Giving every gene 0.05% chance that it will mutate isn't uncommon. You usually want to give a small chance of mutation, without it completely overriding the results of selection and crossover.
As has been suggested I'd do a lit bit more general background reading on Genetic Algorithms to give a better understanding of its concepts.
Sharing a bit of advice from 'Practical Neural Network Recipies in C++' book... It is a good idea to have a significantly larger population for your first epoc, then your likely to include features which will contribute to an acceptable solution. Later epocs which can have smaller populations will then tune and combine or obsolete these favourable features.
And Handbook-Multiparent-Eiben seems to indicate four parents are better than two. However bed manufactures have not caught on to this yet and seem to only produce single and double-beds.

Choosing Clustering Method based on results

I'm using WEKA for my thesis and have over 1000 lines of data. The database includes demographical information (Age, Location, status etc.) followed by name of products (valued 1 or 0). The end results is a recommender system.
I used two methods of clustering, K-Means and DBScan.
When using K-means I tried 3 different number of cluster, while using DBscan I chose 3 different epsilons (Epsilon 3 = 48 clusters with ignored 17% of data, Epsilone 2.5 = 19 clusters while cluster 0 holds 229 items with ignored 6%.) Meaning i have 6 different clustering results for same data.
How do I choose what's best suits my data ?
What is "best"?
As some smart people noticed:
the validity of a clustering is often in the eye of the beholder
There is no objectively "better" for clustering, or you are not doing cluster analysis.
Even when a result actually is "better" on some mathematical measure such as separation, silhouette or even when using a supervised evaluation using labels - its still only better at optimizing towards some mathematical goal, not to your use case.
K-means finds a local optimal sum-of-squares assignment for a given k. (And if you increase k, there exists a better assignment!) DBSCAN (it's actually correctly spelled all uppercase) always finds the optimal density-connected components for the given MinPts/Epsilon combination. Yet, both just optimize with respect to some mathematical criterion. Unless this critertion aligns with your requirements, it is worthless. So there is no best, until you know what you need. But if you know what you need, you would not need to do cluster analysis.
So what to do?
Try different algorithms and different parameters and analyze the output with your domain knowledge, if they help you with the problem you are trying to solve. If they help you solving your problem, then they are good. If they do not help, try again.
Over time, you will collect some experience. For example, if the sum-of-squares is meaningless for your domain, don't use k-means. If your data does not have meaningful density, don't use density based clustering such as DBSCAN. It's not that these algorithms fail. They just don't solve your problem, they solve a different problem that you are not interested in. And they might be really good at solving this other problem...