How to reduce the false positive rate of Bloom Filter without generating any false negatives? - bloom-filter

A key characteristic of Bloom Filter is that it only generates false positives but does not have any false negatives. However, in many applications such network monitoring, besides accelerating the membership checking procedure, we should also ensure that there are not any false negatives.
Thus, how to effectively reduce the false positive rate of Bloom Filter without generating any false negatives?
I tried to search the raw dataset after using Bloom Filter for approximate membership checking, but this procedure causes large computational costs and uses extra memory space to store the raw dataset.

There are a few ways to reduce the false positive rate. First, you can ensure you're using the optimal number of hash functions. Check the Wikipedia page on Bloom filters to see what that is. Second, you can increase the size of the Bloom filter. Third, you can use a more space-efficient filter like a ribbon filter.

Related

The performance of k-means evaluated by different metrics

I am trying to evaluate the clusters generated by K-means with different metrics, but I am not sure about whether the results are good or not.
I have 40 documents to cluster in 6 categories.
I first converted them into tf-idf vectors, then I clustered them by K-means (k = 6). Finally, I tried to evaluate the results by different metrics.
Because I have the real labels of the documents, I tried to calculate the F1 score and accuracy. But I also want to know the performance for the metrics that do not need real labels such as silhouette score.
For F1 score and accuracy, the results are about 0.65 and 0.88 respectively, while for the silhouette score, it is only about 0.05, which means I may have overlapping clusters.
In this case, can I say that the results are acceptable? Or should I handle the overlapping issue by trying other methods instead of tf-idf to represent the documents or other algorithms to cluster?
With such tiny data sets, you really need to use a measure that is adjusted for chance.
Do the following: label each document randomly with an integer 1..6
What F1 score to you get? Now repeat this 100x times, what is the best result you get? A completely random result can score pretty well on such tiny data!
Because of this problem, the standard measure used in clustering it's the adjusted Rand index (ARI). A similar adjustment also exists for NMI: Adjusted Mutual Information or AMI. But AMI is much less common.

Neural networks for an imbalanced dataset

I have a very imbalanced dataset consisting of 186219 rows of data by 6 dimensions including 132 true positives against 186087 false positives, what types of neural network would you recommend to try? This spreadsheet in my google drive IPDC_algorithm_training_dataset contains my training dataset. If the value in output tab has a value of 100, that feature is a true positive, and if a feature has a value of 0 means that feature is a false positive.
I am tied up with MATLAB now, so it would be more convenient for me if I use MATLAB for this problem.
With a dataset that imbalanced you have limited options. If you trained a neural network on the entire dataset, it'd achieve 99.9% accuracy just by always predicting false positives. You need to deal with that imbalance in some way, such as discarding (vast swathes of) false positive samples or weighting your loss function to account for the imbalance. With an imbalance as extreme as this, you'd probably need to apply both (along with regularisation to prevent overfitting the remaining data).
In terms of what network type to use, you could try just a basic MLP (Multi-Layer Perceptron), at least as a baseline – there's no point in building a complicated architecture, with more parameters to train, with a very limited dataset.
In reality, you'd probably be better off using a shallow learning algorithm, such as boosted trees or naive Bayes, or getting more data to enable use of a neural network. If new data is likely to remain as imbalanced, you'd need a very large amount of extra data.

trainCascadeObjectDetector FalseAlarmRate - lower or higher?

I read info about cascade training in Matlab:
Lower values for FalseAlarmRate increase complexity of each stage.
Increased complexity can achieve fewer false detections but can result
in longer training and detection times. Higher values for
FalseAlarmRate can require a greater number of cascade stages to
achieve reasonable detection accuracy.
Isn't it should be also Lower values in the second sentence? If lower values increases complexity then lower values requires a greater number of cascade stages, not higher ones...
So I'm a little bit confused
https://www.mathworks.com/help/vision/ref/traincascadeobjectdetector.html

Distance between two Bloom filters

I want to compare two big lists of strings (possibly up to 4^31 elements).
I tried Jaccard distance and MinHash (using Perl for the moment) which give good results, but I have a memory issue. So I represented my lists as Bloom filters.
Is there any way to approximate Jaccard distance using Bloom filters as input ? Or any alternative to Bloom filters for this purpose ?

ELKI - Clustering Statistics

When a data set is analyzed by a clustering algorithm in ELKI 0.5, the program produces a number of statistics: the Jaccard index, F1-Measures, etc. In order to calculate these statistics, there have to be 2 clusterings to compare. What is the clustering created by the algorithm compared to?
The automatic evaluation (note that you can configure the evaluation manually!) is based on labels in your data set. At least in the current version (why are you using 0.5 and not 0.6.0?) it should only automatically evaluate if it finds labels in the data set.
We currently have not published internal measures. There are some implementations, such as evaluation/clustering/internal/EvaluateSilhouette.java, some of which will be in the next release.
In my experiments, internal evaluation measures were badly misleading. For example on the Silhouette coefficient, the labeled "solution" would often even score a negative silhouette coefficient (i.e. worse than not clustering at all).
Also, these measures are not scalable. The silhouette coefficient is in O(n^2) to compute; which usually makes this evaluation more expensive than the actual clustering!
We do appreciate contributions!
You are more than welcome to contribute your favorite evaluation measure to ELKI, to share with others.