using bloom filter for composite objects - bloom-filter

As far as I understand Bloom filter allows to tell if the element does not exist in the set with 100% guarantee. But it might tell with 1% chance that element exist while in fact it doesn't.
But can it be used for complex object and keys - not just single password, id or name? For example assuming I have millions of object with distinctive characteristics (id, name, some other fields) - can I use bloom filter to check object non-existence with all those characteristics at the same time?

Sure, you can. You have multiple choices:
Combine all those fields (id, name, some other field) into one combined key. And calculate the hash functions from that combined key.
Maintain separate Bloom filters for each field (one Bloom filter for the id, another Bloom filter for the name, one Bloom filter for the other field). When querying, you query each Bloom filter separately. The object is most likely in the set only if each of the Bloom filter returns yes. If one or more Bloom filter returns no, then the object is not definitely not in the set. This allows you to query even if you only have partial information about the object.
Or a combination of the two, for example one Bloom filter for the id, and one for the combination of name and other field.
Of course, having multiple Bloom filters uses more memory.

Related

What does the distinct on clause mean in cloud datastore and how does it effect the reads?

This is what the cloud datastore doc says but I'm having a hard time understanding what exactly this means:
A projection query that does not use the distinct on clause is a small operation and counts as only a single entity read for the query itself.
Grouping
Projection queries can use the distinct on clause to ensure that only the first result for each distinct combination of values for the specified properties will be returned. This will return only the first result for entities which have the same values for the properties that are being projected.
Let's say i have a table for questions and i only want to get the question text sorted by the created date would this be counted as a single read and rest as small operations?
If your goal is to just project the date and text fields, you can create a composite index on those two fields. When you query, this is a small operation with all the results as a single read. You are not trying to de-duplicate (so no distinct/on) in this case and so it is a small operation with a single read.

In Algolia, how do you construct records to allow for alphabetical sorting of query results?

As far as I know, you can only sort on numeric fields in Algolia, so how do you efficiently set up your records to allow for results to be returned alphabetically based on a specific string field?
For example, let's say in each record in an index you have a field called "title" that contains an arbitrary string value. How would you create a sibling field called "title_sort" that contains a number that allows for the the results to be sorted such that the records come out in alphabetical order by "title"? Is there a particularly well-accepted algorithm for creating such a number from the string in "title"?
If you have a static dataset, then you can just sort your data and put an index on it. This works as long as sorting data every time you update your indices.
I'm also thinking that if you can deal with a partial sorting, meaning that you can accept orc < orb but you need or < os, then you could derive an can use base64 as our index. You can then sort it to as many characters as you have precision for. It's only a partial sorting, but it might be acceptable for your use case. You just need to map your base64 -> base10 mappings to accomodate the sorting.
Additionally, if you don't care about the difference between capital and lowercase letters, then you can do base26 -> base10. The more I think about this the more limited it is, but it might work for your use case.

MongoDB Aggregate Framework - Grouping with Multiple Fields in _id

Before marking this question as a duplicate - please read through. I don't think a sufficiently conclusive and general answer has been given yet, as most questions have focused on specific examples.
The MongoDB documentation says that you can specify an aggregate key for the _id value of a $group operation. There are a number of previously answered questions about using MongoDB's aggregate framework to group over multiple fields in this way, i.e:
{$group: {_id:{field_a:'$field_a', field_b:'$field_b'} } }
Q: In the most general sense, what does this action do?
If grouping documents by field A condenses any documents sharing the same value of field A into a single document, does grouping by fields A and B condense documents with matching values of both A and B into a single document?
Is the grouping operation sequential?
If so, does that imply any level of precedence between 'field_a' and 'field_b' depending on their ordering?
If grouping documents by field A condenses any documents sharing the same value of field A into a single document, does grouping by fields A and B condense documents with matching values of both A and B into a single document?
Let A = { a:A, b:B }, then that automatically follows from the assumption. You didn't make any assumption about the type of A, which is correct: the type doesn't matter. If the type of A is document, the usual comparison rules apply (equal content is considered equal).
Is the grouping operation sequential?
I'm not sure what that means. The aggregation pipeline runs accumulator functions on all items in each stage, so it certainly iterates the entire set, but I'd refrain from making assumptions about the exact order that happens in, i.e. from performing any non-associative operations.
If so, does that imply any level of precedence between 'field_a' and 'field_b' depending on their ordering?
No, documents are compared field-by-field and there are no strict guarantees on the ordering of fields ("attempts to...") in MongoDB. However, one can, in principle, create documents that contain multiple fields of the same name where the ordering might matter. But it's hard to do so, since most client interfaces don't allow different fields of equal name.

What should the indexing strategy be to support queries that are a combination of different fields?

Lets say I have a User collection, where a document looks like this
{
"name": "Starlord",
"age": 24,
"gender": "Male",
"height": 180,
"weight": 230,
"hobbies": "Flying Spaceships"
}
Now, I want someone to be able to search for User based on one or more of these fields. So I add a compound index containing all these fields in the order above.
The issue is that MongoDB indexing works great when the query fields are a prefix of the indexed fields. For example, if I query by name, age and gender then the performance of the query is great. If I query by name, gender and weight, then the performance of the query is not so great (although it still uses the index and is faster than no-index).
What indexing strategy do you use when you have a use case like this?
The reason why your query by name, age and gender works great while the query by name, gender and weight does not is because the order of the fields matter significantly for compound indexes in MongoDB, especially the index's prefixes. As explained in this page in the documentation, a compound index can support queries on any prefix of its fields. So assuming you created the index in the order you presented the fields, the query for name, age and gender is a prefix of your compound index, while name, gender and weight can only take advantage of the name part of the index.
Supporting all possible combinations of queries on these fields would require you to create enough compound indexes so that all possible queries are prefixes of your indexes. I would say that this is not something you would want to do. Since your question asks about indexing strategies for queries with multiple fields, I would suggest that you look into the specific data access patterns that are most useful for your data set and create a few compound indexes that support these, taking advantage of the prefixes concept and omitting certain fields with low cardinality from the index, such as gender.
If you need to be able to query for all combinations, the number of indexes requires explodes quickly. The feature that comes to the rescue is called "index intersection".
Create a simple index on each field and trust the query optimizer to perform the correct index intersection. This feature is relatively new (from 2.6) and not as feature complete as in the well-known RBDMSses. It makes sense to track the Jira Ticket for index intersections to know the limitations, because the limitations are quite severe. It usually makes sense to carefully mix simple indexes (that can be intersected) and compound indexes (for very common queries).
In your specific case, you can utilize the fact that many fields are numeric and the range of valid values is very limited (e.g., for age, height and weight). The gender field has low selectivity and shouldn't be indexed in any case. Filter the gender in the last step, because it will, on average, only double the amount of data that must be processed.
Creating n! compound indexes is almost certainly not an option for n > 3...

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.