Indexing parallel arrays in Mongodb

Indexing parallel arrays in Mongodb - mongodb

I am starting to use MongoDb C#, but have run into a slight issue.
So I have a document with 2 embedded collections(of distinct types). I want to search on fields of both of these collections however I have discovered that if I try to index the searchable fields on the 2 collections I get "cannot index parallel arrays". Reading the Mongodb documentations on multikey indexes I discovered that this is indeed a limitation.
My question is what is the normal work around regarding this issue? I cant really combine these collections since they are pretty distinct? What pattern should I follow?
public class Capture
{
[BsonId]
public Guid Id { get; set; }
...Some other fields
public IList<CustomerInformation> CustomerInformations { get; set; }
public IList<VehicleLicenseDisk> VehicleLicenseDisks { get; set; }
}

Before talking about possible workarounds, I just want to highlight why MongoDB has chosen to enforce this restriction on indexing parallel arrays. When you index an array in MongoDB, it creates a multikey index with one key per array element. Therefore, if you create a compound index on two arrays, one with M distinct values and one with N distinct values, the index essentially has MN keys. This is very bad- it's nonlinear in the number of distinct array elements. Consider the amount of work it takes to maintain an index like this when you add or remove array elements.
OK, justification aside, to work around this restriction it will be helpful to use the current MongoDB version (2.6), which supports index intersection. One can create an index on CustomerInformations and VehicleLicenseDisks and then MongoDB can use both indices and intersect them to serve queries that have restrictions on both.
If you are, for whatever reason, stuck with MongoDB < 2.6, then your options are either to consider redesigning the schema or to depend on indexes that use at most one of the array fields.

It will help if you think about MongoDB schema concerns in terms of MongoDB -- not in terms of programming language objects. Do the arrays really need to be arrays? Can they be replaced with concrete field names? In your case, why is CustomerInformation an array? What does Capture object really represent? You might have to split out, as an example, CustomerInformation into a separate collection where each record contains a link/reference back to the Capture document it belongs to. I dont know the details of what you are trying to model, but whatever it is, forget about object oriented programming and put on a MongoDB hat -- objects will come later.

Related

Algolia optionalFilters acts like filters on virtual index

I have my index and virtual index, on my index query like:
{
"facetFilters": [["objectID:12345", "tag:Luxury","tag:Makeup"]], // 12345 or luxury or makeup
"optionalFilters": "objectID:12345" // put it as first
}
will return all documents that have given object id or tag luxury or tag makeup and puts object with id "12345" as first. It behaves like expected.
But when I run the same query on my virtual index it only returns document with given id "12456". So it behave like filter where in docs it says:
https://www.algolia.com/doc/guides/managing-results/rules/merchandising-and-promoting/in-depth/optional-filters/
Unlike filters, optional filters don’t remove records from your search results when your query doesn’t match them. Instead, they divide your records into two sets: the results that match the optional filter, and the ones that don’t.

Weird. I just set this up and am seeing the same results. I don't see anything in the docs that would explain why the behavior would be different, so I'm reaching out to some engineering colleagues to see what's going on.
UPDATE:
Algolia virtual replicas and optionalFilters both do out-of-band sorting of results at query time. It looks like those two features are causing strangeness when they both try to do their sort. I've cut a ticket on this, but for now to get the results you'll want to use a standard replica with optionalFilters -- the standard replica will do index-time sorting and then the optionalFilters can layer their query time filtering on top of it.

Extensive filtering

Example:
{
shortName: "KITT",
longName: "Knight Industries Two Thousand",
fromZeroToSixty: 2,
year: 1982,
manufacturer: "Pontiac",
/* 25 more fields */
}
Ability to query by at least 20 fields which means that only 10 fields are left unindexed
There's 3 fields (all number) that could be used for sorting (both ways)
This leaves me wondering that how does sites with lots of searchable fields do it: e.g real estate or car sale sites where you can filter by every small detail and can choose between several sort options.
How could I pull this off with MongoDB? How should I index that kind of collection?
Im aware that there are dbs specifically made for searching but there must be general rules of thumb to do this (even if less performant) in every db. Im sure not everybody uses Elasticsearch or similar.
---
Optional reading:
My reasoning is that index could be huge but the index order matters. You'll always make sure that fields that return the least results are first and most generic fields are last in index. However, what if user chooses only generic fields? Should I include non-generic fields to query anyway? How to solve ordering in both ways? Or index intersection saves the day and I should just add 20 different indexes?

text index is your friend.
Read up on it here: https://docs.mongodb.com/v3.2/core/index-text/
In short, it's a way to tell mongodb that you want full text search over a specific field, multiple fields, or all fields (yay!)
To allow text indexing of all fields, use the special symbol $**, and define it of type 'text':
db.collection.createIndex( { "$**": "text" } )
you can also configure it with Case Insensitivity or Diacritic Insensitivity, and more.
To perform text searches using the index, use the $text query helper, see: https://docs.mongodb.com/v3.2/reference/operator/query/text/#op._S_text
Update:
In order to allow user to select specific fields to search on, it's possible to use weights when creating the text-index: https://docs.mongodb.com/v3.2/core/index-text/#specify-weights
If you carefully select your fields' weights, for example using different prime numbers only, and then add the $meta text score to your results you may be able to figure out from the "textScore" which field was matched on this query, and so filter out the results that didn't get a hit from a selected search field.
Read more here: https://docs.mongodb.com/v3.2/tutorial/control-results-of-text-search/

DB Compound indexing best practices Mongo DB

How costly is it to index some fields in MongoDB,
I have a table where i want uniqueness combining two fields, Every where i search they suggested compound index with unique set to true. But what i was doing is " Appending both field1_field2 and making it a key, so that field2 will be always unique for field1.(and add Application logic) As i thought indexing is costly.
And also as MongoDB documentation advices us not to use Custom Object ID like auto incrementing number, I end up giving big numbers to Models like Classes, Students etc, (where i could have used easily used 1,2,3 in sql lite), I didn't think to add a new field for numbering and index that field for querying.
What are the best practices advice for production

The advantage of using compound indexes vs your own indexed field system is that compound indexes allows sorting quicker than regular indexed fields. It also lowers the size of every documents.
In your case, if you want to get the documents sorted with values in field1 ascending and in field2 descending, it is better to use a compound index. If you only want to get the documents that have some specific value contained in field1_field2, it does not really matter if you use compound indexes or a regular indexed field.
However, if you already have field1 and field2 in seperate fields in the documents, and you also have a field containing field1_field2, it could be better to use a compound index on field1 and field2, and simply delete the field containing field1_field2. This could lower the size of every document and ultimately reduce the size of your database.
Regarding the cost of the indexing, you almost have to index field1_field2 if you want to go down that route anyways. Queries based on unindexed fields in MongoDB are really slow. And it does not take much more time adding a document to a database when the document has an indexed field (we're talking 1 millisecond or so). Note that adding an index on many existing documents can take a few minutes. This is why you usually plan the indexing strategy before adding any documents.
TL;DR:
If you have limited disk space or need to sort the results, go with a compound index and delete field1_field2. Otherwise, use field1_field2, but it has to be indexed!

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!

Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.

The fastest way to show Documents with certain property first in MongoDB

I have collections with huge amount of Documents on which I need to do custom search with various different queries.
Each Document have boolean property. Let's call it "isInTop".
I need to show Documents which have this property first in all queries.
Yes. I can easy do sort in this field like:
.sort( { isInTop: -1 } );
And create proper index with field "isInTop" as last field in it. But this will be work slowly, as indexes in mongo works best with unique fields.
So is there is solution to show Documents with field "isInTop" on top of each query?
I see two solutions here.
First: set Documents wich need to be in top the _id from "future". As you know, ObjectId contains timestamp. So I can create ObjectId with timestamp from future and use natural order
Second: create separate collection for Ducuments wich need to be in top. And do queries in it first.
Is there is any other solutions for this problem? Which will work fater?
UPDATE
I have done this issue with sorting on custom field which represent rank.

Using the _id field trick you mention has the problem that at some point in time you will reach the special time, and you can't change the _id field (without inserting a new document and removing the old one).
Creating a special collection which just holds the ones you care about is probably the best option. It gives you the ability to logically (and to some extent, physically) separate the documents.
Newly introduced in mongodb there is also support for a "sparse" index which may fulfill your needs as well. You could only set the "isInTop" field when you want it to be special, and then create a sparse index on it which would not have the problems you would normally have with a single indexed boolean field (in btrees).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse