Retrieving a sequential documents based on _id - mongodb

I've got a scenario where documents are indexed in elastic search, and I need to retrieve the matched document in mongo along with the preceding and following documents as sorted by a timestamp. The idea being to retrieve the context of the document along with the original document.
I am able to do this successfully now if I use a sequential _id. As an example, using the following data:
[
{_id: 1, value: 'Example One' },
{_id: 2, value: 'Example Two' },
{_id: 3, value: 'Example Three' },
{_id: 4, value: 'Example Four' },
{_id: 5, value: 'Example Five' },
{_id: 6, value: 'Example Six' },
...
]
if I search for 'Four' in ES, I get back the document _id of 4, since it's sequential I can create a mongo query to pull a range between id - 2 and id + 2, in this case 2 - 6. This works well, as long as I do not ever delete documents. When I delete a document I'll have to re-index the entire series to eliminate the gap. I'm looking for a way of achieving the same results, but also being able to delete documents without having to update all of the documents.
I'm open to using other technologies to achieve this, I am not necessarily tied to mongodb.

I can get the desired results using something like the following:
collection.find( {_id: { $gte: matchedId } } ).limit(3);
collection.find( {_id: { $lt: matchedId } } ).sort({$natural: -1}).limit(2);
Not quite as nice as using an explicit range, but no need to recalculate anything on document deletion.
Yes, I am aware of the limitations of natural order, and it is not a problem for my particular use case.

This problem has nothing to do with MongoDB in particular and is not different from using a different database here (e.g. a RDBMS). You will have to loop for document ids smaller/larger than the current id and find the first two matching. Yes, this means that you need to perform multiple queries. The only other option is to implement a chained list on top of MongoDB where you store pointers to the right and left neighbour node. And yes, in case of a deletion you need to adjust pointers (basic datastructure algorithms....). The downside is: you will need multiple operations in order to perform the changes. Since MongoDB is not transaction you may run into inconsistent previous/next pointers....that's why MongoDB completely sucks here.

Related

How to speed up a query which use several string filters?

I have a collection in MongoDB 3.4 to store the contacts of all users from some application. Each contact has a large list of string fields (100+). I use MongoDB but the question is valid for any other engine (MySQL, Elastic Search, etc.)
Almost all the queries to retrieve contacts have the same four base conditions, for example, user_id, base_field1, base_field2, base_field3 so a I created a compound index with those fields to improve the queries. The base query look like this:
db.contacts.find({
user_id: 1434,
base_field1: {$in: [0, 10]},
base_field2: true,
base_field3: "some value"
}).limit(10)
The execution time of the base query is good (less than 2 seconds) but keep in mind that there are 25K contacts which match the base conditions.
However the application lets the user to filter contacts by any other field and even to add any number of filters. All the filters uses contains operator so the query look like:
db.contacts.find({
user_id: 1434,
base_field1: {$in: [0, 10]},
base_field2: true,
base_field3: "some value",
field4: {$regex: "foobar", $options: "i"},
field5: {$regex: "foobar", $options: "i"},
field6: {$regex: "foobar", $options: "i"},
.
.
.
}).limit(10)
So the execution time is not good (between 9-10 seconds) for our requirements. Also, as you can expect, increasing the number of filters increase the execution time too so:
Is there any way to speed up the query from a design&query point of view?
Is there any other DB engine better than MongoDB to improve this kind of queries?
Please take in account the following comments and restrictions before reply:
A Text Index is useless here because if I create a compound text index with all the possible fields but the user filters only by field4 contains "foobar" then the result might have contacts which contains "foobar" on field5.
Just create a compound index with more than 31 fields is not possible in MongoDB.
Create a simple index for each field doesn't make sense because when the user filters by several fields, only one index will be used by MongoDB. Also you can create only 64 indexes per collection.
I actually use a MongoDB shared cluster through a hashed key (user_id) but for the sake of simplification I reduced the problem to the scope of only one shard, I mean, the problem exists even when I add a shard per user.
Edit: I changed OR conditions (field4 OR field5 ...) by AND conditions which is the real case.

MongoDB indexing on variable query

I have a collection of user generated posts. They contain the following fields
_id: String
groupId: String // id of the group this was posted in
authorId: String
tagIds: [String]
latestActivity: Date // updated whenever someone comments on this post
createdAt: Date
numberOfVotes: Number
...some more...
My queries always look something like this...
Posts.find({
groupId: {$in: [...]},
authorId: 'xyz', // only SOMETIMES included
tagIds: {$in: [...]}, // only SOMETIMES included
}, {
sort: {latestActivity/createdAt/numberOfVotes: +1/-1, _id: -1}
});
So I'm always querying on the groupId, but only sometimes adding tagIds or userIds. I'm also switching out the field on which this is sorted. How would my best indexing strategy look like?
From what I've read so far here on SO, I would probably create multiple compound indices and have them always start with {groupId: 1, _id: -1} - because they are included in every query, they are good prefix candidates.
Now, I'm guessing that creating a new index for every possible combination wouldn't be a good idea memory wise. Therefore, should I just keep it like that and only index groupId and _id?
Thanks.
You are going in the right direction. With compound indexes, you want the most selective indexes on the left and the ranges on the right. {groupId: 1, _id: -1} satisfies this.
It's also important to remember that compound indexes are used when the keys are in the query from left to right. So, one compound index can cover many common scenarios. If, for example, your index was {groupId: 1, authorId:1, tagIds: 1} and your query was Posts.find({groupId: {$in: [...]},authorId: 'xyz'}), that index would get used (even though tagIds was absent). Also, Posts.find({groupId: {$in: [...]},tagIds: {$in: [...]}}) would use this index (the first and third field of the index was used, so if there isn't a more specific index found by Mongo, this index would be used) . However, Posts.find({authorId: 'xyz',tagIds: {$in: [...]}}) would not use the index because the first field in the index was missing.
Given all of that, I would suggest starting with {groupId: 1, authorId:1, tagIds: 1, _id: -1}. groupId is the only non-optional field in your queries, so it goes on the left before the optional ones. It looks like authorId is more selective than tagIds, so it should go on the left after groupId. You're sorting by _id so that should go on the right. Be sure to Analyze Query performance on the different ways you query the data. Make sure they are all choosing this index (otherwise you'll need to make more tweaks or possibly a second compound index). You could then make other indexes and force the query to use it to do some a-b testing on performance.

Mongodb findAndModify on an array

I have an array of objects:
[{
_id: 1,
data: 'one'
},{
_id: 2,
data: 'two'
}]
I am receiving a new array every so often. Is there a way to shove all the data back into mongo (without dups) in bulk.
I.E. I know that I can loop over each element and and do a findAndModify (with upsert true for new records coming in). But I cann do an insert with the array each time because the ids will collide.
At least in the shell if you try to insert the whole array in one step it cycles through each element of the array and works, so the instruction:
db.coll.insert([{ _id: 1, data: 'one' },{ _id: 2, data: 'two' }])
Works and inserts two different records.
_id check also works, if you try it again you will receive an error as expected.
Anyway, there's a downside and it's the fact Mongo cycles every single record for real, and in case you try something like:
db.col.insert([{ _id: 1, data: 'one again' },{ _id: 5, data: 'five' }])
It won't work to test duplicates due to the fact Mongo stops in the first record and the second is never processed.
There're a couple of other tricks to insert this into a collection with a single node named "data" for example and process it faster, however you're always limited to 16mb records, if your bulk data is too large no any method will work.
In case you use mongoimport it's possible to use the --jsonArray parameter but you're still limited to 16 mb.
There's not another way to do it if you need larger chunks of data.

In MongoDB, which index would be more efficient? One that queries an array with two values, or one that uses an $or statement?

Let's say I have a document that looks like this:
{
_id: ObjectId("5260ca3a1606ed3e76bf3835"),
event_id: "20131020_NFL_SF_TEN",
team: {
away: "SF",
home: "TEN"
}
}
I want to query for any game with "SF" as the away team or home team. So I put an index on team.away and team.home and run an $or query to find all San Francisco games.
Another option:
{
_id: ObjectId("5260ca3a1606ed3e76bf3835"),
event_id: "20131020_NFL_SF_TEN",
team: [
{
name: "SF",
loc: "AWAY"
},
{
name: "TEN",
loc: "HOME"
}
]
}
In the array above, I could put an index on team.name instead of two indexes as before. Then I would query team.name for any game with "SF" inside.
Which query would be more efficient? Thanks!
I believe that you would want to use the second example you gave with the single index on team.name.
There are some special considerations that you need to know when working with the $or operator. Quoting from the documentation (with some additional formatting):
When using indexes with $or queries, remember that each clause of an $or query will execute in parallel. These clauses can each use their own index.
db.inventory.find ( { $or: [ { price: 1.99 }, { sale: true } ] } )
For this query, you would create one index on price:db.inventory.ensureIndex({ price: 1 },
and another index on sale:db.inventory.ensureIndex({ sale: 1 } )
rather than a compound index.
Taking your first example into consideration, it doesn't make much sense to index a field that you are not going to specifically query. When you say that you don't mind if SF is playing on an away or home game, you would always include both the away and home fields in your query, so you're using two indexes where all you need to query is one value - SF.
It seems appropriate to mention at this stage that you should always consider the majority of your queries when thinking about the format of your documents. Think about the queries that you are planning to make most often and build your documents accordingly. It's always better to handle 80% of the cases as best you can rather than trying to solve all the possibilities (which might lead to worse performance overall).
Looking at your second example, of nested documents, as you said, you would only need to use one index (saving valuable space on your server).
Some more relevant quotes from the $or docs (again with added formatting):
Also, when using the $or operator with the sort() method in a query, the query will not use the indexes on the $or fields. Consider the following query which adds a sort() method to the above query:
db.inventory.find ({ $or: [{ price: 1.99 }, { sale: true }] }).sort({item:1})
This modified query will not use the index on price nor the index on sale.
So the question now is - are you planning to use the sort() function? If the answer is yes then you should be aware that your indexes might turn out to be useless! :(
The take-away from this is pretty much "it depends!". Consider the queries you plan to make, and consider what document structure and indexes will be most beneficial to you according to your usage projections.

Mongodb update limited number of documents

I have a collection with 100 million documents. I want to safely update a number of the documents (by safely I mean update a document only if it hasn't already been updated). Is there an efficient way to do it in Mongo?
I was planning to use the $isolated operator with a limit clause but it appears mongo doesn't support limiting on updates.
This seems simple but I'm stuck. Any help would be appreciated.
Per Sammaye, it doesn't look like there is a "proper" way to do this.
My workaround was to create a sequence as outlined on the mongo site and simply add a 'seq' field to every record in my collection. Now I have a unique field which is reliably sortable to update on.
Reliably sortable is important here. I was going to just sort on the auto-generated _id but I quickly realized that natural order is NOT the same as ascending order for ObjectId's (from this page it looks like the string value takes precedence over the object value which matches the behavior I observed in testing). Also, it is entirely possible for a record to be relocated on disk which makes the natural order unreliable for sorting.
So now I can query for the record with the smallest 'seq' which has NOT already been updated to get an inclusive starting point. Next I query for records with 'seq' greater than my starting point and skip (it is important to skip since the 'seq' may be sparse if you remove documents, etc...) the number of records I want to update. Put a limit of 1 on that query and you've got a non-inclusive endpoint. Now I can issue an update with a query of 'updated' = 0, 'seq' >= my starting point and < my endpoint. Assuming no other thread has beat me to the punch the update should give me what I want.
Here are the steps again:
create an auto-increment sequence using findAndModify
add a field to your collection which uses the auto-increment sequence
query to find a suitable starting point: db.xx.find({ updated: 0 }).sort({ seq: 1 }).limit(1)
query to find a suitable endpoint: db.xx.find({ seq: { $gt: startSeq }}).sort({ seq: 1 }).skip(updateCount).limit(1)
update the collection using the starting and ending points: db.xx.update({ updated: 0, seq: { $gte: startSeq }, seq: { $lt: endSeq }, $isolated: 1}, { updated: 1 },{ multi: true })
Pretty painful but it gets the job done.