Optimize array query match with operator $all in MongoDb - mongodb

In a collection of 130k elements with the structure:
{
"tags": ["restaurant", "john doe"]
}
There are 40k documents with "restaurant" tag but only 2 with "john doe". So the next queries are different:
// 0.100 seconds (40.000 objects scanned)
{"tags": {$all: [/^restaurant/, /^john doe/]}}
// 0.004 seconds (2 objects scanned)
{"tags": {$all: [/^john doe/, /^restaurant/]}}
It's there a way to optimize the query without sorting the tags in the client? The only way I can imagine now is putting less frequent tags at start of the search array.

I found a request feature for this in mongodb team JIRA:
https://jira.mongodb.org/browse/SERVER-1000
I implemented a stadistic system to put tags with more cadinality at the end of the array.

Related

mongoDB bulkWrite bad performence

I have MongoDB collection that contains around 10M documents
I try to update (upsert) around 2500~ documents, each update is around 1K
I try to use bulkWrite with order=false.
It took around 10 seconds and it like 3-4 ms per document.
So I try to insert the 2500~ documents by updateOne (iterative mode)
I measure the avg time per document and it took like 3.5ms for each update.
Why I don't get a better result for bulkWrite and how can I improve the bulkWrite update time?
Example for bulkUpdate with 1 document:
db.collections.bulkWrite( [
{ updateOne :
{
"filter": {"Name": "someName", "Namespace" :
"someNs", "Node" : "someNode" , "Date" : 0},
"update": {"$addToSet" : {"Data" :{"$each" : ["1", , "2"]}}},
"upsert": true
}
}
] )
Document Example:
{
"Name": "someName",
"Namespace": "someNs",
"Node": "SomeNode",
"Date": 23245,
"Data" : ["a", "b"]
}
I have a compound index that contains: Name, Namespace, Node, Date.
When I try to find documents I get good results
TL DR; play around with your batch sizes to find the sweet spot.
Bulk write or updateMany will be faster than single updates. Simply because there is less chatter going on (less round trips). The amount of data being transferred is exactly the same.
What you need to do is to find an optimum number which gives you the highest throughput based on your network, cluster config etc.
Typically what you'd see is that if batch size is smaller, you are not using it to the ability. And if it is too big, then you are spending too much time just transferring the package to database.

MongoDB compound shard key

I have a doubt regarding Mongo compound shard keys. Let's suppose I have document that is structured like this:
{
"players": [
{
"id": "12345",
"name": "John",
},
{
"id": "23415",
"name": "Doe",
}
]
}
Players embedded documents are always present and always 2. I think that the "players.0.id" and "players.1.id" should be a good choice as shard keys because are not monotonic and are evenly distributed.
What I can't understand from the documentation is if:
All documents with same "players.0.id" OR same "players.1.id" are supposed to be saved into the same Chunk, or
All documents with same "players.0.id" AND same "players.1.id" are supposed to be saved into the same Chunk.
In other words, if I query the Collection to get all games played by John (as player 1 or player 2) the query will be sent to one chunk or to all chunks?
You cannot create a shard key where part of the key is a multikey index (i.e. index on an array field). This is mentioned in Shard Key Index Type:
A shard key index cannot be an index that specifies a multikey index, a text index or a geospatial index on the shard key fields.
If you have exactly two items under the players field, why not create two sub-documents instead of using an array? An array is typically useful for use cases where you have multiple items of indeterminate number in a document. For example, this structure might work for your use case:
{
"players": {
"player_1": {
"id" : 12345,
"name": "John"
},
"player_2": {
"id": 54321,
"name": "Doe"
}
}
}
You can then create an index like:
> db.test.createIndex({'players.player_1.id':1, 'players.player_2.id':1})
To answer your questions, if you're using this shard key, then:
There is no guarantee that the same player_1.id and player_2.id will be on the same chunk. This will depend on your data distribution.
If you query John as player_1 OR player_2, the query will be sent to all shards. This is because you have a compound index as the shard key, and you're searching for an exact match on the non-prefix field.
To elaborate on question 2:
The query you're doing is this:
db.test.find({$or: [
{'players.player_1.id':123},
{'players.player_2.id':123}
]})
In a compound index, the index was first sorted by player_1.id, then for each player_1.id, there exist sorted player_2.id. For example, if you have 10 documents with some combination of values for player_1.id and player_2.id, you can visualize the index like this:
player_1.id | player_2.id
------------|-------------
0 | 10
0 | 123
1 | 100
1 | 123
2 | 123
2 | 150
123 | 10
123 | 100
123 | 123
123 | 150
Note that the value player_2.id: 123 occur multiple times in the table, once per each player_1.id. Also note that for each player_1.id value, the player_2.id values are sorted within it.
This is how MongoDB's compound index works and how it's sorted. There are more nuances with compound indexes that is too long to explain here, but the details are explained in the Compound Indexes page
The effect of this ordering method is that, there are many, many identical player_2.id values spread across the index. Since the overall index is only sorted in terms of player_1.id, it is not possible to find an exact player_2.id without specifying player_1.id. Hence, the above query will be sent to all shards.

MonogDB: How to get all the documents with specific value in a field?

I have an array of 10 unique Object IDs named Arr
I have 10,000 documents in a collection named xyz.
How can I find documents using Object IDs in the array Arr from the collection xyz with only one request?
There are $all and $in operators but are used to query fields with an array.
Or do I need to make requests equal to the length of Arr and get individual document using findOne?
EDIT:
I'm expecting something like this:
db.getCollection("xyz").find({"_id" : [array containing 10 unique IDs]})
....for which the result callback will contain an array of all the matched IDs of query array.
According to the documentation here: https://docs.mongodb.com/manual/reference/operator/query/in/
You should use the following query:
db.getCollection("xyz").find({"Arr" : { $in: [123, 456, 789 ] }});

mongo db count differs from aggregate sum

i have a query and when i validate it i see that the count command returns a different results from the aggregate result.
i have an array of sub-documents like so:
{
...
wished: [{'game':'dayz','appid':'1234'}, {'game':'half-life','appid':'1234'}]
...
}
i am trying to query a count of all games in the collection and return the name along with the count of how many times i found that game name.
if i go
db.user_info.count({'wished.game':'dayz'})
it returns 106 as the value and
db.user_info.aggregate([{'$unwind':'$wished'},{'$group':{'_id':'$wished.game','total':{'$sum':1}}},{'$sort':{'total':-1}}])
returns 110
i don't understand why my counts are different. the only thing i can think of is that it has to do with the data being in an array of sub-documents as opposed to being in an array or just in a document.
The $unwind statement will cause one user with multiple wished games to appear as several users. Imagine this data:
{
_id: 1,
wished: [{game:'a'}, {game:'b'}]
}
{
_id: 2,
wished: [{game:'a'}, {game:'c'}, {game:'a'}]
}
The count can NEVER be more than 2.
But with this same data, an $unwind will give you 5 different documents. Summing them up will then give you a:3, b:1, c:1.

How to fetch between a range of indexes in mongodb?

I need help.. Is there any method available to fetch documents between a range of indexes while using find in mongo.. Like [2:10] (from 2 to 10) ?
If you are talking about the "index" position within an array in your document then you want the $slice operator. The first argument being the index to start with and the second is how many to return. So from a 0 index position 2 is the "third" index:
db.collection.find({},{ "list": { "$slice": [ 2, 8 ] })
Within a collection itself if you use the .limit() an .skip() modifiers to move through the range in the collection:
db.collection.find({}).skip(2).limit(8)
Keep in mind that in the collection context MongoDB has no concept of "ordered" records and is dependent on the query and/or sort order that is given