Find query don't limit fields - mongodb

I got the following query:
query =
{
application: application,
creationTime:{
$gte: start_date.valueOf(),
$lte: end_date.valueOf()
}
},
{
application: 1,
creationTime: 1,
buildSystem: 1,
_id:1
}
var ut_data = db.my_collection.find(query).sort({ _id: -1 }).forEach(function(doc) {print(doc.testStatus)})
Where I want to limit the fields in the result to application, creationTime and buildSystem, to not load the whole documents matching the condition.
Once I print testStatus, it seems like is also available, moreover all fields are available. How can I limit the fields in the result?
(I also tried: {fields:{application: 1, creationTime: 1, buildSystem: 1, _id: 1}} as proposed in Limit Field in Mongodb find() query not working
)

I was passing the two parameters of the query wrongly, so the second part was not considered, I fixed it like this:
query =
[{
application: application,
creationTime:{
$gte: start_date.valueOf(),
$lte: end_date.valueOf()
}
},
{
_id: 1,
application: 1,
creationTime: 1,
buildSystem: 1,
}]
And then, passing each argument separately:
db.my_collection.find(query[0], query[1]).sort({ _id: -1 }).forEach(function(doc) {print(doc.testStatus)})
Or simply passing the query directly to the find method (w/o variables).

Related

MongoDB partial index doesn't work with $elemMatch

I have following document:
user{
_id: objId(..),
name: "John Doe",
transactions:[
{
_id: 1,
amount: 10.00
item_id: 123,
condition: SUCCESS
},
{
_id: 2,
amount: 5.00
item_id: 124,
condition: FAILED
}
..
]
..
}
I tried placing a partial index for failed transactions using:
db.user.createIndex(
{ "transactions.condition": 1 },
{ partialFilterExpression: {"transactions.condition": "FAILED"} }
)
But whenever I do a query or $match through aggregate pipeline with followiing:
{$match: {"transactions": {$elemMatch: {"condition": "FAILED"}}}}
I always get a full document scan COLLSCAN with explain(). I am guessing the filter needs to strictly to follow expression of transactions.condition: "FAILED", but I thought {"transactions": {$elemMatch: {"condition": "FAILED"}}} was identicial to transactions.condition: "FAILED" if you just do one expression. What am I missing here?
Yes, try this instead:
db.user.explain().aggregate({$match: {"transactions.condition": "FAILED"}})
The reason is that MongoDB doesn't build the index for you. It has to be a key.
You'll get a even faster query if you can reduce it to:
db.user.explain().aggregate([
{$match: {"transactions.condition": "FAILED"}},
{$project:{_id:0, transactions.condition:1}}
])
but that may not be the case here.

Aggregation pipeline "latest for all distinct id" is very slow, need to create proper indexes?

Considering the following aggregation pipeline code to return newest entry for all distinct "internal_id":
db.locations.aggregate({$sort: {timestamp: -1}}, {$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}})
This call takes up to 10 seconds, which is not acceptable. The collection is not so huge:
db.locations.count()
1513671
So I guess there's something wrong with the indexes, however I tried to create many indexes and none of them made an improvement, currently I kept those two that were supposed to be enough imho: {timestamp: -1, internal_id: 1} and {internal_id: 1, timestamp: -1}.
MongoDB is NOT sharded, and running a 3 hosts replicaset running version 3.6.14.
MongoDB log show the following:
2020-05-30T12:21:18.598+0200 I COMMAND [conn12652918] command mydb.locations appName: "MongoDB Shell" command: aggregate { aggregate: "locations", pipeline: [ { $sort: { timestamp: -1.0 } }, { $group: { _id: "$internal_id", doc: { $first: "$$ROOT" } } } ], cursor: {}, lsid: { id: UUID("70fea740-9665-4068-a2b5-b7b0f10dcde9") }, $clusterTime: { clusterTime: Timestamp(1590834060, 34), signature: { hash: BinData(0, 9DFB6DBCEE52CFA3A5832DC209519A8E9D6F1204), keyId: 6783976096153993217 } }, $db: "mydb" } planSummary: IXSCAN { timestamp: -1, ms_id: 1 } cursorid:8337712045451536023 keysExamined:1513708 docsExamined:1513708 numYields:11838 nreturned:101 reslen:36699 locks:{ Global: { acquireCount: { r: 24560 } }, Database: { acquireCount: { r: 12280 } }, Collection: { acquireCount: { r: 12280 } } } protocol:op_msg 7677msms
Mongo aggregations are theoretically descriptive (in that you describe what you want to have happen, and the query optimizer figures out an efficient way of doing that calculation), but in practice many aggregations end up being procedural & not optimized. If you take a look at the procedural aggregation instructions:
{$sort: {timestamp: -1}}: sort all documents by the timestamp.
{$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}: go through these timestamp sorted documents and then group them by the id. Because everything is sorted by timestamp at this point (rather than id), it'll end up being a decent amount of work.
You can see that this is what mongo is actually doing by taking a look at that log line's query plan: planSummary IXSCAN { timestamp: -1, ms_id: 1 }.
You want to force mongo to come up with a better query plan than that that uses the
{internal_id: 1, timestamp: -1} index. Giving it a hint to use this index might work -- it depends on how well it's able to calculate the query plan.
If providing that hint doesn't work, one altenative would be to break this query into 2 parts that each uses an appropriate index.
Find the maximum timestamp for each internal_id. db.my_collection.aggregate([{$group: {_id: "$internal_id", timestamp: {$max: "$timestamp"}}}]). This should use the {internal_id: 1, timestamp: -1} index.
Use those results to find the documents that you actually care about: db.my_collection.find({$or: [{internal_id, timestamp}, {other_internal_id, other_timestamp}, ....]}) (if there are duplicate timestamps for the same internal_id you may need to dedupe).
If you wanted to combine these 2 parts into 1, you can use a self-join on the original collection with a $lookup.
So finally I've been able to do all the testing, here is all version I wrote, thanks to willis answer and the result:
Original aggregate query
mongo_query = [
{"$match": group_filter},
{"$sort": {"timestamp": -1}},
{"$group": {"_id": "$internal_id", "doc": {"$first": "$$ROOT"}}},
]
res = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query)
res = await res.to_list(None)
9.61 seconds
Give MongoDB a hint to use proper index (filter internal_id first)
from bson.son import SON
cursor = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query, hint=SON([("internal_id", 1), ("timestamp", -1)]))
res = await cursor.to_list(None)
Not working, MongoDB replies with an exception, saying sorting consume too much memory
Split aggregation, to first find latest timestamp for each internal_id
cursor = mongo.db[self.factory.config.mongo_collection].aggregate([{"$group": {"_id": "$internal_id", "timestamp": {"$max": "$timestamp"}}}])
res = await cursor.to_list(None)
or_query = []
for entry in res:
or_query.append({"internal_id": entry["_id"], "timestamp": entry["timestamp"]})
cursor = mongo.db[self.factory.config.mongo_collection].find({"$or": or_query})
fixed_res = await cursor.to_list(None)
1.88 seconds, a lot better but still not that fast
Parallel coroutines (and the winner is....)
In the meanwhile, as I already have the list of internal_id, and I'm using asynchronous Python, I went for parallel coroutine, getting latest entry for a single internal_id at once:
fixed_res: List[Dict] = []
async def get_one_result(db_filter: Dict) -> None:
""" Coroutine getting one result for each known internal ID """
cursor = mongo.db[self.factory.config.mongo_collection].find(db_filter).sort("timestamp", -1).limit(1)
res = await cursor.to_list(1)
if res:
fixed_res.append(res[0])
coros: List[Awaitable] = []
for internal_id in self.list_of_internal_ids:
coro = get_one_result({"internal_id": internal_id})
coros.append(coro)
await asyncio.gather(*coros)
0.5s, way better than others
If you don't have a list of internal_id
There's an alternative I did not implement but I confirmed the call is very fast: use lowlevel distinct command against {internal_id: 1} index to retrieve list of individual IDs, then use parallel calls.

MongoDB - Filter on First Field of a Two Field Compound Index and Sort on the Second Field?

I am inserting data into my MongoDB collection of the following format.
{'customer_id': 1, 'timestamp': 200}
{'customer_id': 2, 'timestamp': 210}
{'customer_id': 3, 'timestamp': 300}
I have a compound index created with keys: { 'customer_id': 1, 'timestamp': -1 }
db.collection.createIndex( { customer_id: 1, timestamp: -1 } , { name: "query for inventory" } )
Now, I need to filter such that I get the documents with customer_id = 1 or 2 and then sort the documents on the timestamp (in descending format, that is the latest will be at the top).
My query looks like this:
db.collection.find( { 'customer_id': { '$in': [ 1, 2 ] } } ).sort( { 'timestamp': -1 } ).limit( 100 )
I know how to do the query but I am unsure if I should be using this Compound Index or using two Indices on the separate fields or both.
It would be really helpful if I could get a clarification on which approach to use and why that approach is better.

How to show specific column in mongo db collection

I tried to show particular columns in mongodb colletion.but its not working.how to show particular columnns.
user_collection
[{
"user_name":"hari",
"user_password":"123456"
}]
find_query
db.use_collection.find({},{projection:{user_name:1}})
I got output
[{
"user_name":"hari",
"user_password":"123456"
}]
Excepted output
[{
"user_name":"hari",
}]
Try:
db.use_collection.find({}, {user_name:1, _id: 0 })
In that way you get the field user_name and exclude the _id.
Extra info:
project fields and project fields excluding the id
With aggregate:
db.use_collection.aggregate( [ { $project : { _id: 0, user_name : 1 } } ] )
You can try this
Mongo query:
db.users.aggregate([
{
"$project":
{
"_id": 0,
"first_name": 1,
}
}
])
Or in ruby (Mongoid)
User.collection.aggregate(
[
"$project":
{
"_id": 0,
"first_name": 1,
}
]
)
If you try to inspect the record, you can convert it into an array first (e.g. User.collection.aggregate(...).to_a)
You can use the official mongodb reference when writing in Mongoid, usually you just need to use double quote on the property name on the left hand side, to make it work on Mongoid.
Try:
db.use_collection.find({}, {user_password:0, _id: 0 ,user_name:1 })

How to count the number of documents on date field in MongoDB

Scenario: Consider, I have the following collection in the MongoDB:
{
"_id" : "CustomeID_3723",
"IsActive" : "Y",
"CreatedDateTime" : "2013-06-06T14:35:00Z"
}
Now I want to know the count of the created document on the particular day (say on 2013-03-04)
So, I am trying to find the solution using aggregation framework.
Information:
So far I have the following query built:
collection.aggregate([
{ $group: {
_id: '$CreatedDateTime'
}
},
{ $group: {
count: { _id: null, $sum: 1 }
}
},
{ $project: {
_id: 0,
"count" :"$count"
}
}
])
Issue: Now considering above query, its giving me the count. But not based on only date! Its taking time as well into consideration for unique count.
Question: Considering the field has ISO date, Can any one tell me how to count the documents based on only date (i.e excluding time)?
Replace your two groups with
{$project:{day:{$dayOfMonth:'$createdDateTime'},month:{$month:'$createdDateTime'},year:{$year:'$createdDateTime'}}},
{$group:{_id:{day:'$day',month:'$month',year:'$year'}, count: {$sum:1}}}
You can read more about the date operators here: http://docs.mongodb.org/manual/reference/aggregation/#date-operators