Mongo queries to search all the collections of a database (Mongo/PyMongo) - mongodb

I have been stuck on how to query db which the common data structure of every document looks as:
{
"_id": {
"$oid": "5e0983863bcf0dab51f2872b"
},
"word": "never", // get the `word` value for each of below queries
"wordset_id": "a42b50e85e",
"meanings": [{
"id": "1f1bca9d9f",
"def": "not ever",
"speech_part": "adverb",
"synonyms": ["ne'er"]
}, {
"id": "d35f973ed0",
"def": "not at all",
"speech_part": "adverb"
}]
}
1) query to get all the wordfor speech_part: "adverb" (eg: never,....) //
2)query to get all the word for: word length of 6 and speech_part: "adverb"
I have learnt from SO that ,to search whole collections first i have to retrieve all collections in the database , but how to write a query is where i stuck

db.collection.find({"meanings.speech_part":"adverb"},{"_id":0, "word":1})
To get array of all word of a specific speech_part above is the query.
First part of the query is filter predicate like in your scenario matching speach_part.if your matching column were not inside another object or a object inside a array, you could just write {column_name: "something"}.
as speech_part is inside an object which is inside an array, you have to write {"parentClumn.key":"something"}, in your case {"meanings.speech_part":"adverb"}.
where second part of the query is projection where you define which columns you want in your result. so to get only word column values you do {word:1}, to have more column you do {word:1, etc:1}. While mongodb project _id by default, so to remove _id from result you have to explicitly set {_id:0}
db.collection.find({
"meanings.speech_part":"adverb",
"$expr": { "$gt": [ { "$strLenCP": "$word" }, 6 ] }
},{"_id":0, "word":1})
To get array of all word of a specific speech_part with length greater than 6. This one is a bit complex query. You can look up $expr documentation. In $expr you can run function on your column and match the result. In your case strLenCP is calculating the length of your word column value and then checking, is it greater then 6 by $gt comparison operator

You may try below query to get the matching rows. You will have to try the same with pymongo.
db.getCollection('test-collection').find(
{
'meanings.speech_part': 'adverb'
},
{
_id: 0,
word: 1
}
);
Read about the projections in mongodb here:
https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results

Related

MongoDB match on document and subdocuments, what to use as indexes?

I have a lot of documents looking like this:
[{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"price": 433,
"automatic": false,
"destination": "5d26fc92f72acc7a0b19f2c4",
"date": "2020-01-19T00:00:00.000+00:00",
"days": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"automatic": true,
"destination": "5d26fc92f72acc7a0b19f2c4",
"prices": [{
"price": 433,
"date_from": "2020-01-19T00:00:00.000+00:00",
"date_to": "2020-01-28T00:00:00.000+00:00",
"day_count": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"price": 899,
"date_from": "2020-04-19T00:00:00.000+00:00",
"date_to": "2020-04-28T00:00:00.000+00:00",
"day_count": 19,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
}
]
}
]
As you can see, automatic deals have multiple prices (can be a lot, between 1000 and 4000) and does not have the original fields available.
Now I need to search in the original document as well in the subdocuments to look for a match.
This is the aggregation I use to search through the documents:
[{
"$match": {
"destination": {
"$in": ["5d26fc9af72acc7a0b19f313"]
}
}
}, {
"$match": {
"$or": [{
"prices": {
"$elemMatch": {
"price": {
"$lte": 1500,
"$gte": 400
},
"date_to": {
"$lte": "2020-04-30T22:00:00.000Z"
},
"date_from": {
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}
}
}, {
"price": {
"$lte": 1500,
"$gte": 400
},
"date": {
"$lte": "2020-04-30T22:00:00.000Z",
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}]
}
}, {
"$limit": 20
}]
I would like to speed things up, because it can be quite slow. I was wondering, what is the best index strategy for this aggregate, what fields do I use? Is this the best way of doing it or is there a better way?
From Mongo's $or docs:
When evaluating the clauses in the $or expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
So with that in mind in order to avoid a collection scan in this pipeline you have to create a compound index containing both price and prices fields.
Remember that order matters in compound indexes so the order of the field should vary depending on your possible usage of it.
It seems to me that the index you want to create looks something like:
{destination: 1, date: 1, board_type: 1, price: 1, prices: 1}
A compound index including the match filter fields is required to make the aggregation run fast. In aggregation queries, having the $match stage early in the pipeline (preferably, first stage) utilizes indexes, if any are defined on the filter fields. In the posted query it is so, and defining the indexes is all needed for a fast query. But, index on what fields?
The index is going to be compound index; i.e., index on multiple fields of the query criteria. The index prefix starts with the destination field. The remaining index fields are to be determined. What are the remaining fields?
Most of these fields are in the prices array's sub-document fields - price, date_from, date_to and board_type. There is also the date field from the main document. Which of these fields need to be used in the compound index?
Defining indexes on array elements (or fields of sub-documents in an array) creates lots of index keys. This means lots of storage and for using the index the memory (or RAM). This is an important consideration. Indexes on array elements are called as multikey indexes. For an index to be properly utilized, the collection's documents and the index being used by the query (together called as working set) must fit into the RAM.
Another aspect you need to consider is the query selectivity. How many documents gets selected using a filter which uses an index field, is a factor. It is imperative that the filter field with must select a small set of the input documents to be effective. See Create Queries that Ensure Selectivity.
It is difficult to determine what other fields need to be considered (sure some of the fields of the prices) based on the above two factors. So, the index is going to be something like this:
{ destination: 1, fld1: 1, fld2: 1, ... }
The fld1, fld2, ..., are going to be the prices array sub-document fields and / or the date field. I think only one set of date fields can be used with the index. An example index can be one of these:
{ destination: 1, date: 1, "prices.price": 1, "prices.board_type": 1}
{ destination: 1, "prices.price": 1, "prices.date_from": 1, "prices.date_to": 1, "prices.board_type": 1}
Note the index keys order and the necessity of the price, date_from, date_to and board_type is to be determined based upon the two main factors - requirement of the working set and the query selectivity - this is important.
NOTES: On a small sample data set with similar structure showed usage of the compound index with the primary destination field and two fields from the prices (one with equality condition and one with range condition). The query plan using the explain showed an IXSCAN (index scan) on the compound index, and using an index will sure improve the query performance.

MongoDB: How to do a text search and sort by a date

Context: I have a MongoDB populated with large number of emails. I'd like to do a search for all emails that include a given email address within any of the following fields: To, From, CC and BCC. The result needs to be sorted by the field Date. We're currently trying the following query:
db.collection.find({ $text : {$search: "\"email#domain.com\""}}).sort({Date:1})
I've tried doing a compound index including the date but it does not work.
With this index...
db.collection.createIndex({Date: 1, From:"text", To:"text", CC:"text", BCC:"text"})
it gives error 17007 as Date should have an equality match as it's a prefix. This is not an option as we'd like all emails regardless of the date.
Also with this other index...
db.collection.createIndex({From:"text", To:"text", CC:"text", BCC:"text", Date:1})
Then it gives error 17144 as it goes over the internal limit for the sort.
We've read the following:
Stackoverflow ref
Stackoverflow ref
mongoDB doc on compound index
In these references and others I'm getting the idea that this is not possible but I don't think what we're trying to do is atypical or so much out of the box.
Are we doing something wrong? Is there a way to do this query with compound index or any other MongoDB feature?
thanks!
Regardless of other compound index keys, you need to include the $meta for the "textScore" in order to get the correct sorting:
db.collection.find(
{ "$text": { "$search": "\"email#domain.com\""}},
{ "score": { "$meta": "textScore" } }
).sort({
"score": { "$meta": "textScore" }, "Date": 1
})
So naturally you want that "score" to sort first, and then by "Date" in order for things to be correctly ranked by relevance of the search.
The order of index does not matter, but of course you can ony have "one" text index. So make sure you drop all others before creating:
db.collection.createIndex({
"From": "text",
"To": "text",
"CC":"text",
"BCC": "text",
"Date":1
})
Look for indexes that are current with:
db.collection.getIndicies()
Or just drop everything and start fresh:
db.collection.dropIndexes()
For the data you appear to be searching on though, I would have thought a regular compound index on each field should suit you better. Looking for "email" addresses should be an "exact match", and if you expect multiple items for each field then they should be arrays of strings, like so:
{
"TO": ["bill#example.com"],
"FROM": ["ted#example.com"],
"CC": ["marty#example.com","sarah#example.com"],
"BCC": [],
"Date": ISODate("2015-07-27T13:42:05.535Z")
}
Then you need seperate indexes on each field, possibly in compound with "Date" like so:
db.email.createIndex({ "TO": 1, "Date": 1 })
db.email.createIndex({ "FROM": 1, "Date": 1 })
db.email.createIndex({ "CC": 1, "Date": 1 })
db.email.createIndex({ "BCC": 1, "Date": 1 })
And query with an $or condition:
db.email.find({
"$or": [
{ "TO": "sarah#example.com" },
{ "FROM": "sarah#example.com" },
{ "CC": "sarah#example.com" },
{ "BCC": "sarah#example.com" }
],
"Date": { "$lt": new Date() }
})
If you look at the .explain(true) (verbose) output from that, you should see that the winning plan is an "index intersection" of all the specified indexes. This works out to be very efficient as every field ( and index selected ) has an exact match value, and a range match on the indexed date.
That's going to be a lot better for you than the "fuzzy matching" of text searches. Even regular expressions should work better here in general ( for e-mail addresses ) and especially if they are "anchored" ^ to the start of the string.
Text indexes are meant for "word like tokens" to match, but this should not be your data. The $or does not look at nice, but it should do a much better job.

MongoDB Compound Index to Optimize Update with Key and Range Condition

Have read this doc, it states that index can optimize update operation. Then, I am adding an index to my collection to optimize update operation I am using.
Records in the collection have object as _id, and a timestamp:
{_id: {userId: "sample"}, firstTimestamp: 123, otherField: "abc"}
What I want to do is operate update using query below:
db.userFirstTimestamp.update(
{_id: {userId: "sample"}, firstTimestamp: {$gt: 100}},
{_id: {userId: "sample"}, firstTimestamp: 100, otherField2: "efg"})
I want to store 'first document' based on 'firstTimestamp', field of old and new document can be different, hence it cannot be $set query, it should rewrite document instead. For sample below "otherField" should not be exist, it should be "otherField2" instead.
Based on my understanding on MongoDB doc and this article, I created index as per below
db.sample.createIndex({_id:1, timestamp:1})
Then I try to benchmark the query on an isolated experimental node using MongoDB 3.0.4 with spec below:
MongoDB 3.0.4
Machine is empty, no other operation, only mongo
RAM ~30GB
Disk is RAID 0 stripped
Collection has 60 million record
Average object size 1001 bytes
Index size 5.34 gig
When I check the log, many update query take more than 100ms, and when I do mongotop, top of the query is write query which takes ~1000ms. It is a bit slow since it takes that long to do one query.
When I do mongostat, throughput is only 400-500 query per second.
Then I try to do query explain using find query (since update does not support explain)
When I am not using projection, it is using default index {_id:1}.
When I am using projection for _id and timestamp only, it is using {_id:1, timestamp:1} index.
My question is:
Does index I have created help this update query?
If it is not helping, then how the index should be?
Any other way to optimize this update query?
Somewhat. But not optimally.
Should be this really, so index on the "element" of the object in the _id key:
db.sample.createIndex({ "_id.userId": 1, "timestamp": 1 })
Use the $set operator and stop overwiting your documents:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": { "otherfield": "cfg" }
}
)
But really your data "should" look like this:
{
"_id": "sample",
"firstTimestamp": 200,
"otherfield2": "sam"
}
And update like:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"otherfield2": "efg"
}
}
)
Or if you insist that fields other than "_id" and "firstTimestamp" are going to change a lot, then rather do this:
{
"_id": "sample",
"firstTimestamp": 200,
"data": {
"otherfield2": "sam"
}
}
When if you just want to replace data then do:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data": {
"overwritingField": "efg"
}
}
}
)
Since "data" can be replaced as an entire object if you wish, or just update a single key:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data.newfield": "efg"
}
}
)
In all cases, try to use the operators rather than replacing the whole object as it typically works out as more traffic and more load to the server.
But overall, what makes sense here is that the "userId" part "should" be the portion of the index that narrows down the results the most. So it definately goes before the timestamp, of which there should be a lot more possible values.
Compound primary keys are fine, but make sure you actually use them. A singular value would not make any sense and could just be assigned to _id. If you can just query on one field of they key as you are here, then you probably don't need a compound object as the primary key.
Your _id in the update suggests that you are getting exact matches for the _id therefore it is not a compound field with other keys. With this being the case, it should just a value in the _id itself.
Also a "range" is okay, but again consider that you are trying to match a single document ( well you don't mention "multi" anywhere ), so again questin why is it needed and either then go for an exact match or at "least" an upper limit.
The $set will "only" update the fields that you specifiy. I think you made a mistake in typing your question though, as the syntax for the "update" portion would not be valid. But use update operators anyway, as they send less traffic by sending a single field, or just the fields you intend to update.

MongoDB find query with $ref and $id

I'm struggling with a mongodb query.
Given this query 1:
db.procedures.find({'procedure.name':'nameOfMyProcedure'})
and this query 2:
db.procedure_executions.find({'foo.bar':'whatever'})
Query 1 returns a lot of procedure objects that look like this in shortened version:
{
"_id": ObjectId("5564df8d30041530fb68e1eb"),
"_class": "eu.whatever.model.db.impl.DbProcedureExecutionImpl",
"procedure": {
"_class": "eu.whatever.common.model.impl.ProcedureImpl",
"className": "eu.whatever",
"name": "nameOfMyProcedure",
"kind": "METHOD",
"arguments": []
},
"caller": {
"$ref": "procedure_executions",
"$id": ObjectId("5564df8d30041530fb68e1e8")
}
}
The resulting objects of query 2 are referenced as "caller" in query 1.
How can I filter procedures (query 1) by the referenced caller and its attributes (query 2) in a single nested query?
I came across $in. Is it possible to add a query to another collection (procedure_executions) within $in?
"I'm new here myself", but try:
db.procedures.find(
{'procedure.name':'nameOfMyProcedure'},
{'caller.$ref': 'procedure_executions'})
From $and:
NOTE:
MongoDB provides an implicit AND operation when specifying a comma separated list of expressions. Using an explicit AND with the $and operator is necessary when the same field or operator has to be specified in multiple expressions.

Dot notation vs. $elemMatch

I have a unitScores collection, where each document has an id and an array of documents like this:
"_id": ObjectId("52134edd5b1c2bb503000001"),
"scores": [
{
"userId": ObjectId("5212bf3869bf351223000002"),
"unitId": ObjectId("521160695483658217000001"),
"score": 33
},
{
"unitId": ObjectId("521160695483658217000001"),
"userId": ObjectId("5200f6e4006292d308000008"),
"score": 17
}
]
I have two find queries:
_id:new ObjectID(scoreId)
"scores.userId":new ObjectID(userId)
"scores.unitId":new ObjectID(unitId)
and
_id:new ObjectID(scoreId)
scores:
$elemMatch:
userId:new ObjectID(userId)
unitId:new ObjectID(unitId)
I would expect them to give the same result, but using the input userId and unitId of
userId: 5212bf3869bf351223000002
unitId: 521160695483658217000001
the dot notation version returns the wrong array entry (the one with score:17) and the $elemMatch returns the correct entry (the one with score:33). Why is that?
$elemMatch is not the same logic as dot notation. $elemMatch requires the same nested elements to have the values. Using dot notation allows for any nested elements to have an values. Thus, your seeing different results because the query logic is different.