Could someone tell me the difference between these two MongoDB queries - mongodb

I was given the following problem in MongoDB:
How many companies in the sample_training.companies dataset were
(Either founded in 2004 && ( either have the social category_code [or] web category_code))
|| (founded in the month of October && ( either have the social category_code [or] web category_code))
The actual query for the above question is given below and it returned 149 documents.
CorrectQuery:
db.companies.find({
"$or":[
{
"founded_year":2004,
"$or":[
{ "category_code":"social" },
{ "category_code":"web" }
]
},
{
"founded_month":10,
"$or":[
{ "category_code":"social" },
{ "category_code":"web" }
]
}
]
}).count()
But I've tried to formulate another query for the same problem, unfortunately, it returned the incorrect value of 668 rows.
Incorrect Query:
db.companies.find({
"$or":[
{"category_code":"social"},
{"category_code":"web"}
],
"$or":[
{"founded_year":2004},
{"founded_month":10}
]
}).count()
Could someone help me in understanding the difference between these queries?

The answer is just the standard order of operations in boolean logic.
AND has priority over OR in the same way that multiplication has priority over addition in mathematics.
The first example is correct because the logic matches the logic in your problem.
The second example logic is:
(category_code is "social" or category_code is "web") and (founded_year is 2004 or founded_month is 10)
Your second example:
db.companies.find({
"$or":[
{"category_code":"social"},
{"category_code":"web"}
],
"$or":[
{"founded_year":2004},
{"founded_month":10}
]
}).count()

Related

Writting down $subtract in No-SQL MongoDB in a sophisticated way

I'm doing MongoDB Academy, and for this question:
What is the difference between the number of people born in 1998 and the number of people born after 1998 in the sample_training.trips collection?
The simplest way is to do this is (the way they expect you to answer):
db.trips.find({"birth year":1988}).count()
and:
db.trips.find({"birth year":{$gt:1988}}).count()
then, manually calculate.
I'm not familiar with programming and code syntax but, wondering about a sophisticated method, something like the code bellow can be improved.
db.trips.aggregate({$subtract:[db.trips.find({"pop":1988}).count(),db.trips.find({"pop":{$gt:1988}}).count()]})
Note: Atlas "free subscriber" don't allow to use $subtract in queries, so I even tested if it will work.
Much Simpler way use:
(query 1) - (query 2)
db.trips.find({"birth year":{"$gt":1998}}).count() - db.trips.find({"birth year":1998}).count()
You'll get the result as an integer
6
You can group the number of people born in the year 1998 and after 1988 using the following approach that too with just one aggregate query.
Aggregate pipeline stages:
Select only those records where the birth year is equal or greater than 1988
Group the number of people by the birth year
Add a new field as born_on to every record using the $cond operator in the $project stage
Group record again by the newly added field born_on, which will now give you only two records with count as on_1988 and after_1988
The query is as following
db.trips.aggregate([
{
"$match":{
"birth year":{
"$gte":1988
}
}
},
{
"$group":{
"_id":"$birth year",
"count":{
"$sum":1
}
}
},
{
"$project":{
"count":"$count",
"born_on":{
"$cond":[
{
"$eq":[
"$_id",
1988
]
},
"on_1988",
"after_1988"
]
}
}
},
{
"$group":{
"_id":"$born_on",
"total":{
"$sum":"$count"
}
}
}
])
** There could be another more optimized way of achieving the same result but this also works
you can use the following queries to get this value:
db.trips.find( { "birth year": { "$gt": 1998 } } ).count()
db.trips.find( { "birth year": 1998 } ).count()
Use the $gt instead of $gte operator to exclude all 1998 births, and then see how many people were born in 1998 by using implicit equality, then subtract the two values to get the difference.

mongodb how to get a document which has max value of each "group with the same key" [duplicate]

This question already has answers here:
MongoDB - get documents with max attribute per group in a collection
(2 answers)
Closed 5 years ago.
I have a collection:
{'_id':'008','name':'ada','update':'1504501629','star':3.6,'desc':'ok', ...}
{'_id':'007','name':'bob','update':'1504501614','star':4.2,'desc':'gb', ...}
{'_id':'005','name':'ada','update':'1504501532','star':3.2,'desc':'ok', ...}
{'_id':'003','name':'bob','update':'1504501431','star':4.5,'desc':'bg', ...}
{'_id':'002','name':'ada','update':'1504501378','star':3.4,'desc':'no', ...}
{'_id':'001','name':'ada','update':'1504501325','star':3.6,'desc':'ok', ...}
{'_id':'000','name':'bob','update':'1504501268','star':4.3,'desc':'gg', ...}
...
if I want the result is, the max value of 'update' of the same 'name', means the newest document of 'name', get the whole document:
{'_id':'008','name':'ada','update':'1504501629','star':3.6,'desc':'ok', ...}
{'_id':'007','name':'bob','update':'1504501614','star':4.2,'desc':'gb', ...}
...
How to do it most effective?
I do it now in python is:
result=[]
for name in db.collection.distinct('name'):
result.append(db.collection.find({'name':name}).sort('update',-1)[0])
is it do 'find' too many times?
=====
I do this for crawl data with 'name', get many other keys, and every time I insert a document, I set a key named 'update'.
When I using the database, I want the newest document of specific 'name'. so it looks can not just use $group.
How should I do? re design the db structure or better way to find?
=====
Improved !
I've tried create index of 'name' & 'update', the process is shortened from half hour to 30 seconds!
But I still welcome for better solution ^_^
Your use case scenario suits real good for aggregation. As I see in your question you already know that but can't figure out how to use $group and take whole document that has the max update. If you $sort your documents before $groupyou can use $firstoperator. So no need to send a find query for each name.
db.collection.aggregate(
{ $sort: { "name": 1, "update": -1 } },
{ $group: { _id: "$name", "update": { $first: "$update" }, "doc_id": { $first: "$_id" } } }
)
I did not add an extra $projectoperation to aggregate, you can just add fields that you want in result to $groupwith $firstoperator.
Additionally, if you look closer to $sortoperation, you can see it uses your newly created index, so you did good to add that, otherwise I will recommend it too :)
Update: For your question in comment:
You should write all keys in $group. But if you think it will look bad or new fileds will come in future and does not want to rewrite $groupeach time, I would do that:
First get all _idfields of desired documents in aggregation and then get these documents in one findquery with $inoperator.
db.collection.find( { "_id": { $in: [<ids returned in aggregation] } } )

Mongo error 16996 during aggregation - too large document produced

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

How to get sorted result start from a point in MongoDB?

For example, I got some data in MongoDB
db: people
{_id:1, name:"Tom", age:26}
{_id:2, name:"Jim", age:22}
{_id:3, name:"Mac", age:22}
{_id:4, name:"Zoe", age:22}
{_id:5, name:"Ray", age:18}
....
If I want to get result sorted by "age", that's easy, just create a index "age" and use sort. Then I got a long list of return. I may got result like below:
{_id:5, name:"Ray", age:18}
{_id:2, name:"Jim", age:22}
{_id:3, name:"Mac", age:22}
{_id:4, name:"Zoe", age:22}
{_id:1, name:"Tom", age:26}
...
What if I only want this list also sorted by "age" and start from "Mac"? like below:
{_id:3, name:"Mac", age:22}
{_id:4, name:"Zoe", age:22}
{_id:1, name:"Tom", age:26}
...
I can't use $gte because this may include "Jim". Ages can be the same.
What the right way to query this? Thanks.
I think this is more of a "terminology" problem in that what you call a "start point" others call it something different. There are two things I see here as both what I would believe to be the "wrong" approach, and one I would think is the "right" approach to what you want to do. Both would give the desired result on this sample though. There is of course the "obvious" approach if that is simple enough for your needs as well.
For the "wrong" approach I would basically say to use $gte in both cases, for the "name" and "age". This basically gives you are "starting point" at "Mac":
db.collection.find(
{ "name": { "$gte": "Mac" }, "age": { "$gte": 22 } }
).sort({ "age": 1 })
But of course this would not work if you had "Alan" of age "27" since the name is less than the starting name value. Works on your sample though of course.
What I believe to the the "right" thing to what you are asking is that you are talking about "paging" data with a more efficient way that using .skip(). In this case what you want to do is "exclude" results in a similar way.
So this means essentially keeping the last "page" of documents seen, or possibly more depending on how much the "range" value changes, and excluding by the unique _id values. Best demonstrated as:
// First iteration
var cursor = db.collection.find({}).sort({ "age": 1 }).limit(2);
cursor.forEach(function(result) {
seenIds.push(result._id);
lastAge = result.age;
// do other things
});
// Next iteration
var cursor = db.collection.find(
{ "_id": { "$nin": seenIds }, "age": { "$gte": lastAge } }
).sort({ "age": 1 }).limit(2);
Since in the first instance you had already "seen" the first two results, you submit the _id values as a $nin operation to exclude them and also ask for anything "greater than or equal to" the "last seen" age value.
That is an efficient way to "page" data in a forwards direction, and may indeed be what you are asking, but of course it requires that you know "which data" comes "before Mac" in order to do things properly. So that leaves the final "obvious" approach:
The most simple way to just start at "Mac" is to basically query the results and just "discard" anything before the results ever got to this desired value:
var startSeen = false;
db.collection.find(
{ "age": {"$gte": 22}}
).sort({ "age": 1 }).forEach(function(result) {
if ( !startSeen )
startSeen = ( result.name == 'Mac' );
if ( startSeen ) {
// Mac has been seen. Do something with your data
}
})
At the end of the day, there is no way of course to "start from where 'Mac' appears in a a sorted list" in any arbitrary way. You are either going to :
Lexically remove any other results occurring before
Store results and page through them to "cut points" for last seen values
Just live with iterating the cursor and discarding results until the "first" desired match is found.
I did a test and found the solution.
db.collection.find({
$or: [{name: {$gt 'Mac'}, age: 22}, {age: {$gt: 22}}]
})
.sort({age:1, name:1})
really did the magic.

MongoDB multiple condition in WHERE clause

In MongoDB i have to give OR condition in WHERE clause
I am a beginner. I dont know how to acheive this?
DELETE FROM tablename WHERE id = 6 OR id =8
What is the similar query in MongoDB??
Just use the $or operator as described here.
db.tablename.remove({ $or: [ { _id: 6 }, { _id: 8 } ] })
You may also find the appropriate section in the manual for SQL comparison useful.