how do I do 'not-in' operation in mongodb? - mongodb

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.

A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.

Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.

One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});

Related

how to populate field of one collection with count query results of another collection?

Kind of a complex one here and i'm pretty new to Mongo, so hopefully someone can help. I have a db of users. Each user has a state/province listed. I'm trying to create another collection of the total users in each state/province. Because users sign up pretty regularly, this will be an growing total i'm trying to generate and display on a map.
I'm able to query the database to find total number of users in a specific state, but i want to do this for all users and come out with a list of totals in all states/provinces and have a separate collection in the DB with all states/provinces listed and the field TOTAL to be dynamically populated with the count query of the other collection. But i'm not sure how to have a query be the result of a field in another collection.
used this to get users totals:
db.users.aggregate([
{"$group" : {_id:"$state", count:{$sum:1}}}
])
My main question is how to make the results of a query the value of a field in each corresponding record in another collection. Or if that's even possible.
Thanks for any help or guidance.
Looks like that On-Demand Materialized Views (just added on version 4.2 of MongoDB) should solve your problem!
You can create an On-Demand Materialized View using the $merge operator.
A possible definition of the Materialized View could be:
updateUsersLocationTotal = function() {
db.users.aggregate( [
{ $match: { <if you need to perform a match, like $state, otherwise remove it> } },
{ $group: { _id:"$state", users_quantity: { $sum: 1} } },
{ $merge: { into: "users_total", whenMatched: "replace" } }
] );
};
And then you perform updates just by calling updateUsersLocationTotal()
After that you can query the view just like a normal collection, using db.users_total.find() or db.users_total.aggregate().

MongoDB , how to choose the right design?

I have to create a collection of document and I have a doubt about the right design.
Each document is an "identity"; each Identity has a list of "partner Data"; each partner data are defined by an ID and a set of Data.
One approach can be (1):
{
_id: ...
partners: [
{
id: partner1,
data: {
}
},
{
id: partner2,
data: {
}
},
]
}
Another approach can be (2)
{
_id: ...
partners: {
partner1: {
data: {
}
},
partner2: {
data: {
}
},
]
}
I prefer the first one, but considering that I could have million of these identities, which could be the most performed schema?
A typical query can be: "how many identities have partner with ID N".
With the second example, a query can be:
db.identities.find({partner.partnerName: {$exists:true}})
With first approach, how can I get this count?
The second solution is more easy to handle Server Side; each document will have a list where each KEY is the partner ID, so instead of scan all document, I can simply get partner data by key...
What do you think about these solutions? I prefer the first one but the second I think that is more "usable"...
Thanks
I prefer the first one, but considering that I could have million of these identities, wich could be the most performed schema?
If you going to have millions of identities, then both approaches
are not really scalable .
Each document in mongo has a size limit (16MB) (read about it here)
In case you are going to have really lot's of identities ,
the scalable approach would be to create a different collection,
only for the relations and partnership data.
Now , I also want you to consider how you treat "partnership",
if I'm a user and I got you on my partners list , will you see me as a partner on your list ?
In case we both see each other as partners , then mongo-db may not be the best solution. graph db's are more appropriate for dealing with relations of this type.
All solutions within mongo for two-ways relations will be built on double updates (Your id on my partner's list , My id on your partner's list).
(In SQL you could add an extra condition for joining but not in mongo) ,
so you don't need to save twice partnerships . (me and you , you and me)
just you and me.
Do you see where is going this way ?
If you need to go only in one way,
Then just create a second collection , "partnerships" ,
{
_id: should be uniqe,
user_id: 'your_id',
partner_id: 'his_id'
data: {} or just flatten the fields into the root object.
}
Please notice that you create a row for each partnership !
Then you could use $lookup in order to query for a user with all of his
partners .
something like:
db.getCollection('partners').aggregate([
{
$lookup: {
from: 'parterships',
localField: '_id',
foreignField: 'user_id',
as: 'partners'
}
},
{
$project: {
name: 1,
partners: 1,
num_partners: { $size: "$partners" }
}
}
])
Read more about the aggregation stages here.
In case you are not going to have lot's of partnership's Then please continue with
your first approach which is good .
The second approach will make most queries to this collection pretty weird and you will always have to write code in order to query this table .
It won't be "straight forward" mongo queries .

Mongo error 16996 during aggregation - too large document produced

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

Execution time of a query - MongoDB

I have two collections: coach and team.
Coach collection contains information about coaches like name, surname, age and an array coached_Team that contains the _id of the team that a coach coached.
The team collection contains data about teams like _id, common name, official name, country, championship....
If I want to find, for example, the official name of all teams coached by Allegri, I have to do two queries, the first on coach collection:
>var x = db.coach.find({surname:"Allegri"},{_id:0, "coached_Team.team_id":1})
>var AllegriTeams
>while(x.hasNext()) AllegriTeams=x.next()
{
"coached_Team" : [
{
"team_id" : "Juv.26
},
{
"team_id" : "Mil.74
},
{
"team_id" : "Cag.00
}
]
}
>AllegriTeams=AllegriTeams.coached_Team
[
{
"team_id" : "Juv.26"
},
{
"team_id" : "Mil.74"
},
{
"team_id" : "Cag.00"
}
]
And then I have to execute three queries on team collection:
> db.team.find({ _id:AllegriTeams[0].team_id}, {official_name:1,_id:0})
{official_name : "Juventus Football Club S.p.A."}
> db.team.find({ _id:AllegriTeams[1].team_id}, {official_name:1,_id:0})
{official_name : "Associazione Calcio Milan S.p.A"}
> db.team.find({ _id:AllegriTeams[2].team_id}, {official_name:1,_id:0})
{official_name:"Cagliari Calcio S.p.A"}
Now consider I have about 100k documents on collection team and collection coach. The first query, on coach collection, needs about 71 ms plus the time of while cycle. The three queries on team collection, using cursor.explain("executionStats") needs 0 ms. I don't understand why this query takes 0.
I need executionTimeMillis of these three queries to have the execution time of the query "find official names of all teams coached by Allegri". I want to add the execution time of the query on coach collection(71ms) with the execution time of these three. If the time of these three queries is 0 what can I say about the execution time of the query mainly?
I think the more important observation here is that 71ms is a long time for a simple fetch of one item. Looks like your "surname" field needs an index. The other "three" queries are simple lookups of a primary key, which is why they are relatively fast.
db.coach.createIndex({ "surname": 1 })
If that surname is actually "unique" then add that too:
db.coach.createIndex({ "surname": 1 },{ "unique": true })
You can also simplify your "three" queries as as one by simply mapping the array, and applying the $in operator:
var teamIds = [];
db.coach.find(
{ "surname": "Allegri" },
{ "_id":0, "coached_Team.team_id":1 }
).forEach(function(coach) {
teamIds = coach.coached_Team.map(function(team) {
return team.team_id }).concat(teamIds);
});
});
db.team.find(
{ "_id": { "$in": teamIds" }},
{ "official_name": 1, "_id": 0 }
).forEach(function(team) {
printjson(team);
});
And then certainly the overall execution time is way down, as well as removing the overhead of multiple operations down to just the two queries requried.
Also remembering here that despite what is in the execution plan stats, the more queries to make to and from the server then the longer the overal real time execution will be for making each request and retriving the data. So it is best to keep things as minimal as possible.
Therefore even more logical would be that where to "need" this information regularly, storing the "coach name" on the "team itself" ( and indexing that data ) leads to the fastest possible response and only a single query operation.
It's easy to get caught up in observing execution stats. But really, think of what is "best" and "fastest" as a pattern for the sort of queries you want to do.

Ways to implement data versioning in MongoDB

Can you share your thoughts how would you implement data versioning in MongoDB. (I've asked similar question regarding Cassandra. If you have any thoughts which db is better for that please share)
Suppose that I need to version records in an simple address book. (Address book records are stored as flat json objects). I expect that the history:
will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.
I'm considering the following approaches:
Create a new object collection to store history of records or changes to the records. It would store one object per version with a reference to the address book entry. Such records would looks as follows:
{
'_id': 'new id',
'user': user_id,
'timestamp': timestamp,
'address_book_id': 'id of the address book record'
'old_record': {'first_name': 'Jon', 'last_name':'Doe' ...}
}
This approach can be modified to store an array of versions per document. But this seems to be slower approach without any advantages.
Store versions as serialized (JSON) object attached to address book entries. I'm not sure how to attach such objects to MongoDB documents. Perhaps as an array of strings.
(Modelled after Simple Document Versioning with CouchDB)
The first big question when diving in to this is "how do you want to store changesets"?
Diffs?
Whole record copies?
My personal approach would be to store diffs. Because the display of these diffs is really a special action, I would put the diffs in a different "history" collection.
I would use the different collection to save memory space. You generally don't want a full history for a simple query. So by keeping the history out of the object you can also keep it out of the commonly accessed memory when that data is queried.
To make my life easy, I would make a history document contain a dictionary of time-stamped diffs. Something like this:
{
_id : "id of address book record",
changes : {
1234567 : { "city" : "Omaha", "state" : "Nebraska" },
1234568 : { "city" : "Kansas City", "state" : "Missouri" }
}
}
To make my life really easy, I would make this part of my DataObjects (EntityWrapper, whatever) that I use to access my data. Generally these objects have some form of history, so that you can easily override the save() method to make this change at the same time.
UPDATE: 2015-10
It looks like there is now a spec for handling JSON diffs. This seems like a more robust way to store the diffs / changes.
There is a versioning scheme called "Vermongo" which addresses some aspects which haven't been dealt with in the other replies.
One of these issues is concurrent updates, another one is deleting documents.
Vermongo stores complete document copies in a shadow collection. For some use cases this might cause too much overhead, but I think it also simplifies many things.
https://github.com/thiloplanz/v7files/wiki/Vermongo
Here's another solution using a single document for the current version and all old versions:
{
_id: ObjectId("..."),
data: [
{ vid: 1, content: "foo" },
{ vid: 2, content: "bar" }
]
}
data contains all versions. The data array is ordered, new versions will only get $pushed to the end of the array. data.vid is the version id, which is an incrementing number.
Get the most recent version:
find(
{ "_id":ObjectId("...") },
{ "data":{ $slice:-1 } }
)
Get a specific version by vid:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } } }
)
Return only specified fields:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } }, "data.content":1 }
)
Insert new version: (and prevent concurrent insert/update)
update(
{
"_id":ObjectId("..."),
$and:[
{ "data.vid":{ $not:{ $gt:2 } } },
{ "data.vid":2 }
]
},
{ $push:{ "data":{ "vid":3, "content":"baz" } } }
)
2 is the vid of the current most recent version and 3 is the new version getting inserted. Because you need the most recent version's vid, it's easy to do get the next version's vid: nextVID = oldVID + 1.
The $and condition will ensure, that 2 is the latest vid.
This way there's no need for a unique index, but the application logic has to take care of incrementing the vid on insert.
Remove a specific version:
update(
{ "_id":ObjectId("...") },
{ $pull:{ "data":{ "vid":2 } } }
)
That's it!
(remember the 16MB per document limit)
If you're looking for a ready-to-roll solution -
Mongoid has built in simple versioning
http://mongoid.org/en/mongoid/docs/extras.html#versioning
mongoid-history is a Ruby plugin that provides a significantly more complicated solution with auditing, undo and redo
https://github.com/aq1018/mongoid-history
I worked through this solution that accommodates a published, draft and historical versions of the data:
{
published: {},
draft: {},
history: {
"1" : {
metadata: <value>,
document: {}
},
...
}
}
I explain the model further here: http://software.danielwatrous.com/representing-revision-data-in-mongodb/
For those that may implement something like this in Java, here's an example:
http://software.danielwatrous.com/using-java-to-work-with-versioned-data/
Including all the code that you can fork, if you like
https://github.com/dwatrous/mongodb-revision-objects
If you are using mongoose, I have found the following plugin to be a useful implementation of the JSON Patch format
mongoose-patch-history
Another option is to use mongoose-history plugin.
let mongoose = require('mongoose');
let mongooseHistory = require('mongoose-history');
let Schema = mongoose.Schema;
let MySchema = Post = new Schema({
title: String,
status: Boolean
});
MySchema.plugin(mongooseHistory);
// The plugin will automatically create a new collection with the schema name + "_history".
// In this case, collection with name "my_schema_history" will be created.
I have used the below package for a meteor/MongoDB project, and it works well, the main advantage is that it stores history/revisions within an array in the same document, hence no need for an additional publications or middleware to access change-history. It can support a limited number of previous versions (ex. last ten versions), it also supports change-concatenation (so all changes happened within a specific period will be covered by one revision).
nicklozon/meteor-collection-revisions
Another sound option is to use Meteor Vermongo (here)