Mongo “manual reference” performance compare to traditional DB’s “table joining” - mongodb

According to the official document: "manual reference" operation is usually preferred, experienced guy even suggest never use DBref, then I am seriously concerning how much of the performance penalty to do twice query when I want to query entities with relational collection, especially comparing with the traditional relational DB - we can retrieve the expected result within one query using table joins.
Denormalize example:
db.blogs.insert({
_id: 1,
title: "Investigation on MongoDB",
content: "some investigation contents",
post_date: Date.now(),
permalink: "http://foo.bar/investigation_on_mongodb",
comments: [
{ content: "Gorgeous post!!!", nickname: "Scott", email: "foo#bar.org", timestamp: "1377742184305" },
{ content: "Splendid article!!!", nickname: "Guthrie", email: "foo#bar.org", timestamp: "1377742184305" }
]}
)
We can simply use: db.blogs.find() to get everything we want: blog posts with comments belong to them.
Normalize example:
db.books.insert({
  _id: 1,
  name: "MongoDB Applied Design Patterns",
  price: 35,
  rate: 5,
  author: "Rick Copeland",
  ISBN: "1449340040",
  publisher_id: 1,
  reviews: [
    { isUseful: true, content: "Cool book!", reviewer: "Dick", timestamp: "1377742184305" },
    { isUseful: true, content: "Cool book!", reviewer: "Xiaoshen", timestamp: "1377742184305" }
  ]
  }
);
  
db.publishers.insert({
  _id: 1,
  name: "Packtpub INC",
  address: "2nd Floor, Livery Place 35 Livery Street Birmingham",
  telephone: "+44 0121 265 6484",
  }
);
Now if I want to get the complete information about a single book I have to manually query twice, similar to below:
> var book = db.books.find({ "name": { $regex: 'mongo*', $options: 'i' } })
> db.publishers.find({ _id: book.publisher_id })
Things I know is: the precedence operations will be process by Mongo "in memory", but I will have the summarized question below:
In simply words: document oriented database advocates "denormalize" data to retrieve result within one query, however, when we have to store relational data, it "suggest" you to use "manual reference", which means twice query, while in relational DB there will be only one time query by using "table joining".
This makes no sense for me:)

A relational database also performs a JOIN by querying both tables. But it has the advantage that it can do this internally and doesn't have to communicate with the client to do this. It can query the 2nd table immediately.
MongoDB first needs to send the results of the first query to the client before the application can formulate and send the 2nd query back to the database. The time lost by this is:
Network latency between database server and application server (a couple ms)
Interprete the response on the application server and generate a $in-query from it (a couple µs)
Network latency between application server and database server (a couple ms)
Depending on how well the application server and the database server are interconnected, we are talking about a penalty of a few ms here.

Related

Mongodb create index for boolean and integer fields

user collection
[{
deleted: false,
otp: 3435,
number: '+919737624720',
email: 'Test#gmail.com',
name: 'Test child name',
coin: 2
},
{
deleted: false,
otp: 5659,
number: '+917406732496',
email: 'anand.satyan#gmail.com',
name: 'Nivaan',
coin: 0
}
]
I am using below command to create index Looks like for string it is working
But i am not sure this is correct for number and boolean field.
db.users.createIndex({name:"text", email: "text", coin: 1, deleted: 1})
I am using this command to filter data:
db.users.find({$text:{$search:"anand.satya"}}).pretty()
db.users.find({$text:{$search:"test"}}).pretty()
db.users.find({$text:{$search:2}}).pretty()
db.users.find({$text:{$search:false}}).pretty()
string related fields working. But numeric and boolean fields are not working.
Please check how i will create index for them
The title and comments in this question are misleading. Part of the question is more focused on how to query with fields that contain boolean and integer fields while another part of the question is focused on overall indexing strategies.
Regarding indexing, the index that was shown in the question is perfectly capable of satisfying some queries that include predicates on coin and deleted. We can see that when looking at the explain output for a query of .find({$text:{$search:"test"}, coin:123, deleted: false}):
> db.users.find({$text:{$search:"test"}, coin:123, deleted: false}).explain().queryPlanner.winningPlan.inputStage
{
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
filter: {
'$and': [ { coin: { '$eq': 123 } }, { deleted: { '$eq': false } } ]
},
keyPattern: { _fts: 'text', _ftsx: 1, coin: 1, deleted: 1 },
indexName: 'name_text_email_text_coin_1_deleted_1',
isMultiKey: false,
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'backward',
indexBounds: {}
}
}
Observe here that the index scan stage (IXSCAN) is responsible for providing the filter for the coin and deleted predicates (as opposed to the database having to do that after FETCHing the full document.
Separately, you mentioned in the question that these two particular queries aren't working:
db.users.find({$text:{$search:2}}).pretty()
db.users.find({$text:{$search:false}}).pretty()
And by 'not working' you are referring to the fact that no results are being returned. This is also related to the following discussion in the comments which seemed to have a misleading takeaway:
You'll have to convert your coin and deleted fields to string, if you want it to be picked up by $search – Charchit Kapoor
So. There is no way for searching boolean or integger field. ? – Kiran S youtube channel
Nope, not that I know of. – Charchit Kapoor
You can absolutely use boolean and integer values in your query predicate to filter data. This playground demonstrates that.
What #Charchit Kapoor is mentioning that can't be done is using the $text operator to match and return results whose field values are not strings. Said another way, the $text operator is specifically used to perform a text search.
If what you are trying to achieve are direct equality matches for the field values, both strings and otherwise, then you can delete the text index as there is no need for using the $text operator in your query. A simplified query might be:
db.users.find({ name: "test"})
Demonstrated in this playground.
A few additional things come to mind:
Regarding indexing overall, databases will generally consider using an index if the first key is used in the query. You can read more about this for MongoDB specifically on this page. The takeaway is that you will want to create the appropriate set of indexes to align with your most commonly executed queries. If you have a query that just filters on coin, for example, then you may wish to create an index that has coin as its first key.
If you want to check if the exact string value is present in multiple fields, then you may want to do so using the $or operator (and have appropriate indexes for the database to use).
If you do indeed need more advanced text searching capabilities, then it would be appropriate to either continue using the $text operator or consider Atlas Search if the cluster is running in Atlas. Doing so does not prevent you from also having indexes that would support your other queries, such as on { coin: 2 }. It's simply that the syntax for performing such a query needs to be updated.
There is a lot going on here, but the big takeaway is that you can absolutely filter data based on any data type. Doing so simply requires using the appropriate syntax, and doing so efficiently requires an appropriate indexing strategy to be used along side of the queries.

MongoDB , how to choose the right design?

I have to create a collection of document and I have a doubt about the right design.
Each document is an "identity"; each Identity has a list of "partner Data"; each partner data are defined by an ID and a set of Data.
One approach can be (1):
{
_id: ...
partners: [
{
id: partner1,
data: {
}
},
{
id: partner2,
data: {
}
},
]
}
Another approach can be (2)
{
_id: ...
partners: {
partner1: {
data: {
}
},
partner2: {
data: {
}
},
]
}
I prefer the first one, but considering that I could have million of these identities, which could be the most performed schema?
A typical query can be: "how many identities have partner with ID N".
With the second example, a query can be:
db.identities.find({partner.partnerName: {$exists:true}})
With first approach, how can I get this count?
The second solution is more easy to handle Server Side; each document will have a list where each KEY is the partner ID, so instead of scan all document, I can simply get partner data by key...
What do you think about these solutions? I prefer the first one but the second I think that is more "usable"...
Thanks
I prefer the first one, but considering that I could have million of these identities, wich could be the most performed schema?
If you going to have millions of identities, then both approaches
are not really scalable .
Each document in mongo has a size limit (16MB) (read about it here)
In case you are going to have really lot's of identities ,
the scalable approach would be to create a different collection,
only for the relations and partnership data.
Now , I also want you to consider how you treat "partnership",
if I'm a user and I got you on my partners list , will you see me as a partner on your list ?
In case we both see each other as partners , then mongo-db may not be the best solution. graph db's are more appropriate for dealing with relations of this type.
All solutions within mongo for two-ways relations will be built on double updates (Your id on my partner's list , My id on your partner's list).
(In SQL you could add an extra condition for joining but not in mongo) ,
so you don't need to save twice partnerships . (me and you , you and me)
just you and me.
Do you see where is going this way ?
If you need to go only in one way,
Then just create a second collection , "partnerships" ,
{
_id: should be uniqe,
user_id: 'your_id',
partner_id: 'his_id'
data: {} or just flatten the fields into the root object.
}
Please notice that you create a row for each partnership !
Then you could use $lookup in order to query for a user with all of his
partners .
something like:
db.getCollection('partners').aggregate([
{
$lookup: {
from: 'parterships',
localField: '_id',
foreignField: 'user_id',
as: 'partners'
}
},
{
$project: {
name: 1,
partners: 1,
num_partners: { $size: "$partners" }
}
}
])
Read more about the aggregation stages here.
In case you are not going to have lot's of partnership's Then please continue with
your first approach which is good .
The second approach will make most queries to this collection pretty weird and you will always have to write code in order to query this table .
It won't be "straight forward" mongo queries .

How to store related records in mongodb?

I have a number of associated records such as below.
Parent records
{
_id:234,
title: "title1",
name: "name1",
association:"assoc1"
},
Child record
{
_id:21,
title: "child title",
name: "child name1",
},
I want to store such records into MongoDb. Can anyone help?
Regards.
Even MongoDB doesn't support joins, you can organize data in several different ways:
1) First of all, you can inline(or embed) related documents. This case is useful, if you have some hierarchy of document, e.g. post and comments. In this case you can like so:
{
_id: <post_id>,
title: 'asdf',
text: 'asdf asdf',
comments: [
{<comment #1>},
{<comment #2>},
...
]
}
In this case, all related data will be in the save document. You can fetch it by one query, but pushing new comments to post cause moving this document on disk, frequent updates will increase disk load and space usage.
2) referencing is other technique you can use: in each document, you can put special field that contains _id of parent/related object:
{
_id: 1,
type: 'post',
title: 'asdf',
text: 'asdf asdf'
},
{
_id:2
type: 'comment',
text: 'yep!',
parent_id: 1
}
In this case you store posts and comments in same collection, therefor you have to store additional field type. MongoDB doesn't support constraints or any other way to check data constancy. This means that if you delete post with _id=1, comments with _id=2 store broken link in parent_id.
You can separate posts from comments in different collections or even databases by using database references, see your driver documentation for more details.
Both solutions can store tree-structured date, but in different way.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});

Ways to implement data versioning in MongoDB

Can you share your thoughts how would you implement data versioning in MongoDB. (I've asked similar question regarding Cassandra. If you have any thoughts which db is better for that please share)
Suppose that I need to version records in an simple address book. (Address book records are stored as flat json objects). I expect that the history:
will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.
I'm considering the following approaches:
Create a new object collection to store history of records or changes to the records. It would store one object per version with a reference to the address book entry. Such records would looks as follows:
{
'_id': 'new id',
'user': user_id,
'timestamp': timestamp,
'address_book_id': 'id of the address book record'
'old_record': {'first_name': 'Jon', 'last_name':'Doe' ...}
}
This approach can be modified to store an array of versions per document. But this seems to be slower approach without any advantages.
Store versions as serialized (JSON) object attached to address book entries. I'm not sure how to attach such objects to MongoDB documents. Perhaps as an array of strings.
(Modelled after Simple Document Versioning with CouchDB)
The first big question when diving in to this is "how do you want to store changesets"?
Diffs?
Whole record copies?
My personal approach would be to store diffs. Because the display of these diffs is really a special action, I would put the diffs in a different "history" collection.
I would use the different collection to save memory space. You generally don't want a full history for a simple query. So by keeping the history out of the object you can also keep it out of the commonly accessed memory when that data is queried.
To make my life easy, I would make a history document contain a dictionary of time-stamped diffs. Something like this:
{
_id : "id of address book record",
changes : {
1234567 : { "city" : "Omaha", "state" : "Nebraska" },
1234568 : { "city" : "Kansas City", "state" : "Missouri" }
}
}
To make my life really easy, I would make this part of my DataObjects (EntityWrapper, whatever) that I use to access my data. Generally these objects have some form of history, so that you can easily override the save() method to make this change at the same time.
UPDATE: 2015-10
It looks like there is now a spec for handling JSON diffs. This seems like a more robust way to store the diffs / changes.
There is a versioning scheme called "Vermongo" which addresses some aspects which haven't been dealt with in the other replies.
One of these issues is concurrent updates, another one is deleting documents.
Vermongo stores complete document copies in a shadow collection. For some use cases this might cause too much overhead, but I think it also simplifies many things.
https://github.com/thiloplanz/v7files/wiki/Vermongo
Here's another solution using a single document for the current version and all old versions:
{
_id: ObjectId("..."),
data: [
{ vid: 1, content: "foo" },
{ vid: 2, content: "bar" }
]
}
data contains all versions. The data array is ordered, new versions will only get $pushed to the end of the array. data.vid is the version id, which is an incrementing number.
Get the most recent version:
find(
{ "_id":ObjectId("...") },
{ "data":{ $slice:-1 } }
)
Get a specific version by vid:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } } }
)
Return only specified fields:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } }, "data.content":1 }
)
Insert new version: (and prevent concurrent insert/update)
update(
{
"_id":ObjectId("..."),
$and:[
{ "data.vid":{ $not:{ $gt:2 } } },
{ "data.vid":2 }
]
},
{ $push:{ "data":{ "vid":3, "content":"baz" } } }
)
2 is the vid of the current most recent version and 3 is the new version getting inserted. Because you need the most recent version's vid, it's easy to do get the next version's vid: nextVID = oldVID + 1.
The $and condition will ensure, that 2 is the latest vid.
This way there's no need for a unique index, but the application logic has to take care of incrementing the vid on insert.
Remove a specific version:
update(
{ "_id":ObjectId("...") },
{ $pull:{ "data":{ "vid":2 } } }
)
That's it!
(remember the 16MB per document limit)
If you're looking for a ready-to-roll solution -
Mongoid has built in simple versioning
http://mongoid.org/en/mongoid/docs/extras.html#versioning
mongoid-history is a Ruby plugin that provides a significantly more complicated solution with auditing, undo and redo
https://github.com/aq1018/mongoid-history
I worked through this solution that accommodates a published, draft and historical versions of the data:
{
published: {},
draft: {},
history: {
"1" : {
metadata: <value>,
document: {}
},
...
}
}
I explain the model further here: http://software.danielwatrous.com/representing-revision-data-in-mongodb/
For those that may implement something like this in Java, here's an example:
http://software.danielwatrous.com/using-java-to-work-with-versioned-data/
Including all the code that you can fork, if you like
https://github.com/dwatrous/mongodb-revision-objects
If you are using mongoose, I have found the following plugin to be a useful implementation of the JSON Patch format
mongoose-patch-history
Another option is to use mongoose-history plugin.
let mongoose = require('mongoose');
let mongooseHistory = require('mongoose-history');
let Schema = mongoose.Schema;
let MySchema = Post = new Schema({
title: String,
status: Boolean
});
MySchema.plugin(mongooseHistory);
// The plugin will automatically create a new collection with the schema name + "_history".
// In this case, collection with name "my_schema_history" will be created.
I have used the below package for a meteor/MongoDB project, and it works well, the main advantage is that it stores history/revisions within an array in the same document, hence no need for an additional publications or middleware to access change-history. It can support a limited number of previous versions (ex. last ten versions), it also supports change-concatenation (so all changes happened within a specific period will be covered by one revision).
nicklozon/meteor-collection-revisions
Another sound option is to use Meteor Vermongo (here)