Many to many in MongoDB - mongodb

I decided to give MongoDB a try and see how well we get along. I do have some questions though.
Premise
I have users(id, name, address, password, email, etc)
I have stamps(id, type, value, price, etc)
Users browse through a stamp archive and filter it in various ways(pagination, filter by price, type, name, etc), select a stamp then add it to their collection.
Users can add more then one stamp to their collection (1 piece of mint and one used or just 2 pieces of used)
Users can flag some of their stamps for sale or trade and perhapa specify a price.
So far
Here's what I have so far:
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps: [stampid-1, stampid-543,...,stampid-23]
}
Questions
How should I add the state of the owned stamp, the quantity and condition?
what would be some sample queries for the situations described earlier?
As far as I know, ensureindex makes it so you reduce the number of "scanned" entries.
The accepted answer here keeps changing the index. Is that just for the purpose of explaining it or is this the way to do it? I mean it does make sense somehow but I keep thinking of it in sql terms and... it does not make ANY sense...

The only change I would do is how you store the stamps that a user owns. I would store an array of objects representing the stamps and duplicating the values that are the more often accessed.
For example something like that :
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps : [
{
_id: id,
type: 'type',
price: 20,
forSale: true/false,
quantity: 2
},
{
_id: id2,
type: 'type2',
price: 5,
forSale: false,
quantity: 10
}
]
}
You can see that some datas are duplicated between the stamps collection and the stamps array in the user collection. You do that with the properties that you access the more often. Because otherwise you would have to do a findOne for each stamps, and it is better to read directly the data that doing that in MongoDB. And this way you can add others properties such as quantity and forSale here.
The goal of duplication here is to avoid to run a query for each stamp in the array.
There is a link of a video that discusses MongoDB design and also explains what I tried to explain here.
http://lacantine.ubicast.eu/videos/3-mongodb-deployment-strategies/

from a SQL background, struggling with NoSQL also. It seems to me that a lot hinges on how unchanging types of data may or may not be. One thing that puzzles me in RDBMS systems is why it is not possible to say a particular column/field is "immutable". If you know a field is immutable (or nearly) in a NoSQL context it seems me to make it more acceptable to duplicate the info. Is it complete heresy to suggest that in many contexts you might actually want a combination of SQL and NoSQL structures?

Related

MongoDb many to many with big relations

I've read a lot of documentation and examples here in Stackoverflow but I'm not really sure about my conclusions so this is why I'm askingfor help.
Imagine we have a collection Films and a collection Users and we want to know, which users have seen a film, and which films has seen an user.
One way to design this in MongoDb is:
User:
{
"name":"User1",
"films":[filmId1, filmId2, filmId3, filmId4] //ObjectIds from Films
}
Film:
{
"name": "The incredible MongoDb Developer",
"watched_by": [userId1, userId2, userId3] //ObjectsIds from User
}
Ok, this may work if the amount of users/films is low, but for example if we expect that one film will have a 800k users the size of the array will be near to: 800k * 12 bytes ~ 9.5MB which is nearly to the 16MB max for a BSON file.
In this case, there are other approach than the typical relational-world way that is create an intermediate collection for the relations?
Also I don't know if read and parse a JSON about 10MB will have a better performance in comparison with the classic relational way.
Thank you
For films, if you include the viewers, you might eventually hit the 16MB size limit of BSON documents, as you correctly stated.
Putting the films a user has seen into an array is a viable way, depending on your use cases. Especially if you want to have relations with attributes (say date and place of viewing), doing updates and statistical analysis becomes less performant (you would need to $unwind your docs first, subsequent $matches become more costly and whatnot).
If your relations have or may have attributes, I'd go with what you describe as the classical relational way, since it answers your most likely use cases as good as embedding and allow for higher performance from my experience:
Given a collection with a structure like
{
_id: someObjectId,
date: ISODate("2016-05-05T03:42:00Z"),
movie: "nameOfMovie",
user: "username"
}
You have everything at hand to answer the following sample questions easily:
For a given user, which movies has he seen in the last 3 month, in descending order of date?
db.views.aggregate([
{$match:{user:userName, date:{$gte:threeMonthAgo}}},
{$sort:{date:-1}},
{$group:{_id:"$user",viewed:{$push:{movie:"$movie",date:"$date"}}}}
])
or, if you are ok with an iterator, even easier with:
db.views.find({user:username, date:{$get:threeMonthAgo}}).sort({date:-1})
For a given movie, how many users have seen it on May 30th this year?
db.views.aggregate([
{$match:{
movie:movieName,
date{
$gte:ISODate("2016-05-30T00:00:00"),
$lt:ISODate("2016-05-31T00:00:00")}
}},
{$group:{
_id: "$movie",
views: {$sum:1}
}}
])
The reason why I use an aggregation here instead of a .count() on the result is SERVER-3645
For a given movie, show all users which have seen it.
db.views.find({movie:movieName},{_id:0,user:1})
There is a thing to note: Since we used the usernames and movie names, respectively, we do not need a JOIN (or something similar), which should give us good performance. Plus we do not have to do rather costly update operations when adding entries. Instead of an update, we simply insert the data.

MongoDB One to Many Relationship

I’m starting to learn MongoDB and I at one moment I was asking myself how to solve the “one to many” relationship design in MongoDB. While searching, I found many comments in other posts/articles like ” you are thinking relational “.
Ok, I agree. There will be some cases like duplication of information won’t be a problem, like in for example, CLIENTS-ORDERS example.
But, suppose you have the tables: ORDERS, that has an embedded DETAIL structure with the PRODUCTS that a client bought.
So for one thing or another, you need to change a product name (or another kind of information) that is already embedded in several orders.
At the end, you are force to do a one-to-many relashionship in MongoDB (that means, putting the ObjectID field as link to another collection) so you can solve this simple problem, don’t you ?
But every time I found some article/comment about this, it says that will be a performance fault in Mongo. It’s kind of disappointing
Is there another way to solve/design this without performance fault in MongoDB ?
One to Many Relations
In this relationship, there is many, many entities or many entities that map to the one entity. e.g.:
- a city have many persons who live in that city. Say NYC have 8 million people.
Let's assume the below data model:
//city
{
_id: 1,
name: 'NYC',
area: 30,
people: [{
_id: 1,
name: 'name',
gender: 'gender'
.....
},
....
8 million people data inside this array
....
]
}
This won't work because that's going to be REALLY HUGE. Let's try to flip the head.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: {
_id: 1,
name: 'NYC',
area: '30'
.....
}
}
Now the problem with this design is that if there are obviously multiple people living in NYC, so we've done a lot of duplication for city data.
Probably, the best way to model this data is to use true linking.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: 'NYC'
}
//city
{
_id: 'NYC',
...
}
In this case, people collection can be linked to the city collection. Knowing we don't have foreign key constraints, we've to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.
So, if it's truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.
The problem is that you over normalize your data. An order is defined by a customer, who lives at a certain place at the given point in time, pays a certain price valid at the time of the order (which might heavily change over the application lifetime and which you have to document anyway and several other parameters which are all valid only in a certain point of time. So to document an order (pun intended), you need to persist all data for that certain point in time. Let me give you an example:
{ _id: "order123456789",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: ObjectId("53fb38f0040980c9960ee270"),
items:[ ObjectId("53fb3940040980c9960ee271"),
ObjectId("53fb3940040980c9960ee272"),
ObjectId("53fb3940040980c9960ee273")
],
Total:400
}
Now, as long as neither the customer nor the details of the items change, you are able to reproduce where this order was sent to, what the prices on the order were and alike. But now what happens if the customer changes it's address? Or if the price of an item changes? You would need to keep track of those changes in their respective documents. It would be much easier and sufficiently efficient to store the order like:
{
_id: "order987654321",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: {
userID: ObjectId("53fb3940040980c9960ee283"),
recipientName: "Foo Bar"
address: {
street: "742 Evergreen Terrace",
city: "Springfield",
state: null
}
},
items: [
{count:1, productId:ObjectId("53fb3940040980c9960ee300"), price: 42.00 },
{count:3, productId:ObjectId("53fb3940040980c9960ee301"), price: 0.99},
{count:5, productId:ObjectId("53fb3940040980c9960ee302"), price: 199.00}
]
}
With this data model and the usage of aggregation pipelines, you have several advantages:
You don't need to independently keep track of prices and addresses or name changes or gift buys of a customer - it is already documented.
Using aggregation pipelines, you can create a price trends without the need of storing pricing data independently. You simply store the current price of an item in an order document.
Even complex aggregations such as price elasticity, turnover by state / city and alike can be done using pretty simple aggregations.
In general, it is safe to say that in a document oriented database, every property or field which is subject to change in the future and this change would create a different semantic meaning should be stored inside the document. Everything which is subject to change in the future but doesn't touch the semantic meaning (the users password in the example) may be linked via a GUID.

mongodb: Embedded only id or both id and name

I'm new to mongodb, please suggest me how to correct design schema for situation like below:
I have User collection and Product collection. Product contain info like id, title, description, price... User can bookmark or like Product. Currently, in User collection, I'm store 1 array for liked products, and 1 array for bookmarked products. So when I need to view info about 1 user, I have to read out these 2 array, then search in Product collection to get title of liked and bookmarked products.
//User collection
{
_id : 12345,
name: "John",
liked: [123, 456, 789],
bkmark: [123, 125]
}
//Product collection
{
_id : 123,
title: "computer",
desc: "awesome computer",
price: 12
}
Now I think I can speed up this process by embedded both product id and title in User collection, so that I don't have to search in Product collection, just read it out and display. But if I choose this way, whenever Product's title get updated, I have to search and update in User collection too. I can't evaluate update cost in 2nd way, so I don't know which way is correct. Please help me to choose between them.
Thanks & Regards.
You should consider what happens more often: A product gets renamed or the information of a user is requested.
You should also consider what's a bigger problem: Some time lag in which users see an outdated product name (we are talking about seconds, maybe minutes when you have a really large number of users) or always a longer response time when requesting a user profile.
Without knowing your actual usage patterns and requirements, I would guess that it's the latter in both cases, so you should rather optimize for this situation.
In general it is not recommended to normalize a MongoDB as radical as you would normalize a relational database. The reason is that MongoDB can not perform JOINs. So it's usually not such a bad idea to duplicate some relevant information in multiple documents, while accepting a higher cost for updates and a potential risk of inconsistencies.

best possible schema design for log analysis database in mongodb

i have to store the following data in mongodb uid, gender ,country, city, date_of_visit, url_of_visit
I would like to store uid, gender, country and city in one collection because these information will never change for particular user.
in the other collection i would like to store uid, date_of_visit, url_of_visit
i want to know which is best practice to store uid, date_of_visit and url_of_visit.there are two things in my mind..
(a) { uid: 100, date: xxxxxxxxxxxxxxx, url: abc.php }
{ uid: 100, date: xxxxxx, url: ref.php }
{ uid: 200, date: xxxxxxxxx, url: ref.php }
(b) { uid:100, visit:[{date:xxxxxxx, url:abc.php},
{date:xxxx, url:def.php},
{.........................}]}
i want to have following index date:1, uid:1 ,url:1 ...the problem with approach (a) is with each row inserted in database the database side and index size will grow and there will come a point when index size will not fit into RAM
problem with approach (b) is at some point each document will exceed the 16 MB limit and this approach will fail that time..
please suggest me what should be the best schema design for this scenario. i would also have the query which include uid, gender, country, date_of_visit, url_of_visit
I know this thread is a bit older but I'm wondering if you've decided on a structure and if it works well.
My idea was, instead of risking to create too large documents, to structure it similar to your second approach but include the date in the main collection. This way each document would be the user's activity within one day. It would be indexed by user and date, easy to update and query and keep things organized.
Something like:
{ uid:100, date:xxxxxxx, event:[{time:xxxxxxx, url:abc.php},
{time:xxxx, url:def.php},
{.........................}]}
I think the second approach is better than one because it corresponds to idea of grouping similar data together. About exceeding 16M of document you can reach this limit but he should be a very active user. :)
Also you can pull out some data to another collection and make reference using ObjectId or DBRef.
See more info http://www.mongodb.org/display/DOCS/Database+References#DatabaseReferences-DBRef
Your second approach will force you to fetch a huge amount of data from the embedded document, which cannot be filtered by Mongo. In other words, if you have a million documents stored inside the "event" field for a particular user, then when you fetch those embedded documents with dot notation, then the entire document including the parent will be returned. There's no way you can filter the results.
I would recommend the first approach which makes the data easier to retrieve and work with.

MongoDB / NOSQL: Best approach to handling read/unread status on messages

Suppose you have a large number of users (M) and a large number of documents (N) and you want each user to be able to mark each document as read or unread (just like any email system). What's the best way to represent this in MongoDB? Or any other document database?
There are several questions on StackOverflow asking this question for relational databases but I didn't see any with recommendations for document databases:
What's the most efficient way to remember read/unread status across multiple items?
Implementing an efficient system of "unread comments" counters
Typically the answers involve a table listing everything a user has read: (i.e. tuples of user id, document id) with some possible optimizations for a cut off date allowing mark-all-as-read to wipe the database and start again knowing that anything prior to that date is 'read'.
So, MongoDB / NOSQL experts, what approaches have you seen in practice to this problem and how did they perform?
{
_id: messagePrefs_uniqueId,
type: 'prefs',
timestamp: unix_timestamp
ownerId: receipientId,
messageId: messageId,
read: true / false,
}
{
_id: message_uniqueId,
timestamp: unix_timestamp
type: 'message',
contents: 'this is the message',
senderId: senderId,
recipients: [receipientId1,receipientId2]
}
Say you have 3 messages you want to retrieve preferences for, you can get them via something like:
db.messages.find({
messageId : { $in : [messageId1,messageId2,messageId3]},
ownerId: receipientId,
type:'prefs'
})
If all you need is read/unread you could use this with MongoDB's upsert capabilities, so you are not creating prefs for each message unless the user actually reads it, then basically you create the prefs object with your own unique id and upsert it into MongoDB. If you want more flexibility(like say tags or folders) you'll probably want to make the pref for each recipient of the message. For example you could add:
tags: ['inbox','tech stuff']
to the prefs object and then to get all the prefs of all the messages tagged with 'tech stuff' you'd go something like:
db.messages.find({type: 'prefs', ownerId: recipientId, tags: 'tech stuff'})
You could then use the messageIds you find within the prefs to query and find all the messages that correspond:
db.messages.find((type:'message', _id: { $in : [array of messageIds from prefs]}})
It might be a little tricky if you want to do something like counting how many messages each 'tag' contains efficiently. If it's only a handful of tags you can just add .count() to the end of your query for each query. If it's hundreds or thousands then you might do better with a map/reduce server side script or maybe an object that keeps track of message counts per tag per user.
If you're only storing a simple boolean value, like read/unread, another method is to embedded an array in each Document that contains a list of the Users who have read it.
{
_id: 'document#42',
...
read_by: ['user#83', 'user#2702']
}
You should then be able to index that field, making for fast queries for Documents-read-by-User and Users-who-read-Document.
db.documents.find({read_by: 'user#83'})
db.documents.find({_id: 'document#42}, {read_by: 1})
However, I find that I'm usually querying for all Documents that have not been read by a particular User, and I can't think of any solution that can make use of the index in this case. I suspect it's not possible to make this fast without having both read_by and unread_by arrays, so that every User is included in every Document (or join table), but that would have a large storage cost.