MongoDB Collection Structure - mongodb

I have a collection named User.
User has many information that can be grouped together example:
{address : {street:"xx", city:"xx"}}
{history: {school:"xx", job: "xx"}}
{..}
So I want to know what the best practice is
1. First way, Using nested fields:
{user:
{address : {street:"xx", city:"xx"}}
{history: {school:"xx", job: "xx"}}
{..}
}
2. Second way: Just put them all together.
{user:
street:"xx",
city:"xx",
...
school:"xx",
job: "xx",
...
}
First way is obviously more readable for humans and it makes it easier for human to find relevant information.
What are the downside to grouping/nesting data like the first way?
Does it make querying of nested fields slower? Indexing issues? Any idea?

If you store the extra data under the user, it enables faster reading and writing of entire user document.
If you the extra data under separate collections, it may enable faster access find/update of that data (depending on your indexes). MongoDB does enable indexing fields in arrays
My suggestion is to try and list your common data access use cases, create a test database with a LOT of mock data, then test performances using queries and aggregations to defer between the different storage modelling options.

Related

Server side paging and grouping of large dataset

I'll try to explain the issue as best I can. Implement a grid with server paging. On request for N entities, DB should return a set of data which should be grouped or better said transformed in such a way that when the transformation phase is done it should result in those N entities.
Best way as I can see is something like this:
Query_all_data() => Result; (10000000 documents)
Transform(Result) => Transformed (100 groups)
Transformed.Skip(N).Take(N)
Transformation phase should be something like this:
Result = [d0, d1, d2..., dN]
Transformed = [
{ info: "foo", docs: [d0. d2, d21, d67, d100042] },
{ info: "bar", docs: [d3. d28, d121, d6271, d100042] },
{ info: "baz", docs: [d41. d26, d221, d567, d100043] },
{ info: "waz", docs: [d22. d24, d241, d167, d1000324] }
]
Every object in Transformed is an entity in grid.
I'm not sure if it's important but the DB in question is MongoDB and all documents are stored in one collection. Now, the huge pitfall of this approach is that it's way to slow on large dataset which will most certainly be the case.
Is there a better approach. Maybe different DB design?
#dakt, you can store your data in couple of different ways based on how you are going to use the data. In the process it may also be useful to store data in de-normalized form where in some duplication of data may occur.
Store data as individual documents as mentioned in your problem statement
Store the data in transformed format in your problem statement. It looks like you have a consistent way of mapping the docs to some tag. If so, why not maintain documents such that they are always embedded for those tags. This certainly has limitation on number of docs that you may be able to contain base on the 16MB document limit.
I would suggest looking at the MongoDB use-cases - http://docs.mongodb.org/ecosystem/use-cases/ and see if any of those are similar to what you are trying to achieve.

Links vs References in Document databases

I am confused with the term 'link' for connecting documents
In OrientDB page http://www.orientechnologies.com/orientdb-vs-mongodb/ it states that they use links to connect documents, while in MongoDB documents are embedded.
Since in MongoDB http://docs.mongodb.org/manual/core/data-modeling-introduction/, documents can be referenced as well, I can not get the difference between linking documents or referencing them.
The goal of Document Oriented databases is to reduce "Impedance Mismatch" which is the degree to which data is split up to match some sort of database schema from the actual objects residing in memory at runtime. By using a document, the entire object is serialized to disk without the need to split things up across multiple tables and join them back together when retrieved.
That being said, a linked document is the same as a referenced document. They are simply two ways of saying the same thing. How those links are resolved at query time vary from one database implementation to another.
That being said, an embedded document is simply the act of storing an object type that somehow relates to a parent type, inside the parent. For example, I have a class as follows:
class User
{
string Name
List<Achievement> Achievements
}
Where Achievement is an arbitrary class (its contents don't matter for this example).
If I were to save this using linked documents, I would save User in a Users collection and Achievement in an Achievements collection with the List of Achievements for the user being links to the Achievement objects in the Achievements collection. This requires some sort of joining procedure to happen in the database engine itself. However, if you use embedded documents, you would simply save User in a Users collection where Achievements is inside the User document.
A JSON representation of the data for an embedded document would look (roughly) like this:
{
"name":"John Q Taxpayer",
"achievements":
[
{
"name":"High Score",
"point":10000
},
{
"name":"Low Score",
"point":-10000
}
]
}
Whereas a linked document might look something like this:
{
"name":"John Q Taxpayer",
"achievements":
[
"somelink1", "somelink2"
]
}
Inside an Achievements Collection
{
"somelink1":
{
"name":"High Score",
"point":10000
}
"somelink2":
{
"name":"High Score",
"point":10000
}
}
Keep in mind these are just approximate representations.
So to summarize, linked documents function much like RDBMS PK/FK relationships. This allows multiple documents in one collection to reference a single document in another collection, which can help with deduplication of data stored. However it adds a layer of complexity requiring the database engine to make multiple disk I/O calls to form the final document to be returned to user code. An embedded document more closely matches the object in memory, this reduces Impedance Mismatch and (in theory) reduces the number of disk I/O calls.
You can read up on Impedance Mismatch here: http://en.wikipedia.org/wiki/Object-relational_impedance_mismatch
UPDATE
I should add, that choosing the right database to implement for your needs is very important from the start. If you have a lot of questions about each database, it might make sense to contact each supplier and get some of their training material. MongoDB offers 2 free courses you can take to learn more about their product and best uses at MongoDB University. OrientDB does offer training, however it is not free. It might be best to try contacting them directly and getting some sort of pre-sales training (if you are looking to license the db), usually they will put you in touch with some sort of pre-sales consultant to help you evaluate their product.
MongoDB works like RDBMS where the object id is like a foreign key. This means a "JOIN" that is run-time expensive. OrientDB, instead, has direct links that are created only once and have a very low run-time cost.

Ensure data coherence across documents in MongoDB

I'm working on a Rails app that implements some social network features as relationships, following, etc. So far everything was fine until I came across with a problem on many to many relations. As you know mongo lacks of joins, so the recommended workaround is to store the relation as an array of ids on both related documents. OK, it's a bit redundant but it should work, let's say:
field :followers, type: Array, default: []
field :following, type: Array, default: []
def follow!(who)
self.followers << who.id
who.following << self.id
self.save
who.save
end
That works pretty well, but this is one of those cases where we would need a transaction, uh, but mongo doesn't support transactions. What if the id is added to the 'followed' followers list but not to the 'follower' following list? I mean, if the first document is modified properly but the second for some reason can't be updated.
Maybe I'm too pessimistic, but there isn't a better solution?
I would recommend storing relationships only in one direction, storing the users someone follows in their user document as "following". Then if you need to query for all followers of user U1, you can query for {users.following : "U1"} Since you can have a multi-key index on an array, this query will be fast if you index this field.
The other reason to go in that direction only is a single user has a practical limit to how many different users they may be following. But the number of followers that a really popular user may have could be close to the total number of users in your system. You want to avoid creating an array in a document that could be that large.

Returning custom fields in MongoDB

I have a mongoDB collection with an array field that represents the lists the user is member of.
user {
screen_name: string
listed_in: ['list1', 'list2', 'list3', ...] //Could be more than 10000 elements (I'm aware of the BSON 16MB limits)
}
I am using the *listed_in* field to get the members list
db.user.find({'listed_in': 'list2'});
I also need to query for a specific user and know if he is member of certain lists
var user1 = db.findOne({'screen_name': 'user1'});
In this case I will get the *listed_in* field with all its members.
My question is:
Is there a way to pre-compute custom fields in mongoDB?
I would need to be able to get fields like these, user1.isInList1, user1.isInList2
Right now I have to do it in the client side by iterating through the *listed_in* array to know if the user is member of "list1" but *listed_in* could have thousand elements.
My question is: Is there a way to pre-compute custom fields in mongoDB?
Not really. MongoDB does not have any notion of "computed columns". So the query you're looking for doesn't exist.
Right now I have to do it in the client side by iterating through the *listed_in* array to know if the user is member of "list1" but *listed_in* could have thousand elements
In your case you're basically trying to push a client-side for loop onto the server. However, some process still has to do the for loop. And frankly, looping through 10k items is not really that much work for either client or server.
The only real savings here is preventing extra data on the network.
If you really want to save that network traffic, you will need to restructure your data model. This re-structure will likely involve two queries to read and write, but less data over the wire. But that's the trade-off.

When to embed documents in Mongo DB

I'm trying to figure out how to best design Mongo DB schemas. The Mongo DB documentation recommends relying heavily on embedded documents for improved querying, but I'm wondering if my use case actually justifies referenced documents.
A very basic version of my current schema is basically:
(Apologies for the psuedo-format, I'm not sure how to express Mongo schemas)
users {
email (string)
}
games {
user (reference user document)
date_started (timestamp)
date_finished (timestamp)
mode (string)
score: {
total_points (integer)
time_elapsed (integer)
}
}
Games are short (about 60 seconds long) and I expect a lot of concurrent writes to be taking place.
At some point, I'm going to want to calculate a high score list, and possibly in a segregated fashion (e.g., high score list for a particular game.mode or date)
Is embedded documents the best approach here? Or is this truly a problem that relations solves better? How would these use cases best be solved in Mongo DB?
... is this truly a problem that relations solves better?
The key here is less about "is this a relation?" and more about "how am I going to access this?"
MongoDB is not "anti-reference". MongoDB does not have the benefits of joins, but it does have the benefit of embedded documents.
As long as you understand these trade-offs then it's perfectly fair to use references in MongoDB. It's really about how you plan to query these objects.
Is embedded documents the best approach here?
Maybe. Some things to consider.
Do games have value outside of the context of the user?
How many games will a single user have?
Is games transactional in nature?
How are you going to access games? Do you always need all of a user's games?
If you're planning to build leaderboards and a user can generate hundreds of game documents, then it's probably fair to have games in their own collection. Storing ten thousand instances of "game" inside of each users isn't particularly useful.
But depending on your answers to the above, you could really go either way. As the litmus test, I would try running some Map / Reduce jobs (i.e. build a simple leaderboard) to see how you feel about the structure of your data.
Why would you use a relation here? If the 'email' is the only user property than denormalization and using an embedded document would be perfectly fine. If the user object contains other information I would go for a reference.
I think that you should to use "entity-object" and "object-value" definitions from DDD. For entity use reference,but for "object-value" use embed document.
Also you can use denormalization of your object. i mean that you can duplicate your data. e.g.
// root document
game
{
//duplicate part that you need of root user
user: { FirstName: "Some name", Id: "some ID"}
}
// root document
user
{
Id:"ID",
FirstName:"someName",
LastName:"last name",
...
}