Mongodb architecture required [closed] - mongodb

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am creating a project that requires data storage and I am considering useing MongoDB but am having trouble finding the logical / optimal way of organising the data
My simplified data needs to store Place information like so:
{place_city : "London",
place_owner: "Tim",
place_name: "Big Ben"}
{place_city : "Paris",
place_owner: "Tim",
place_name: "Eifel Tower"}
{place_city : "Paris",
place_owner: "Ben",
place_name: "The Louvre"}
And here are the main operations I need
Retrieve all my places
Retrieve all my friends places
Retrieve all my friends cities
If I use mongoDB a collection document max size is 16meg right? If that is correct then I can't store all the information in a PLACES similar to my example above right?
I would probably need to create a "OWNER" collection? like so:
{
owner: "Tim",
cities: [ {
name: "London",
places:[ {name:"Big Ben"}]
},
{
name: "Paris",
places:[ {name:"Eifel Tower"}, {name: "The Louvre"}]
}
]
}
but the problem now is that Retrieving my friends places becomes cumbersome and my friends cities even more so....
any advice from a cunning DB architect woudl be much appreciated.

The size limit of 16MB is per document, not per collection.
{place_city : "London", place_owner: "Tim", place_name: "Big Ben"}
is a very little document, so don't worry. The design of your collections depends heavily on how you query your data.

The data size limitation is per document and not per collection. Collections can easily become several GB (or even TB) large.
I would suggest you keep your data as simple as you have, like:
{place_city : "London",
place_owner: "Tim",
place_name: "Big Ben"}
{place_city : "Paris",
place_owner: "Tim",
place_name: "Eifel Tower"}
{place_city : "Paris",
place_owner: "Ben",
place_name: "The Louvre"}
I am thinking that friends are stored like this:
{
username: "Ben",
friends: [ "Tim", "Bob" ]
}
Then your three queries can be done as:
All your places: db.places.find( { place_owner: "Ben" } );
All your friends' places with two queries (pseudo code):
friends = db.friends.find( { username: "Ben" } );
// friends = [ "Tim", "Bob" ], you do need to do some code to make this change
db.places.find( { place_owner: { $in: [ "Tim", "Bob" ] } } );
All your friends' cities with two queries (pseudo code):
friends = db.friends.find( { username: "Ben" } );
db.so.distinct( 'name', { place_owner: { $in: [ "Tim", "Bob" ] } } );
Even with millions of documents, this should work fine, providing you have an index on the fields that you query for: { place_owner: 1 } and { username: 1 }.

I love MongoDB, but this data is not a good candidate for MongoDB. MongoDB does not support relationships, and that is basically all you are tracking here. Use a relational database to store the relationships.
Think of it like this: under the skin of the DBMS, MongoDB or SQL, an index is an index, a table is a table (basically). You get more performance from MongoDB not because it does the same things faster, but because you are able to use it to get your DB server to do less. (E.g. pull an entire document containing nested arrays and subdocs instead of joining a bunch of tables together). There are some fundamental differences in the way MongoDB handles updates, but for querying simple data sets most systems are going to be relatively similar. One big difference between the two rooted in the way MongoDB works is that it cannot use data in a collection as parameters for a another query, which is basically the whole point of a relational database. Since 2 of your use cases require "joins" (to "all my friends"), you need two queries.
So what you're doing with two queries is the same thing as a join, except relational databases are optimized to do this extremely efficiently; I promise you it will be slower to do this join manually, plus you're sending all data (friends' IDs) over the wire and making an extra DB connection. Now, if you could store all your friends' cities and places in a single document, MongoDB will probably be (slightly) faster than joining, but now you've got a new problem, because you now have to manage all this, anytime anyone adds a city or place all their friends have to be modified--this is unrealistic.
But there is even more to the story than that, because SQL DBMS are extremely mature applications with lots of features to improve query performance. They let you do things like create "materialize views" that store all your friends cities and places in memory and update themselves automatically any time one of their source tables is updated so you don't have to do anything special, you'd just query and you'd get your data without actually executing any joins. (A materialized table is not the right fit here, but just as an example, it is possible if you needed it.)
ALSO, when you are using MongoDB, a guideline I've found helpful is this, anytime you are asking yourself whether your document will be large enough to store all the data you will EVENTUALLY have, you probably have a design problem on your hands. If a document's growth is unbound, it should probably be enumerated within a collection instead. Or put another way, your collections should grow as your application is used, not your document's size (much).
If breaking apart your schema like this means that for primary operations you are doing a lot of manual joins, it is worth considering the question of whether or not you should be using a relational database instead.

Related

MongoDB Embedding alongside referencing

There is a lot of content of what kind of relationships should use in a database schema. However, I have not seen anything about mixing both techniques. 
The idea is to embed only the necessaries attributes and with them a reference. This way the application have the necessary data for rendering and the reference for the updating methods.
The problem I see here is that the logic for handle any CRUD operations becomes more tricky because its mandatory to update multiples collections however I have all the information in one single read.
Basic schema for a page that only wants the students names of a classroom:
CLASSROOM COLLECTION
{"_id": ObjectID(),
"students": [{"studentId" : ObjectID(),
"name" : "John Doe",
},
...
]
}
STUDENTS COLLECION
{"_id": ObjectID(),
"name" : "John Doe",
"address" : "...",
"age" : "...",
"gender": "..."
}
I use the students' collection in a different page and there I do not want any information about the classroom. That is the reason not to embed the students.
I started to learning mongo a few days ago and I don't know if this kind of schema bring some problems.
You can embed some fields and store other fields in a different collection as you are suggesting.
The issues with such an arrangement in my opinion would be:
What is the authority for a field? For example, what if a field like name is both embedded and stored in the separate collection, and the values differ?
Both updating and querying become awkward as you need to do it differently depending on which field is being worked with. If you make a mistake and go in the wrong place, you create/compound the first issue.

MongoDB "join" (1-to-many and many-to-many manual references?) Help needed. NO ENBEDDING METHODS [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am new with MongoDB, I know quite some stuff in MySQL. I work in RoboMongo.
Let's say I have two collections:
Instruments:
db.createCollection('Instruments');
db.Instruments.insert(
[{
instrumentID : 1,
instrumentName : "Squier",
instrumentManufacturerID : 3,
instrumentMadeIn: 'someCountry'
},{
...
])
and Manufacturers:
db.createCollection('Manufacturers')
db.Manufacturers.insert([
{
manufactID : 1,
manufactName : 'Fender',
manufactCountry : 'United States',
manufactFounded : 1946
},{
...
])
I want to "SELECT" (find?) the instruments whom manufacturers' country is the 'United States'
...and I want to "UPDATE" all the 'Fender' instruments' instrumentMadeIn country to 'Japan'
I don't want to nest the ManufacturerInformation inside each Instrument, becouse repeating the same information oven and over again is a waste of time and space.
I saw some code, where the ID field was called '_id' and had a large object value. It had to do something with the references, not sure tho.
I just want some simple codes to achieve my goals, no need for large explanations, maybe some short, simple ones, but that's all I ask. I won't be using MongoDB ever, just for this assignment. Thank you!
MongoDB is a document database, not a relational database. It does not do JOINs. A schema which depends on "JOINs" between collections to do its basic operations is not a good fit for a document database and should be denormalized to a single collection with embedded data. When you feel icky about this, you should consider to use a relational database instead.
When you must perform such an operation, you need to do this on the client-side by running multiple queries. You didn't specify the programming language your application is written in, so I will explain how to do it in the mongo shell.
// get the id of Fender
var manufacturerId = db.Manufacturers.findOne({"manufactName" : "Fender",}).manufactID;
// update all instruments with that manufacturer ID which are made in japan
db.Instruments.update(
{ "instrumentManufacturerID" : manufacturerId, "instrumentMadeIn": "Japan" },
{ $set: { "foo":"bar"}},
{ multi:true }
);
Yes, it is ugly and slow that you need two queries for such a simple operation, but that's what you get when you try to use a screwdriver as a hammer.
Regarding your question about the _id field: Every document gets this field automatically, and when it is set automatically it is set to a (practically) globally unique ObjectId. You can use this as an unique ID for documents, but you don't have to. When your data already has an unique key, you can also set _id to that value yourself.

Why does MongoDB not support queries of properties of embedded documents that are stored in hashed arrays?

Why does MongoDB not support queries of properties of embedded documents that are stored using hashes?
For example say you have a collection called "invoices" which was created like this:
db.invoices.insert(
[
{
productsBySku: {
12432: {
price: 49.99,
qty_in_stock: 4
},
54352: {
price: 29.99,
qty_in_stock: 5
}
}
},
{
productsBySku: {
42432: {
price: 69.99,
qty_in_stock: 0
},
53352: {
price: 19.99,
qty_in_stock: 5
}
}
}
]
);
With such a structure, MongoDB queries with $elemMatch, dot syntax, or the positional operator ($) fail to access any of the properties of each productsBySku member.
For example you can't do any of these:
db.invoices.find({"productsBySku.qty_in_stock":0});
db.invoices.find({"productsBySku.$.qty_in_stock":0});
db.invoices.find({"productsBySku.qty_in_stock":{$elemMatch:{$eq:0}}});
db.invoices.find({"productsBySku.$.qty_in_stock":{$elemMatch:{$eq:0}}});
To find out-of-stock products therefore you have to resort to using a $where query like:
db.invoices.find({
$where: function () {
for (var i in this.productsBySku)
if (!this.productsBySku[i].qty_in_stock)
return this;
}
});
On a technical level... why did they design MongoDB with this very severe limitation on queries? Surely there must be some kind of technical reason for this seeming major flaw. Is this inability to deal with an a list of objects as an array, ignoring the keys, just a limitation of JavaScript as a language? Or was this the result of some architectural decision within MongoDB?
Just curious.
As a rule of thumb: Usually, these problems aren't technical ones, but problems with data modeling. I have yet to find a use case where it makes sense to have keys hold semantic value.
If you had something like
'products':[
{sku:12432,price:49.99,qty_in_stock:4},
{sku:54352,price:29.99,qty_in_stock:5}
]
It would make a lot more sense.
But: you are modelling invoices. An invoice should – for many reasons – reflect a status at a certain point in time. The ever changing stock rarely belongs to an invoice. So here is how I would model the data for items and invoices
{
'_id':'12432',
'name':'SuperFoo',
'description':'Without SuperFoo, you can't bar or baz!',
'current_price':49.99
}
Same with the other items.
Now, the invoice would look quite simple:
{ _id:"Invoice2",
customerId:"987654"
date:ISODate("2014-07-07T12:42:00Z"),
delivery_address:"Foo Blvd 42, Appt 42, 424242 Bar, BAZ"
items:
[{id:'12432', qty: 2, price: 49.99},
{id:'54352', qty: 1, price: 29.99}
]
}
Now the invoice would hold things that may only be valid at a given point in time (prices and delivery address may change) and both your stock and the invoices are queried easily:
// How many items of 12432 are in stock?
db.products.find({_id:'12432'},{qty_in_stock:1})
// How many items of 12432 were sold during July and what was the average price?
db.invoices.aggregate([
{$unwind:"$items"},
{
$match:{
"items.id":"12432",
"date":{
$gt:ISODate("2014-07-01T00:00:00Z"),
$lt:ISODate("2014-08-01T00:00:00Z")
}
}
},
{$group : { _id:"$items.id", count: { $sum:"$items.qty" }, avg:{$avg:"$items.price"} } }
])
// How many items of each product sold did I sell yesterday?
db.invoices.aggregate([
{$match:{ date:{$gte:ISODate("2014-11-16T00:00:00Z"),$lt:ISODate("2014-11-17T00:00:00Z")}}},
{$unwind:"$items"},
{$group: { _id:"$items.id",count:{$sum:"$qty"}}}
])
Combined with the query on how many items of each product you have in stock, you can find out wether you have to order something (you have to do that calculation in your code, there is no easy way to do this in MongoDB).
You see, with a "small" change, you get a lot of questions answered.
And that's basically how it works. With relational data, you model your data so that the entities are reflected properly and then you ask
How do I get my answers out of this data?
In NoSQL in general and especially with MongoDB you first ask
Which questions do I need to get answered?
and model your data accordingly. A subtle, but important difference.
If I am honest I am not sure, you would have to ask MongoDB Inc. (10gen) themselves. I will attempt to explain some of my reasoning.
I have searched on Google a little and nothing seems to appear: https://www.google.co.uk/search?q=mognodb+jira+support+querying+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=as9pVOW3OMyq8wfhtYCgCw#rls=org.mozilla:en-GB:official&channel=fflb&q=mongodb+jira+querying+objects
It is quick to see how using objectual propeties for keys could be advantageous, for example: remove queries would not have to search every object and its properties within the array but instead just find the single object property in the parent object and unset it. Essentially it would be the difference of:
[
{id:1, d:3, e:54},
{id:3, t:6, b:56}
]
and:
{
1: [d:3, e: 54],
3: [t:6, b:56]
}
with the latter, obviously, being a lot quicker to delete an id of 3.
Not only that but all array operations that MongoDB introduces, from $elemMatch to $unwind would work wth objects as well, I mean how is unwinding:
[
{id:5, d:4}
]
much different to unwinding:
{
5: {d:4}
}
?
So, if I am honest, I cannot answer your question. There is no defense on Google as to their decision and there is no extensive talk from what I can find.
In fact I went as far as to search up on this a couple of times, including: https://www.google.co.uk/search?q=array+operations+that+do+not+work+on+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=DtNpVLrwDsPo7AaH4oCoDw and I found results that went as far as underscore.js who actually comply their array functions to all objects as well.
The only real reason, I can think of, is standardisation. Instead of catering to all minority groups etc on how subdocuments may work they just cater to a single minority turned majority by their choice.
It is one of the points about MongoDB which does confuse me even now, since there are many times within my own programming where it seems advantageous for speed and power to actually use objects instead of arrays.

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}
If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.
Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.

MongoDB data structure with large number internal documents

I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?
I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.
As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like
| USER |
--------
|ID
|Name
|Etc.
|TWEET__|
---------
|ID
|UserID
|Etc
It seems like the logical schema in Mongo would be
User
|-Tweet (0..3000)
|-Entities
|-Hashtags (0..10+)
|-urls (0..5)
|-user_mentions (0..12)
|-GeoData (0..20)
|-somegroupID
but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?
You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.
Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.
I mean a design for a tweet document as:
{
'hashtags': [ '#foo', '#bar' ],
'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
'user_mentions' : [ 'queen_uk' ],
'geodata': { ... },
'userid': 'derickr',
'somegroupid' : 40
}
And then for a user collection, the documents could look like:
{
'userid' : 'derickr',
'realname' : Derick Rethans',
...
}
All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ
Chris Winslett # MongoHQ
You will find this video interesting:
http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale
Essentially, in one document, store one days of tweets for one
person. The reasoning:
Querying typically consists of days and users
Therefore, you can have the following index:
{user_id: 1, date: 1} # Date needs to be last because you will range
and sort on the date
Have fun!
Chris MongoHQ
I think it makes the most sense to implement the following:
user
{ user_id: 123123,
screen_name: 'cledwyn',
misc_bits: {...},
groups: [123123_group_tall_people, 123123_group_techies, ],
groups_in: [123123_group_tall_people]
}
tweet
{ tweet_id: 98798798798987987987987,
user_id: 123123,
tweet_date: 20120220,
text: 'MongoDB is pretty sweet',
misc_bits: {...},
groups_in: [123123_group_tall_people]
}