Mongodb, sharding and horizontal scaling - mongodb

In mongodb, i want to use mongos and do mongodb sharding over 2 machines, is it common to have a single collection and adding an documents to my collections such as :
{type:'user',name:'xxx',id:1,.........}
{type:'userentery',userid:1.........}
{type:'usersettings',userid:1.......}
{type:'userevent',userid:1.......}
{type:'SomthingNotRelated',....}
is my understanding correct to how you should use mongodb?
and is the the way to do horizontal scaling and avoid vertical scaling by
avoiding adding more collections?
what are the disadvantages of my approach?
if a user had a very big array wouldn't it be better putting it in a
seperate document rather than the user document itself?

"shredding" no such word for MongoDB. It is "sharding", since you cannot get the name right I would strongly suggest you read the documentation right here: http://docs.mongodb.org/manual/core/sharding/
is my understanding correct to how you should use mongodb?
From what I understand yes.
and is the the way to do horizontal scaling and avoid vertical scaling by avoiding adding more collections?
More or less. Sometimes an aggregate collection of another, containing unique entries or summed entries is also helpful for scaling.
what are the disadvantages of my approach?
You haven't really described a specific approach to anything as such I cannot answer that.
if a user had a very big array wouldn't it be better putting it in a seperate document rather than the user document itself?
Depends on the operations of that array. If the array were to be consistently and continiously updated so that it would dramatically shift in size regularly then yes, you would be better splitting it off.
Such subdocuments are normally actually separate entities in themselves when thought of logically.

Sharding is the ability of Mongo to split a single collection (any collection) in shards (pieces of the collection) into different small databases (to make it simple). For you it's completely transparent, you use a collection "colX" sharded and you can split it into several machines if you want. The only recommendation is you have to be smart enough and read the documentation to use a proper shard key that helps you to split in the best balanced way possible your collection.
You can use your collection and in case this type is statistically relevant to represent a good balance in the collection (it means if you have 10 million records and 10 types it's normal you have around 1 million each) you can use it and shard by type.
Your approach is correct, you just need to use the correct shard key.
One more comment added to my note. A wrong shard key won't accelerate your process too much, if you query by type and your shard key is type it's faster to get the proper shard to return your information. In the other case, if you need, let's say to query by date and it's not in your shard key, Mongo will need to send your query to every shard and merge the result at the end. Sharding help you a lot in some cases and not too much in some other cases, of course you duplicate processor and it's always better but you won't see a big difference always if you didn't choose your shard key properly.

Related

I wonder if there are a lot of collections

Do many mongodb collections have a big impact on mongodb performance, memory and capacity? I am designing an api with mvc pattern, and a collection is being created for each model. I question the way I am doing now.
MongoDB with the WirdeTiger engine supports an unlimited number of collections. So you are not going to run into any hard technical limitations.
When you wonder if something should be in one collection or in multiple collections, these are some of the considerations you need to keep in mind:
More collections = more maintenance work. Sharding is configured on the collection level. So having a large number of collections will make shard configuration a lot more work. You also need to set up indexes for each collection separately, but this is quite easy to automatize, because createIndex on an index which already exists does nothing.
The MongoDB API is designed in a way that every database query operates on one collection at a time. That means when you need to search for a document in n different collections, you need to perform n queries. When you need to aggregate data stored in multiple collections, you run into even more problems. So any data which is queried together should be stored together in the same collection.
Creating one collection for each class in your model is usually a god rule of thumb, but it is not a golden hammer solution. There are situations where you want to embed object in their parent-object documents instead of putting them into a separate collection. There are also cases where you want to put all objects with the same base-class in the same collection to benefit from MongoDB's ability to handle heterogeneous collections. But that goes beyond the scope of this question.
Why don't you use this and test your application ?
https://docs.mongodb.com/manual/tutorial/evaluate-operation-performance/
By the way your question is not completely clear... is more like a "discussion" rather than question. And you're asking others to evaluate your work instead of searching the web the rigth approach.

mongodb/mongoose - when to use subdocument and when to use new collection

I would like to know if there is a rule of thumb about when to use a new document and when to use a sub document. In sql database I used to break all realtions to seperate tables by the rule of normalization and connect them with keys , but I can't find a good approch about what to do in mongodb ( I don't know how other no-sql databases are handled).
Any help will be appreicated.
Kind regards.
Though no fixed rules, there are some general guidelines which are intuitive enough to follow while modeling data in noSql.
Nearly all cases of 1-1 can be handled with sub-documents. For example: A user has an address. All likelihood is that address would be unique for each user (in context of your system, say a social website). So, keeping address in another collection would be a waste of space and queries. Address sub-document is the best choice.
Another example: Hundreds of employees share a same building/address. In this case keeping 1-1 is a poor use of space and will cost you a lot of updates whenever a slight change happens in any of the addresses because it's being replicated across multiple employee documents as sub-document. Therefore, an address should have many employees i.e. 1 to many relationship
You must have noticed that in noSql there are multiple ways to represent 1 to many relationship.
Keep an array of references. Choose this if you're sure the size of the array won't get too big and preferably the document containing the array is not expected to be updated a lot.
Keep an array of sub-documents. A handy option if the sub-documents don't qualify for a separate collection and you don't run the risk of hitting 16Mb document size limit. (thanks greyfairer for reminding!)
Sql style foreign key. Use this if 1 and 2 are not good enough or you prefer this style over them
Modeling documents for retrieval, Document design considerations and Modeling Relationships from Couchbase (another noSql database) are really good reads and equally applicable to mongodb.

Uniqueness of _id within a shard

I'm looking into sharding using mongodb, and most if it is rather straight forward. I have some experience with sharding in other databases, so I'm not asking about the concept itself. There's one thing I'm confused by, and there doesn't seem to be anything in the documentation about this, so here goes.
Is _id required to be unique within the shard, regardless of shard key?
A small scale (single shard) test seems to confirm that this is the case. It does however seem like a less than stellar approach to sharding, which has me confused. To me it would make more sense to require shard-key + _id to be unique (i.e. use a compound key), or you'll have inconsistent behavior depending on where your shard-keys end up being routed to. My data model uses deterministic keys, and the shard key is an intrinsic part of it. So I guess it comes down to, did I do something wrong in my small scale test? Do I need to store the shard-key twice, once as a shard-key field and once as part of _id? Or is there some special case where I can somehow declare a compound key using shard-key and _id?
Update
For completeness, this is the trivial case I'm testing, inserting the following two documents:
{"_id": 1, "shardkey": 1}
{"_id": 1, "shardkey": 2}
First one obviously goes through, second one fails. If I would've had two shards, and the shard keys would've been routed to different shards, I assume both would've succeeded.
I can obviously just combine the shard-key and the id to create the _id field for mongodb, since this is really the key I'm using, but it seems like a weird way to approach the problem from a database architectural standpoint.
_id needs to be unique, always, whether the collection is sharded or not. The shard key does not need to be unique. It is used to split the collection into chunks which can be split onto the shards making up the database. The shard key needs to provide enough granularity to split the documents in the collection into chunks. Its obviously a good idea to link the shard key to how you query the data, and use a shard key which relates to the fields that you query on. This way the queries you run will be easily directed to the relevant shards to satisfy the query. If the shard key isnt selective enough then the query will need to go to multiple shards to find the correct documents. You can create a compound index on _id + shard-key and make it unique if you want.
I realise this doesnt fully answer the question. tbh I am struggling to understand what you're asking. Perhaps if you could post an example of the documents you're storing and the queries you're running it would help.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

using array fields in mongoDB

I'm working on a scheme that will heavily use array fields in mongo documents,
are there any known problems with the approach of holding rather large arrays of other documents etc?
performance issues?
being rather new to mongo and coming from SQL background the approach seems "out of place" for me since it's bit different than the approach of grouping all records in a table by a set of "primary keys" instead of holding the "primary keys" once and holding the rest of the data in arrays.
the "primary keys" approach is my other option to use in mongo as well
what is best?
I'm not aware of special performance issues. What you need to keep in mind is that you have only about 16mb at most for one document. So if you have several hundred or thousand subdocs, with some reasonal length you may come into trouble with the document limit. Depending on the fact how often you need those subdocs and your primary docs, you may consider to split it. Else your primary doc will have much overload (your subdocs), blocking other "business objects" to be kept in RAM. So this may be a point.
Additionally without that I worked with Arrays much actual, it may be wise to use only primary keys, if those subdocs may be displayed/queried over all those subdocs, without their parents needed. I think it highly depends upon the fact if you need those subdocs separately and often.
What is best, is a question of your use case, nothing in general :)