MongoDB multikey : max (recommended, effective) number of keys? - mongodb

I have "feeds" collection, each feed have comments. So when someone comments on feed, he is added to "subsribers" which is Mongo's multikey field.
feeds: {
_id: ...,
text: "Text",
comments: [{by: "A", text: "asd"},{by: "B", text: "sdf"}],
subscribers: ["A","B"]
}
Then when I need to get all feeds with new comments for user A, I ask for feeds with {subscribers: "A"}.
Usually there're 2-5 comments, but sometimes (on hot feeds) there might be >100 comments and >100 subscribers.
I know that it's not recommended to have multikey fields with too many keys. So how much is too much?
I ask because I need to decide - if I will use multikeys or is it better to send comments directly to each user. In this case I have to copy feed for each subscriber - and collection will grow VERY fast - which I think isn't good also: 1000 user, each followed by 10 users, each making 10 actions a day = 1 000 000 records every 10 days!

Although you may experience problems with really large documents, particularly if MongoDB must scan the entire document to fulfill a query, as you would expect; arrays with large numbers of values are not in and of themselves problematic in MongoDB, even if they're multi-key indexes.
There is one caveat: the index will not store keys (in the case of a multi-key document, this is the item in the array,) longer than 1024 bytes. As long as the items in the array are shorter than this limit, you should be fine.
Having said this, you do want to avoid data models where an array or other part of the document will grow unbounded and forever. While MongoDB adds a little bit of padding on disk for every document, if the document grows substantially after creation, the database must move it to another location on disk. However you decide to model your data, ensure that your documents do not tend to grow much after creation.
Reference:
Index Size Limit
Multikey Indexes

Related

Efficient way for mongodb.find() to search through 1 million document?

I have a blog post server that will contains million of articles, and i need to be able to get all Articles written by User A.
What would be the best schema design.
1) Separate both User and Articles documents, and to get user A Articles search in all the million record for the User's id
articles.find({Writer_id: User_A.id})
2) Put a article id reference inside User schema. ex:
userSchema = {
name: "name",
age: "age",
articles: [ {type:mongoose.Article_id}, {type:mongoose.Article_id} ]
}
And search for User A and make a join to get Articles back.
It's far better to keep the Writer_id approach and create an index on that property. If you store an array of references, then you'll need to perform an $in operation on your find() calls. This will result in your query "jumping" from one matching Article_id to another. If instead you have a Writer_id and an index built for that property, all of user's articles will exist in the same sequential "block" in the index, requiring no jumping around whatsoever. The result is a far more read-efficient find() operation.
Additionally, the articles array approach would require frequent updates to the user document, whereas the Writer_id approach only requires inserts. Inserts are incredibly efficient, whereas frequent updates are relatively inefficient. Finally, an array of Article_ids can potentially (if unlikely) result in hitting the 16 MB document size limit. The Writer_id approach runs into no such limitation.
The difference should be relatively negligible for a smaller project, but if you're looking for scalability, then you're better off with the Writer_id approach.

Mongodb Index or not to index

quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.
Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.

Aggregate collection that have an aggregate collectin

I am having some trouble which schema design to pick, i have a document which holds user info each user have a very big set of items that can be up to 20k items.
an item have a date and an id and 19 other fields and also an internal array which can have 20-30 items , and it can be modified,deleted and of course newly inserted and queried by any property that it holds.
so i came up with 2 possible schemas.
1.Putting everything into a single docment
{_id:ObjectId("") type:'user' name:'xxx' items:[{.......,internalitems:[]},{.......,internalitems:[]},...]}
{_id:ObjectId("") type:'user' name:'yyy' items:[{.......,internalitems:[]},{.......,internalitems:[]},...]}
2.Seperating the items from the user and letting eachitem have its own
document
{_id:ObjectId(""), type:'user', username:'xxx'}
{_id:ObjectId(""), type:'user', username:'yyy'}
{_id:ObjectId(""), type:'useritem' username:'xxx' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'xxx' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'yyy' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'yyy' item:{.......,internalitems:[]}]}
as i explained before a single user can have thousands of items and i have tens of users, internalitems can have 20-30 items, and it has 9 fields
considering that a single item can be queried by different users and can be modified only by the owner and another process.
if performance is really important which design would you pick?
if you pick neither of them what schema can you suggest?
on a side note i will be sharding and i have a single collection for everything.
I wouldn't recommend the first approach, there is a limit to the maximum document size:
"The maximum BSON document size is 16 megabytes.
The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API. See mongofiles and the documentation for your driver for more information about GridFS."
Source: http://docs.mongodb.org/manual/reference/limits/
There is also a performance implication if you exceed the current allocated document space when updating (http://docs.mongodb.org/manual/core/write-performance/ "Document Growth").
Your first solution is susceptible to both of these issues.
The second one is (Disclaimer: In the case of 20-30 internal items) is less susceptible of reaching the limit but still might require reallocation when doing updates. I haven't had this issue with a similar scenario, so this might be the way to go. And you might wanna look into Record Padding(http://docs.mongodb.org/manual/core/record-padding/) for some more details.
And, if all else fails, you can always split the internal items out as well.
Hope this helps!

Updating large number of records in a collection

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed few fields from another collection called Department which is mostly won't get any updates and only rarely some records will be updated. By rarely I mean only once or twice in a year and also not all records, only less than 1% of the records in the collection.
Mostly once a department is created there won't any update, even if there is an update, it will be done initially (when there are not many related records in TimeSheet)
Now if someone updates a department after a year, in a worst case scenario there are chances collection TimeSheet will have about 300 million records totally and about 5 million matching records for the department which gets updated. The update query condition will be on a index field.
Since this update is time consuming and creates locks, I'm wondering is there any better way to do it? One option that I'm thinking is run update query in batches by adding extra condition like UpdatedDateTime> somedate && UpdatedDateTime < somedate.
Other details:
A single document size could be about 3 or 4 KB
We have a replica set containing three replicas.
Is there any other better way to do this? What do you think about this kind of design? What do you think if there numbers I given are less like below?
1) 100 million total records and 100,000 matching records for the update query
2) 10 million total records and 10,000 matching records for the update query
3) 1 million total records and 1000 matching records for the update query
Note: The collection names department and timesheet, and their purpose are fictional, not the real collections but the statistics that I have given are true.
Let me give you a couple of hints based on my global knowledge and experience:
Use shorter field names
MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.
Pros:
Less size of the documents, so less disk space
More documennt to fit in RAM (more caching)
Size of the do indexes will be less in some scenario
Cons:
Less readable names
Optimize on index size
The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.
Understand padding factor
For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.
We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.
You can see the collection padding factor using:
db.collection.stats().paddingFactor
Add a padding manually
In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.
Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.
You can get details about the query causing document move:
db.system.profile.find({ moved: { $exists : true } })
Large number of collections VS large number of documents in few collection
Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.
Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.
Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.
Denormalization of data
Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.
Pros:
index size will be very less as number of index entries will be less
query will be very fast which includes fetching all necessary details
document size will be comparable to page size which means when we bring this data in RAM, most of the time we are not bringing other data along the page
document move will make sure that we are freeing a page, not a small tiny chunk in the page which may not be used in further inserts
Capped Collections
Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).

MongoDB - How much data is too much data?

I'm building an application that uses MongoDB as a database. I have a lot of products, and I want to log what products a user looks at to the user's database entry. For instance, a user profile looks like this:
{
"email" : "foo#bar.com",
"name" : "John Snow",
"_id" : ObjectId("51ecbcc6896652a008000001"),
"productsViewed" : [
product1,
product2,
product3,
product4
]
}
I have two options here. I can log just the _id of each product, or I could log entire objects representing the product (name, price, ~100 word description, categories, that sort of thing). The difference in object size is 1 line of text per product vs about 30 lines per product.
I realise that this is probably a trivial amount of data to be concerned about, but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact? Logging more data is far more useful for my purposes but I'd like to avoid my database calls lagging if the user profile becomes quite large.
Question is: At what point (in character length, I guess?) is too much data to store with one MongoDB record?
16 Meg is the limitation for the entire document. This means that all strings etc have to fit within 16 meg. However, before that there are more limitation on your schema which you, yourself hint at:
but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact?
And the answer is yes. First off with the added data of the root user you will probably be over the 16 meg limit, however, further on from this the in-memory $pull, $push and other sub document operators might have a hard time keeping peformance up. You can sort of mitigate that problem by batching your subdocuments into groups of 100.
However, yet again, you have an even bigger problem: Fragmentation. Since MongoDB stores the record in a single contigeous space on the disk, hence it has settings like padding, you could see considerable fragmentation from odd sized record objects not being reused here.
I would personally say that you should factor off this relation to a separate collection.