I have a MongoDB collection with 580+ fields: should I split it up? - mongodb

The vast majority of the fields will not need to be indexed and will never be queried (think display ONLY). There are maybe 20-30 fields that need to be queried.
Likewise, the vast majority of the the fields will be for simple field value pairs (not embedded docs).
Finally, there will be some fields that will store embedded docs, but not a lot and the embedded docs will not be large.
I was thinking of maybe breaking up the collection into two collections:
A collection with fields that need to be indexed/queried and fields that need to be displayed in larger result sets (really anything more than 1 single document).
A collection with: _id, related_id and data (where data is an embedded doc with all the extra data in it). This collection would only ever be accessed when viewing the "detailed" display of the document.
Also, there will be 100s of millions of documents (eventually).

Related

Is the Firestore collection write limit (imposed by sequentially-updated indexed fields) affected by collection-group queries?

From how I understand it, if a collection has a monotonically-increasing indexed field, a write limit is imposed on that collection. If that collection is split into two separate collections, each collection would have its own write limit. However, if we split that collection into two separate collections but give them the same name (putting them under different documents), would they still have their own independent write limits if the monotonically-indexed field was part of a collection-group query that queried them both together?
No, that's not the way it works. A collection group query requires its own index, and the limit you're talking about is the write rate of the index itself, not the collection. Each collection automatically indexes fields from documents for just that specific collection, but that would not apply the collection group queries that span collections.
Note that the documentation states the limit as:
Maximum write rate to a collection in which documents contain sequential values in an indexed field
On a related note, disabling the indexing for a specific field on a collection allows you to bypass the normal monotonic write limits for that one field on that collection because it's no longer being indexed.

Storing visiting IP addresses in document array or seperate collection

I have a collection Items. Each document in this collection has a view counter, that increments every time a user who hasn't viewed the item earlier, visits its page.
Currently, I am storing an array of ipaddresses in each item document, so that I can keep track of who has viewed it, and only increment the view counter when a new user visits.
I am however concerned that this may affect performance since I have no way of retrieving the item document, without also getting the IP array.
I expect this array to range between 1 - 5000.
Would I be better off having a separate collection with an item id and the array, or am i overblowing the potential performance risks?
Quoting the official documentation.
In general, embedding provides better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedded data models make it possible to update related data in a single atomic write operation.
However, embedding related data in documents may lead to situations where documents grow after creation. Document growth can impact write performance and lead to data fragmentation
Since your array size will grow embedding your document is not a good option.
You may want to go for One-to-One Relationships

Best way to organize subdocuments in Mongo?

I have a project that I'm working on that will require me to store a large number of objects in an array linked to a parent object, akin to the storing of social media comments to their original post. What is the best way for me to organize the data for the array of child documents/comments?
Is it considered best practice to have the child objects under a different collection and reference to their parent or would it be more ideal just to put them all within the parent object directly?
I discuss this a little here, read this first:
https://stackoverflow.com/a/27285313/68567
For your case, Option 3 (keeping some of the data in your primary model) is probably the best. The key is to Avoid unbounded array growth.
This has to do with how Mongodb allocates documents. http://docs.mongodb.org/manual/core/storage/
"Every document in MongoDB is stored in a record which contains the document itself and extra space, or padding, which allows the document to grow as the result of updates."
When node allocates new documents it allocates space based on the size of the inserted document and the sizes of documents already in your collection. (Read more in the link above.) If you have some documents that are orders of magnitude larger than others this will likely lead to fragmentation.
The way to avoid having too many documents in your 'comments' sub-document array is with the $push and $slice commands.
http://docs.mongodb.org/manual/reference/operator/update/slice/
So store the 'most recent 5' and display those when the item first loads. (Or oldest, or whatever other sorting criteria you want to use.) Then provide a way for the user to load more which will do a separate round-trip to the collection that has all of them.

Dynamic Data inside MongoDB

I'm making dynamic data storage, which it was in relational data base as:
form -> field
form -> form_data
field -> field_data
form_data -> filed_data
and field data contains field_id, form_data_id, and value
but for scalability and performance I'm planing to move form_data and field_data to MongoDB and my problem now is how to design MongoDB collections, using one collection for all form_data and move field_data to map inside it and key is field_id and value is the value of this field_data, or make collection for each form record and store data direct in form_data without map as the all data in this case will be consistent.
In document-oriented databases like MongoDB you should always favor aggregation over references, because they don't support JOIN-operations on the database to link multiple documents.
That means that a "A has many B" relationship between two entities should not be modeled as two tables A and B, but rather as one collection of A where each A has an embedded array of B objects.
In the context of MongoDB, there is however an additional limitation: MongoDB doesn't like objects which grow. When an object accumulates more and more data over its lifetime, the object will grow. That means that the hard drive space MongoDB allocated for it will run out again and again, which will require space reallocation. This takes performance and fragments the database. Also, MongoDB has an artificial size limit for documents, mainly to discourage developers from designing growing objects.
Therefor:
When the data exists at creation, embed it directly.
When more and more data is added after creation, put it into a different collection.
When a form has X fields, the number of fields will likely not change much over its lifetime. So you should embed the fields and their descriptions directly into the form object.
But the number of answers entered into these forms will grow over time, which means that these should be treated as separate objects in a separate collection.
So I would recommend you to have two collections, forms and form_data.
Each document in forms embeds a sub-object of fields with the static field properties.
Each document in form_data has a field with the _id of the corresponding form and embeds a sub-object of field_data which uses the same keys as the fields sub-object of forms and stores the entries the user made in that form.
When your use-case requires frequent access to the aggregated data (like when you want to publish the up-to-date statistics on a public website), you could also store this information in the fields of forms to avoid an expensive aggregation query over many form_data documents. MongoDB in general recommends to orient your database schema rather on your performance requirements than on the semantics of your data.
Regarding your remark "all data in this case will be consistent": keep in mind that MongoDB does not enforce referential integrity. When an application deletes or changes a document, it's the responsibility of the application to fix any outdated references to it in other documents.

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.