How much space does it take to store data in MongoDB? - mongodb

I have a MongoDB with approximately 50 collections but it can increase in future. On each collections we will have fields ranging from 5 - 11 columns.
My question is how do I optimize the MongoDB so that I do not take up storage spaces because of superLongCollectionFieldName. How is character/word calculated when storing the data?
Lets say I have a field called, userID and another field called, IP does it both take full size for the bit block?

The overall storage required for your data will depend on many use case specific factors including schema, indexes, how compressible the data is, and your data update/deletion patterns. The length of field names does not significantly affect index size (since indexes only store key values and document locations), but long names may have some impact on storage usage. The best way to guesstimate storage usage would be to generate some representative test data using a data generator or by extrapolating from existing data.
MongoDB (as at 4.0) does not maintain a central catalog of field names: field names are stored in each document so documents are self-describing in a distributed deployment. In all modern versions of MongoDB (3.2+) data is compressed by default so the size of field names is not a typical concern for most use cases.
You could implement a mapping to shorter names via application code, but that will add translation overhead and reduce clarity of the documents stored in the server. For more discussion, see: SERVER-863: Tokenize the field names.

Related

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?
Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.
BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

Can I store data that won't affect query performance in MongoDB?

We have an application which requires saving of data that should be in documents, for querying and sorting purposes. The data should be schema less, as some of the fields would be known only via usage. For this, MongoDB is a great solution and it works great for us.
Part of the data in each document, is for displaying purposes. Meaning the data can be objects (let's say json) that the client side uses in order to plot diagrams.
I tried to save this data using gridfs, but the use cases makes it not responsive enough. Also, the documents won't exceed the 16 MB limits even with the diagram data inside them. And in fact, while trying to save this data directly within the documents, we got better results.
This data is used only for client side responses, meaning we should never query it. So my question is, can I insert this data to MongoDB, and set it as a 'not for query' data? Meaning, can I insert this data without affecting Mongo's performance? The data is strict and once a document is inserted, there might be only updating of existing fields, not adding new ones.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for?
Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
As at MongoDB 3.4, read and write operations are atomic on the level of a single document from the storage/memory point of view. If the MongoDB server needs to fetch a document from memory or disk (even when projecting a subset of fields to return) the full document generally has to be loaded into memory on a mongod. The only exception is if you can take advantage of covered queries where all of the fields returned are also included in the index used.
This data is used only for client side responses, meaning we should never query it.
Data fields which aren't queried directly do not need to be in any indexes. However, there is currently no concept like "not for query" fields in MongoDB. You can query or project any field (with or without an index).
Meaning, can I insert this data without affecting Mongo's performance?
Data with very different access or growth patterns (such as your infrequently requested client data) is a recommended candidate for storing separately from a parent document with frequently accessed data. This will improve the efficiency of memory usage for mongod by avoiding unnecessary retrieval of data when working with documents in the parent collection.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for? Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
You should use a type that is most appropriate for the data that you are storing. Storing text data as binary will not gain you any obvious efficiencies in server storage. However, storing a complex object as a single value (for example, a JSON document serialized as a string) could save some serialization overhead if that object will only be interpreted via your client-side code. Binary data stored in MongoDB will be an opaque blob as far as indexing or querying, which sounds fine for your purposes.

MongoDB GridFS Size Limit

I am using MongoDB as a convenient way of storing a dataset as a series of columns where there is a document that stores the values for a given column and another document that stores the details of the detaset, and a mapping to the other documents with the associated column values. The issue I'm now facing as things get bigger is that I can no longer store the entire column in a single document.
I'm aware that there is also the GridFS option, the only downside is that I believe it stores the files as blobs meaning I would lose random access to a chunk of the column, or the value at a specified index, something that was incredibly useful from the document store, however I may not ahve any other option.
So my question is: does GridFS also impose an upper limit on the size of documents and if so does anyone know what this is. I've looked in hte docs and haven't found anything, but it may be I'm not looking in the correct place or that there is a limit but it's not well documented.
Thanks,
Vackar
GridFS
Per the GridFS documentation:
Instead of storing a file in an single document, GridFS divides a file
into parts, or chunks, and stores each of those chunks as a separate
document. By default GridFS limits chunk size to 256k. GridFS uses
two collections to store files. One collection stores the file chunks,
and the other stores file metadata.
GridFS will allow you to store arbitrarily large files however this really won't help your use case. A file in GridFS will effectively be a large binary blob and you will not get any of the benefits of structured documents and indexing.
Schema Design
The fundamental challenge you have is your approach to schema design. If you are creating documents that are likely to grow beyond the 16Mb document limit, these will also have a significant impact on your database storage and fragmentation as the documents grow in size.
The appropriate solution would be to rethink your schema approach so that you do not have unbounded document growth. This probably means flattening the array of "columns" that you are growing so it is represented by a collection of documents rather than an array.
A better (and separate) question to ask would be how to refactor your schema given the expected data growth patterns.

Why are key names stored in the document in MongodDB

I'm curious about this quote from Kyle Banker's MongoDB In Action:
It’s important to consider the length of the key names you choose, since key names are stored in the documents themselves. This contrasts with an RDBMS, where column names are always kept separate from the rows they refer to. So when using BSON, if you can live with dob in place of date_of_birth as a key name, you’ll save 10 bytes per document. That may not sound like much, but once you have a billion such documents, you’ll have saved nearly 10 GB of storage space just by using a shorter key name. This doesn’t mean you should go to unreasonable lengths to ensure small key names; be sensible. But if you expect massive amounts of data, economizing on key names will save space.
I am interested in the reason why this is not optimized on the database server side. Would a in-memory lookup table with all key names in the collection be too much of a performance penalty that is not worth the potential space savings?
What you are referring to is often called "key compression"*. There are several reasons why it hasn't been implemented:
If you want it done, you can currently do it at the Application/ORM/ODM level quite easily.
It's not necessarily a performance** advantage in all cases — think collections with lots of key names, and/or key names that vary wildly between documents.
It might not provide a measurable performance** advantage at all until you have millions of documents.
If the server does it, the full key names still have to be transmitted over the network.
If compressed key names are transmitted over the network, then readability really suffers using the javascript console.
Compressing the entire JSON document might offer offers an even better performance advantage.
Like all features, there's a cost benefit analysis for implementing it, and (at least so far) other features have offered more "bang for the buck".
Full document compression is [being considered][1] for a future MongoDB version. available as of version 3.0 (see below)
* An in-memory lookup table for key names is basically a special case of LZW style compression — that's more or less what most compression algorithms do.
** Compression provides both a space advantage and a performance advantage. Smaller documents means that more documents can be read per IO, which means that in a system with fixed IO, more documents per second can be read.
Update
MongoDB versions 3.0 and up now have full document compression capability with the WiredTiger storage engine.
Two compression algorithms are available: snappy, and zlib. The intent is for snappy to be the best choice for all-around performance, and for zlib to be the best choice for maximum storage capacity.
In my personal (non-scientific, but related to a commercial project) experimentation, snappy compression (we didn't evaluate zlib) offered significantly improved storage density at no noticeable net performance cost. In fact, there was slightly better performance in some cases, roughly in line with my previous comments/predictions.
I believe one of the original reasons behind storing the key names with the documents is to allow a more easily scalable schema-less database. Each document is self-contained to a greater extent, in that if you move the document to another server (for example, via replication or sharding) you can index the contents of the document without having to reference separate or centralized metadata such as a mapping of key names to more compact key IDs.
Since there is no enforced schema for a MongoDB collection, the field names can potentially be different for every document in the same collection. In a sharded environment, inserts to each shard are (intentionally) independent so at a document level the raw data could end up differing unless the key mapping was able to be consistent per shard.
Depending on your use case, the key names may or may not consume a significant amount of space relative to the accompanying data. You could always workaround the storage concern from the application / ODM implementation by mapping YourFriendlyKeyNames to shorter DB key equivalents.
There is an open MongoDB Jira issue and some further discussion to have the server tokenize field names, which you can vote on to help prioritize including this feature in a future release.
MongoDB's current design goals include performance with dynamic schemas, replication & high availability, auto-sharding, and in-place updates .. with one potential tradeoff being some extra disk usage.
Having to look this up within the database for each and every query would be a serious penalty.
Most drivers allow you to specify ElementName, so that MyLongButReadablePropertyName in your domain model becomes mlbrpn in mongodb.
Therefore, when you query in your application, it's the application that transforms the query that would of been:
db.myCollection.find({"MyLongButReadablePropertyName" : "some value"})
into
db.myCollection.find({"mlbrpn" : "some value"})
Efficient drivers, like the C# driver cache this mapping, so it doesn't need to look this up for each and every query.
Coming back to the title of your question:
Why are key names stored in the document in MongodDB
This is the only way documents can be searched?
Without the key names stored , there'd be no key to search on.
Hope this helps