I'm curious about this quote from Kyle Banker's MongoDB In Action:
It’s important to consider the length of the key names you choose, since key names are stored in the documents themselves. This contrasts with an RDBMS, where column names are always kept separate from the rows they refer to. So when using BSON, if you can live with dob in place of date_of_birth as a key name, you’ll save 10 bytes per document. That may not sound like much, but once you have a billion such documents, you’ll have saved nearly 10 GB of storage space just by using a shorter key name. This doesn’t mean you should go to unreasonable lengths to ensure small key names; be sensible. But if you expect massive amounts of data, economizing on key names will save space.
I am interested in the reason why this is not optimized on the database server side. Would a in-memory lookup table with all key names in the collection be too much of a performance penalty that is not worth the potential space savings?
What you are referring to is often called "key compression"*. There are several reasons why it hasn't been implemented:
If you want it done, you can currently do it at the Application/ORM/ODM level quite easily.
It's not necessarily a performance** advantage in all cases — think collections with lots of key names, and/or key names that vary wildly between documents.
It might not provide a measurable performance** advantage at all until you have millions of documents.
If the server does it, the full key names still have to be transmitted over the network.
If compressed key names are transmitted over the network, then readability really suffers using the javascript console.
Compressing the entire JSON document might offer offers an even better performance advantage.
Like all features, there's a cost benefit analysis for implementing it, and (at least so far) other features have offered more "bang for the buck".
Full document compression is [being considered][1] for a future MongoDB version. available as of version 3.0 (see below)
* An in-memory lookup table for key names is basically a special case of LZW style compression — that's more or less what most compression algorithms do.
** Compression provides both a space advantage and a performance advantage. Smaller documents means that more documents can be read per IO, which means that in a system with fixed IO, more documents per second can be read.
Update
MongoDB versions 3.0 and up now have full document compression capability with the WiredTiger storage engine.
Two compression algorithms are available: snappy, and zlib. The intent is for snappy to be the best choice for all-around performance, and for zlib to be the best choice for maximum storage capacity.
In my personal (non-scientific, but related to a commercial project) experimentation, snappy compression (we didn't evaluate zlib) offered significantly improved storage density at no noticeable net performance cost. In fact, there was slightly better performance in some cases, roughly in line with my previous comments/predictions.
I believe one of the original reasons behind storing the key names with the documents is to allow a more easily scalable schema-less database. Each document is self-contained to a greater extent, in that if you move the document to another server (for example, via replication or sharding) you can index the contents of the document without having to reference separate or centralized metadata such as a mapping of key names to more compact key IDs.
Since there is no enforced schema for a MongoDB collection, the field names can potentially be different for every document in the same collection. In a sharded environment, inserts to each shard are (intentionally) independent so at a document level the raw data could end up differing unless the key mapping was able to be consistent per shard.
Depending on your use case, the key names may or may not consume a significant amount of space relative to the accompanying data. You could always workaround the storage concern from the application / ODM implementation by mapping YourFriendlyKeyNames to shorter DB key equivalents.
There is an open MongoDB Jira issue and some further discussion to have the server tokenize field names, which you can vote on to help prioritize including this feature in a future release.
MongoDB's current design goals include performance with dynamic schemas, replication & high availability, auto-sharding, and in-place updates .. with one potential tradeoff being some extra disk usage.
Having to look this up within the database for each and every query would be a serious penalty.
Most drivers allow you to specify ElementName, so that MyLongButReadablePropertyName in your domain model becomes mlbrpn in mongodb.
Therefore, when you query in your application, it's the application that transforms the query that would of been:
db.myCollection.find({"MyLongButReadablePropertyName" : "some value"})
into
db.myCollection.find({"mlbrpn" : "some value"})
Efficient drivers, like the C# driver cache this mapping, so it doesn't need to look this up for each and every query.
Coming back to the title of your question:
Why are key names stored in the document in MongodDB
This is the only way documents can be searched?
Without the key names stored , there'd be no key to search on.
Hope this helps
Related
I have a MongoDB with approximately 50 collections but it can increase in future. On each collections we will have fields ranging from 5 - 11 columns.
My question is how do I optimize the MongoDB so that I do not take up storage spaces because of superLongCollectionFieldName. How is character/word calculated when storing the data?
Lets say I have a field called, userID and another field called, IP does it both take full size for the bit block?
The overall storage required for your data will depend on many use case specific factors including schema, indexes, how compressible the data is, and your data update/deletion patterns. The length of field names does not significantly affect index size (since indexes only store key values and document locations), but long names may have some impact on storage usage. The best way to guesstimate storage usage would be to generate some representative test data using a data generator or by extrapolating from existing data.
MongoDB (as at 4.0) does not maintain a central catalog of field names: field names are stored in each document so documents are self-describing in a distributed deployment. In all modern versions of MongoDB (3.2+) data is compressed by default so the size of field names is not a typical concern for most use cases.
You could implement a mapping to shorter names via application code, but that will add translation overhead and reduce clarity of the documents stored in the server. For more discussion, see: SERVER-863: Tokenize the field names.
I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.
Schemaless is a term that is currently floating around in the NoSql world.
What does this mean ?
I have a document with 3 properties today and I move to production with it, then what happens to my data when I need to add 2 more properties to my document?
Is this purely a migrations problem where I need to manage my data migration or can a NoSql database create as much friction as a RDBMS or make it easier in someway ?
Schema-less is a bit of a misnomer, it's better to think of it as:
SQL = Schema enforced by a RDBMS on Write
NoSQL = Partial Schema enforced by the DBMS on Write, PLUS schema fully enforced by the Application on Read (Externalised
schema)
So while a supposed Schema-less NoSQL data-store will in theory allow you to store any data you like (typically key value pairs, in a document) without prior knowledge of the keys, or data types, it will be pointless unless you have some mechanism to retrieve and use the data. So essentially the schema is partially moved from the RDBMS into the application code. I say partially as you'll have added Indexes to document collections and or partitioned the data for performance, so the NoSQL DBMS will have a partial schema defined locally, and possibly enforced via Unique constraints.
As to adding additional attributes to a document/object in the store. Depending on how much padding is around the document (un-used space), in its physical data block, adding a few more key value pairs to the documents may result in the document having to be physically moved to a larger contiguous block of storage, and the associated indexes re-built. If you plan to use the new keys in a frequently utilised query then you'll be wanting to also add a suitable new index, which will obviously require some physical storage, take a while to initially build and possibly lead you to ask the sysadmin to allocate more memory to the DBMS, to allow the new index(s) to be cached.
A bit late in the day but while searching on the topic again I found this article
http://tech.pro/tutorial/1189/basics-of-ravendb-nosql
Refer to section 3 in the article, I will quote it again for ease.
Adding and changing data models to RavenDB couldn't be simpler. Since
it is a NoSQL database, it can handle additions and deletions to your
models very simply. If a property is added to your class, it will be
set to the default value of that type. If a property is deleted, then
upon deserialization that value will be ignored. No more futzing with
SQL Scripts.
This seems to be the logical answer for RavenDB.
I'm looking for a good approach to do the following :
Given a document with some field F I want to set up sharding to that my application can generate a static hash for that field value (meaning, the hash will always be the same if the value is the same) and then use that hash to target the appropriate shard in a normal MongoDB sharded setup.
Questions :
Is this a safe/good approach?
What is a good way to go about implementing it
Are there any gotchas concerning shard cluster setup that I should be aware of.
Thanks!
I've actually implemented this and it's very doable and results in very good write performance. I'm assuming you have the same reasons for implementing it as I did (instant shard targeting without warmup/balancing, write throughput, no performance degradation during chunk moves/splits, etc.).
Your questions :
Yes it is provided you implement it correctly.
What I do in our in-house ORM layer is mark certain fields in document as a hash sharded field. Our ORM will then automatically generate a hash value for that field value just prior to writing or reading the document. Outgoing queries are then decorated with that hash value (in our case always called "hash") which MongoDB sharding then uses for shard targeting. Obviously in this scenario "hash" is always the only sharding key.
Most important by far is to generate good hashes. A lot of field values (most commonly an ObjectId based _id field) are incremental so your hash algorithm must be so that the generated hashes for incremental values result in hash values that hit different shards. Other issues include selecting the appropriate chunk size.
Some downsides to consider :
Default MongoDB chunk balancing becomes less useful since you typically set up your initial cluster with a lot of chunks (this facilitates adding shards to your cluster while maintaining good chunk spread across all shards). This means the balancer will only start splitting if you have enough data in your premade chunks to require splitting.
It's likely to become an officially supported MongoDB feature in the near future which may make this entire effort a bit wasteful. Like me you may not have the luxury of waiting though.
Good luck.
UPDATE 25/03/2013 : As of version 2.4 MongoDB supports hash indexes natively.
This is a good and safe idea.
However the choice of the hash function is crucial :
do you want it to be uniform (you smooth the load of all your shard but you loose some semantic bulk access) ?
do you want it human readable ? (you loose efficiency (compared to binary hash which are very fast) but you win, well, readability
can you make it consistent (beware of language provided hash function)
can you you enforce unicity if you want ?
I have successfully choosen : uniformity, binary form, consistency and unicity with a murmurHash3 function :
value -> murmurmHash(valueInBinaryForm) followed by the valueInBinaryForm
Is SimpleDB similar to MongoDB?
The most substantial similarity is the fact that they both avoid the relational model. Other than that, they are mainly different any way you look at them. Here is a breakdown of a dozen or so ways to compare them.
SimpleDB
An Amazon service hosted, maintained and scaled by Amazon. You are billed for what you use each month beyond the free usage tier.
All data is replicated live in the background across multiple data centers
All replicas are able to service live requests
After a network or server failure any out of sync nodes will resync automatically
Background replication results in eventual consistency but higher (theoretical) availability
All data is stored as String name / String value pairs, each associated with an ItemName
Each item is limited to half a megabyte (each name or value can only be 1024 bytes long, each item holds 256 name / value pairs) and each domain can hold 10GB
These limits make it suitable for data sets that can be broken down into small pieces.
SimpleDB is optimized for many small requests executed in parallel
Throughput limits are in place for each domain of data
Horizontal Scalability is achieved by spreading your data across more domains
All attributes values are indexed automatically, compound indexes don't exist (but can be simulated)
Queries are performed using a (stripped down) SQL Select-like query language
MongoDB
An open source product that you install and maintain on your own servers.
Data can be replicated in master-slave mode
Only the master can service live write requests, slave can service queries (except in non-recommend limited-master-master mode)
After a network or server failure or when a replica falls too far behind, operator intervention will always be required.
The single master is strongly consistent.
All data is stored as serialized JSON documents, allowing a large set of data types
Each document is limited to 4MB, larger documents can be stored using a special document chunking system
Most Suitable for small and medium sized data, and small binary objects
Throughput limits are dictated by MongoDB and your hardware
Vertical scalability via a bigger server, potential for future horizontal scalability across your own server cluster via a sharding module currently in development.
The document id is indexed automatically. Indexes can be created and deleted as needed. Indexes can be for a single key or compound.
Queries are performed using a JSON style query language.
SimpleDB is described as:
The data model is simply:
Large collections of items organized
into domains.
Items are little hash
tables containing attributes of key,
value pairs.
Attributes can be
searched with various lexicographical
queries.
MongoDB is a bit simpler:
The database manages collections of
JSON-like documents which are stored
in a binary format referred to as
BSON.
I have a decent knowledge of mongodb and just started to work with SimpleDB. So first of all both of them are not key-value storage. Mongodb and SimpleDB is a document based nosql database which are schema-free. This means that you do not need to create a schema for a 'table' before entering the data in it (basically it means you can store there everything you want).
Basically here the similarity ends. I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.