I'm looking for a good approach to do the following :
Given a document with some field F I want to set up sharding to that my application can generate a static hash for that field value (meaning, the hash will always be the same if the value is the same) and then use that hash to target the appropriate shard in a normal MongoDB sharded setup.
Questions :
Is this a safe/good approach?
What is a good way to go about implementing it
Are there any gotchas concerning shard cluster setup that I should be aware of.
Thanks!
I've actually implemented this and it's very doable and results in very good write performance. I'm assuming you have the same reasons for implementing it as I did (instant shard targeting without warmup/balancing, write throughput, no performance degradation during chunk moves/splits, etc.).
Your questions :
Yes it is provided you implement it correctly.
What I do in our in-house ORM layer is mark certain fields in document as a hash sharded field. Our ORM will then automatically generate a hash value for that field value just prior to writing or reading the document. Outgoing queries are then decorated with that hash value (in our case always called "hash") which MongoDB sharding then uses for shard targeting. Obviously in this scenario "hash" is always the only sharding key.
Most important by far is to generate good hashes. A lot of field values (most commonly an ObjectId based _id field) are incremental so your hash algorithm must be so that the generated hashes for incremental values result in hash values that hit different shards. Other issues include selecting the appropriate chunk size.
Some downsides to consider :
Default MongoDB chunk balancing becomes less useful since you typically set up your initial cluster with a lot of chunks (this facilitates adding shards to your cluster while maintaining good chunk spread across all shards). This means the balancer will only start splitting if you have enough data in your premade chunks to require splitting.
It's likely to become an officially supported MongoDB feature in the near future which may make this entire effort a bit wasteful. Like me you may not have the luxury of waiting though.
Good luck.
UPDATE 25/03/2013 : As of version 2.4 MongoDB supports hash indexes natively.
This is a good and safe idea.
However the choice of the hash function is crucial :
do you want it to be uniform (you smooth the load of all your shard but you loose some semantic bulk access) ?
do you want it human readable ? (you loose efficiency (compared to binary hash which are very fast) but you win, well, readability
can you make it consistent (beware of language provided hash function)
can you you enforce unicity if you want ?
I have successfully choosen : uniformity, binary form, consistency and unicity with a murmurHash3 function :
value -> murmurmHash(valueInBinaryForm) followed by the valueInBinaryForm
Related
in our DB we have a large text field which we want to filter on exists/does not exist basis. So we don't need to perform any text search in it.
we assume that index would help, although it's not guaranteed the fiels wont exceed 1024 bytes. So that's not an option.
does hashed index on such field support $exists-filtering queries?
do hashed indexes impose any field-size limitations (in our experiments, hashed index is well capable of indexing fields where ordinary index fails)? We haven't found any explicit statement on this in docs though.
is chosen approach as a whole the correct one?
Yes, your approach is the correct one given the constraints. However, there are some caveats.
The performance advantage of an index compared to a collection scan is limited by the RAM available, since mongod tries to keep indices in RAM. If it can't (die to queries, for example), even an index will be read from disk, more or less eliminating the performance advantage in using it. So you should test wether the additional index does not push the RAM needed beyond the limits of your planned deployment.
The other, more severe problem is that you can not use said index to reliably distinguish unique documents with it, since there is no guarantee for uniqueness on hashes. Albeit a bit theoretical, you have to keep that in mind.
I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?
If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.
I'm curious about this quote from Kyle Banker's MongoDB In Action:
It’s important to consider the length of the key names you choose, since key names are stored in the documents themselves. This contrasts with an RDBMS, where column names are always kept separate from the rows they refer to. So when using BSON, if you can live with dob in place of date_of_birth as a key name, you’ll save 10 bytes per document. That may not sound like much, but once you have a billion such documents, you’ll have saved nearly 10 GB of storage space just by using a shorter key name. This doesn’t mean you should go to unreasonable lengths to ensure small key names; be sensible. But if you expect massive amounts of data, economizing on key names will save space.
I am interested in the reason why this is not optimized on the database server side. Would a in-memory lookup table with all key names in the collection be too much of a performance penalty that is not worth the potential space savings?
What you are referring to is often called "key compression"*. There are several reasons why it hasn't been implemented:
If you want it done, you can currently do it at the Application/ORM/ODM level quite easily.
It's not necessarily a performance** advantage in all cases — think collections with lots of key names, and/or key names that vary wildly between documents.
It might not provide a measurable performance** advantage at all until you have millions of documents.
If the server does it, the full key names still have to be transmitted over the network.
If compressed key names are transmitted over the network, then readability really suffers using the javascript console.
Compressing the entire JSON document might offer offers an even better performance advantage.
Like all features, there's a cost benefit analysis for implementing it, and (at least so far) other features have offered more "bang for the buck".
Full document compression is [being considered][1] for a future MongoDB version. available as of version 3.0 (see below)
* An in-memory lookup table for key names is basically a special case of LZW style compression — that's more or less what most compression algorithms do.
** Compression provides both a space advantage and a performance advantage. Smaller documents means that more documents can be read per IO, which means that in a system with fixed IO, more documents per second can be read.
Update
MongoDB versions 3.0 and up now have full document compression capability with the WiredTiger storage engine.
Two compression algorithms are available: snappy, and zlib. The intent is for snappy to be the best choice for all-around performance, and for zlib to be the best choice for maximum storage capacity.
In my personal (non-scientific, but related to a commercial project) experimentation, snappy compression (we didn't evaluate zlib) offered significantly improved storage density at no noticeable net performance cost. In fact, there was slightly better performance in some cases, roughly in line with my previous comments/predictions.
I believe one of the original reasons behind storing the key names with the documents is to allow a more easily scalable schema-less database. Each document is self-contained to a greater extent, in that if you move the document to another server (for example, via replication or sharding) you can index the contents of the document without having to reference separate or centralized metadata such as a mapping of key names to more compact key IDs.
Since there is no enforced schema for a MongoDB collection, the field names can potentially be different for every document in the same collection. In a sharded environment, inserts to each shard are (intentionally) independent so at a document level the raw data could end up differing unless the key mapping was able to be consistent per shard.
Depending on your use case, the key names may or may not consume a significant amount of space relative to the accompanying data. You could always workaround the storage concern from the application / ODM implementation by mapping YourFriendlyKeyNames to shorter DB key equivalents.
There is an open MongoDB Jira issue and some further discussion to have the server tokenize field names, which you can vote on to help prioritize including this feature in a future release.
MongoDB's current design goals include performance with dynamic schemas, replication & high availability, auto-sharding, and in-place updates .. with one potential tradeoff being some extra disk usage.
Having to look this up within the database for each and every query would be a serious penalty.
Most drivers allow you to specify ElementName, so that MyLongButReadablePropertyName in your domain model becomes mlbrpn in mongodb.
Therefore, when you query in your application, it's the application that transforms the query that would of been:
db.myCollection.find({"MyLongButReadablePropertyName" : "some value"})
into
db.myCollection.find({"mlbrpn" : "some value"})
Efficient drivers, like the C# driver cache this mapping, so it doesn't need to look this up for each and every query.
Coming back to the title of your question:
Why are key names stored in the document in MongodDB
This is the only way documents can be searched?
Without the key names stored , there'd be no key to search on.
Hope this helps
I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.
Is SimpleDB similar to MongoDB?
The most substantial similarity is the fact that they both avoid the relational model. Other than that, they are mainly different any way you look at them. Here is a breakdown of a dozen or so ways to compare them.
SimpleDB
An Amazon service hosted, maintained and scaled by Amazon. You are billed for what you use each month beyond the free usage tier.
All data is replicated live in the background across multiple data centers
All replicas are able to service live requests
After a network or server failure any out of sync nodes will resync automatically
Background replication results in eventual consistency but higher (theoretical) availability
All data is stored as String name / String value pairs, each associated with an ItemName
Each item is limited to half a megabyte (each name or value can only be 1024 bytes long, each item holds 256 name / value pairs) and each domain can hold 10GB
These limits make it suitable for data sets that can be broken down into small pieces.
SimpleDB is optimized for many small requests executed in parallel
Throughput limits are in place for each domain of data
Horizontal Scalability is achieved by spreading your data across more domains
All attributes values are indexed automatically, compound indexes don't exist (but can be simulated)
Queries are performed using a (stripped down) SQL Select-like query language
MongoDB
An open source product that you install and maintain on your own servers.
Data can be replicated in master-slave mode
Only the master can service live write requests, slave can service queries (except in non-recommend limited-master-master mode)
After a network or server failure or when a replica falls too far behind, operator intervention will always be required.
The single master is strongly consistent.
All data is stored as serialized JSON documents, allowing a large set of data types
Each document is limited to 4MB, larger documents can be stored using a special document chunking system
Most Suitable for small and medium sized data, and small binary objects
Throughput limits are dictated by MongoDB and your hardware
Vertical scalability via a bigger server, potential for future horizontal scalability across your own server cluster via a sharding module currently in development.
The document id is indexed automatically. Indexes can be created and deleted as needed. Indexes can be for a single key or compound.
Queries are performed using a JSON style query language.
SimpleDB is described as:
The data model is simply:
Large collections of items organized
into domains.
Items are little hash
tables containing attributes of key,
value pairs.
Attributes can be
searched with various lexicographical
queries.
MongoDB is a bit simpler:
The database manages collections of
JSON-like documents which are stored
in a binary format referred to as
BSON.
I have a decent knowledge of mongodb and just started to work with SimpleDB. So first of all both of them are not key-value storage. Mongodb and SimpleDB is a document based nosql database which are schema-free. This means that you do not need to create a schema for a 'table' before entering the data in it (basically it means you can store there everything you want).
Basically here the similarity ends. I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.