how does mongodb do a 42T drive per node - mongodb

We had heard mongodb had one client with 42T per node and I am wondering more about this. I know cassandra has Bloomfilters that skipp hitting disk to find out which file a row might be in.
Does mongodb have something similar to bloomfilters?
IS mongodb using something similar to SSTables?
I did read mongodb does compaction just like cassandra, I would think this would be an awfully long process with a 42T node????
I guess I don't know what terms to search for as I research mongodb here(in cassandra they are called SSTables).
thanks,
Dean

MongoDB does not support online compaction. In fact, data fragmentation is a current problem in systems with many doc updates. To prevent data fragmentation MongoDB tries to calculate an automated padding factor, minimizing the number of data moves.
The compact command blocks the entire database until it finished. Besides, MongoDB does not support dictionary compression, so field names takes space on every object stored. I guess the layout used by MongoDB is not any fancy data structure. It's simply composed of header (offset, length...), bson data and padding factor.
Since MongoDB is not a key/value or columnar database it doesn't use SSTables (efficient data structure for columnar layout). Every file created for the database is named "extent".
AFAIK, MongoDB doesn't use bloom filters.

Related

MongoDB integration with Solr

I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.

MongoDB with LOTS OF datas?

I'm a beginner with a non SQL structure like here with MongoDB and I don't find somebody talk about a collection with lots of data, like 1.000.000 entries ? and more ?
I saw a company page on the official site. But nothing with large data companies.
I heard about a combo with SQL : Large data are stocked on SQL tables, and only the "cache" are on MongoDB, but it's the only one solution for MongoDB and large data ?
We're using MongoDB to power Where's it Up, and the api behind it. We're currently pushing in >3 million documents per day. MongoDB is the only storage engine in use. We were keeping a bunch around for a while, but we're now using TTL to delete old records.
Things are going super well, just make sure you have all the indexes you need. Querying a million+ records without an index is bad, regardless of your storage engine. Auto-failover has been super helpful.
Something to watch out for is updating records to include more information, it can be pretty expensive if the document grows past pre-allocated space. We ended up changing how we stored data to avoid updates, and create new documents instead.
MongoDB in it's current incarnation is explicitly designed to make it easy to scale out.
As for the numbers: one of my test databases has 10M records and runs easily on my MacBook Air, which is 4 years old now.
So what you can do when your current cluster can not handle the data stored (either because the indices are too big for your RAM or because of processing the queries takes too long): add another node to your MongoDB cluster. Your performance gain should be something between slightly below linear (if your cluster was in perfect condition otherwise) up to several orders of magnitude (when indices didn't fit into RAM and/or IO was pushed to it's limits before and that situation changed after scaling out).
A word of warning: you should have somebody who knows about MongoDB administration in case you want to put you deployment into production. Though MongoDB administration seems to be easy, it is by no means something to be done by a layman. Especially not for production use.

MongoDB fast deletion best approach

My application currently use MySQL. In order to support very fast deletion, I organize my data in partitions, according to timestamp. Then when data becomes obsolete, I just drop the whole partition.
It works great, and cleaning up my DB doesn't harm my application performance.
I would want to replace MySQL with MongoDB, and I'm wondering if there's something similiar in MongoDB, or would I just need to delete the records one by one (which, I'm afraid, will be really slow and will make my DB busy, and slow down queries response time).
In MongoDB, if your requirement is to delete data to limit the collection size, you should use a capped collection.
On the other hand, if your requirement is to delete data based on a timestamp, then a TTL index might be exactly what you're looking for.
From official doc regarding capped collections:
Capped collections automatically remove the oldest documents in the collection without requiring scripts or explicit remove operations.
And regarding TTL indexes:
Implemented as a special index type, TTL collections make it possible to store data in MongoDB and have the mongod automatically remove data after a specified period of time.
I thought, even though I am late and an answer has already been accepted, I would add a little more.
The problem with capped collections is that they regularly reside upon one shard in a cluster. Even though, in latter versions of MongoDB, capped collections are shardable they normally are not. Adding to this a capped collection MUST be allocated on the spot, so if you wish to have a long history before clearing the data you might find your collection uses up significantly more space than it should.
TTL is a good answer however it is not as fast as drop(). TTL is basically MongoDB doing the same thing, server-side, that you would do in your application of judging when a row is historical and deleting it. If done excessively it will have a detrimental effect on performance. Not only that but it isn't good at freeing up space to your $freelists which is key to stopping fragmentation in MongoDB.
drop()ing a collection will literally just "drop" the collection on the spot, instantly and gracefully giving that space back to MongoDB (not the OS) giving you absolutely no fragmentation what-so-ever. Not only that but the operation is a lot faster, 90% of the time, than most other alternatives.
So I would stick by my comment:
You could factor the data into time series collections based on how long it takes for data to become historical, then just drop() the collection
Edit
As #Zaid pointed out, even with the _id field capped collections are not shardable.
One solution to this is using TokuMX which supports partitioning:
https://www.percona.com/blog/2014/05/29/introducing-partitioned-collections-for-mongodb-applications/
Advantages over capped collections: capped collections use a fixed amount of space (even when you don't have this much data) and they can't be resized on-the-fly. Partitioned collections usage depends on data; you can add and remove partitions (for newly inserted data) as you see fit.
Advantages over TTL: TTL is slow, it just takes care of removing old data automatically. Partitions are fast - removing data is basically just a file removal.
HOWEVER: after getting acquired by Percona, development of TokuMX appears to have stopped (would love to be corrected on this point). Unfortunately MongoDB doesn't support this functionality and with TokuMX on its way out it looks like we will be stranded without proper solution.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

NoSql database suitable for long value

I am looking to use NoSql database for my applications. I have searched internet and found Berkeley DB, Mongodb, redis, Tokyo cabinet etc. There are some suggestions, usecases which database to use when. Some useful links i find are
http://perfectmarket.com/blog/not_only_nosql_review_solution_evaluation_guide_chart
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
But i didn't find which database performs good when value(in key-value pair) is very big like 1 MB or something.
MongoDB looks good to me because of its query feature. How it performs when you store very big documents.
RavenDB has the notion of Attachemnts. In a document, instead of having a property 1MB in size (usually a byte array), you'd put a minimalistic document with data you want to Map/Reduce on and save that large data bite as an attachment. That speeds up things very well.