Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I want to understand more about the system and DB architecture of MongoDB.
I am trying to understand how MongoDB stores and retrieves the documents. If it's all in memory etc.
A comparative analysis between MongoDB and Oracle will be a bonus but, I am mostly focusing on understanding the MongoDB architecture per se.
Any pointers will be helpful.
MongoDB memory maps the database files. It allows the OS to control this and allocate the maximum amount of RAM to the memory mapping. As MongoDB updates and reads from the DB it is reading and writing to RAM. All indexes on the documents in the database are held in RAM also. The files in RAM are flushed to disk every 60 seconds. To prevent data loss in the event of power failure, the default is to run with journaling switched on. The journal file is flushed to disk every 100ms and if there is power loss is used to bring the database back to a consistent state.
An important design decision with mongo is on the amount of RAM. You need to figure out your working set size - i.e if you are going to be reading and writing to only the most recent 10% of your data in the database then this 10% is your working set and should be held in memory for maximum performance. So if your working set is 10GB you are going to neen 10GB for max performance - otherwise your queries/updates will run slower as pages of memory are paged from disk into memory.
Other important aspects of mongoDB are replication for backups and sharding for scaling.
There are a lot of great online resources for learning. MongoDB is free and opensource.
EDIT:
It's a good idea to check out the tutorial
http://www.mongodb.org/display/DOCS/Tutorial
and manual
http://www.mongodb.org/display/DOCS/Manual
and the Admin Zone is useful too
http://www.mongodb.org/display/DOCS/Admin+Zone
and if you get bored of reading then the presentations are worth checking out.
http://www.10gen.com/presentations
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 12 months ago.
Improve this question
We have certain linux devices which send data like battery percentage, cpu utilization, ram utilization, etc. in certain intervals. We want to run analytics for this data. Should we capture this data in mongo(https://www.mongodb.com/blog/post/time-series-data-and-mongodb-part-1-introduction) or use a specific timeseries database like influxdb or TSDB? The data generated is around 100 GB per day and we want it for last 3 months.
TSDB bencmarks show (TimescaleDB vs MongoDB, InfluxDB vs MongoDB) that dedicated timeseries databases outperform MongoDB. At 100 GB per day x 3 months on-disk data compression is also important. VictoriaMetrics seems to be leading in ingestion rate, query speed and compression for typical use cases although TimescaleDB has recently improved data compression. And have a look at Yandex ClickHouse benchmarks too.
For another alternative, check out QuestDB at Questdb.io. QuestDB outperforms all of the above mentioned TSDBs and is SQL-based.
You can try it out for speed at http://try.questdb.io:9000/ which is a live instance loaded with 1.9B rows of data from the NYC Taxi dataset.
For timeseries data, it's highly recommended to use timeseries database instead of RDBMS or NoSQL DB because the storage and query are optimized for timeseries data in TSDB.
Here I want to recommend a lightweight, high performance, open source time series database, TDengine. TDengine is a distributed TSDB and its distributed solution is also open source, it also supports SQL for easy use.
https://tdengine.com/
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I've got a problem whit the gfix sweep command, because it doesn't clean the garbage collector. What problem it can be. The database backup size is 900mb smaller than the database itself. What is the problem if the gfix sweep started manually don't work?
A backup is smaller because it doesn't contain indexes, but just the database data itself, and it only contains data of the latest committed transaction, no earlier record versions. In addition, the storage format of the backup is more efficient, because it is written and read serially and doesn't need the more complex layout used for the database itself.
In other words, in almost all cases a backup will be smaller than the database itself, sometimes significantly smaller (if you have a lot of indexes or a lot of transaction churn, or a lot of blobs).
Garbage collection in Firebird will remove old record versions, sweep will also cleanup transaction information. Neither will release allocated pages, that is: the database file will not shrink. See Firebird for the Database Expert: Episode 4 - OAT, OIT, & Sweep
If you want to shrink a database, you need to backup and restore it, but generally there is no need for that: Firebird will re-use free space on its data pages automatically.
See also Firebird for the Database Expert: Episode 6 - Why can't I shrink my databases.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to implement mongodb as a distributed database but i cannot find good tutorials for it. Whenever i searched for distributed database in mongodb, it gives me links of sharding, so i am confused if both of them are the same things?
Generally speaking, if you got a read-heavy system, you may want to use replication. Which is 1 primary with at most 50 secondaries. The secondaries share the read stress while the primary takes care of writes. It is a auto-failover system so that when the primary is down, one of the secondaries would take the job there and becomes a new primary.
Sharding, however, is more flexible. All the Shards share write stress and read stress. That is to say, data are distributed into different Shards. And each shard can be consists of a Replication system and auto-failover works as described above.
I would choose replication first because it's simple and is basically enough for most scenarios. And once it's not enough, you can choose to convert from replication to sharding.
There is also another discussion of differences between replication and sharding for your reference.
Just some perspective on distributed databases:
In early nineties a lot of applications were desktop based and had a local database which contained MB/GBs of data.
Now with the advent of web based applications there can be millions of users who use and store their data, this data can run into GB/TB/PB. Storing all this data on a single server is economically expensive so there is a cluster of servers(or commodity hardware) across which data is horizontally partitioned. Sharding is another term for horizontal partitioning of data.
For example you have a Customer table which contains 100 rows, you want to shard it across 4 servers, you can pick 'key' based sharding in which customers will be distributed as follows: SHARD-1(1-25),SHARD-2(26-50),SHARD-3(51-75),SHARD-4(76-100)
Sharding can be done in 2 ways:
Hash based
Key based
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
We have a set of data in MongoDB that we are map reducing (twice) we're going to be using Mongo's map reduce for now, but I'm thinking about how to scale and improve performance in the future and am thinking about Hadoop.
Most of the stuff I'm reading about Hadoop talks about big data, terabytes of the stuff, when we're going to be dealing with megabytes, 10s maybe 100s of thousands of records. (There may be many of these running concurrently though so whilst a single task is small total could be large).
We really want to get insane performance out of small data rather than make it possible to do big data. i.e. get map reduce results that take 10s of seconds in MongoDB to take seconds or sub second in Hadoop.
Is this possible?
Is Hadoop a good fit for this?
If not what other technologies are there that will make this possible?
Details of exact problem this is needed for and my solution to date, can be found in this question: Linear funnel from a collection of events with MongoDB aggregation, is it possible?
Is this possible?
NO. No matter how small your data is there will always be some initial delay while running MR jobs, incurred because a lot of things happening under the hood, like checking input/output paths, split creation, map creation etc. And this is unavoidable.
Is Hadoop a good fit for this?
NO. You can't expect Hadoop to give you results in nano or a few milliseconds.
If not what other technologies are there that will make this possible?
If you need something really fast and which scales well better have a look at Storm.
Most of the stuff I'm reading about Hadoop talks about big data, terabytes of the stuff, when we're going to be dealing with megabytes, 10s maybe 100s of thousands of records.
One of the things that gives hadoop its speed is its clustering abilities with Map Reduce, such things, of course, only really apply to "Big Data" (whatever that means now-a-days).
In fact map reduce is normally slower than say, the aggregation framework on small data because of how long it takes to actually run an average map reduce.
Map reduce is really designed for something other than what your doing.
You could look into storing your data in a traditional database and use that databases aggregation framework, i.e. SQL or MongoDB.
Hadoop is not going to fulfill your requirements. The very first things is infrastructure requirement and its administration. The cost of running map-reduce will be more on hadoop than in Mongo or other similar technologies if your data is in MBs.
Furthermore, I'd like to suggest to expand your existing mongoDB infrastructure. The querying and document based flexibilities (like easy indexes and data retrieval ) cannot be achieved with Hadoop technologies easily.
Hadoop 'in general' is moving toward lower latency processing, through projects like Tez for instance. And there are hadoop-like alternatives, like Spark.
But for event processing, and usually that means Storm, the future may already be here, see Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing (also see the slideshare from Hadoop Summit).
Hadoop is a vast ecosystem. There are huge differences in capabilities between the old (1.0), the new (1.3) and the bleeding edge (2.0 and beyond). Can some of these technologies replace Mongo's own M/R? I certainly think so. Can your problem be split out into many parallel tasks (this is actually not clear to me)? Then somewhere between Spark/YARN/Tez there is a solution that would go faster as you throw more hardware at it.
And of course, for a workset that first in one host RAM there will always be a SMP RDBMS that will run circles around clusters...
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We are looking at a document db storage solution with fail over clustering, for some read/write intensive application.
We will be having an average of 40K concurrent writes per second written to the db (with peak can go up to 70,000 during) - and may have around almost similiar number of reads happening.
We also need a mechanism for the db to notify about the newly written records (some kind of trigger at db level).
What will be a good option in terms of a proper choice of document db and related capacity planning?
Updated
More details on the expectation.
On an average, we are expecting 40,000 (40K) Number of inserts (new documents) per second across 3-4 databases/document collections.
The peak may go up to 120,000 (120K) Inserts
The Inserts should be readable right away - almost realtime
Along with this, we expect around 5000 updates or deletes per second
Along with this, we also expect 500-600 concurrent queries accessing data. These queries and execution plans are somewhat known, though this might have to be updated, like say, once in a week or so.
The system should support failover clustering on the storage side
if "20,000 concurrent writes" means inserts then I would go for CouchDB and use "_changes" api for triggers. But with 20.000 writes you would need a stable sharding aswell. Then you would better take a look at bigcouch
And if "20.000" concurrent writes consist "mostly" updates I would go for MongoDB for sure, since Its "update in place" is pretty awesome. But then you should handle triggers manually, but using another collection to update in place a general document can be a handy solution. Again be careful about sharding.
Finally I think you cannot select a database with just concurrency, you need to plan the api (how you would retrieve data) then look at options in hand.
I would recommend MongoDB. My requirements wasn't nearly as high as yours but it was reasonably close. Assuming you'll be using C#, I recommend the official MongoDB C# driver and the InsertBatch method with SafeMode turned on. It will literally write data as fast as your file system can handle. A few caveats:
MongoDB does not support triggers (at least the last time I checked).
MongoDB initially caches data to RAM before syncing to disk. If you need real-time needs with durability, you might want to set fsync lower. This will have a significant performance hit.
The C# driver is a little wonky. I don't know if it's just me but I get odd errors whenever I try to run any long running operations with it. The C++ driver is much better and actually faster than the C# driver (or any other driver for that matter).
That being said, I'd also recommend looking into RavenDB as well. It supports everything you're looking for but for the life of me, I couldn't get it to perform anywhere close to Mongo.
The only other database that came close to MongoDB was Riak. Its default Bitcask backend is ridiculously fast as long as you have enough memory to store the keyspace but as I recall it doesn't support triggers.
Membase (and the soon-to-be-released Couchbase Server) will easily handle your needs and provide dynamic scalability (on-the-fly add or remove nodes), replication with failover. The memcached caching layer on top will easily handle 200k ops/sec, and you can linearly scale out with multiple nodes to support getting the data persisted to disk.
We've got some recent benchmarks showing extremely low latency (which roughly equates to high throughput): http://10gigabitethernet.typepad.com/network_stack/2011/09/couchbase-goes-faster-with-openonload.html
Don't know how important it is for you to have a supported Enterprise class product with engineering and QA resources behind it, but that's available too.
Edit: Forgot to mention that there is a built-in trigger interface already, and we're extending it even further to track when data hits disk (persisted) or is replicated.
Perry
We are looking at a document db storage solution with fail over clustering, for some read/write intensive application
Riak with Google's LevelDB backend [here is an awesome benchmark from Google], given enough cache and solid disks is very fast. Depending on a structure of the document, and its size ( you mentioned 2KB ), you would need to benchmark it of course. [ Keep in mind, if you are able to shard your data ( business wise ), you do not have to maintain 40K/s throughput on a single node ]
Another advantage with LevelDB is data compression => storage. If storage is not an issue, you can disable the compression, in which case LevelDB would literally fly.
Riak with secondary indicies allows you to make you data structures as documented as you like => you index only those fields that you care about searching by.
Successful and painless Fail Over is Riak's second name. It really shines here.
We also need a mechanism for the db to notify about the newly written records (some kind of trigger at db level)
You can rely on pre-commit and post-commit hooks in Riak to achieve that behavior, but again, as any triggers, it comes with the price => performance / maintainability.
The Inserts should be readable right away - almost realtime
Riak writes to disk (no async MongoDB surprises) => reliably readable right away. In case you need a better consistency, you can configure Riak's quorum for inserts: e.g. how many nodes should come back before the insert is treated as successful
In general, if fault tolerance / concurrency / fail over / scalability are important to you, I would go with data stores that are written in Erlang, since Erlang successfully solves these problems for many years now.