mongodb single node configuration - mongodb

I am going to configure mongodb on a small number of cloud servers.
I am coming from mysql, and I remember that if I needed to change settings like RAM, etc. I would have to modify "my.cnf" file. This came useful while resizing each cloud server.
Now, how can I check or modify how much RAM or disk space the database is going to take for each node?
thank you in advance.

I don't think there are any built in broad stroke limitation tools or flags in mongodb per se and that is most likely because this is something you should be doing at the operating system level.
Most modern multi-user operating systems have built in ways to set quotas on disk space, etc per user so you could probably set up a mongo user and place the limits on them if you really wanted to. MongoDB works best when it has enough memory to hold the working set of data and indexes in memory and it does a good job of managing that on its own.
However, if you want to get granular you can take a look at the help output of mongod --help
I see the following options that you could tweak:
--nssize arg (=16) .ns file size (in MB) for new databases
--quota limits each database to a certain number of files (8 default)
--quotaFiles arg number of files allower per db, requires --quota

Related

Postgres auto tuning

As we all know postgres performance highly depends on config params. Eg if I have ssd drive or more RAM I need to tell that postgres by changing relevant cfg param
I wonder if there is any tool (for Linux) which can suggest best postgres configuration for current hardware?
Im aware Websites (eg pgtune) where I can enter server spec and those can suggest best config
However each hardware is different (eg I might have better raid / controller or some processes what might consume more ram etc). My wish would be postgres doing self tuning, analysing query execution time available resources etc
Understand there is no such mechanism, so maybe there is some tool / script I can run which can do this job for me (checking eg disk seq. / random disk read, memory available etc) and telling me what to change in config
There are parameters that you can tweak to get better performance from postgresql.
This article gives good read about that.
There are few scripts that can do that. One that is mentioned in postgres wiki is this one.
To get more idea about what more tuning your database needs, you need to log its request and performance, after analysing those logs you can tune more params. For this there is pgbadger log analyzer.
After using database in production, you get more idea regarding what requirements you have and how to approach them rather than making changes just based on os or hardware configuration.

GridFS: what it gives us

I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:
One downside of a distributed system can be the lack of a single
coherent filesystem. Say you operate a website where users can upload
images of themselves. If you run several web servers on several
different nodes, you must manually replicate the uploaded image to
each web server’s disk or create some alternative central system.
Mongo handles this scenario by its own distributed filesystem called
GridFS.
Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?
On the need for data distribution and its intricacies
Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.
Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.
There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.
Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.
As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.
Sharding vs. replication
Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.
The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.
Lets first make the two involved concepts a bit more clear.
Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.
Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)
Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.
The advantage of a sharded cluster is that there is no single point of failure any more:
Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.
Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.
The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.
A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:
At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter
But: in order to use GridFS in general, you do not even need a replica set ;).
To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.
Do all replicated data storages tend to implement own filesystem?
Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".
Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.
In this case,IMHO, author was stating that each web server has own disk storage, not shared with others - having that - upload path could be /home/server/app/uploads and as it is part of server filesystem is not shared at all as a kind of security with service provider. To populate those we need to have a script/job which will sync data to other places behind the scenes.
This scenario could be a case to use GridFS with mongo.
How gridFS works:
GridFS divides the file into parts, or chunks 1, and stores each
chunk as a separate document. By default, GridFS uses a chunk size of
255 kB; that is, GridFS divides a file into chunks of 255 kB with the
exception of the last chunk. The last chunk is only as large as
necessary. Similarly, files that are no larger than the chunk size
only have a final chunk, using only as much space as needed plus some
additional metadata.
In reply to comment:
BSON is binary format, and mongo has special replication mechanism for replicating collection data (gridFS is a special set of 2 collections). It uses OpLog to send diffs toother servers in replica set. More here
Any comments welcome!

Setting up MongoDB to work with two data directories

I have a mongoDb machine with 1 TB drive (on AWS that is the limit).
however I need to store more than 1 TB of data on this mongoDB setup, but it's not heavy on reads / write.
Is there a way to split the data directory to two mounts - two different directories? (instead of using LVM)
You can use the directoryperdb configuration option so that each database's files are stored in a separate subdirectory, and then use different mount points depending on the volume you want to use.
This option can be helpful in provisioning additional storage or different storage configurations depending on the database usage (i.e. SSD or PIOPS for a database which is more I/O intensive, while using normal EBS storage for archival data).
Important caveats:
There is a highlighted note in the documentation for directoryperdb which shows how to migrate the files for an existing deployment.
If you separate your data files onto multiple volumes, this may change your Backup Strategy. In particular, if you are using filesystem/EC2 snapshots to get a consistent backup of a running mongod you can no longer do so if the data and/or journal files are on different volumes.
I haven't solved anything like your problem. However I've read about Sharding technique which is defined as:
Sharding is a method for storing data across multiple machines.
MongoDB uses sharding to support deployments with very large data sets
and high throughput operations.
It sounds promissing for your problem. Wanna try? Sharding MongoDB
(sorry, not enough rep for commenting)

MongoDB: Can different databases be placed on separate drives?

I am working on an application in which there is a pretty dramatic difference in usage patterns between "hot" data and other data. We have selected MongoDB as our data repository, and in most ways it seems to be a fantastic match for the kind of application we're building.
Here's the problem. There will be a central document repository, which must be searched and accessed fairly often: it's size is about 2 GB now, and will grow to 4GB in the next couple years. To increase performance, we will be placing that DB on a server-class mirrored SSD array, and given the total size of the data, don't imagine that memory will become a problem.
The system will also be keeping record versions, audit trail, customer interactions, notification records, and the like. that will be referenced only rarely, and which could grow quite large in size. We would like to place this on more traditional spinning disks, as it would be accessed rarely (we're guessing that a typical record might be accessed four or five times per year, and will be needed only to satisfy research and customer service inquiries), and could grow quite large, as well.
I haven't found any reference material that indicates whether MongoDB would allow us to place different databases on different disks (were're running mongod under Windows, but that doesn't have to be the case when we go into production.
Sorry about all the detail here, but these are primary factors we have to think about as we plan for deployment. Given Mongo's proclivity to grab all available memory, and that it'll be running on a machine that maxes out at 24GB memory, we're trying to work out the best production configuration for our database(s).
So here are what our options seem to be:
Single instance of Mongo with multiple databases This seems to have the advantage of simplicity, but I still haven't found any definitive answer on how to split databases to different physical drives on the machine.
Two instances of Mongo, one for the "hot" data, and the other for the archival stuff. I'm not sure how well Mongo will handle two instances of mongod contending for resources, but we were thinking that, since the 32-bit version of the server is limited to 2GB of memory, we could use that for the archival stuff without having it overwhelm the resources of the machine. For the "hot" data, we could then easily configure a 64-bit instance of the database engine to use an SSD array, and given the relatively small size of our data, the whole DB and indexes could be directly memory mapped without page faults.
Two instances of Mongo in two separate virtual machines Would could use VMWare, or something similar, to create two Linux machines which could host Mongo separately. While it might up the administrative burden a bit, this seems to me to provide the most fine-grained control of system resource usage, while still leaving the Windows Server host enough memory to run IIS and it's own processes.
But all this is speculation, as none of us have ever done significant MongoDB deployments before, so we don't have a great experience base to draw upon.
My actual question is whether there are options to have two databases in the same mongod server instance utilize entirely separate drives. But any insight into the advantages and drawbacks of our three identified deployment options would be welcome as well.
That's actually a pretty easy thing to do when using Linux:
Activate the directoryPerDB config option
Create the databases you need.
Shut down the instance.
Copy over the data from the individual database directories to the different block devices (disks, RAID arrays, Logical volumes, iSCSI targets and alike).
Mount the respective block devices to their according positions beyond the dbpath directory (don't forget to add the according lines to /etc/fstab!)
Restart mongod.
Edit: As a side note, I would like to add that you should not use Windows as OS for a production MongoDB. The available filesystems NTFS and ReFS perform horribly when compared to ext4 or XFS (the latter being the suggested filesystem for production, see the MongoDB production notes for details ). For this reason alone, I would suggest Linux. Another reason is the RAM used by rather unnecessary subsystems of Windows, like the GUI.

mongo db --smallfiles switch drawbacks

I want to use mongodb for my new project. the problem is, mongo use pre-alocate files :
Each datafile is preallocated to a particular size. (This is done to prevent file system fragmentation, among other reasons.) The first filename for a database is .0, then .1, etc. .0 will be 64MB, .1 128MB, et cetera, up to 2GB. Once the files reach 2GB in size, each successive file is also 2GB. Thus, if the last datafile present is, say, 1GB, that file might be 90% empty if it was recently created.
from here : http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
And its normal to have many 2GB files with nothing in it. there is a --smallfiles switch, to limit this files to 512MB
--smallfiles => Use a smaller initial file size (16MB) and maximum size (512MB)
I want to know using smallfiles is good for production? and what's its drawbacks.
there is noprealloc switch but its not good in production. but there is no note about smallfiles.
You would usually only use smallfiles if you are creating a whole bunch of databases, if you're only operating out of a few databases it doesn't save you enough to mess with.
We haven't seen any performance problems with it for customers that have many, many DBS (and actually benefit from small files). Their activity level is normally somewhat low compared to other installs, though. Based on what Mongo is doing, it might be slightly slower to do some operations but I don't think you'll ever notice.
Additionally, if running in AWS cloud and using the m3.small instances with SSDs, you are limited to 4GB storage. Setting this option will allow you to have a small SSD-backed mongodb node. Could be sufficient for small tasks