Maximum number of databases supported by MongoDB - mongodb

I would like to create a database for each customer. But before, I would like to know how many databases can be created in a single instance of MongoDB ?

There's no explicit limit, but there are probably some implicit limits
due to max number of open file handles / files in a directory on the
host OS/filesystem.
see: http://groups.google.com/group/mongodb-user/browse_thread/thread/01727e1af681985a?fwc=2

By default, you can run some 12,000 collections in a single instance of MongoDB( that is, if each collection also has 1 index).
If you want to create more number of collections, then use --nssize when you run mongod process. You can see this link for more details:
http://www.mongodb.org/display/DOCS/Using+a+Large+Number+of+Collections

Related

Where should I use sharding in mongodb or run multiple instance of mongodb?

Issue
I have at least 10 text files(CSV), each reaches to 5GB in size. There is no issue when I import the first text file. But when I start importing the second text file it shows the Maximum Size Limit (16MB).
My primary purpose for using the database is for searching the customers from the database using customer_id index.
Given Below is the details of One CSV File.
Collection Name|Documents|Avg.Document Size|Total Document Size|Num.Indexes| Total Index Size|Properties
Customers|8,874,412|1.8 KB|15.7 GB|3|262.0 MB
To overcome this MongoDB community were recommending GridFS, but the problem with GridFS is that the data is stored in bytes and its not possible to query for a specific index in the textfile.
I don't know if its possible to query for a specific index in a textfile when using GridFS. If some one knows any help is appreciated.
Then the other solution I thought about was creating multiple instance of MonogDB running in different ports to solve the issue. Is this method feasible?
But lot of the tutorial on multiple instance shows how to cerate a replica set. There by storing the same data in the PRIMARY and the SECONDARY.
The SECONDARY instances don't allow to write and only allows to read data.
Is it possible to create multiple instance of MongoDB without creating replica set and with write and read operations on them? If Yes How? Can this method overcome the 16MB limit.
Second Solution I thought about was creating shards of the collections or simply sharding. Can this method overcome the 16MB limit. If yes any help regarding this.
Of the two solutions which is more efficient for searching for data (in terms of speed). As I mentioned earlier I just want to search of customers from this database.
The error message shows exactly where the problem is: entry #8437: line 13530, column 627
Have a look at the file and correct it in the file.
The error extraneous " in field ... is quite clear. In your CSV file you have an opening quote " but it is not closed, i.e. the rest of entire file is considered as one single field.

Intercept or filter out oplog transactions from MongoDB

There is a MongoDB which has interesting data I want to examine. Unfortunately, due to size concerns, once every 48 hours, the database is purged of "old" records.
I created a replica set with a secondary database system that has priority 0 and vote 0 so as not to interfere with the main database performance. This works great as I can query the secondary and get my data. However, there are many occasions that my system cannot process all the records in time and will lose some old records if I did not get to them within 48 hours.
Is there a way where I can cache the oplog on another system which I can then process at my leisure, possibly filtering out the deletes until I am ready?
I considered the slavedelay parameters, but that will affect all transactions. I also looked into Tungsten Replicate as a solution so I can essentially cache the the oplogs, however, they do not support MongoDB as a source of the data.
Is the oplog stored in plain text on the secondary such that I can read it and extract what I want from it.
Any pointers to this would be helpful, unfortunately I could not find much documentation on Oplog in the MongoDB website.
MongoDB oplog is stored as a capped collection called 'oplog.rs' in your local DB:
use local
db.oplog.rs.find()
If you want to store more old data in oplog for later use, you can try to increase the size of that collection. See http://docs.mongodb.org/manual/tutorial/change-oplog-size/
Alternatively, you can recreate oplog.rs as an uncapped collection (though this is not recommended since you will have to maually clean up oplog). Follow the same steps to increase the size above, but when recreating the oplog, use this command
db.runCommand( { create: "oplog.rs", capped: false})
Another solution is to create a cron job with the following command dump oplog into the folder YYYYMMDD:
mongodump --db local --collection oplog.rs -o $(date +%Y%m%d)
Hope that helps.
I wonder why you would do that manually. The "canonical" way to do it is to either identify the lifetime or expiration date of a record. If it is a lifetime, you'd do sth like
db.collection.insert({'foo':'bar' [...], created: ISODate("2014-10-06T09:00:05Z")})
and
db.collection.ensureIndex({'created':1},{expireAfterSeconds:172800})
By doing so, a thread called TTLMonitor will awake every minute and remove all documents which have a created field which is older than two days.
If you have a fixed expiration date for each document, you'd basically do the same:
db.collection.insert({'foo':'bar' [...], expirationDate: ISODate("2100-01-01T00:00:00Z"})
and
db.collection.ensureIndex({expirationDate:1},{expireAfterSeconds:0})
This will purge the documents in the first run of TTLMonitor after the expirationDate.
You could adjust expireAfterSeconds to a value which safely allows you to process the records before they are purged, keeping the overall size at acceptable needs and making sure that even when your application goes down during the purging work, the records are removed. (Not to mention that you don't need to maintain the purging logic yourself).
That being said and in the hope it might be useful to you, I think your problem is a conceptual one.
You have a scaling problem. You system is unable to deal with peaks, hence it occasionally is unable to process all data in time. Instead of fiddling with the internals of MongoDB (which might be quite dangerous, as #chianh correctly pointed out), you should rather scale accordingly by identifying your bottleneck and scale it according to your peaks.

MongoDB - How to fix high CPU usage

I'm a MongoDB / PyMongo newbie. I've noticed that MongoDB uses a lot of CPU (in htop) and narrowed it down to one line of code.
db[collection].update({}, {'$inc':large_python_dictionary})
Although there is only ONE document in this particular database, there are a few thousand fields being updated with this one line contained in large_python_dictionary.
Is there something I can do to fix this or should I be looking into a different schema / different type of database?

How can I discover a mongo database's structure

I have a Mongo database that I did not create or architect, is there a good way to introspect the db or print out what the structure is to start to get a handle on what types of data are being stored, how the data types are nested, etc?
Just query the database by running the following commands in the mongo shell:
use mydb //this switches to the database you want to query
show collections //this command will list all collections in the database
db.collectionName.find().pretty() //this will show all documents in the database in a readable format; do the same for each collection in the database
You should then be able to examine the document structure.
There is actually a tool to help you out here called Variety:
http://blog.mongodb.org/post/21923016898/meet-variety-a-schema-analyzer-for-mongodb
You can view the Github repo for it here: https://github.com/variety/variety
I should probably warn you that:
It uses MR to accomplish its tasks
It uses certain other queries that could bring a production set-up to a near halt in terms of performance.
As such I recommend you run this on a development server or a hidden node of a replica or something.
Depending on the size and depth of your documents it may take a very long time to understand the rough structure of your database through this but it will eventually give one.
This will print name and its type
var schematodo = db.collection_name.findOne()
for (var key in schematodo) { print (key, typeof key) ; }
I would recommend limiting the result set rather than issuing an unrestricted find command.
use mydb
db.collectionName.find().limit(10)
var z = db.collectionName.find().limit(10)
Object.keys(z[0])
Object.keys(z[1])
This will help you being to understand your database structure or lack thereof.
This is an open-source tool that I, along with my friend, have created - https://pypi.python.org/pypi/mongoschema/
It is a Python library with a pretty simple usage. You can try it out (even contribute).
One option is to use the Mongoeye. It is open-source tool similar to the Variety.
The difference is that Mongoeye is a stand-alone program (Mongo Shell is not required) and has more features (histograms, most frequent values, etc.).
https://github.com/mongoeye/mongoeye
Few days ago I found GUI client MongoDB Compass with some nice visualizations. See the product overview. It comes directly from the mongodb people and according to their doc:
MongoDB Compass is designed to allow users to easily analyze and understand the contents of their data collections within MongoDB...
You may've asked about validation schema. Here's the answer how to get it:
How to retrieve MongoDb collection validator rules?
Use Mongo Compass
which does a sample as explained here
Which does a random sample of 1000 documents to get you the schema - it could miss something but it's the only rational option if you database is several GBs.
Visualisation
The schema then can be exported as JSON
Documentation
You can use MongoDB's tool mongodump. On running it, a dump folder is created in the directory from which you executed mongodump. In that folder, there are multiple folders that correspond to the databases in MongDB, and there are subfolders that correspond to the collections, and files that correspond to the documents.
This method is the best I know of, as you can also make out the schema of empty collections.

how to store User logs on a large server best practice

From my experience this is what i come up with.
Im currently saving Users and Statistic classes into the MongoDb and everything works grate.
But how do i save the log each user generate?
Was thinking to use the LogBack SiftingAppender and delegate the log information
to separate MongoDb Collections. Like every MongoDb Collection have the id of the user.
That way i don't have to create advanced mapreduce query's since logs are neatly stacked.
Or use SiftingAppender with a FileAppender so each user have a separate log file.
I there is a problem with this if the MongoDB have one million Log Collections each one named with the User Id. (is it even possible btw)
If everything is stored in the MongoDb the MongoDb master-slave replication makes
it easy if a master node dies.
What about the FileAppender approach. Feels like there will be a hole lot of log files
to administrate. One could maybe save them in folders according to Alphabet. Folder A
for user/id with names/id starting with A.
What are other options to make this work?
On your qn of 1M collections, the default namespace file for a db is 16MB which allows about 24000 namespaces (12000 collections + their _id indexes). More info on this website
And you can set maximum .ns (namespace) file size to 2GB with --nssize option, which will allow probably 3072000 namespaces.
Make use of Embedded Documents and have one Document for each user with an array of embedded documents containing log files. You can also benefit from sharding if collections get large.