Where should I use sharding in mongodb or run multiple instance of mongodb? - mongodb

Issue
I have at least 10 text files(CSV), each reaches to 5GB in size. There is no issue when I import the first text file. But when I start importing the second text file it shows the Maximum Size Limit (16MB).
My primary purpose for using the database is for searching the customers from the database using customer_id index.
Given Below is the details of One CSV File.
Collection Name|Documents|Avg.Document Size|Total Document Size|Num.Indexes| Total Index Size|Properties
Customers|8,874,412|1.8 KB|15.7 GB|3|262.0 MB
To overcome this MongoDB community were recommending GridFS, but the problem with GridFS is that the data is stored in bytes and its not possible to query for a specific index in the textfile.
I don't know if its possible to query for a specific index in a textfile when using GridFS. If some one knows any help is appreciated.
Then the other solution I thought about was creating multiple instance of MonogDB running in different ports to solve the issue. Is this method feasible?
But lot of the tutorial on multiple instance shows how to cerate a replica set. There by storing the same data in the PRIMARY and the SECONDARY.
The SECONDARY instances don't allow to write and only allows to read data.
Is it possible to create multiple instance of MongoDB without creating replica set and with write and read operations on them? If Yes How? Can this method overcome the 16MB limit.
Second Solution I thought about was creating shards of the collections or simply sharding. Can this method overcome the 16MB limit. If yes any help regarding this.
Of the two solutions which is more efficient for searching for data (in terms of speed). As I mentioned earlier I just want to search of customers from this database.

The error message shows exactly where the problem is: entry #8437: line 13530, column 627
Have a look at the file and correct it in the file.
The error extraneous " in field ... is quite clear. In your CSV file you have an opening quote " but it is not closed, i.e. the rest of entire file is considered as one single field.

Related

How to store lookup values in MongoDB?

I have a collection in db which represents mediafiles.
And among other info I shoud store format name. I wonder if there best practices to store info like that. Is it better to create new collection for file formats and use link to that collection or to store format name right in file documents as a plain text? What about perfomance and compression? It supposed to be more than a billion documents in db. What would mongo expers suggest in this situation?
Embedded documents are the preferred approach.
In your case, it means it is better to store file format in the same collection.
Putting the file format into the separate collection means creating a new file on the disk.
It is a slower option and should be used if your document ( any of them ) exceeds 16 MB in size.
See these links for more information
6 Rules of Thumb for MongoDB Schema Design
and
How to Program with MongoDB Using the .NET Driver
I've done some benchmarks and figured out that in my case storing "lookup values" as plaintext is more efficient in terms of disk space than embedded document and than reference to outstanding collection. Sorry for poor terminology.

how does mongodb do a 42T drive per node

We had heard mongodb had one client with 42T per node and I am wondering more about this. I know cassandra has Bloomfilters that skipp hitting disk to find out which file a row might be in.
Does mongodb have something similar to bloomfilters?
IS mongodb using something similar to SSTables?
I did read mongodb does compaction just like cassandra, I would think this would be an awfully long process with a 42T node????
I guess I don't know what terms to search for as I research mongodb here(in cassandra they are called SSTables).
thanks,
Dean
MongoDB does not support online compaction. In fact, data fragmentation is a current problem in systems with many doc updates. To prevent data fragmentation MongoDB tries to calculate an automated padding factor, minimizing the number of data moves.
The compact command blocks the entire database until it finished. Besides, MongoDB does not support dictionary compression, so field names takes space on every object stored. I guess the layout used by MongoDB is not any fancy data structure. It's simply composed of header (offset, length...), bson data and padding factor.
Since MongoDB is not a key/value or columnar database it doesn't use SSTables (efficient data structure for columnar layout). Every file created for the database is named "extent".
AFAIK, MongoDB doesn't use bloom filters.

Mongoid create vs collection.insert

I'm not sure how to put this. Well, recently I worked on a rails project with mongoid, and I had the task of inserting multiple records in Mongodb.
Say insert multiple records of PartPriceRecord in the database. After googling this I came across the collection.insert commands:
PartPriceRecord.collection.insert(multiple_part_price_records)
But on inserting large number of records, MongoDb always seemed to prompt me with error message:
Exceded maximum insert size of 16,000,000 bytes
Googling around I found that the the upper limit for MongoDb for a single document, but surprisingly when I changed my above query to this:
multiple_part_price_records.each do|mppr|
PartPriceRecord.create(mppr)
end
the above errors do not seem to appear any more.
Can anyone explain in depth under the hood what is exactly is the difference between the two?
Thanks
The maximum size for a single, bulk insert is 16M bytes. That's what you're trying to do in your first example.
In your second example, you're inserting each document individually. Therefore, each insert is under the max limit for an insert.
#Kyle explained the difference in his answer quite succinctly (+1'd), but as for a solution to your problem you may want to look at doing batch inserts:
BATCH_SIZE = 200
multiple_part_price_records.each_slice(BATCH_SIZE) do |batch|
PartPriceRecord.collection.insert(batch)
end
This will slice the records into batches of 200 (or whatever size is best for your situation) and insert them within that limit. This will be a lot faster than running save on each one individually which would be sending far more requests to the database.
A few quick things to note about collection.insert:
It does not run validations on your models prior to insertion, you may want to check this prior to insert
It is required to be in a document format unlike save which requires it be a model. You can easily convert to a document by calling as_document on the model.

how to store User logs on a large server best practice

From my experience this is what i come up with.
Im currently saving Users and Statistic classes into the MongoDb and everything works grate.
But how do i save the log each user generate?
Was thinking to use the LogBack SiftingAppender and delegate the log information
to separate MongoDb Collections. Like every MongoDb Collection have the id of the user.
That way i don't have to create advanced mapreduce query's since logs are neatly stacked.
Or use SiftingAppender with a FileAppender so each user have a separate log file.
I there is a problem with this if the MongoDB have one million Log Collections each one named with the User Id. (is it even possible btw)
If everything is stored in the MongoDb the MongoDb master-slave replication makes
it easy if a master node dies.
What about the FileAppender approach. Feels like there will be a hole lot of log files
to administrate. One could maybe save them in folders according to Alphabet. Folder A
for user/id with names/id starting with A.
What are other options to make this work?
On your qn of 1M collections, the default namespace file for a db is 16MB which allows about 24000 namespaces (12000 collections + their _id indexes). More info on this website
And you can set maximum .ns (namespace) file size to 2GB with --nssize option, which will allow probably 3072000 namespaces.
Make use of Embedded Documents and have one Document for each user with an array of embedded documents containing log files. You can also benefit from sharding if collections get large.

Maximum number of databases supported by MongoDB

I would like to create a database for each customer. But before, I would like to know how many databases can be created in a single instance of MongoDB ?
There's no explicit limit, but there are probably some implicit limits
due to max number of open file handles / files in a directory on the
host OS/filesystem.
see: http://groups.google.com/group/mongodb-user/browse_thread/thread/01727e1af681985a?fwc=2
By default, you can run some 12,000 collections in a single instance of MongoDB( that is, if each collection also has 1 index).
If you want to create more number of collections, then use --nssize when you run mongod process. You can see this link for more details:
http://www.mongodb.org/display/DOCS/Using+a+Large+Number+of+Collections