Is it normal size for MongoDB? - mongodb

I just import my Mysql database (size 2,8 Mio) in my new Mongo database with very simple php script i build, import was ok without error but when i look my Mongo database (with RockMongo) i can see this : Data Size 8.01m, Storage Size 13.7m.
MongoDB is bigger than Mysql for the same amount of data, is this normal ?
Thanks for your help and sorry for my english.

Yes, it's normal that the "same" data will take up more space in mongodb. There's a few things you need to take into account:
1) the document _id that's stored for each document (unless you are specifying your own value for that) is 12 bytes per doc
2) you're storing the key for each key-value pair in each document, whereas in MySql the column name is not stored for every single row so you have that extra overhead in your mongodb documents too. One way to reduce this is to use shortened key names ("column names") in your docs
3) mongodb automatically adds padding to allow documents to grow
In similar tests, loading data from SQL Server to MongoDB, with shortened 2 character document key names instead of the full names as per SQL Server, I see about 25-30% extra space being used in MongoDB.

Related

Where should I use sharding in mongodb or run multiple instance of mongodb?

Issue
I have at least 10 text files(CSV), each reaches to 5GB in size. There is no issue when I import the first text file. But when I start importing the second text file it shows the Maximum Size Limit (16MB).
My primary purpose for using the database is for searching the customers from the database using customer_id index.
Given Below is the details of One CSV File.
Collection Name|Documents|Avg.Document Size|Total Document Size|Num.Indexes| Total Index Size|Properties
Customers|8,874,412|1.8 KB|15.7 GB|3|262.0 MB
To overcome this MongoDB community were recommending GridFS, but the problem with GridFS is that the data is stored in bytes and its not possible to query for a specific index in the textfile.
I don't know if its possible to query for a specific index in a textfile when using GridFS. If some one knows any help is appreciated.
Then the other solution I thought about was creating multiple instance of MonogDB running in different ports to solve the issue. Is this method feasible?
But lot of the tutorial on multiple instance shows how to cerate a replica set. There by storing the same data in the PRIMARY and the SECONDARY.
The SECONDARY instances don't allow to write and only allows to read data.
Is it possible to create multiple instance of MongoDB without creating replica set and with write and read operations on them? If Yes How? Can this method overcome the 16MB limit.
Second Solution I thought about was creating shards of the collections or simply sharding. Can this method overcome the 16MB limit. If yes any help regarding this.
Of the two solutions which is more efficient for searching for data (in terms of speed). As I mentioned earlier I just want to search of customers from this database.
The error message shows exactly where the problem is: entry #8437: line 13530, column 627
Have a look at the file and correct it in the file.
The error extraneous " in field ... is quite clear. In your CSV file you have an opening quote " but it is not closed, i.e. the rest of entire file is considered as one single field.

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?
Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.
BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

How to store lookup values in MongoDB?

I have a collection in db which represents mediafiles.
And among other info I shoud store format name. I wonder if there best practices to store info like that. Is it better to create new collection for file formats and use link to that collection or to store format name right in file documents as a plain text? What about perfomance and compression? It supposed to be more than a billion documents in db. What would mongo expers suggest in this situation?
Embedded documents are the preferred approach.
In your case, it means it is better to store file format in the same collection.
Putting the file format into the separate collection means creating a new file on the disk.
It is a slower option and should be used if your document ( any of them ) exceeds 16 MB in size.
See these links for more information
6 Rules of Thumb for MongoDB Schema Design
and
How to Program with MongoDB Using the .NET Driver
I've done some benchmarks and figured out that in my case storing "lookup values" as plaintext is more efficient in terms of disk space than embedded document and than reference to outstanding collection. Sorry for poor terminology.

Why does 24 MB of CSV data become 230 MB in MongoDB collection?

My Meteor app takes a CSV file, parses it with Baby Parse (Papa Parse for server) and inserts the data to a MongoDB collection.
Each CSV row is inserted as a document. 24 MB CSV file contains ~900,000 rows; hence, ~900,000 documents in the collection. Each document has 5 fields including the unique id of documents.
When I use dataSize() to get collection size, I receive the number 230172976; if I'm not mistaken, this number is in bytes; therefore it is 230 MB.
Why is this gigantic increase happening? How can I fix this?
This is because the value returns by .dataSize() include the records padding. Also note that if your documents don't have the _id field it will be added and each _id field is 12-byte. You may want to read Record Allocation Strategies
How can I fix this:
Using the collMod command with the noPadding flag or the db.createCollection() method with the noPadding option. But you shouldn't do that because as mentioned in the documentation:
Only set noPadding to true for collections whose workloads have no update operations that cause documents to grow, such as for collections with workloads that are insert-only.
As Pete Garafano mentioned in the comment below, this is applicable for the MMAPv1 Storage Engine only; which is the default storage engine in MongoDB 3.0 and all previous versions.
MongoDB 3.2 use the WiredTiger Storage Engine and you will need to change the default storage engine in order to use that option in your configuration file or using the --storageEngine option.

Exceded maximum insert size of 16,000,000 bytes + mongoDB + ruby

I have an application where I'm using mongodb as a database for storing record the ruby wrapper for mongodb I'm using is mongoid
Now everything was working fine until I hit a above error
Exceded maximum insert size of 16,000,000 bytes
Can any pin point how to get rid of errors.
I'm running a mongodb server which does not have a configuration file (no configuration was provide with mongodb source file)
Can anyone help
You have hit the maximum limit of a single document in MongoDB.
If you save large data files in MongoDB, use GridFs instead.
If your document has too many subdocuments, consider splitting it and use relations instead of nesting.
The limit of 16MB data per document is a very well known limitation.
Use GridFS for storing arbitrary binary data of arbitrary size + metadata.