MongoDB / mongoose split large documents - mongodb

We are extending an existing node+mongo app. We need to add what could be large docs, but we currently do not know how big they could get to.
MongoDB has a default limit to 16mb max size, i am aware we can increase this but would rather not.
Has anyone ever seen a auto doc. split module? Something to auto split the docs into partials if the size exceeds a certain size?

If you have large CSV data to be stored in MongoDB, then there are two approaches which will both work well in different ways:
1: Save in MongoDB format
This means that you have your application read the csv, and write it to a MongoDB collection one row at a time. So each row is saved as a separate document, perhaps something like this:
{
"filename" : "restaurants.csv",
"version" : "2",
"uploadDate" : ISODate("2017-06-15"),
"name" : "Ace Cafe",
"cuisine" : "British",
etc
},
{
"filename" : "restaurants.csv",
"version" : "2",
"uploadDate" : ISODate("2017-06-15"),
"name" : "Bengal Tiger",
"cuisine" : "Bangladeshi",
etc
}
This will take work on your application's part, to render the data into this format and deciding how and where to save the metadata
You can index and query on the data, field by field and row by row
You have no worries about any single document getting too large
2: Save in CSV format using GridFS
This means that your file is uploaded as an un-analysed blob, and automatically divided into 16MB chunks in order to save it in MongoDB documents.
This is easy to do, and does not disturb your original CSV structure
However the data is opaque to MongoDB: you cannot scan it or read it row by row
to work with the data, your application will have to download the entire file from MongoDB and work on it in memory
Hopefully one of these approaches will suit your needs.

Related

MongoDB workaround for document above 16mb size?

The collection of MongoDB I am working on takes sensor data from cellphone and it is pinged to the server like every 2-6 seconds.
The data is huge and the limit of 16mb is crossed after 4-5 hours, there don't seem to be any work around for this?
I have tried searching for it on Stack Overflow and went through various questions but no one actually shared their hack.
Is there any way... on the DB side maybe which will distribute the chunk like it is done for big files via gridFS?
To fix this problem you will need to make some small amendments to your data structure. By the sounds of it, for your documents to exceed the 16mb limit, you must be embedding your sensor data into an array in a single document.
I would not suggest using GridFS here, I do not believe it to be the best solution, and here is why.
There is a technique known as bucketing that you could employ which will essentially split your sensor readings out into separate documents, solving this problem for you.
The way it works is this:
Lets say I have a document with some embedded readings for a particular sensor that looks like this:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
]
}
With the structure above, there is already a major flaw, the readings array could grow exponentially, and exceed the 16mb document limit.
So what we can do is change the structure slightly to look like this, to include a count property:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
],
count : 3
}
The idea behind this is, when you $push your reading into your embedded array, you increment ($inc) the count variable for every push that is performed. And when you perform this update (push) operation, you would include a filter on this "count" property, which might look something like this:
{ count : { $lt : 500} }
Then, set your Update Options so that you can set "upsert" to "true":
db.sensorReadings.update(
{ name: "SensorName1", count { $lt : 500} },
{
//Your update. $push your reading and $inc your count
$push: { readings: [ReadingDocumentToPush] },
$inc: { count: 1 }
},
{ upsert: true }
)
see here for more info on MongoDb Update and the Upsert option:
MongoDB update documentation
What will happen is, when the filter condition is not met (i.e when there is either no existing document for this sensor, or the count is greater or equal to 500 - because you are incrementing it every time an item is pushed), a new document will be created, and the readings will now be embedded in this new document. So you will never hit the 16mb limit if you do this properly.
Now, when querying the database for readings of a particular sensor, you may get back multiple documents for that sensor (instead of just one with all the readings in it), for example, if you have 10,000 readings, you will get 20 documents back, each with 500 readings each.
You can then use aggregation pipeline and $unwind to filter your readings as if they were their own individual documents.
For more information on unwind see here, it's very useful
MongoDB Unwind
I hope this helps.
You can handle this type of situations using GridFS in MongoDB.
Instead of storing a file in a single document, GridFS divides the file into parts, or chunks 1, and stores each chunk as a separate document. By default, GridFS uses a chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk. The last chunk is only as large as necessary. Similarly, files that are no larger than the chunk size only have a final chunk, using only as much space as needed plus some additional metadata.
The documentation of GriFS contains almost everything you need to implement GridFS. You can follow it.
As your data is stream, you can try as following...
gs.write(data, callback)
where data is a Buffer or a string, callback gets two parameters - an error object (if error occured) and result value which indicates if the write was successful or not. While the GridStore is not closed, every write is appended to the opened GridStore.
You can follow this github page for streaming related information.

How to Query GridFS for particular text in MongoDB [duplicate]

I have a blogging system that stores uploaded files into the GridFS system. Problem is, I dont understand how to query it!
I am using Mongoose with NodeJS which doesnt yet support GridFS so I am using the actual mongodb module for the GridFS operations. There doesn't SEEM to be a way to query the files metadata like you do documents in a regular collection.
Would it be wise to store the metadata in a document pointing to the GridFS objectId? to easily be able to query?
Any help would be GREATLY appreciated, im kinda stuck :/
GridFS works by storing a number of chunks for each file. This way, you can deliver and store very large files without having to store the entire file in RAM. Also, this enables you to store files that are larger than the maximum document size. The recommended chunk size is 256kb.
The file metadata field can be used to store additional file-specific metadata, which can be more efficient than storing the metadata in a separate document. This greatly depends on your exact requirements, but the metadata field, in general, offers a lot of flexibility. Keep in mind that some of the more obvious metadata is already part of the fs.files document, by default:
> db.fs.files.findOne();
{
"_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"filename" : "2e117dc7f5ba434c90be29c767426c29",
"length" : 486912,
"chunkSize" : 262144,
"uploadDate" : ISODate("2011-10-18T09:05:54.851Z"),
"md5" : "4f31970165766913fdece5417f7fa4a8",
"contentType" : "application/pdf"
}
To actually read the file from GridFS you'll have to fetch the file document from fs.files and the chunks from fs.chunks. The most efficient way to do that is to stream this to the client chunk-by-chunk, so you don't have to load the entire file in RAM. The chunks collection has the following structure:
> db.fs.chunks.findOne({}, {"data" :0});
{
"_id" : ObjectId("4e9d4172b2ceac15506445e1"),
"files_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"n" : 0, // this is the 0th chunk of the file
"data" : /* loads of data */
}
If you want to use the metadata field of fs.files for your queries, make sure you understand the dot notation, e.g.
> db.fs.files.find({"metadata.OwnerId": new ObjectId("..."),
"metadata.ImageWidth" : 280});
also make sure your queries can use an index using explain().
As the specification says, you can store whatever you want in the metadata field.
Here's how a document from the files collection looks like:
Required fields
{
"_id" : <unspecified>, // unique ID for this file
"length" : data_number, // size of the file in bytes
"chunkSize" : data_number, // size of each of the chunks. Default is 256k
"uploadDate" : data_date, // date when object first stored
"md5" : data_string // result of running the "filemd5" command on this file's chunks
}
Optional fields
{
"filename" : data_string, // human name for the file
"contentType" : data_string, // valid mime type for the object
"aliases" : data_array of data_string, // optional array of alias strings
"metadata" : data_object, // anything the user wants to store
}
So store anything you want in the metadata and query it normally like you would in MongoDB:
db.fs.files.find({"metadata.some_info" : "sample"});
I know the question doesn't ask about the Java way of querying for metadata, but here it is, assuming you add gender as a metadata field:
// Get your database's GridFS
GridFS gfs = new GridFS("myDatabase);
// Write out your JSON query within JSON.parse() and cast it as a DBObject
DBObject dbObject = (DBObject) JSON.parse("{metadata: {gender: 'Male'}}");
// Querying action (find)
List<GridFSDBFile> gridFSDBFiles = gfs.find(dbObject);
// Loop through the results
for (GridFSDBFile gridFSDBFile : gridFSDBFiles) {
System.out.println(gridFSDBFile.getFilename());
}
metadata is stored in metadata field. You can query it like
db.fs.files.find({metadata: {content_type: 'text/html'}})

How to add data into mongo collections

I had following mongo collections structures
{
"_id" : ObjectId("52204f5b24c8cbf03ca16f8e"),
"Date" : 1377849179,
"cpuUtilization" : 31641,
"memory" : 20623801,
"hostId" : "600.6.6.6"
}
In above collections I had 1000 hostId and every hostId produced cpuutilization and memory every 5 min. So any one suggest me I put my data into single collection or I create separate 1000 collections using hostId like collections name as 100.1.12.2,101.2.10.1....
and I also want indexing on collections for searching records.
From the structure you have shared it would be an intelligent choice to put data into separate records, since the memory and cpuUtilization would always be different. Also, if you store timestamp in Date field, that would always be different.
It would be far more easier to query your database if you store records separately and you could avoid using aggregation as well which will give you better query performance by using appropriate indexes.
So your records should look like below:
{ "_id" : ObjectId("someID1"),"Date" : 1377849179,"cpuUtilization" : 31641,"memory" : 20623801,"hostId" : "600.6.6.6"}
{ "_id" : ObjectId("someID2"),"Date" : 1377849210,"cpuUtilization" : 20141,"memory" : 28787801,"hostId" : "600.6.6.6"}
One collection will be good enough to store the information . One of the thought you have to take care is Write performance , as mongodb locks while writing at database level , Write may be slow. One suggestion I can give to have two or three database which will hold the collections for specific range of host. It help you to write faster . Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations.

Querying MongoDB GridFS?

I have a blogging system that stores uploaded files into the GridFS system. Problem is, I dont understand how to query it!
I am using Mongoose with NodeJS which doesnt yet support GridFS so I am using the actual mongodb module for the GridFS operations. There doesn't SEEM to be a way to query the files metadata like you do documents in a regular collection.
Would it be wise to store the metadata in a document pointing to the GridFS objectId? to easily be able to query?
Any help would be GREATLY appreciated, im kinda stuck :/
GridFS works by storing a number of chunks for each file. This way, you can deliver and store very large files without having to store the entire file in RAM. Also, this enables you to store files that are larger than the maximum document size. The recommended chunk size is 256kb.
The file metadata field can be used to store additional file-specific metadata, which can be more efficient than storing the metadata in a separate document. This greatly depends on your exact requirements, but the metadata field, in general, offers a lot of flexibility. Keep in mind that some of the more obvious metadata is already part of the fs.files document, by default:
> db.fs.files.findOne();
{
"_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"filename" : "2e117dc7f5ba434c90be29c767426c29",
"length" : 486912,
"chunkSize" : 262144,
"uploadDate" : ISODate("2011-10-18T09:05:54.851Z"),
"md5" : "4f31970165766913fdece5417f7fa4a8",
"contentType" : "application/pdf"
}
To actually read the file from GridFS you'll have to fetch the file document from fs.files and the chunks from fs.chunks. The most efficient way to do that is to stream this to the client chunk-by-chunk, so you don't have to load the entire file in RAM. The chunks collection has the following structure:
> db.fs.chunks.findOne({}, {"data" :0});
{
"_id" : ObjectId("4e9d4172b2ceac15506445e1"),
"files_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"n" : 0, // this is the 0th chunk of the file
"data" : /* loads of data */
}
If you want to use the metadata field of fs.files for your queries, make sure you understand the dot notation, e.g.
> db.fs.files.find({"metadata.OwnerId": new ObjectId("..."),
"metadata.ImageWidth" : 280});
also make sure your queries can use an index using explain().
As the specification says, you can store whatever you want in the metadata field.
Here's how a document from the files collection looks like:
Required fields
{
"_id" : <unspecified>, // unique ID for this file
"length" : data_number, // size of the file in bytes
"chunkSize" : data_number, // size of each of the chunks. Default is 256k
"uploadDate" : data_date, // date when object first stored
"md5" : data_string // result of running the "filemd5" command on this file's chunks
}
Optional fields
{
"filename" : data_string, // human name for the file
"contentType" : data_string, // valid mime type for the object
"aliases" : data_array of data_string, // optional array of alias strings
"metadata" : data_object, // anything the user wants to store
}
So store anything you want in the metadata and query it normally like you would in MongoDB:
db.fs.files.find({"metadata.some_info" : "sample"});
I know the question doesn't ask about the Java way of querying for metadata, but here it is, assuming you add gender as a metadata field:
// Get your database's GridFS
GridFS gfs = new GridFS("myDatabase);
// Write out your JSON query within JSON.parse() and cast it as a DBObject
DBObject dbObject = (DBObject) JSON.parse("{metadata: {gender: 'Male'}}");
// Querying action (find)
List<GridFSDBFile> gridFSDBFiles = gfs.find(dbObject);
// Loop through the results
for (GridFSDBFile gridFSDBFile : gridFSDBFiles) {
System.out.println(gridFSDBFile.getFilename());
}
metadata is stored in metadata field. You can query it like
db.fs.files.find({metadata: {content_type: 'text/html'}})

MongoDB Table Design and Query Performance

I'm new to MongoDB. When creating a new table a question came to my mind related to how to design it and performance. My table structure looks this way:
{
"name" : string,
"data" : { "data1" : "xxx", "data2" : "yyy", "data3" : "zzz", .... }
}
The "data" field could grow until it reaches an amount of 100.000 elements ( "data100.000" : "aaaXXX"). However the number of rows in this table would be under control (between 500 and 1000).
This table will be accessed many times in my application and I'd like to maximize the performance of any queries. I would do queries like this one (I'll put an example in java):
new Query().addCriteria(Criteria.where("name").is(name).and("data.data3").is("zzz"));
I don't know if this would get slower when the amount of "dataX"... elements grows.
So the question is: Is this design correct? Should I change something?
I'll be pleased to read your advice, many thanks in advance
A document could be viewed like a table with columns, but you have to be carefull. It has other usage characteristics. The document size can be max. 16 MB. And you have to keep in mind that the documents are hold in memory by mongo.
With your query the whole document will be returned. Ask yourself do you need all entries or
will you have to use a single entry on his own?
Using MongoDB for eCommerce
MongoDB Schema Design
MongoDB and eCommerce
MongoDB Transactions
This should be a good start.
What is data? I wouldn't store a single nested document with up to 100,000 fields as it you wouldn't be able to index it easily so you would get performance issues.
You'd be better off storing as an array of strings, then you can index the array field which would index all the values.
{
"name" : string,
"data" : [ "xxx", "yyy", "zzz" ]
}
If like in your query you then wanted the value at a particular position in the array, instead of data.data3 you could do:
db.Collection.find( { "data.2" : "zzz" } )
Or, if you don't care about the position and just want all documents where the data array contains 'zzz' you can do:
db.Collection.find( { "data" : "zzz" } )
100,000 strings is not going to get anywhere near 16MB so you don't need to worry about that, but having 100,000 fields in a nested document or array indicates something is wrong with the design, but without knowing what data is I couldn't say for sure.