The collection of MongoDB I am working on takes sensor data from cellphone and it is pinged to the server like every 2-6 seconds.
The data is huge and the limit of 16mb is crossed after 4-5 hours, there don't seem to be any work around for this?
I have tried searching for it on Stack Overflow and went through various questions but no one actually shared their hack.
Is there any way... on the DB side maybe which will distribute the chunk like it is done for big files via gridFS?
To fix this problem you will need to make some small amendments to your data structure. By the sounds of it, for your documents to exceed the 16mb limit, you must be embedding your sensor data into an array in a single document.
I would not suggest using GridFS here, I do not believe it to be the best solution, and here is why.
There is a technique known as bucketing that you could employ which will essentially split your sensor readings out into separate documents, solving this problem for you.
The way it works is this:
Lets say I have a document with some embedded readings for a particular sensor that looks like this:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
]
}
With the structure above, there is already a major flaw, the readings array could grow exponentially, and exceed the 16mb document limit.
So what we can do is change the structure slightly to look like this, to include a count property:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
],
count : 3
}
The idea behind this is, when you $push your reading into your embedded array, you increment ($inc) the count variable for every push that is performed. And when you perform this update (push) operation, you would include a filter on this "count" property, which might look something like this:
{ count : { $lt : 500} }
Then, set your Update Options so that you can set "upsert" to "true":
db.sensorReadings.update(
{ name: "SensorName1", count { $lt : 500} },
{
//Your update. $push your reading and $inc your count
$push: { readings: [ReadingDocumentToPush] },
$inc: { count: 1 }
},
{ upsert: true }
)
see here for more info on MongoDb Update and the Upsert option:
MongoDB update documentation
What will happen is, when the filter condition is not met (i.e when there is either no existing document for this sensor, or the count is greater or equal to 500 - because you are incrementing it every time an item is pushed), a new document will be created, and the readings will now be embedded in this new document. So you will never hit the 16mb limit if you do this properly.
Now, when querying the database for readings of a particular sensor, you may get back multiple documents for that sensor (instead of just one with all the readings in it), for example, if you have 10,000 readings, you will get 20 documents back, each with 500 readings each.
You can then use aggregation pipeline and $unwind to filter your readings as if they were their own individual documents.
For more information on unwind see here, it's very useful
MongoDB Unwind
I hope this helps.
You can handle this type of situations using GridFS in MongoDB.
Instead of storing a file in a single document, GridFS divides the file into parts, or chunks 1, and stores each chunk as a separate document. By default, GridFS uses a chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk. The last chunk is only as large as necessary. Similarly, files that are no larger than the chunk size only have a final chunk, using only as much space as needed plus some additional metadata.
The documentation of GriFS contains almost everything you need to implement GridFS. You can follow it.
As your data is stream, you can try as following...
gs.write(data, callback)
where data is a Buffer or a string, callback gets two parameters - an error object (if error occured) and result value which indicates if the write was successful or not. While the GridStore is not closed, every write is appended to the opened GridStore.
You can follow this github page for streaming related information.
Related
We are extending an existing node+mongo app. We need to add what could be large docs, but we currently do not know how big they could get to.
MongoDB has a default limit to 16mb max size, i am aware we can increase this but would rather not.
Has anyone ever seen a auto doc. split module? Something to auto split the docs into partials if the size exceeds a certain size?
If you have large CSV data to be stored in MongoDB, then there are two approaches which will both work well in different ways:
1: Save in MongoDB format
This means that you have your application read the csv, and write it to a MongoDB collection one row at a time. So each row is saved as a separate document, perhaps something like this:
{
"filename" : "restaurants.csv",
"version" : "2",
"uploadDate" : ISODate("2017-06-15"),
"name" : "Ace Cafe",
"cuisine" : "British",
etc
},
{
"filename" : "restaurants.csv",
"version" : "2",
"uploadDate" : ISODate("2017-06-15"),
"name" : "Bengal Tiger",
"cuisine" : "Bangladeshi",
etc
}
This will take work on your application's part, to render the data into this format and deciding how and where to save the metadata
You can index and query on the data, field by field and row by row
You have no worries about any single document getting too large
2: Save in CSV format using GridFS
This means that your file is uploaded as an un-analysed blob, and automatically divided into 16MB chunks in order to save it in MongoDB documents.
This is easy to do, and does not disturb your original CSV structure
However the data is opaque to MongoDB: you cannot scan it or read it row by row
to work with the data, your application will have to download the entire file from MongoDB and work on it in memory
Hopefully one of these approaches will suit your needs.
I have removed some documents in my last query by mistake, Is there any way to rollback my last query mongo collection.
Here it is my last query :
db.foo.remove({ "name" : "some_x_name"})
Is there any rollback/undo option? Can I get my data back?
There is no rollback option (rollback has a different meaning in a MongoDB context), and strictly speaking there is no supported way to get these documents back - the precautions you can/should take are covered in the comments. With that said however, if you are running a replica set, even a single node replica set, then you have an oplog. With an oplog that covers when the documents were inserted, you may be able to recover them.
The easiest way to illustrate this is with an example. I will use a simplified example with just 100 deleted documents that need to be restored. To go beyond this (huge number of documents, or perhaps you wish to only selectively restore etc.) you will either want to change the code to iterate over a cursor or write this using your language of choice outside the MongoDB shell. The basic logic remains the same.
First, let's create our example collection foo in the database dropTest. We will insert 100 documents without a name field and 100 documents with an identical name field so that they can be mistakenly removed later:
use dropTest;
for(i=0; i < 100; i++){db.foo.insert({_id : i})};
for(i=100; i < 200; i++){db.foo.insert({_id : i, name : "some_x_name"})};
Now, let's simulate the accidental removal of our 100 name documents:
> db.foo.remove({ "name" : "some_x_name"})
WriteResult({ "nRemoved" : 100 })
Because we are running in a replica set, we still have a record of these documents in the oplog (being inserted) and thankfully those inserts have not (yet) fallen off the end of the oplog (the oplog is a capped collection remember) . Let's see if we can find them:
use local;
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}).count();
100
The count looks correct, we seem to have our documents still. I know from experience that the only piece of the oplog entry we will need here is the o field, so let's add a projection to only return that (output snipped for brevity, but you get the idea):
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1});
{ "o" : { "_id" : 100, "name" : "some_x_name" } }
{ "o" : { "_id" : 101, "name" : "some_x_name" } }
{ "o" : { "_id" : 102, "name" : "some_x_name" } }
{ "o" : { "_id" : 103, "name" : "some_x_name" } }
{ "o" : { "_id" : 104, "name" : "some_x_name" } }
To re-insert those documents, we can just store them in an array, then iterate over the array and insert the relevant pieces. First, let's create our array:
var deletedDocs = db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1}).toArray();
> deletedDocs.length
100
Next we remind ourselves that we only have 100 docs in the collection now, then loop over the 100 inserts, and finally revalidate our counts:
use dropTest;
db.foo.count();
100
// simple for loop to re-insert the relevant elements
for (var i = 0; i < deletedDocs.length; i++) {
db.foo.insert({_id : deletedDocs[i].o._id, name : deletedDocs[i].o.name});
}
// check total and name counts again
db.foo.count();
200
db.foo.count({name : "some_x_name"})
100
And there you have it, with some caveats:
This is not meant to be a true restoration strategy, look at backups (MMS, other), delayed secondaries for that, as mentioned in the comments
It's not going to be particularly quick to query the documents out of the oplog (any oplog query is a table scan) on a large busy system.
The documents may age out of the oplog at any time (you can, of course, make a copy of the oplog for later use to give you more time)
Depending on your workload you might have to de-dupe the results before re-inserting them
Larger sets of documents will be too large for an array as demonstrated, so you will need to iterate over a cursor instead
The format of the oplog is considered internal and may change at any time (without notice), so use at your own risk
While I understand this is a bit old but I wanted to share something that I researched in this area that may be useful to others with a similar problem.
The fact is that MongoDB does not Physically delete data immediately - it only marks it for deletion. This is however version specific and there is currently no documentation or standardization - which could enable a third party tool developer (or someone in desperate need) to build a tool or write a simple script reliably that works across versions. I opened a ticket for this - https://jira.mongodb.org/browse/DOCS-5151.
I did explore one option which is at a much lower level and may need fine tuning based on the version of MongoDB used. Understandably too low level for most people's linking, however it works and can be handy when all else fails.
My approach involves directly working with the binary in the file and using a Python script (or commands) to identify, read and unpack (BSON) the deleted data.
My approach is inspired by this GitHub project (I am NOT the developer of this project). Here on my blog I have tried to simplify the script and extract a specific deleted record from a Raw MongoDB file.
Currently a record is marked for deletion as "\xee" at the start of the record. This is what a deleted record looks like in the raw db file,
‘\xee\xee\xee\xee\x07_id\x00U\x19\xa6g\x9f\xdf\x19\xc1\xads\xdb\xa8\x02name\x00\x04\x00\x00\x00AAA\x00\x01marks\x00\x00\x00\x00\x00\x00#\x9f#\x00′
I replaced the first block with the size of the record which I identified earlier based on other records.
y=”3\x00\x00\x00″+x[20804:20800+51]
Finally using the BSON package (that comes with pymongo), I decoded the binary to a Readable object.
bson.decode_all(y)
[{u’_id': ObjectId(‘5519a6679fdf19c1ad73dba8′), u’name': u’AAA’, u’marks': 2000.0}]
This BSON is a python object now and can be dumped into a recover collection or simply logged somewhere.
Needless to say this or any other recovery technique should be ideally done in a staging area on a backup copy of the database file.
I am collecting data from a streaming API and I want to create a real-time analytics dashboard. Every time a new record appears at the end of the stream I update a counter in the below document.
From a design perspective. Am I correct to use only one document, like in the below example?
{
"_id" : ObjectId("5238beb4d4bed9e444c99978"),
"counts" : {
"hours" : {
"1" : 835,
"2" : 1007,
.
.
.
"3" : 174,
}
}
The benefit with this approach is that only one document needs to be sent to the real-time analytics dashboard. Also after a year this document would have only 365 * 24 fields, 1 for each hour in that year?
What about indexing? Can I create an index on counts.hours if I only have one document? Or do indexes only work across collections in mongodb? Do indexes help with finding documents faster or also fields inside documents?
If I could create an index on counts.hours, then the counter increment process could find the correct hour to increment (per new document at the end of the stream) much more efficiently.
You can create indexes in fields embedded in a document. In the case above:
yourCollection.ensureIndex({ 'counts.hours':1 });
The index will help you optimize queries to return documents based on 'counts.hours' field.
youCollection.find({ 'count.hours':1 });
Your data structure design should depend on the kind of queries and updates you are planning to do. In the case you described I imagine you will be adding members to the 'hours' object, updates like that might be expensive since MongoDB pads each collection record optimizing for the case where the record size is stable across updates.
I'm new to MongoDB. When creating a new table a question came to my mind related to how to design it and performance. My table structure looks this way:
{
"name" : string,
"data" : { "data1" : "xxx", "data2" : "yyy", "data3" : "zzz", .... }
}
The "data" field could grow until it reaches an amount of 100.000 elements ( "data100.000" : "aaaXXX"). However the number of rows in this table would be under control (between 500 and 1000).
This table will be accessed many times in my application and I'd like to maximize the performance of any queries. I would do queries like this one (I'll put an example in java):
new Query().addCriteria(Criteria.where("name").is(name).and("data.data3").is("zzz"));
I don't know if this would get slower when the amount of "dataX"... elements grows.
So the question is: Is this design correct? Should I change something?
I'll be pleased to read your advice, many thanks in advance
A document could be viewed like a table with columns, but you have to be carefull. It has other usage characteristics. The document size can be max. 16 MB. And you have to keep in mind that the documents are hold in memory by mongo.
With your query the whole document will be returned. Ask yourself do you need all entries or
will you have to use a single entry on his own?
Using MongoDB for eCommerce
MongoDB Schema Design
MongoDB and eCommerce
MongoDB Transactions
This should be a good start.
What is data? I wouldn't store a single nested document with up to 100,000 fields as it you wouldn't be able to index it easily so you would get performance issues.
You'd be better off storing as an array of strings, then you can index the array field which would index all the values.
{
"name" : string,
"data" : [ "xxx", "yyy", "zzz" ]
}
If like in your query you then wanted the value at a particular position in the array, instead of data.data3 you could do:
db.Collection.find( { "data.2" : "zzz" } )
Or, if you don't care about the position and just want all documents where the data array contains 'zzz' you can do:
db.Collection.find( { "data" : "zzz" } )
100,000 strings is not going to get anywhere near 16MB so you don't need to worry about that, but having 100,000 fields in a nested document or array indicates something is wrong with the design, but without knowing what data is I couldn't say for sure.
I have some scripts which update, mongoDb records which look like this :
{ "_id" : "c12345", "arr" : [
{
"i" : 270099850,
"a" : 772,
},
{
"i" : 286855630,
"a" : 622,
}
] }
The scripts append elements in the "arr" array of the object,using "pushAll" which
works fine and is very fast.
My requirement:
1. Keep modifying these objects, but process them once the size of arr exceeds 1000.
When arr exceeds 1000,I choose some important records, discard some less important ones,
and discard some old ones, and reduce the size of arr to 500 .
Current implementation:
1. Script A takes some data from somewhere and finds the object in another collection
using "_id" field, and appends that data into "arr" array.
The same script when finds the element,checks for size of "arr", if less than 1000, it does a normal append to arr, else proceeds to processing of PHP object retreived through find,modifies it, and updates the mongo record using "SET".
Current bottlenecks:
1. I want the updating script to run very fast. Upserts are fast, however the find and modifying operations are slower for each record.
Ideas in mind:
1. Instead of processing EXCEEDED items within the scripts, set a bool flag in the object, and process it using a seperate Data Cleaner script. ( but this also requires me to FIND the object before doing UPSERT ).
always maintain a COUNT variable in the object,which stores current length of "arr", and use it in Data cleaner script which cleans all the objects fetched through a mongodb
query "count" > 1000. ( As mongodb does not allow $size operator to have Ranges, and only equal condition currently, I need to have my own COUNT counter)
Any other clean and efficient ideas you can suggest ?
Thanks .
In version 2.3.2 of mongo a new feature has been added. There is now a $slice that can be used to keep an array to a fixed size.
E.g.:
t.update( {_id:7}, { $push: { x: { $each: [ {a:{b:3}} ], $slice:-2, $sort: {'a.b':1} } } } )
There's no easy way to do this, however, this is a good idea:
Instead of processing EXCEEDED items within the scripts, set a bool flag in the object, and process it using a seperate Data Cleaner script.
Running a separate script definitely makes sense for this.
MongoDB does not have a method for "fixed-length" arrays. But it definitely does not have a method for doing something like this:
choose some important records, discard some less important ones, and discard some old ones
The only exception I would make is the "bool" flag. You probably want just a straight counter. If you can index on this counter then it should be fast to find those arrays that are "too big".