MongoDB Performance with Upsert

MongoDB Performance with Upsert - mongodb

We are trying to make a "real time" statistics part for our application,
and we want to use MongoDB.
So, to do this, I basically imagine a DB named storage. In this db, I create a statistics collection.
And I store my data like this :
{
"_id" : ObjectId("55642d270528055b171fedf5"),
"cat" : "module",
"name" : "Injector",
"ts_min" : ISODate("2015-05-22T13:16:00Z"),
"nb_action" : {
"0" : 156
},
"tps_action" : {
"0" : 45016
},
"min_tps" : 10,
"max_tps" : 879
}
So, I have a category, a name and a date to determine an unique Object. In this object, I store :
number of used per second (nb_action.[0..59])
Total time per second (tps_action.[0..59])
Min time
Max time
Now, to inject my data I use an Upsert method:
db.statistics.update({
ts_min: ISODate("2015-05-22T13:16:00.000Z"),
name: "Injector",
cat: "module"
},
{
$inc: {"nb_action.0":1, "tps_action.0":250},
$min: {min_tps:250},
$max: {max_tps:250}
},
{ upsert: true })
So, I perform 2 $inc to manage my counter and used $min and $max to manage my stats.
All of this works...
With 1 thread injecting 50.000 data on one single machine (no shard) (for 10 modules), I observe 3.000/3.500 ops per second.
And my problem is.... I can't say if it's good or not.
Any suggestions?
PS: I use long name field for the example and add a set part for initialize each second in case of insert

Related

Matching elements in array documents sometimes gets very slow

I have a mongodb collection with about 100.000 documents.
Each document has an array with about ~ 100 elements. Is an array of strings like this:
features: [
"0_Toyota",
"29776_Grey",
"101037_Hybrid",
"240473_Iron Gray",
"46290_Aluminium,Magnesium",
"2787_14",
"9350_1920 x 1080",
"36303_Y",
"310870_N",
"57721_Y"
...
Making queries like this, are very fast. But sometimes gets very slow, including an specific extra condition inside $and. I have no idea why this happens. When gets slow, it takes more than 40 seconds. Always happens with the same extra condition. It is very possible that it happens with other conditions.
db.products.find({
$and:[
{
"features" : {
"$eq" : "36303_N"
}
},
{
"features" : {
"$eq" : "91135_IPS"
}
},
{
"features" : {
"$eq" : "9350_1366 x 768"
}
},
{
"features" : {
"$eq" : "178874_Y"
}
},
{
"features" : {
"$eq" : "43547_Y"
}
}
...
I'm running the same mongodb in my unix laptop and on a linux server instance.
Also trying indexing the field "features" with the same results.

use $all in your mongo query with your data helps you to query for an array
first create index on features
use this query may helps to you
db.products.find( { features: { $all: ["36303_N", "91135_IPS","others..."] } } )
by the way ,
if your query is very slow ,get the slow operation from your mongod log
show your mongodb version .
any writing when query (write will blocking read in some version)

I have realized that order inside $all matters. I change the order of elements by its number of documents that exists inside the collection, ascending. Making the query more selective.
Before, the query takes ~ 40 seconds to execute, now, with elements ordered, it takes ~ 22 seconds.
Still many seconds anyway.

MongoDb TTL index for non ISO date format

Below is my sample Json message and it has Timestamp format YYYY-MM-DDThh:mmTZD (eg 2015-08-18T22:43:01-04:00)
Also I have a TTL index setup for 30 days but my data is not getting removed. I know that Mongodb uses ISODate("2015-09-03T14:21:30.177-04:00") kind format but is that absolutely necessary? What modification I can do in my index to get the TTL working.
We have millions of documents under multiple collections and we run of space every now and then.
JSON:
{
"_id" : ObjectId("55d3ed35817f4809e14e2"),
"AuditEnvelope" : {
"TrackingInformation" : {
"CorelationId" : "2703-4ce2-af68-47832462",
"Timestamp" : "2015-08-18T22:43:01-04:00",
"LogData" : {
"msgDetailJson" : "[Somedata here]"
}
}
}
}
Index
"1" : {
"v" : 1,
"key" : {
"AuditEnvelope.TrackingInformation.Timestamp" : 1
},
"name" : "TTL",
"ns" : "MyDB.MyColl",
"expireAfterSeconds" : 2592000
},
MongoDB version : 3.0.1

In order for the TTL clean-up process to work with a defined TTL index, the specified field must contain a Date BSON type, as is covered in the documentation for TTL indexes.
If the indexed field in a document is not a date or an array that holds a date value(s), the document will not expire.
You will need to convert such strings as BSON dates. This is also a wise thing to do as the internal storage of a BSON Date is a numeric timestamp value, and this takes up a lot less storage than a string does.
Tranformation requires an update to "cast" to a date object. As a "one off" operation this probably best done through the MongoDB shell and with the use of Bulk Operations to minimize the network overhead when writing back the data.
var bulk = db.MyColl.initializeOrderedBulkOp(),
count = 0;
db.MyColl.find({
"AuditEnvelope.TrackingInformation.Timestamp": { "$type": 2 }
}).forEach(function(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": {
"AuditEnvelope.TrackingInformation.Timestamp":
new Date(doc.AuditEnvelope.TrackingInformation.Timestamp)
}
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.MyColl.initializeOrderedBulkOp();
}
});
if ( count % 1000 != 0 )
bulk.execute();
Also not the BSON $type operation there is designed to match "strings", so even if you began a conversion or changed some code to start producing BSON date objects in the field then the query only picks up the "string" values for conversion.
Ideally you should drop the indexes already on the "Timestamp" fields and then re-create them after the update. This removes the overhead of writing to the index with the updated information. You can also set a foreground index build on the new index creation and this will also save some space in what the index itself consumes.

How to store only certain number of document versions in mongodb?

Let's assume I would like to track computers on my network storing information about mac-address with the device name, port name, vlan number and timestamp.
I could grab mac-address-table from all switches in regular intervals, parse it and dump that data into mongodb.
The problem is: how to STORE only last 100 unique entries for each mac-address.
Capped collections is no-go, because to do that I would have to create separate collection for each mac, which is bad idea.
The number of switches and mac-addresses may change over time, and new data might be inserted at irregular intervals.
The other idea I have is to write some query which looks for the timestamp of 100th oldest entry for each mac-address and remove all older entries, and run this queries after each batch of inserts. It may work, but doesn't seem very efficient.
Do you have any better ideas?

Hmm... found something interesting:
how about storing each version in an array using $push operator with $slice modifier?
there are some examples in the docs:
http://docs.mongodb.org/manual/reference/operator/update/slice/

The other idea I have is to write some query which looks for the timestamp of 100th oldest entry for each mac-address and remove all older entries, and run this queries after each batch of inserts. It may work, but doesn't seem very efficient.
That sounds good to me. It might be cleaner to use cyclic buffer for this:
{
mac : "AA:AA:AA:AA:AA",
entryPointer : 2, // pointer of the next entry to be written
lastEntries : [
{ "ip" : "127.0.0.1", "service" : "foo", ts : ISODate(...), ... },
{ "ip" : "127.0.0.1", "service" : "foo", ts : ISODate(...), ... },
{ "ip" : "255.255.255.255", "service" : "longestProbableServiceName",
ts : ISODate(0001-01-01), ... }
...
{ "ip" : "255.255.255.255", "service" : "longestProbableServiceName",
ts : ISODate(0001-01-01), ... }
]
}
An update would have to increase the pointer and overwrite the position in the array given by pointer % 100. It will be helpful to preallocate the memory in the array as demonstrated to avoid fragmentation and reallocation overhead.
As you pointed out, the modulo-update can be done using $slice and $push:
db.foo.update(
{ mac : "AA:AA:AA..." },
{
$push: {
lastEntries : {
$each: [ { "ip" : "012.002.003.012", ... } ],
$slice: -100
}
}
}
)
Pre-populating the array also comes with the advantage that the most recent entry is always at the last position.

event streaming via mongodb. Get last inserted events

I consuming data from existing database. this database store system events. My service should check this database by timer, check if some new events created, then upload it and handle. Something like simple queue implementation.
The question is - how can I get new docs each time, when I check database. I can't use timestamps, because events goes to database from different sources and there are no any order for events. So I just need to use inserting order only.

There are a couple of options.
The first, and easiest if it matches your use case, is to use a capped collection. The capped collection is a collection as a pre-defined size that acts as a sort of ring-buffer. Once then collection is full it starts overwriting the documents. For iterating over the collection you simply create a "tailable" cursor you will need some way of identifying the "last document processed (even a simple "done" flag in the document could work but it would have to exist when the document is inserted). If you truly can't modify the documents in any way then you could even save off the last processed document somewhere and use a course time stamp to (approximate the start position) and look for the last document before processing more documents.
The only real issue with this solution is that you will be limited in the number of documents you can write in the collection and it won't grow over time. There are limits on the write operations you can perform on the documents (they can't grow) but it does not sound like you are modifying the documents.
The second option, which is more complex, is to use the oplog. For a standalone configuration you will need to still pass the -replSet option to create and use the oplog. You will just not configure the oplog. In a sharded configuration you will need to track each "replica set" separately. The oplog contains a document for each insert, update, delete done to all collections/documents on the server. Each entry contains a timestamp, operation and id (at a minimum). Here are examples of each.
Insert
{ "ts" : { "t" : 1362958492000, "i" : 1 },
"h" : NumberLong("5915409566571821368"), "v" : 2,
"op" : "i",
"ns" : "test.test",
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Delete
{ ... "op" : "d", ..., "b" : true,
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Update
{ ... "op" : "u", ...,
"o2" : { "_id" : "513d189c8544eb2b5e000001" },
"o" : { "$set" : { "i" : 1 } } }
The timestamps are generated on the server and are guaranteed to be monotonically increasing. which allows you to quickly find the documents of interest.
This option is the most robust but requires some work on your part.
I wrote some demo code to create a "watcher" on a collection that is almost what you want. You can find that code on GitHub. Specifically look at the code in the com.allanbank.mongodb.demo.coordination package.
HTH, Rob

You can actually use timestamps if your _id is of type ObjectId:
prefix = Math.floor((new Date( 2013 , 03 , 11 )).getTime()/1000).toString(16)
db.foo.find( { _id : { $gt : new ObjectId( prefix + "0000000000000000" ) } } )
This way, it doesn't matter where the source of the event was or when it was,
it only matters when document insertion was recorded (higher than previous timer)
Of course, it is schema-less and you can always set a field such as isNew to true,
and set it to false in conjunction with your query / cursor

Multiple update of embedded documents' properties

I have the following collection:
{
"Milestones" : [
{ "ActualDate" : null,
"Index": 0,
"Name" : "milestone1",
"TargetDate" : ISODate("2011-12-13T22:00:00Z"),
"_id" : ObjectId("4ee89ae7e60fc615c42e28d1")},
{ "ActualDate" : null,
"Index" : 0,
"Name" : "milestone2",
"TargetDate" : ISODate("2011-12-13T22:00:00Z"),
"_id" : ObjectId("4ee89ae7e60fc615c42e28d2") } ]
,
"Name" : "a", "_id" : ObjectId("4ee89ae7e60fc615c42e28ce")
}
I want to update definite documents: that have specified _id, List of Milestones._id and ActualDate is null.
I dotnet my code looks like:
var query = Query.And(new[] { Query.EQ("_id", ObjectId.Parse(projectId)),
Query.In("Milestones._id", new BsonArray(values.Select(ObjectId.Parse))),
Query.EQ("Milestones.ActualDate", BsonNull.Value) });
var update = Update.Set("Milestones.$.ActualDate", DateTime.Now.Date);
Coll.Update(query, update, UpdateFlags.Multi, SafeMode.True);
Or in native code:
db.Projects.update({ "_id" : ObjectId("4ee89ae7e60fc615c42e28ce"), "Milestones._id" : { "$in" : [ObjectId("4ee89ae7e60fc615c42e28d1"), ObjectId("4ee89ae7e60fc615c42e28d2"), ObjectId("4ee8a648e60fc615c41d481e")] }, "Milestones.ActualDate" : null },{ "$set" : { "Milestones.$.ActualDate" : ISODate("2011-12-13T22:00:00Z") } }, false, true)
But only the first item is being updated.

This is not possible in current moment. Flag multi in update means update of multiple root documents. Positional operator can match only one nested array item. There is such feature in mongodb jira. You can vote up and wait.
Current solution can be only load document, update as you wish and save back or multiple atomic update for each nested array id.
From documentation at mongodb.org:
Currently the $ operator only applies to the first matched item in the
query

As answered by Andrew Orsich, this is not possible for the moment, at least not as you wish. But loading the document, modifying the array then saving it back will work. The risk is that some other process could modify the array in the meantime, so you would overwrite its changes. To avoid this, you can use optimistic locking, especially if the array is not modified every second.
load the document, including a new attribute: milestones_version
modify the array as needed
save back to mongodb, but now add a query constraint on the milestones_version, and increment it:
db.Projects.findAndModify({
query: {
_id: your_project_id,
milestones_version: expected_milestones_version
},
update: {
$set: {
Milestones: modified_milestones
},
$inc: {
milestones_version: 1
}
},
new: 1
})
If another process modified the milestones array (and hence the milestones_version) before we did, then this command will do nothing and simply return null. We just need to reload the document and try again. If the array is not modified every second, then this will be very rare and will not have any impact on performance.
The main problem with this solution is that you have to edit every Project, one by one (no multi: true). You could still write a javascript function and have it run on the server though.

According to their JIRA page "This new feature is available starting with the MongoDB 3.5.12 development version, and included in the MongoDB 3.6 production version"
https://jira.mongodb.org/browse/SERVER-1243

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse