How to store only certain number of document versions in mongodb? - mongodb

Let's assume I would like to track computers on my network storing information about mac-address with the device name, port name, vlan number and timestamp.
I could grab mac-address-table from all switches in regular intervals, parse it and dump that data into mongodb.
The problem is: how to STORE only last 100 unique entries for each mac-address.
Capped collections is no-go, because to do that I would have to create separate collection for each mac, which is bad idea.
The number of switches and mac-addresses may change over time, and new data might be inserted at irregular intervals.
The other idea I have is to write some query which looks for the timestamp of 100th oldest entry for each mac-address and remove all older entries, and run this queries after each batch of inserts. It may work, but doesn't seem very efficient.
Do you have any better ideas?

Hmm... found something interesting:
how about storing each version in an array using $push operator with $slice modifier?
there are some examples in the docs:
http://docs.mongodb.org/manual/reference/operator/update/slice/

The other idea I have is to write some query which looks for the timestamp of 100th oldest entry for each mac-address and remove all older entries, and run this queries after each batch of inserts. It may work, but doesn't seem very efficient.
That sounds good to me. It might be cleaner to use cyclic buffer for this:
{
mac : "AA:AA:AA:AA:AA",
entryPointer : 2, // pointer of the next entry to be written
lastEntries : [
{ "ip" : "127.0.0.1", "service" : "foo", ts : ISODate(...), ... },
{ "ip" : "127.0.0.1", "service" : "foo", ts : ISODate(...), ... },
{ "ip" : "255.255.255.255", "service" : "longestProbableServiceName",
ts : ISODate(0001-01-01), ... }
...
{ "ip" : "255.255.255.255", "service" : "longestProbableServiceName",
ts : ISODate(0001-01-01), ... }
]
}
An update would have to increase the pointer and overwrite the position in the array given by pointer % 100. It will be helpful to preallocate the memory in the array as demonstrated to avoid fragmentation and reallocation overhead.
As you pointed out, the modulo-update can be done using $slice and $push:
db.foo.update(
{ mac : "AA:AA:AA..." },
{
$push: {
lastEntries : {
$each: [ { "ip" : "012.002.003.012", ... } ],
$slice: -100
}
}
}
)
Pre-populating the array also comes with the advantage that the most recent entry is always at the last position.

Related

Matching elements in array documents sometimes gets very slow

I have a mongodb collection with about 100.000 documents.
Each document has an array with about ~ 100 elements. Is an array of strings like this:
features: [
"0_Toyota",
"29776_Grey",
"101037_Hybrid",
"240473_Iron Gray",
"46290_Aluminium,Magnesium",
"2787_14",
"9350_1920 x 1080",
"36303_Y",
"310870_N",
"57721_Y"
...
Making queries like this, are very fast. But sometimes gets very slow, including an specific extra condition inside $and. I have no idea why this happens. When gets slow, it takes more than 40 seconds. Always happens with the same extra condition. It is very possible that it happens with other conditions.
db.products.find({
$and:[
{
"features" : {
"$eq" : "36303_N"
}
},
{
"features" : {
"$eq" : "91135_IPS"
}
},
{
"features" : {
"$eq" : "9350_1366 x 768"
}
},
{
"features" : {
"$eq" : "178874_Y"
}
},
{
"features" : {
"$eq" : "43547_Y"
}
}
...
I'm running the same mongodb in my unix laptop and on a linux server instance.
Also trying indexing the field "features" with the same results.
use $all in your mongo query with your data helps you to query for an array
first create index on features
use this query may helps to you
db.products.find( { features: { $all: ["36303_N", "91135_IPS","others..."] } } )
by the way ,
if your query is very slow ,get the slow operation from your mongod log
show your mongodb version .
any writing when query (write will blocking read in some version)
I have realized that order inside $all matters. I change the order of elements by its number of documents that exists inside the collection, ascending. Making the query more selective.
Before, the query takes ~ 40 seconds to execute, now, with elements ordered, it takes ~ 22 seconds.
Still many seconds anyway.

Storing a query in Mongo

This is the case: A webshop in which I want to configure which items should be listed in the sjop based on a set of parameters.
I want this to be configurable, because that allows me to experiment with different parameters also change their values easily.
I have a Product collection that I want to query based on multiple parameters.
A couple of these are found here:
within product:
"delivery" : {
"maximum_delivery_days" : 30,
"average_delivery_days" : 10,
"source" : 1,
"filling_rate" : 85,
"stock" : 0
}
but also other parameters exist.
An example of such query to decide whether or not to include a product could be:
"$or" : [
{
"delivery.stock" : 1
},
{
"$or" : [
{
"$and" : [
{
"delivery.maximum_delivery_days" : {
"$lt" : 60
}
},
{
"delivery.filling_rate" : {
"$gt" : 90
}
}
]
},
{
"$and" : [
{
"delivery.maximum_delivery_days" : {
"$lt" : 40
}
},
{
"delivery.filling_rate" : {
"$gt" : 80
}
}
]
},
{
"$and" : [
{
"delivery.delivery_days" : {
"$lt" : 25
}
},
{
"delivery.filling_rate" : {
"$gt" : 70
}
}
]
}
]
}
]
Now to make this configurable, I need to be able to handle boolean logic, parameters and values.
So, I got the idea, since such query itself is JSON, to store it in Mongo and have my Java app retrieve it.
Next thing is using it in the filter (e.g. find, or whatever) and work on the corresponding selection of products.
The advantage of this approach is that I can actually analyse the data and the effectiveness of the query outside of my program.
I would store it by name in the database. E.g.
{
"name": "query1",
"query": { the thing printed above starting with "$or"... }
}
using:
db.queries.insert({
"name" : "query1",
"query": { the thing printed above starting with "$or"... }
})
Which results in:
2016-03-27T14:43:37.265+0200 E QUERY Error: field names cannot start with $ [$or]
at Error (<anonymous>)
at DBCollection._validateForStorage (src/mongo/shell/collection.js:161:19)
at DBCollection._validateForStorage (src/mongo/shell/collection.js:165:18)
at insert (src/mongo/shell/bulk_api.js:646:20)
at DBCollection.insert (src/mongo/shell/collection.js:243:18)
at (shell):1:12 at src/mongo/shell/collection.js:161
But I CAN STORE it using Robomongo, but not always. Obviously I am doing something wrong. But I have NO IDEA what it is.
If it fails, and I create a brand new collection and try again, it succeeds. Weird stuff that goes beyond what I can comprehend.
But when I try updating values in the "query", changes are not going through. Never. Not even sometimes.
I can however create a new object and discard the previous one. So, the workaround is there.
db.queries.update(
{"name": "query1"},
{"$set": {
... update goes here ...
}
}
)
doing this results in:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 52,
"errmsg" : "The dollar ($) prefixed field '$or' in 'action.$or' is not valid for storage."
}
})
seems pretty close to the other message above.
Needles to say, I am pretty clueless about what is going on here, so I hope some of the wizzards here are able to shed some light on the matter
I think the error message contains the important info you need to consider:
QUERY Error: field names cannot start with $
Since you are trying to store a query (or part of one) in a document, you'll end up with attribute names that contain mongo operator keywords (such as $or, $ne, $gt). The mongo documentation actually references this exact scenario - emphasis added
Field names cannot contain dots (i.e. .) or null characters, and they must not start with a dollar sign (i.e. $)...
I wouldn't trust 3rd party applications such as Robomongo in these instances. I suggest debugging/testing this issue directly in the mongo shell.
My suggestion would be to store an escaped version of the query in your document as to not interfere with reserved operator keywords. You can use the available JSON.stringify(my_obj); to encode your partial query into a string and then parse/decode it when you choose to retrieve it later on: JSON.parse(escaped_query_string_from_db)
Your approach of storing the query as a JSON object in MongoDB is not viable.
You could potentially store your query logic and fields in MongoDB, but you have to have an external app build the query with the proper MongoDB syntax.
MongoDB queries contain operators, and some of those have special characters in them.
There are rules for mongoDB filed names. These rules do not allow for special characters.
Look here: https://docs.mongodb.org/manual/reference/limits/#Restrictions-on-Field-Names
The probable reason you can sometimes successfully create the doc using Robomongo is because Robomongo is transforming your query into a string and properly escaping the special characters as it sends it to MongoDB.
This also explains why your attempt to update them never works. You tried to create a document, but instead created something that is a string object, so your update conditions are probably not retrieving any docs.
I see two problems with your approach.
In following query
db.queries.insert({
"name" : "query1",
"query": { the thing printed above starting with "$or"... }
})
a valid JSON expects key, value pair. here in "query" you are storing an object without a key. You have two options. either store query as text or create another key inside curly braces.
Second problem is, you are storing query values without wrapping in quotes. All string values must be wrapped in quotes.
so your final document should appear as
db.queries.insert({
"name" : "query1",
"query": 'the thing printed above starting with "$or"... '
})
Now try, it should work.
Obviously my attempt to store a query in mongo the way I did was foolish as became clear from the answers from both #bigdatakid and #lix. So what I finally did was this: I altered the naming of the fields to comply to the mongo requirements.
E.g. instead of $or I used _$or etc. and instead of using a . inside the name I used a #. Both of which I am replacing in my Java code.
This way I can still easily try and test the queries outside of my program. In my Java program I just change the names and use the query. Using just 2 lines of code. It simply works now. Thanks guys for the suggestions you made.
String documentAsString = query.toJson().replaceAll("_\\$", "\\$").replaceAll("#", ".");
Object q = JSON.parse(documentAsString);

MongoDB Performance with Upsert

We are trying to make a "real time" statistics part for our application,
and we want to use MongoDB.
So, to do this, I basically imagine a DB named storage. In this db, I create a statistics collection.
And I store my data like this :
{
"_id" : ObjectId("55642d270528055b171fedf5"),
"cat" : "module",
"name" : "Injector",
"ts_min" : ISODate("2015-05-22T13:16:00Z"),
"nb_action" : {
"0" : 156
},
"tps_action" : {
"0" : 45016
},
"min_tps" : 10,
"max_tps" : 879
}
So, I have a category, a name and a date to determine an unique Object. In this object, I store :
number of used per second (nb_action.[0..59])
Total time per second (tps_action.[0..59])
Min time
Max time
Now, to inject my data I use an Upsert method:
db.statistics.update({
ts_min: ISODate("2015-05-22T13:16:00.000Z"),
name: "Injector",
cat: "module"
},
{
$inc: {"nb_action.0":1, "tps_action.0":250},
$min: {min_tps:250},
$max: {max_tps:250}
},
{ upsert: true })
So, I perform 2 $inc to manage my counter and used $min and $max to manage my stats.
All of this works...
With 1 thread injecting 50.000 data on one single machine (no shard) (for 10 modules), I observe 3.000/3.500 ops per second.
And my problem is.... I can't say if it's good or not.
Any suggestions?
PS: I use long name field for the example and add a set part for initialize each second in case of insert

Search full document in mongodb for a match

Is there a way to match a value with every array and sub document inside the document in mongodb collection and return the document
{
"_id" : "2000001956",
"trimline1" : "abc",
"trimline2" : "xyz",
"subtitle" : "www",
"image" : {
"large" : 0,
"small" : 0,
"tiled" : 0,
"cropped" : false
},
"Kytrr" : {
"count" : 0,
"assigned" : 0
}
}
for eg if in the above document I am searching for xyz or "ab" or "xy" or "z" or "0" this document should be returned.
I actually have to achieve this at the back end using C# driver but a mongo query would also help greatly.
Please advice.
Thanks
You could probably do this using '$where'
db.mycollection({$where:"JSON.stringify(this).indexOf('xyz')!=-1"})
I'm converting the whole record to a big string and then searching to see if your element is in the resulting string. Probably won't work if your xyz is in the fieldnames!
You can make it iterate through the fields to make a big string and then search it though.
This isn't the most elegant way and will involve a full tablescan. It will be faster if you look through the individual fields!
While Malcolm's answer above would work, when your collection gets large or you have high traffic, you'll see this fall over pretty quickly. This is because of 2 things. First, dropping down to javascript is a big deal and second, this will always be a full table scan because $where can't use an index.
MongoDB 2.6 introduced text indexing which is on by default (it was in beta in 2.4). With it, you can have a full text index on all the fields in the document. The documentation gives the following example where a text index is created for every field and names the index "TextIndex".
db.collection.ensureIndex(
{ "$**": "text" },
{ name: "TextIndex" }
)

event streaming via mongodb. Get last inserted events

I consuming data from existing database. this database store system events. My service should check this database by timer, check if some new events created, then upload it and handle. Something like simple queue implementation.
The question is - how can I get new docs each time, when I check database. I can't use timestamps, because events goes to database from different sources and there are no any order for events. So I just need to use inserting order only.
There are a couple of options.
The first, and easiest if it matches your use case, is to use a capped collection. The capped collection is a collection as a pre-defined size that acts as a sort of ring-buffer. Once then collection is full it starts overwriting the documents. For iterating over the collection you simply create a "tailable" cursor you will need some way of identifying the "last document processed (even a simple "done" flag in the document could work but it would have to exist when the document is inserted). If you truly can't modify the documents in any way then you could even save off the last processed document somewhere and use a course time stamp to (approximate the start position) and look for the last document before processing more documents.
The only real issue with this solution is that you will be limited in the number of documents you can write in the collection and it won't grow over time. There are limits on the write operations you can perform on the documents (they can't grow) but it does not sound like you are modifying the documents.
The second option, which is more complex, is to use the oplog. For a standalone configuration you will need to still pass the -replSet option to create and use the oplog. You will just not configure the oplog. In a sharded configuration you will need to track each "replica set" separately. The oplog contains a document for each insert, update, delete done to all collections/documents on the server. Each entry contains a timestamp, operation and id (at a minimum). Here are examples of each.
Insert
{ "ts" : { "t" : 1362958492000, "i" : 1 },
"h" : NumberLong("5915409566571821368"), "v" : 2,
"op" : "i",
"ns" : "test.test",
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Delete
{ ... "op" : "d", ..., "b" : true,
"o" : { "_id" : "513d189c8544eb2b5e000001" } }
Update
{ ... "op" : "u", ...,
"o2" : { "_id" : "513d189c8544eb2b5e000001" },
"o" : { "$set" : { "i" : 1 } } }
The timestamps are generated on the server and are guaranteed to be monotonically increasing. which allows you to quickly find the documents of interest.
This option is the most robust but requires some work on your part.
I wrote some demo code to create a "watcher" on a collection that is almost what you want. You can find that code on GitHub. Specifically look at the code in the com.allanbank.mongodb.demo.coordination package.
HTH, Rob
You can actually use timestamps if your _id is of type ObjectId:
prefix = Math.floor((new Date( 2013 , 03 , 11 )).getTime()/1000).toString(16)
db.foo.find( { _id : { $gt : new ObjectId( prefix + "0000000000000000" ) } } )
This way, it doesn't matter where the source of the event was or when it was,
it only matters when document insertion was recorded (higher than previous timer)
Of course, it is schema-less and you can always set a field such as isNew to true,
and set it to false in conjunction with your query / cursor