Offset value before grouping - mongodb

I have a couple of items of which one field is a UTC-based unix timestamp multiplied by 1000 in order to include milliseconds while keeping it a long (integer) value.
{
"title" : "Merkel 'explains' refugee convention to Trump in phone call",
"iso" : "2017-01-31T04:03:53.807+0000",
"id" : NumberLong(1485835433807)
}
{
"title" : "NASA to Explore an Asteroid Containing Enough Mineral Wealth to Collapse the World Economy",
"iso" : "2017-01-30T23:20:27.327+0000",
"id" : NumberLong(1485818427327)
}
{
"title" : "IMGKit: Python library of HTML to IMG wrapper",
"iso" : "2017-01-30T23:15:39.488+0000",
"id" : NumberLong(1485818139488)
}
the iso field is just a text string to ease debugging, it has no other purpose.
I intend to use the method described in https://stackoverflow.com/a/26550803/277267 to resample the items, to create a summary of items per day, initially just a count if items per day.
The problem is that the timestamp (the "id" field) can't really be used to archieve this, because of the UTC offset. Depending on the location of the user (or the local insertion time, ie 00:30 monday local time vs 23:30 sunday UTC time, if the timezone is +1h) an item would belong either to one day or the other, so the field lacks this information.
Assuming I just want to add an offset to the "id" field, ie by 3600000, which is one hour expressed in milliseconds, before starting to resample the data based on the "id" field, how can I archieve this in the aggregation pipeline?
Is there a way to have a first stage which takes the "id" field value, add 3600000 to that value and store it into an "id_offsetted" field, on which I can then execute the next stages?

Version before 3.4
{$project: {
"title" : 1,
"iso" : 1
"id" : 1,
"id_offsetted" : {$add: ["$id", 3600000]}
} }
Version 3.4 onwards
{$addFields: {
"id_offsetted" : {$add: ["$id", 3600000]}
} }

Related

MongoDB date with time as part of a compound index

Hello I have objects like this one in the db:
{ "_id" : ObjectId("56d9fc68bb9dcdcc9f73e6b7"), "ApplicationId" : 9, "CreatedDateUtc" : ISODate("2016-01-01T21:26:57.116Z"), "Message" : "yolo", "EventType" : 4 }
And I will search by the ApplicationId and the CreatedDateUtc (optionally EventType) fields so I made a composite index (I kept the defult index on id field intact):
{
"v" : 1,
"key" : {
"ApplicationId" : 1,
"CreatedDateUtc" : -1
},
"name" : "ApplicationId_1_CreatedDateUtc_-1",
"ns" : "test.testLogs"
}
Is that a good idea to use a field which is unique almost every time (the date) as part of the index? If I get the whole index idea, then this approach will bloat the index making it harder to find stuff fast. Am I correct?
With 770k entries I have ~19MB index. I have no idea is it large or not, but it seems big.
> db.testLogs.count()
770999
> db.testLogs.totalIndexSize()
18952192
I was thinking about making a field unique for each hour (maybe the date with minutes and the small stuff floor'ed) and use it for indexing. Any better ideas?

MongoDB Performance with Upsert

We are trying to make a "real time" statistics part for our application,
and we want to use MongoDB.
So, to do this, I basically imagine a DB named storage. In this db, I create a statistics collection.
And I store my data like this :
{
"_id" : ObjectId("55642d270528055b171fedf5"),
"cat" : "module",
"name" : "Injector",
"ts_min" : ISODate("2015-05-22T13:16:00Z"),
"nb_action" : {
"0" : 156
},
"tps_action" : {
"0" : 45016
},
"min_tps" : 10,
"max_tps" : 879
}
So, I have a category, a name and a date to determine an unique Object. In this object, I store :
number of used per second (nb_action.[0..59])
Total time per second (tps_action.[0..59])
Min time
Max time
Now, to inject my data I use an Upsert method:
db.statistics.update({
ts_min: ISODate("2015-05-22T13:16:00.000Z"),
name: "Injector",
cat: "module"
},
{
$inc: {"nb_action.0":1, "tps_action.0":250},
$min: {min_tps:250},
$max: {max_tps:250}
},
{ upsert: true })
So, I perform 2 $inc to manage my counter and used $min and $max to manage my stats.
All of this works...
With 1 thread injecting 50.000 data on one single machine (no shard) (for 10 modules), I observe 3.000/3.500 ops per second.
And my problem is.... I can't say if it's good or not.
Any suggestions?
PS: I use long name field for the example and add a set part for initialize each second in case of insert

Mongodb tail subdocuments

I have a collection with users. Each user has comments. I want to track for some specific users (according to theirs ids) if there is a new comment.
Tailable cursor I guess are what I need but my main problem is that I want to track subdocuments and not documents.
Sample of tracking documents in python:
db = Connection().my_db
coll = db.my_collection
cursor = coll.find(tailable=True)
while cursor.alive:
try:
doc = cursor.next()
print doc
except StopIteration:
time.sleep(1)
One solution is to run intervals every x time and see if the number of the comments has changed. However I do not find the interval solution very appealing. Is there any better way to track changes? Probably with tailable cursors.
PS: I have a comment_id field (which is an ObjectID) in each comment.
Small update:
Since I have the commect_id bson, I can store the biggest (=latest) one in each user. Then run intervals compare the bson if it's still the latest one. I don't mind not to be a precisely real time method. Even 10 minutes of delay is fine. However now I have 70k users and 180k comments but I worry for the scalability of this method.
This would be my solution. Evaluate if it fits your requirement -
I am assuming a data structure as follows
db.user.find().pretty()
{
"_id" : ObjectId("5335123d900f7849d5ea2530"),
"user_id" : 200,
"comments" : [
{
"comment_id" : 1,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 2,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
{
"_id" : ObjectId("5335123e900f7849d5ea2531"),
"user_id" : 201,
"comments" : [
{
"comment_id" : 3,
"comment" : "hi",
"createDate" : ISODate("2012-01-01T00:00:00Z")
},
{
"comment_id" : 4,
"comment" : "bye",
"createDate" : ISODate("2013-01-01T00:00:00Z")
}
]
}
I added createDate attribute to the document. Add an index as follows -
db.user.ensureIndex({"user_id":1,"comments.createDate":-1})
You can search for latest comments with the query -
db.user.find({"user_id":200,"comments.createDate":{$gt:ISODate('2012-12-31')}})
The time used for "greater than" comparison would be last checked time. Since you are using index, the search will be faster. You can follow the same idea of checking in for new comments in some interval.
You can also use UTC time stamp, instead of ISODate. That way you don't have to worry about bson data type.
Note that while creating index on createDate, I have specified descending index.
If you will have too many comments within a user document, over a period of time, I would suggest that, you move comments to a different collection. Use user_id as one of the attributes in the comment document. That will give a better performance in the long run.

Mongodb returns capitalized strings first when sorting

When I tried to sort a collection a string field (here Title), sorting not working as expected. Please see below:
db.SomeCollection.find().limit(50).sort({ "Title" : -1 });
Actual Result order
"Title" : "geog.3 students' book"
"Title" : "geog.2 students' book"
"Title" : "geog.1 students' book"
"Title" : "Zoe and Swift"
"Title" : "Zip at the Theme Park"
"Title" : "Zip at the Supermarket"
Expected Result order
"Title" : "Zoe and Swift"
"Title" : "Zip at the Theme Park"
"Title" : "Zip at the Supermarket"
"Title" : "geog.3 students' book"
"Title" : "geog.2 students' book"
"Title" : "geog.1 students' book"
Same issues occurs when I tried to sort by Date field.
Any suggestions?
Update: Version 3.4 has case insensitive indexes
This is a known issue. MongoDB doesn't support lexical sorting for strings (JIRA: String lexicographical ordering). You should sort the results in your application code, or sort using a numeric field. It should sort date fields reliably though. Can you give an example where sorting by date doesn't work?
What exactly surprises you?
It sorts based on the presentation of the numerical representation of the symbol. If you will look here (I know that mongodb stores string in UTF-8, so this is just for educational purpose). You will see that the upper case letters have corresponding numbers lower then lower case letters. Thus they will go in front.
Mongodb can not sort letters based on localization or case insensitive.
In your case g has higher number then Z, so it goes first (sorting in decreasing order). And then 3 has corresponding number higher then 2 and 1. So basically everything is correct.
If you use aggregation expected output is possible see below:
db.collection.aggregate([
{
"$project": {
"Title": 1,
"output": { "$toLower": "$Title" }
}},
{ "$sort": { "output":-1 } },
{"$project": {"Title": 1, "_id":0}}
])
it will give you expected output as below:
{
"result" : [
{
"Title" : "Zoe and Swift"
},
{
"Title" : "Zip at the Theme Park"
},
{
"Title" : "Zip at the Supermarket"
},
{
"Title" : "geog.3 students' book"
},
{
"Title" : "geog.2 students' book"
},
{
"Title" : "geog.1 students' book"
}
],
"ok" : 1
}
Starting with the dates not sorting correctly....
If you're storing a date as a string, it needs to be sortable as a string. It's quite simple:
2013-11-08 // yyyy-mm-dd (the dashes would be optional)
As long as every piece of the date string is padded with 0 correctly, the strings will all sort naturally and in the way you would expect.
A full date time is stored in UTC typically:
2013-11-23T10:46:01.914Z
But, I'd also suggest you instead of storing a date value as a string, you consider whether using a native MongoDB Date would make more sense (reference). If you look at MongoDb's aggregation framework, you'll find there are many functions that can manipulate these dates, while a string is very limited.
As to the string sorting, it's been pointed out that it's sorting like a computer stores the data rather than the way you would sort as a person. If you consider the string is stored as its ASCII/UTF-8 representation, you should see why the sorting is working the way it is:
Zoe = [90, 111, 101]
geo = [103, 101, 111]
If you were to sort those in descending order as you've specified, you should see how "geo"'s internal byte representation is larger than that of the string "Zoe" (with 103 sorting higher than 90 in this case).
Typically, the recommendation when using MongoDb is to store the strings twice if you need to sort a string that has mixed case:
Original string ("Title")
As a normalized string. Possibly for example all as "lowercase"' possibly with accented characters also converted to a common character. So, you'd end up with a new field named "SortedTitle" for example and your code would use that to sort, but display the actual "Title" to users.
If you are doing in ror and mongomapper then follow below steps :
I have taken my model name abc and fetch result for Title.
#test_abc_details_array_full=Abc.collection.aggregate([
{"$project"=> {
"Title"=> 1,
"output"=> { "$toLower"=> "$Title" }
}},
{ "$sort"=> { "output"=>1 } },
{"$project"=> {Title: 1, _id:0}},
]);

Get the latest record from mongodb collection

I want to know the most recent record in a collection. How to do that?
Note: I know the following command line queries works:
1. db.test.find().sort({"idate":-1}).limit(1).forEach(printjson);
2. db.test.find().skip(db.test.count()-1).forEach(printjson)
where idate has the timestamp added.
The problem is longer the collection is the time to get back the data and my 'test' collection is really really huge. I need a query with constant time response.
If there is any better mongodb command line query, do let me know.
This is a rehash of the previous answer but it's more likely to work on different mongodb versions.
db.collection.find().limit(1).sort({$natural:-1})
This will give you one last document for a collection
db.collectionName.findOne({}, {sort:{$natural:-1}})
$natural:-1 means order opposite of the one that records are inserted in.
Edit: For all the downvoters, above is a Mongoose syntax,
mongo CLI syntax is: db.collectionName.find({}).sort({$natural:-1}).limit(1)
Yet another way of getting the last item from a MongoDB Collection (don't mind about the examples):
> db.collection.find().sort({'_id':-1}).limit(1)
Normal Projection
> db.Sports.find()
{ "_id" : ObjectId("5bfb5f82dea65504b456ab12"), "Type" : "NFL", "Head" : "Patriots Won SuperBowl 2017", "Body" : "Again, the Pats won the Super Bowl." }
{ "_id" : ObjectId("5bfb6011dea65504b456ab13"), "Type" : "World Cup 2018", "Head" : "Brazil Qualified for Round of 16", "Body" : "The Brazilians are happy today, due to the qualification of the Brazilian Team for the Round of 16 for the World Cup 2018." }
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
Sorted Projection ( _id: reverse order )
> db.Sports.find().sort({'_id':-1})
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
{ "_id" : ObjectId("5bfb6011dea65504b456ab13"), "Type" : "World Cup 2018", "Head" : "Brazil Qualified for Round of 16", "Body" : "The Brazilians are happy today, due to the qualification of the Brazilian Team for the Round of 16 for the World Cup 2018." }
{ "_id" : ObjectId("5bfb5f82dea65504b456ab12"), "Type" : "NFL", "Head" : "Patriots Won SuperBowl 2018", "Body" : "Again, the Pats won the Super Bowl" }
sort({'_id':-1}), defines a projection in descending order of all documents, based on their _ids.
Sorted Projection ( _id: reverse order ): getting the latest (last) document from a collection.
> db.Sports.find().sort({'_id':-1}).limit(1)
{ "_id" : ObjectId("5bfb60b1dea65504b456ab14"), "Type" : "F1", "Head" : "Ferrari Lost Championship", "Body" : "By two positions, Ferrari loses the F1 Championship, leaving the Italians in tears." }
I need a query with constant time response
By default, the indexes in MongoDB are B-Trees. Searching a B-Tree is a O(logN) operation, so even find({_id:...}) will not provide constant time, O(1) responses.
That stated, you can also sort by the _id if you are using ObjectId for you IDs. See here for details. Of course, even that is only good to the last second.
You may to resort to "writing twice". Write once to the main collection and write again to a "last updated" collection. Without transactions this will not be perfect, but with only one item in the "last updated" collection it will always be fast.
php7.1 mongoDB:
$data = $collection->findOne([],['sort' => ['_id' => -1],'projection' => ['_id' => 1]]);
My Solution :
db.collection("name of collection").find({}, {limit: 1}).sort({$natural: -1})
If you are using auto-generated Mongo Object Ids in your document, it contains timestamp in it as first 4 bytes using which latest doc inserted into the collection could be found out. I understand this is an old question, but if someone is still ending up here looking for one more alternative.
db.collectionName.aggregate(
[{$group: {_id: null, latestDocId: { $max: "$_id"}}}, {$project: {_id: 0, latestDocId: 1}}])
Above query would give the _id for the latest doc inserted into the collection
This is how to get the last record from all MongoDB documents from the "foo" collection.(change foo,x,y.. etc.)
db.foo.aggregate([{$sort:{ x : 1, date : 1 } },{$group: { _id: "$x" ,y: {$last:"$y"},yz: {$last:"$yz"},date: { $last : "$date" }}} ],{ allowDiskUse:true })
you can add or remove from the group
help articles: https://docs.mongodb.com/manual/reference/operator/aggregation/group/#pipe._S_group
https://docs.mongodb.com/manual/reference/operator/aggregation/last/
Mongo CLI syntax:
db.collectionName.find({}).sort({$natural:-1}).limit(1)
Let Mongo create the ID, it is an auto-incremented hash
mymongo:
self._collection.find().sort("_id",-1).limit(1)