Update query for the given case in MongoDB - mongodb

{ "_id" :,
"final_terms" : [
{
"np" : "the role",
"tf" : 28571.000,
"idf" : 0
}]
}
How to update and set the flag to 1 for top 30% sorted in decreasing order by final_terms.idf and 0 for the rest
{ "_id" :,
"final_terms" : [
{
"np" : "the role",
"tf" : 28571.000,
"idf" : 0
"flag": 0
}]
}
I am new to mongodb, and I need to do this for nlp, the mongodb docs are less detail oriented and it is difficult get a grip on mongodb using them.

I would do this in steps. Firstly, you need to know how many documents will be in your result set, so that you can figure out what the top 30% is. Secondly, you do a query that will sort the documents in decreasing order by final_terms.idf and figure out what the value of final_terms.idf is for the last document in the top 30% of the result set. Once you know that, you can update all documents with a final_terms.idf value greater than or equal to that with flag: 1 and all others with flag: 0. The exact implementation would depend on your programming language, but an implementation in the mongo shell would look as follows:
// Get count
> db.collection.find().count();
100
Now you know that you have 100 documents, so the top 30% will be the first 30 documents. Skip the first 29 in the sorted results and find the value for the 30th document:
// Sort and get value for 30th document
> db.collection.find({}, { "final_terms.idf" : 1, "_id" : 0} ).sort({ "final_terms.idf" : -1 }).skip(29).limit(1);
{ "final_terms" : { "idf" : "<SOME_VALUE>" } }
You now have the value at the bottom limit of the first 30%. Use that value to do the respective updates:
// Update top 30%
db.collection.update({ "final_terms.idf" : { $gte : <SOME_VALUE> }}, { $set : { "final_terms.flag" : 1 } }, { "multi" : true });
// Update bottom 70%
db.collection.update({ "final_terms.idf" : { $lt : <SOME_VALUE> }}, { $set : { "final_terms.flag" : 0 } }, { "multi" : true });
That should give you an idea of how to solve your problem.

Related

How to improve aggregate pipeline

I have pipeline
[
{'$match':{templateId:ObjectId('blabla')}},
{
"$sort" : {
"_id" : 1
}
},
{
"$facet" : {
"paginatedResult" : [
{
"$skip" : 0
},
{
"$limit" : 100
}
],
"totalCount" : [
{
"$count" : "count"
}
]
}
}
])
Index:
"key" : {
"templateId" : 1,
"_id" : 1
}
Collection has 10.6M documents 500k of it is with needed templateId.
Aggregate use index
"planSummary" : "IXSCAN { templateId: 1, _id: 1 }",
But the request takes 16 seconds. What i did wrong? How to speed up it?
For start, you should get rid of the $sort operator. The documents are already sorted by _id since the documents are already guaranteed to sorted by the { templateId: 1, _id: 1 } index. The outcome is sorting 500k which are already sorted anyway.
Next, you shouldn't use the $skip approach. For high page numbers you will skip large numbers of documents up to almost 500k (rather index entries, but still).
I suggest an alternative approach:
For the first page, calculate an id you know for sure falls out of the left side of the index. Say, if you know that you don't have entries back dated to 2019 and before, you can use a match operator similar to this:
var pageStart = ObjectId.fromDate(new Date("2020/01/01"))
Then, your match operator should look like this:
{'$match' : {templateId:ObjectId('blabla'), _id: {$gt: pageStart}}}
For the next pages, keep track of the last document of the previous page: if the rightmost document _id is x in a certain page, then pageStart should be x for the next page.
So your pipeline may look like this:
[
{'$match' : {templateId:ObjectId('blabla'), _id: {$gt: pageStart}}},
{
"$facet" : {
"paginatedResult" : [
{
"$limit" : 100
}
]
}
}
]
Note, that now the $skip is missing from the $facet operator as well.

MongoDB querying to with changing values for key

Im trying to get back into Mongodb and Ive come across something that I cant figure out.
I have this data structure
> db.ratings.find().pretty()
{
"_id" : ObjectId("55881e43424cbb1817137b33"),
"e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
"type" : "like",
"time" : 1434984003156,
"u_id" : ObjectId("55817c072e48b4b60cf366a7")
}
{
"_id" : ObjectId("55893be1e6a796c0198e65d3"),
"e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
"type" : "dislike",
"time" : 1435057121808,
"u_id" : ObjectId("55817c072e48b4b60cf366a7")
}
{
"_id" : ObjectId("55893c21e6a796c0198e65d4"),
"e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
"type" : "null",
"time" : 1435057185089,
"u_id" : ObjectId("55817c072e48b4b60cf366a7")
}
What I want to be able to do is count the documents that have either a like or dislike leaving the "null" out of the count. So I should have a count of 2. I tried to go about it like this whereby I set the query to both fields:
db.ratings.find({e_id: ObjectId("5565e106cd7a763b2732ad7c")}, {type: "like", type: "dislike"})
But this just prints out all three documents. Is there any reason?
If its glaringly obvious im sorry pulling out my hair at the moment.
Use the following db.collection.count() method which returns the count of documents that would match a find() query:
db.ratings.count({
"e_id": ObjectId("5565e106cd7a763b2732ad7c"),
type: {
"$in": ["like", "dislike"]
}
})
The db.collection.count() method is equivalent to the db.collection.find(query).count() construct. Your query selection criteria above can be interpreted as:
Get me the count of all documents which have the e_id field values as ObjectId("5565e106cd7a763b2732ad7c") AND the type field which has either value "like" or "dislike", as depicted by the $in operator that selects the documents where the value of a field equals any value in the specified array.
db.ratings.find({e_id: ObjectId("5565e106cd7a763b2732ad7c")},
{type: "like", type: "dislike"})
But this just prints out all three
documents. Is there any reason? If its glaringly obvious im sorry
pulling out my hair at the moment.
The second argument here is the projection used by the find method . It specifies fields that should be included -- regardless of their value. Normally, you specify a boolean value of 1 or true to include the field. Obviously, MongoDB accepts other values as true.
If you only need to count documents, you should issue a count command:
> db.runCommand({count: 'collection',
query: { "e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
type: { $in: ["like", "dislike"]}}
})
{ "n" : 2, "ok" : 1 }
Please note the Mongo Shell provides the count helper for that:
> db.collection.find({ "e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
type: { $in: ["like", "dislike"]}}).count()
2
That being said, to quote the documentation, using the count command "can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress." To avoid that, you might prefer using the aggregation framework:
> db.collection.aggregate([
{ $match: { "e_id" : ObjectId("5565e106cd7a763b2732ad7c"),
type: { $in: ["like", "dislike"]}}},
{ $group: { _id: null, n: { $sum: 1 }}}
])
{ "_id" : null, "n" : 2 }
This query should solve your problem
db.ratings.find({$or : [{"type": "like"}, {"type": "dislike"}]}).count()

MongoDB - Get highest value of child

I'm trying to get the highest value of a child value. If I have two documents like this
{
"_id" : ObjectId("5585b8359557d21f44e1d857"),
"test" : {
"number" : 1,
"number2" : 1
}
}
{
"_id" : ObjectId("5585b8569557d21f44e1d858"),
"test" : {
"number" : 2,
"number2" : 1
}
}
How would I get the highest value of key "number"?
Using dot notation:
db.testSOF.find().sort({'test.number': -1}).limit(1)
To get the highest value of the key "number" you could use two approaches here. You could use the aggregation framework where the pipeline would look like this
db.collection.aggregate([
{
"$group": {
"_id": 0,
"max_number": {
"$max": "$test.number"
}
}
}
])
Result:
/* 0 */
{
"result" : [
{
"_id" : 0,
"max_number" : 2
}
],
"ok" : 1
}
or you could use the find() cursor as follows
db.collection.find().sort({"test.number": -1}).limit(1)
max() does not work the way you would expect it to in SQL for Mongo.
This is perhaps going to change in future versions but as of now,
max,min are to be used with indexed keys primarily internally for
sharding.
see http://www.mongodb.org/display/DOCS/min+and+max+Query+Specifiers
Unfortunately for now the only way to get the max value is to sort the
collection desc on that value and take the first.
db.collection.find("_id" => x).sort({"test.number" => -1}).limit(1).first()
quoted from: Getting the highest value of a column in MongoDB

MongoDB fetch documents with sort by count

I have a document with sub-document which looks something like:
{
"name" : "some name1"
"like" : [
{ "date" : ISODate("2012-11-30T19:00:00Z") },
{ "date" : ISODate("2012-12-02T19:00:00Z") },
{ "date" : ISODate("2012-12-01T19:00:00Z") },
{ "date" : ISODate("2012-12-03T19:00:00Z") }
]
}
Is it possible to fetch documents "most liked" (average value for the last 7 days) and sort by the count?
There are a few different ways to solve this problem. The solution I will focus on uses mongodb's aggregation framework. First, here is an aggregation pipeline that will solve your problem, following it will be an explanation/breakdown of what is happening in the command.
db.testagg.aggregate(
{ $unwind : '$likes' },
{ $group : { _id : '$_id', numlikes : { $sum : 1 }}},
{ $sort : { 'numlikes' : 1}})
This pipeline has 3 main commands:
1) Unwind: this splits up the 'likes' field so that there is 1 'like' element per document
2) Group: this regroups the document using the _id field, incrementing the numLikes field for every document it finds. This will cause numLikes to be filled with a number equal to the number of elements that were in "likes" before
3) Sort: Finally, we sort the return values in ascending order based on numLikes. In a test I ran the output of this command is:
{"result" : [
{
"_id" : 1,
"numlikes" : 1
},
{
"_id" : 2,
"numlikes" : 2
},
{
"_id" : 3,
"numlikes" : 3
},
{
"_id" : 4,
"numlikes" : 4
}....
This is for data inserted via:
for (var i=0; i < 100; i++) {
db.testagg.insert({_id : i})
for (var j=0; j < i; j++) {
db.testagg.update({_id : i}, {'$push' : {'likes' : j}})
}
}
Note that this does not completely answer your question as it avoids the issue of picking the date range, but it should hopefully get you started and moving in the right direction.
Of course, there are other ways to solve this problem. One solution might be to just do all of the sorting and manipulations client-side. This is just one method for getting the information you desire.
EDIT: If you find this somewhat tedious, there is a ticket to add a $size operator to the aggregation framework, I invite you to watch and potentially upvote it to try and speed to addition of this new operator if you are interested.
https://jira.mongodb.org/browse/SERVER-4899
A better solution would be to keep a count field that will record how many likes for this document. While you can use aggregation to do this, the performance will likely be not very good. Having a index on the count field will make read operation fast, and you can use atomic operation to increment the counter when inserting new likes.
You can use this simplify the above aggregation query by the following from mongodb v3.4 onwards:
> db.test.aggregate([
{ $unwind: "$like" },
{ $sortByCount: "$_id" }
]).pretty()
{ "_id" : ObjectId("5864edbfa4d3847e80147698"), "count" : 4 }
Also as #ACE said you can now use $size within a projection instead:
db.test.aggregate([
{ $project: { count: { $size : "$like" } } }
]);
{ "_id" : ObjectId("5864edbfa4d3847e80147698"), "count" : 4 }

mongodb micro-optimization of batch inserts ? or is this an important optimization?

premise : update statements are harmless since the driver by default works in one way messaging (as long as getLastError isn't used).
question Is the following fragment the best way to do this in mongodb for high volume inserts ? Is it possible to fold step 2 and 3 ?
edit : old buggy form , see below
// step 1 : making sure the top-level document is present (an upsert in the real
example)
db.test.insert( { x :1} )
// step 2 : making sure the sub-document entry is present
db.test.update( { x:1 }, { "$addToSet" : { "u" : { i : 1, p : 2 } } }, false)
// step 3 : increment a integer within the subdocument document
db.test.update( { x : 1, "u.i" : 1}, { "$inc" : { "u.$.c" : 1 } },false)
I have a feeling there is no way out of operation 3, since the$ operator requires priming in the query field of the query part of an update. amirite ? iamrite ?
If this is the best way to do things, can I get creative in my code and go nuts with update operations ?
edit : new form
There was a bug in my logic, thanks Gates. Still want to fold the updates if possible :D
// make sure the top-level entry exists and increase the incidence counter
db.test.update( { x : 1 }, { $inc : { i : 1 } }, true ) --1
// implicetly creates the array
db.test.update( { x : 1 , u : { $not : { $elemMatch : { i : 1 } } } } ,
{ $push : { u : { i : 1 , p :2 , c:0} } }) -- 2
db.test.update( { x :1 , "u.i" : 1}, { $inc : { "u.$.c" : 1 } },false) --3
notes : $addToSet is not usefull in this case, since it does a element-wise match, there is no way to express what elements in an array may be mutable as in C++ OO bitwise comparison parlance
question is pointless Data model is wrong. Please vote to close (OP).
So, the first thing to note is that the $ positional operator is a little sketchy. It has a lot of "gotchas": it doesn't play well with upserts, it only affects the first true match, etc.
To understand "folding" of #2 and #3, you need to look at the output of your commands:
db.test.insert( { x :1} )
{ x:1 } // DB value
db.test.update( { x:1 }, { "$addToSet" : { "u" : { i : 1, p : 2 } } }, false)
{ x:1, u: [ {i:1,p;2} ] } // DB value
db.test.update( { x : 1, "u.i" : 1}, { "$inc" : { "u.$.c" : 1 } },false)
{ x:1, u: [ {i:1,p:2,c:1} ] } // DB value
Based on the sequence you provided, the whole thing can be rolled into a single update.
If you're only looking to roll together #2 & #3, then you're worried about matching 'u.i':1 with u.$.c. But there are some edge cases here you have to clarify.
Let your starting document be the following:
{
x:1,
u: [
{i:1, p:2},
{i:1, p:3}
]
}
What do you expect from running update #3?
As written you get:
{
x:1,
u: [
{i:1, p:2, c:1},
{i:1, p:3}
]
}
Is this correct? Is that first document legal? (semantically correct)? Depending on the answers, this may actually be an issue of document structure.