MongoDB Java API - How do I perform a Group By operation? - mongodb

I'm struggling to find an answer to what is probably a simple question, and my question on the MongoDB forum has gone unanswered for days.
I'm trying to perform a Group By operation in the MongoDB Java Aggregation API and failing miserably.
I have this section of old GMongo code that performs the Group By:
[$group: [_id : [shop_id: '$shop.id'], // this "_id" expression isn't an Accumulator operation
count: [$sum: 1],
n : [$first: '$shop.n'],
dn : [$first: '$shop.dn']]]
And I'm trying to convert it to the new Java API like so:
// Mongo DB Java Sync Driver API query
final List<Bson> aggregationPipeline = asList(
Aggregates.match(Filters.and(Filters.gte("date.d", fromDate), Filters.lte("date.d", toDate))),
Aggregates.match(Filters.eq("product.cl", clientId)),
Aggregates.match(Filters.in("platform.id", platformIds)),
Aggregates.match(Filters.exists("shop.inf." + clientId, true)),
Aggregates.match(Filters.exists("shop.cl_pt." + clientId, false)),
Aggregates.match(Filters.eq("product.a", true)),
Aggregates.group(null,
asList(
// _id : [shop_id: '$shop.id'] // how to replicate this Group By line in Java API?
Accumulators.sum("count",1),
Accumulators.first("n","$shop.n"),
Accumulators.first("dn","$shop.dn")
)
)
);
I can replicate the logic the last three lines of the Group statement by using Accumulators (sum, first etc) but the very first line, this one:
_id : [shop_id: '$shop.id']
is what is confusing me. I cant figure out how to replicate it in the Java API as it's not an Accumulator operation, it looks more like an expression that can't find any documentation on.
Can anyone help? This one issue has held me up for a couple of days now.
Any clarification is much appreciated.

Related

MongoDB collection field compare with Current date to get the active users count in a day

var dt= new Date();
var datestring = dt.getFullYear()+"-"+("0"+(dt.getMonth()+1)).slice(-2)+"-"+("0"+dt.getDate()).slice(-2);
db.getCollection('Profile Section').count({Last_Active : {$regex : datestring}})
I have written this query in Mongo DB, It is giving correct count value. So I have written Query properly. I am using Spring Boot for backend, how can I create this query equivalent to JAVA, I am using rest API to call in postman application,
Can u suggest how can I write equivalent spring boot backend with the help of this query? and also I have to insert resultant count value to another collection called ActiveUsers in mongodb.
#GetMapping("/Activeusers")
is the api call in my JAVA application.
You have to use $gte & $lt query commands.
db.getCollection('Profile').find({
"lastActive" : {
$gte: new Date('2020-05-19'),
$lt: new Date('2020-05-20'),
}
}).pretty()
In Spring,
#Query("{'lastActive' : { $gte: ?0, $lt: ?1 } }")
public List<Profile> getProfileByLastActive(LocalDate today, LocalDate nextDay);
Use either LocalDate or Date as per your convenience.
If you are going to implement in Node.js, the best option is to use moment.js. Actually I have been working with moment.js for calendar related activities. That's y I am suggesting to use it.
Your code is accurate. The problem: it's too much accurate.
If you were to update those records every millisecond, you would be able to get them. But this is an overkill for most usages, and this is where the architecture of your system matters.
Querying a date field with an $eq operator, will fetch results with an exact match, with a milliseconds precision. The "active users" logic might attempt us to check users who are active right now, and imply our intuition to use an $eq operator. While this is correct, we will miss lots of users who are active, but their corresponding records on db are not updated at the millisecond rate (and this depends on the way you update your db records).
As implied above, one solution would be to update the db, with dozens of updates just to have an accurate description on db for a kind of real time active users manner. This might be too much for many systems.
Another solution would be to query the active users by an interval / gap of few seconds (or more). Every second would increase the probability of getting active users by a factor of 1,000. You can see this here:
db.getCollection('Profile').find({"lastActive" : {$gte: new Date(ISODate().getTime() - 1000 * 60 * 2)}}).pretty() // this fetch records within 2 seconds

How to comapre all records of two collections in mongodb using mapreduce?

I have an use case in which I want to compare each record of two collections in mongodb and after comparing each record I need to find mismatch fields of all record.
Let us take an example, in collection1 I have one record as {id : 1, name : "bks"}
and in collection2 I have a record as {id : 1, name : "abc"}
When I compare above two records with same key, then field name is a mismatch field as name is different.
I am thinking to achieve this use case using mapreduce in mongodb. But I am facing some problems while accessing collection name in map function. When I tried to compare it in map function, I got error as : "errmsg" : "exception: ReferenceError: db is not defined near '
Can anyone give me some thoughts on how to compare records using mapreduce?
I might have helped you to read the documentation:
When upgrading to MongoDB 2.4, you will need to refactor your code if your map-reduce operations, group commands, or $where operator expressions include any global shell functions or properties that are no longer available, such as db.
So from your error fragment, you appear to be referencing db in order to access another collection. You cannot do that.
If indeed you are intending to "compare" items in one collection to those in another, then there is no other approach other than looping code:
db.collection.find().forEach(function(doc) {
var another = db.anothercollection.findOne({ "_id": doc._id });
// Code to compare
})
There is simply no concept of "joins" as such available to MongoDB, and operations such as mapReduce or aggregate or others strictly work with one collection only.
The exception is db.eval(), but as per all of strict warnings in the documentation, this is almost always a very bad idea.
Live with your comparison in looping code.

Basic GROUP BY statement using OPA MongoDB high level API

My question is quite simple: I'd like to perform a GROUP BY like statement with MongoDB using the OPAlang high level database API. But I don't think that is possible (?)
If I do want to perform a mongodb $group operation, do I necessarily need to use the low-level API (using stdlib.apis.mongo ?)
Finally, can I use both low-level and high-level APIs to communicate with my MongoDB ?
Thanks.
I am afraid that, taking into account latest published Opa compiler code, no aggregation is supported :( See the thread in Opa forum. Also note the comment of Quentin about the using both low- and high-level API-s:
"You can use this [low level] library and built-in [hight level] library together, [...]"
See the auto-increment implementation advices by the guys from the MLstate in this thread. Note the high level DB field /next_id definition and initialization with low level read and increment.
I just got different idea.
All MongoDB commands (eg. the "group" command you are using) are accessible with the virtual collection named $cmd. You just ask the server to find the document {command_name: command_parameter, additional: "options", are: ["listed", "here"]}. You should be able to use every fancy feature of your MongoDB server, not supported yet with the Opa API, with single find query. This includes the aggregation framework introduced in version 2.2 and the full-text searching still in beta since version 2.4.
For example, I want to use new text command to search in full-text index for collecion coll_name the query string query. I am currently using the code (where oncuccess is the function to parse the answer and get the id-s of the documents found):
{ search: query, project: {_id:0, id:1}, }
|> Bson.opa2doc
|> MongoCommands.simple_str_command_opts(ll_db, db_name, "text", coll_name, opts)
|> MongoCommon.outcome_map(_, onsuccess, onfailure)
And if you take a look at the source code of the API, simple_str_command_opts is implemented as a findOne() to the Mongo.
But instead I could use the high level DB support:
/test/`$cmd`[{text: coll_name, search: query, project: {_id: 0, id: 1}}]
What you have to do, is to declare the high-level DB collection with the type including:
all the fields that you use to make the query,
all the fields that you can get in possible answer.
For the text command:
type commands = {
// command
string text,
// query
string search,
{
int _id,
int id,
} project,
// result of executing command "text"
string queryDebugString,
string language,
list({
float score,
{int id} obj,
}) results,
{
int nscanned,
int nscannedObjects,
int n,
int nfound,
int timeMicros,
} stats,
int ok,
// in case of failure (`ok: 0`)
string errmsg,
}
Unfortunately, it is not working :( During the application start-up Opa run-time DB support tries to create the unique index for the primary key of the set (for following example {text, search, project}):
database test {
article /article[{id}]
commands /`$cmd`[{text, search, project}]
}
Using primary key is necessary, since you have to use findOne(), not find(). Creating an index for virtual collection $cmd is not allowed and DB initialization fails.
If you find the way to stop Opa from creating index, you will be able to use all the fancy features of Mongo using no more then high-level API ;)

mongodb: how can I see the execution time for the aggregate command?

I execute the follow mongodb command in mongo shell
db.coll.aggregate(...)
and i see the list of result. but is it possible to see the query
execution time? Is there any equivalent function for explain method for aggregation queries.
var before = new Date()
#aggregation query
var after = new Date()
execution_mills = after - before
You can add a time function to your .mongorc.js file (in your home directory):
function time(command) {
const t1 = new Date();
const result = command();
const t2 = new Date();
print("time: " + (t2 - t1) + "ms");
return result;
}
and then you can use it like so:
time(() => db.coll.aggregate(...))
Caution
This method doesn't give relevant results for db.collection.find()
i see that in mongodb there is a possibility to use this two command:
db.setProfilingLevel(2)
and so after the query you can use db.system.profile.find() to see the query execution time and other
Or you can install the excellent mongo-hacker, which automatically times every query, pretty()fies it, colorizes the output, sorts the keys, and more:
I will write an answer to explain this better.
Basically there is no explain() functionality for the aggregation framework yet: https://jira.mongodb.org/browse/SERVER-4504
However there is a way to measure client side but not without its downsides:
You are not measuring the database
You are measuring the application
There are too many unknowns about the in between parts to be able to get an accurate reading, i.e. you can't say that it took 0.04ms for the document result to be formulated by the MongoDB server, serialised, sent over the wire, de-serialised by the app and then stored into a hash allowing you subtract that sum from the total to get a aggregation benchmark.
However that being said, you might be able to get a slightly accurate result by doing it in MongoDB console on the same server as the mongos / mongod. This will create very little in betweens, still too many but enough to maybe get a reading you could roughly trust. As such you could use #Zagorulkin's answer in that position.

MongoDB Aggregation as slow as MapReduce?

I'm just starting out with mongo db and trying to make some simple things. I filled up my database with a collections of data containing the "item" property. I wanted to try to count how much time every item is in the collection
example of a document:
{ "_id" : ObjectId("50dadc38bbd7591082d920f0"), "item" : "Pons", "lines" : 37 }
So I designed these two functions for doing MapReduce (written in python using pymongo)
all_map = Code("function () {"
" emit(this.item, 1);"
"}")
all_reduce = Code("function (key, values) {"
" var sum = 0;"
" values.forEach(function(value){"
" sum += value;"
" });"
" return sum;"
"}")
This worked like a charm, so I began filling the collection. At around 30.000 documents, the mapreduce already lasts longer than a second... Because NoSQL is bragging about speed I thought I must have been doing something wrong!
A Question here at Stack Overflow made me check out the Aggregation feature of mongodb. So I tried to use the group + sum + sort thingies. Came up with this:
db.wikipedia.aggregate(
{ $group: { _id: "$item", count: { $sum: 1 } } },
{ $sort: {count: 1} }
)
This code works just fine and gives me the same results as the mapreduce set, but it is just as slow. Am I doing something wrong? Do I really need to use other tools like hadoop to get a better performance?
I will place an answer basically summing up my comments. I cannot speak for other techs like Hadoop since I have not yet had the pleasure of finding time to use them but I can speak for MongoDB.
Unfortunately you are using two of the worst operators for any database: computed fields and grouping (or distinct) on a full table scan. The aggregation framework in this case must compute the field, group and then in-memory ( http://docs.mongodb.org/manual/reference/aggregation/#_S_sort ) sort the computed field. This is an extremely inefficient task for MongoDB to perform, in fact most likely any database.
There is no easy way to do this in real-time in line to your own application. Map reduce could be a way out if you didn't need to return the results immediately but since I am guessing you don't really want to wait for this kind of stuff the default method is just to eradicate the group altogether.
You can do this by pre-aggregation. So you can create another collection of grouped_wikipedia and in your application you manage this using an upsert() with atomic operators like $set and $inc (to count the occurrences) to make sure you only get one row per item. This is probably the most sane method of solving this problem.
This does however raise another problem of having to manage this extra collection alongside the detail collection wikipedia but I believe this to be a unavoidable side effect of getting the right performance here. The benefits will be greater than the loss of having to manage the extra collection.