Filter ResultSet after MongDB MapReduce - mongodb

Consider the following MongoDB collection / prototype that keeps track of how many cookies a given person has at a given point in time:
{
"_id": ObjectId("5c5b5c1865e463c5b6a5b748"),
"person": "Drew",
"cookies": 1,
"timestamp": ISODate("2019-02-05T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2265e463c5b6a5b749"),
"person": "Max",
"cookies": 3,
"timestamp": ISODate("2019-02-06T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2e65e463c5b6a5b74a"),
"person": "Max",
"cookies": 0,
"timestamp": ISODate("2019-02-07T20:34:48.922Z")
}
Ultimately, I need to get all people who currently have more than 0 cookies - In the above example, only "Drew" would qualify - ("Max" had 3, but later only had 0).
I've written the following map / reduce functions to sort this out..
var map = function(){
emit(this.person, {'timestamp' : this.timestamp, 'cookies': this.cookies})
}
var reduce = function(person, cookies){
let latestCookie = cookies.sort(function(a,b){
if(a.timestamp > b.timestamp){
return -1;
} else if(a.timestamp < b.timestamp){
return 1
} else {
return 0;
}
})[0];
return {
'timestamp' : latestCookie.timestamp,
'cookies' : latestCookie.cookies
};
}
This works fine and I get the following resultSet:
db.cookies.mapReduce(map, reduce, {out:{inline:1}})
...
"results": [
{
"_id": "Drew",
"value": {
"timestamp": ISODate("2019-02-05T20:34:48.922Z"),
"cookies": 1
}
},
{
"_id": "Max",
"value": {
"timestamp": ISODate("2019-02-07T20:34:48.922Z"),
"cookies": 0
}
}
],
...
Max is included in the results - But I'd like for him to not be included (he has 0 cookies after all)
What are my options here? I'm still relatively new to MongoDB. I have looked at finalize as well as creating a temporary collection ({out: "temp.cookies"}) but I'm just curious if there's an easier option or parameter I am overlooking.
Am I crazy for using MapReduce to solve this problem? The actual workload behind this scenario will include millions of rows..
Thanks in advance

Related

Search and update in array of objects MongoDB

I have a collection in MongoDB containing search history of a user where each document is stored like:
"_id": "user1"
searchHistory: {
"product1": [
{
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
},
{
"timestamp": 1623481234,
"query": {
"query": "lindor",
"qty": 4
}
},
],
"product2": [
{
"timestamp": 1623473622,
"query": {
"query": "table",
"qty": 1
}
},
{
"timestamp": 1623438232,
"query": {
"query": "ike",
"qty": 1
}
},
]
}
Here _id of document acts like a foreign key to the user document in another collection.
I have backend running on nodejs and this function is used to store a new search history in the record.
exports.updateUserSearchCount = function (userId, productId, searchDetails) {
let addToSetData = {}
let key = `searchHistory.${productId}`
addToSetData[key] = { "timestamp": new Date().getTime(), "query": searchDetails }
return client.db("mydb").collection("userSearchHistory").updateOne({ "_id": userId }, { "$addToSet": addToSetData }, { upsert: true }, async (err, res) => {
})
}
Now, I want to get search history of a user based on query only using the db.find().
I want something like this:
db.find({"_id": "user1", "searchHistory.somewildcard.query": "some query"})
I need a wildcard which will replace ".somewildcard." to search in all products searched.
I saw a suggestion that we should store document like:
"_id": "user1"
searchHistory: [
{
"key": "product1",
"value": [
{
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
}
]
}
]
However if I store document like this, then adding search history to existing document becomes a tideous and confusing task.
What should I do?
It's always a bad idea to save values are keys, for this exact reason you're facing. It heavily limits querying that field, obviously the trade off is that it makes updates much easier.
I personally recommend you do not save these searches in nested form at all, this will cause you scaling issues quite quickly, assuming these fields are indexed you will start seeing performance issues when the arrays get's too large ( few hundred searches ).
So my personal recommendation is for you to save it in a new collection like so:
{
"user_id": "1",
"key": "product1",
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
}
Now querying a specific user or a specific product or even a query substring is all very easily supported by creating some basic indexes. an "update" in this case would just be to insert a new document which is also much faster.
If you still prefer to keep the nested structure, then I recommend you do switch to the recommended structure you posted, as you mentioned updates will become slightly more tedious, but you can still do it quite easily using arrayFilters for updating a specific element or just using $push for adding a new search

Can we skip documents in MongoDB Map Reduce based on the values computed in reduce?

In mapReduce function of Mongo, can we skip storing a document with the out option.
Eg:
Sample Documents - animals -
{
"type": "bird",
"value": "10"
},
{
"type": "fish",
"value": "30"
},
{
"type": "fish",
"value": "20"
},
{
"type": "plant",
"value": "40"
}
Map reduce functions
function map() {
emit(this.type, this.value);
}
function reduce(key, value) {
var total = Array.sum(value);
// --> IF total is less than 20 then skip
if (total < 20) {
// skip... Maybe something like returning null? --> Is something like this possible?
return;
}
return total;
}
db.animals.mapReduce(map, reduce, { out: 'mr_test' })
I tried searching for a solution around this in the documentation and I wasn't able to find it. Can anyone help me with this?
No you cannot but you can always delete documents from the out collection:
db.mr_test.deleteMany({ value: { $lt:20 } })

RXjava complicated async task

I am trying to use async rxjava flow to handle a complicated task.
I get list of metrics, and list of cities and get their values (list of date and value). Each metric is stored in a different db, so lots of HTTP calls are expected.
The result order means nothing so I want to:
run over all the metrics and asynchronously.
For every metric I want to run over all the cities asynchronously.
For every city and metric I want to create a HTTP call to fetch the data.
The end response should look like this:
"result": {
"metric1": [
{
"cityId": 8,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
},
{
"cityId": 9,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
}
],
"metric2": [
{
"cityId": 8,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
},
{
"cityId": 9,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
}
]
}
}
}
How can I achieve this? Got lost with all the observables. Which should be blocking, which should return a real value?
EDIT: added code sample
Here is what I have so far:
1. for each metric (parallel I guess):
Observable.from(request.getMetrics())
.subscribe(metric -> {
List metricDataList = getMetricData(metric, citiesList);
result.put(metric.getName(), metricDataList);
});
return ImmutableMap.of("result", result);
2. Get metric data (according to cities):
final List<Map<String,Object>> result = new ArrayList<>();
Observable.from(citiesList).subscribe(city ->
result.add(ImmutableMap.of(
"id", city.getId(),
"name", city.getName(),
"value", getMetricValues(metricType, city.getId())
.toBlocking()
.firstOrDefault(new ArrayList<>()))));
return result;
3. According to metric I decide which service to invoke:
private Observable<List<AggregatedObject>> getMetricValues(AggregationMetricType metric, Integer cityId) {
switch (metric) {
case METRIC_1: return fetchMetric1(city);
case METRIC_2: return fetchMetric2(city);
}
return Observable.empty();
}
4. Invoke the service:
public Observable<List<AggregatedObject>> fetchMetric1(Integer city) {
return Observable.just(httpClient.getData(city)
.map(this::transformCountResult)
.onErrorResumeNext(err -> Observable.from(new ArrayList<>()));
}
5. Transform the received data:
protected List<AggregatedObject> transformCountResult(JsonNode jsonNode) {
...
}
It is working, though I am not sure that I have implemented it correctly regarding blocking and concurrency.
Help please.
Regards,
Ido
In the step you call "2) get metric data", you could refactor as
final List<Map<String,Object>> result = new ArrayList<>();
Observable.from(citiesList)
.flatMap(city -> getMetricValues(metricType, city.getId())
.firstOrDefault(new ArrayList<>())
.map( metric -> ImmutableMap.of(
"id", city.getId(),
"name", city.getName(),
"value", metric) ) )
.toList()
.toBlocking()
.subscribe( resultList -> result = resultList );
Moving the toBlocking() outside of the processing means that individual requests will occur in a more parallel fashion.

Adding Fields to an existing Array

I have another problem to solve here. Thinking in arrays sometimes could be very challenging. Here is what I am lined up with. This is what my data looks like: -
{
"_id": { "Firm": "ABC", "year": 2014 },
"Headings": [
{
"costHead": "MNF",
"amount": 500000
},
{
"costHead": "SLS",
"amount": 25000
},
{
"costHead": "OVRHD",
"amount": 100
}
]
}
{
"_id": { "Firm": "CDF", "year": 2015 },
"Headings": [
{
"costHead": "MNF",
"amount": 15000
},
{
"costHead": "SLS",
"amount": 100500
},
{
"costHead": "MNTNC",
"amount": 7500
}
]
}
As you can see, I have a list that has a whole bunch of sub-documents.
Here is what I want to do .. I need to add more elements to this "Headings" list which should be : -
{
"costHead": "FxdCost",
"amount": "$Headings.amount (for costhead MFC) + $Headings.amount (for costhead OVRHD)"
}
I am unsure as to how to produce the above. Here are some challenges: -
I can addToSet the new subdocument I wish to add but the problem is addToSet can only be used in the group stage - which would be expensive (unless of course there is no other way).
Even if I use addToSet, I always have to use the $ operator to refer to elements that I read from my JSON file. Now the element I am trying to add here (costHead: FxdCost) is not present in my JSON file and hence I cannot use the $ operator.
Does anyone have any advice on how to go about this. This is after all basic ETL.

Get a list of all unique tags in mongodb

I am beginning with mongodb and have a collection with documents that look like the following
{
"type": 1,
"tags": ["tag1", "tag2", "tag3"]
}
{
"type": 2,
"tags": ["tag2", "tag3"]
}
{
"type": 3,
"tags": ["tag1", "tag3"]
}
{
"type": 1,
"tags": ["tag1", "tag4"]
}
With this, I want a set of all the tags for a particular type. For example, for type 1, I want the set of tag1, tag2, tag3, tag4 (any order).
All I could think of is to get the tags and add them to a set in python, but I wanted to know if there is a way to do it with mongodb's mapreduce or something else. Please advise.
If you just want a (distinct) list of the tags then using distinct will be best. Map/Reduce will be slower and can't use an index for the javascript part.
http://docs.mongodb.org/manual/reference/method/db.collection.distinct/
db.coll.distinct("tags", {type:1}) Will return a set of tags for type=1.
You are right, a Map/Reduce might work for what you are trying to accomplish, but a Set might be faster and less code.
> m = function() {
... for (var tag in this.tags) {
... emit(this.tags[tag], 1);
... }
... }
> r = function(key, values) {
... return 1;
... }
> db.tags.mapReduce(m, r).find()
{ "_id" : "tag1", "value" : 1 }
{ "_id" : "tag2", "value" : 1 }
{ "_id" : "tag3", "value" : 1 }