Can we skip documents in MongoDB Map Reduce based on the values computed in reduce? - mongodb

In mapReduce function of Mongo, can we skip storing a document with the out option.
Eg:
Sample Documents - animals -
{
"type": "bird",
"value": "10"
},
{
"type": "fish",
"value": "30"
},
{
"type": "fish",
"value": "20"
},
{
"type": "plant",
"value": "40"
}
Map reduce functions
function map() {
emit(this.type, this.value);
}
function reduce(key, value) {
var total = Array.sum(value);
// --> IF total is less than 20 then skip
if (total < 20) {
// skip... Maybe something like returning null? --> Is something like this possible?
return;
}
return total;
}
db.animals.mapReduce(map, reduce, { out: 'mr_test' })
I tried searching for a solution around this in the documentation and I wasn't able to find it. Can anyone help me with this?

No you cannot but you can always delete documents from the out collection:
db.mr_test.deleteMany({ value: { $lt:20 } })

Related

Elasticsearch java high level client group by and max

I am using Scala 2.12 and Elasticsearch 6.5. Using the high level java client to query the ES.
Required Data is as E.g. Simple example of Documents has 2 sets of data (published 2 times) with different id and timestamp.
id: id_123 and id_234 (Theese are 2 different ids of required documents) and timestamp(representation only) 10 AM (for id_123) and 11 AM (for id_234).
So I just need those documents which are latest among these i.e. 11 AM one.
I have some filter conditions and then need to group on field1 and take the max of field2 (which is timestamp).
val searchRequest = new SearchRequest("index_name")
val searchSourceBuilder = new SearchSourceBuilder()
val qb = QueryBuilders.boolQuery()
.must(QueryBuilders.matchQuery("myfield.date", "2019-07-02"))
.must(QueryBuilders.matchQuery("myfield.data", "1111"))
.must(QueryBuilders.boolQuery()
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex1"))
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex2"))
)
val myAgg = AggregationBuilders.terms("group_by_Id").field("field1.Id").subAggregation(AggregationBuilders.max("timestamp").field("field1.timeStamp"))
searchSourceBuilder.query(qb)
searchSourceBuilder.aggregation(myAgg)
searchSourceBuilder.size(1000)
searchRequest.source(searchSourceBuilder)
val searchResponse = client.search(searchRequest, RequestOptions.DEFAULT)
Basically, all works good if I do not use the aggregation.
When I use the aggregation, I am getting the following error:
ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Expected numeric type on field [field1.timeStamp], but got [keyword]]]
So what am I missing here?
I am basically looking for SQL-like query, which has fileter (where, AND/OR clause) and then group by a field (Id) and take documents only where timeStamp is max.
UPDATE:
I tried the above query in cURL via command prompt and get the same error when using "max" on aggregaation.
{
"query": {
"bool": {
"must": [
{
"match": { "myfield.date" : "2019-07-02" }
},
{
"match": { "myfield.data" : "1111" }
},
{
"bool": {
"should": [
{
"regexp": { "myOtherFieldId": "myregex1" }
},
{
"regexp": { "myOtherFieldId": "myregex2" }
}
]
}
}
]
}
},
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp"
}
}
}
}
},
"size": "10000"
}
I am getting the same error.
I tried to check the mappings of the index.
It is showing as keyword. So how to do max on such fields?
Adding the relevant mappings:
{"index_name":{"mappings":{"data":{"dynamic_templates":[{"boolean_as_keyword":{"match":"*","match_mapping_type":"boolean","mapping":{"ignore_above":256,"type":"keyword"}}},{"double_as_keyword":{"match":"*","match_mapping_type":"double","mapping":{"ignore_above":256,"type":"keyword"}}},{"long_as_keyword":{"match":"*","match_mapping_type":"long","mapping":{"ignore_above":256,"type":"keyword"}}},{"string_as_keyword":{"match":"*","match_mapping_type":"string","mapping":{"ignore_above":256,"type":"keyword"}}}],"date_detection":false,"properties":{"header":{"properties":{"Id":{"type":"keyword","ignore_above":256},"otherId":{"type":"keyword","ignore_above":256},"someKey":{"type":"keyword","ignore_above":256},"dataType":{"type":"keyword","ignore_above":256},"processing":{"type":"keyword","ignore_above":256},"otherKey":{"type":"keyword","ignore_above":256},"sender":{"type":"keyword","ignore_above":256},"receiver":{"type":"keyword","ignore_above":256},"system":{"type":"keyword","ignore_above":256},"timeStamp":{"type":"keyword","ignore_above":256}}}}}}}}
UPDATE2:
I think I need to aggregate (timeStamp) on keyword.
Please note that timeStamp is a subfield i.e. under field1. So below syntax for keyword doesn't seem to work or I am missing something else.
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp.keyword"
}
}
}
}
}
It fails now saying:
"Invalid aggregator order path [field1.timeStamp]. Unknown aggregation [field1]"

Filter ResultSet after MongDB MapReduce

Consider the following MongoDB collection / prototype that keeps track of how many cookies a given person has at a given point in time:
{
"_id": ObjectId("5c5b5c1865e463c5b6a5b748"),
"person": "Drew",
"cookies": 1,
"timestamp": ISODate("2019-02-05T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2265e463c5b6a5b749"),
"person": "Max",
"cookies": 3,
"timestamp": ISODate("2019-02-06T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2e65e463c5b6a5b74a"),
"person": "Max",
"cookies": 0,
"timestamp": ISODate("2019-02-07T20:34:48.922Z")
}
Ultimately, I need to get all people who currently have more than 0 cookies - In the above example, only "Drew" would qualify - ("Max" had 3, but later only had 0).
I've written the following map / reduce functions to sort this out..
var map = function(){
emit(this.person, {'timestamp' : this.timestamp, 'cookies': this.cookies})
}
var reduce = function(person, cookies){
let latestCookie = cookies.sort(function(a,b){
if(a.timestamp > b.timestamp){
return -1;
} else if(a.timestamp < b.timestamp){
return 1
} else {
return 0;
}
})[0];
return {
'timestamp' : latestCookie.timestamp,
'cookies' : latestCookie.cookies
};
}
This works fine and I get the following resultSet:
db.cookies.mapReduce(map, reduce, {out:{inline:1}})
...
"results": [
{
"_id": "Drew",
"value": {
"timestamp": ISODate("2019-02-05T20:34:48.922Z"),
"cookies": 1
}
},
{
"_id": "Max",
"value": {
"timestamp": ISODate("2019-02-07T20:34:48.922Z"),
"cookies": 0
}
}
],
...
Max is included in the results - But I'd like for him to not be included (he has 0 cookies after all)
What are my options here? I'm still relatively new to MongoDB. I have looked at finalize as well as creating a temporary collection ({out: "temp.cookies"}) but I'm just curious if there's an easier option or parameter I am overlooking.
Am I crazy for using MapReduce to solve this problem? The actual workload behind this scenario will include millions of rows..
Thanks in advance

elasticsearch_dsl: Generate multiple buckets in aggregation

I want to generate this:
GET /packets-2017-09-25/_search
{
"size": 0,
"query": {
"match": {
"transport_protocol": "tcp"
}
},
"aggs": {
"clients": {
"terms": {
"field": "layers.ip.src.keyword",
"size": 1000,
"order":{ "num_servers.value":"desc" }
},
"aggs": {
"num_servers": {
"cardinality": {
"field": "layers.ip.dst.keyword",
"precision_threshold" : 40000
}
},
"server_list": {
"terms": {
"field": "layers.ip.dst.keyword"
}
}
}
}
}
}
i.e I want two buckets (num_servers) and (server_list) under clients.
I am trying the below piece of code, which errors out:
def get_streams_per_client(proto='tcp', max=40000):
s = Search(using=client, index="packets-2017-09-25") \
.query("match", transport_protocol=proto)
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})\
.bucket('num_servers', 'cardinality', field='layers.ip.dst.keyword', precision_threshold=40000)\
.bucket('server_list', 'terms', field='layers.ip.dst.keyword')
s = s.execute()
<snip>
I think I am missing on the right syntax. Appreciate some guidance.
You can always reach existing aggregation using the ["name"] notation if you want to define other sub-aggregations:
s = Search().query('match', transport_protocol='tcp')
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})
s.aggs['clients'].metric('num_servers', 'cardinality', field=..., precision_threshold=...)
s.aggs['clients'].bucket('server_list', 'terms', ...)
Hope this helps!
Got the answer from Honza on elasticsearch_dsl project page:
s = Search(using=client, index="packets-2017-09-25").query('match', transport_protocol=proto)
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})
s.aggs['clients'].bucket('num_servers', 'cardinality', field='layers.ip.dst.keyword', precision_threshold=40000)
s.aggs['clients'].bucket('server_list', 'terms', field='layers.ip.dst.keyword')
print json.dumps(s.to_dict(), indent=4)
s = s.execute()

RXjava complicated async task

I am trying to use async rxjava flow to handle a complicated task.
I get list of metrics, and list of cities and get their values (list of date and value). Each metric is stored in a different db, so lots of HTTP calls are expected.
The result order means nothing so I want to:
run over all the metrics and asynchronously.
For every metric I want to run over all the cities asynchronously.
For every city and metric I want to create a HTTP call to fetch the data.
The end response should look like this:
"result": {
"metric1": [
{
"cityId": 8,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
},
{
"cityId": 9,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
}
],
"metric2": [
{
"cityId": 8,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
},
{
"cityId": 9,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
}
]
}
}
}
How can I achieve this? Got lost with all the observables. Which should be blocking, which should return a real value?
EDIT: added code sample
Here is what I have so far:
1. for each metric (parallel I guess):
Observable.from(request.getMetrics())
.subscribe(metric -> {
List metricDataList = getMetricData(metric, citiesList);
result.put(metric.getName(), metricDataList);
});
return ImmutableMap.of("result", result);
2. Get metric data (according to cities):
final List<Map<String,Object>> result = new ArrayList<>();
Observable.from(citiesList).subscribe(city ->
result.add(ImmutableMap.of(
"id", city.getId(),
"name", city.getName(),
"value", getMetricValues(metricType, city.getId())
.toBlocking()
.firstOrDefault(new ArrayList<>()))));
return result;
3. According to metric I decide which service to invoke:
private Observable<List<AggregatedObject>> getMetricValues(AggregationMetricType metric, Integer cityId) {
switch (metric) {
case METRIC_1: return fetchMetric1(city);
case METRIC_2: return fetchMetric2(city);
}
return Observable.empty();
}
4. Invoke the service:
public Observable<List<AggregatedObject>> fetchMetric1(Integer city) {
return Observable.just(httpClient.getData(city)
.map(this::transformCountResult)
.onErrorResumeNext(err -> Observable.from(new ArrayList<>()));
}
5. Transform the received data:
protected List<AggregatedObject> transformCountResult(JsonNode jsonNode) {
...
}
It is working, though I am not sure that I have implemented it correctly regarding blocking and concurrency.
Help please.
Regards,
Ido
In the step you call "2) get metric data", you could refactor as
final List<Map<String,Object>> result = new ArrayList<>();
Observable.from(citiesList)
.flatMap(city -> getMetricValues(metricType, city.getId())
.firstOrDefault(new ArrayList<>())
.map( metric -> ImmutableMap.of(
"id", city.getId(),
"name", city.getName(),
"value", metric) ) )
.toList()
.toBlocking()
.subscribe( resultList -> result = resultList );
Moving the toBlocking() outside of the processing means that individual requests will occur in a more parallel fashion.

Find all objects, that's nested properties have desired value

I have collection with the following (sample) documents:
{
"label": "Tree",
"properties": {
"height": {
"type": "int",
"label": "Height",
"description": "In meters"
},
"coordinates": {
"type": "coords",
"label": "Coordinates"
},
"age": {
"type": "int",
"label": "Age"
}
}
}
Keys in the properties attribute are different for almost each of the documents in collection.
I want to find all documents that have at least one property of given type.
What I'm looking for is to query this for {"properties.*.type": "coords"}. But this is not working as it is only my invention of mongo query.
Every help I was able to find concerned the $elemMatch operator which I can not use here because properties is an object, not an array.
Hi as per my knowledge in mongodb not provide this kind of search. So for finding this first I separated out all keys using map-reduce and then find query form so below code will help you
var mapReduce = db.runCommand({
"mapreduce": "collectionName",
"map": function() {
for (var key in this.properties) {
emit(key, null);
}
},
"reduce": function(key, stuff) {
return null;
},
"out": "collectionName" + "_keys"
})
db[mapReduce.result].distinct("_id").forEach(function(data) {
findkey = [];
findkey.push("properties." + data + ".type");
var query = {};
query[findkey] = "coords";
var myCursor = db.collectionName.find(query);
while (myCursor.hasNext()) {
print(tojson(myCursor.next()));
}
})
MongoDB doesn't support searches on keys - things like properties.* to match all subkeys of properties, etc. You shouldn't have arbitrary keys or keys that you don't know about in your schema, unless they are just for display, generally, because you will not be able to interact with them very easily in MongoDB.
If you do want to store dynamic attributes, the best approach is usually an array like the following:
{
"properties" : [
{
"key" : "height",
"value" : {
"type" : "Int",
"label" : "Height",
"description" : "In meters"
}
},
...
]
}
Efficient querying for your use case
find all documents that have at least one property of given type
results from an index on { "key" : 1 }:
db.test.find({ "properties.key" : { "$in" : ["height", "coordinates", "age"] } })