Conversion of elasticsearch aggregation into code using java REST API is giving error - rest

This is working fine in kibana console without any error
POST rangedattr_4_1/_search
{
"size": 0,
"aggs": {
"user_field": {
"terms": {
"field": "FRONTLINK_OBJECT_GUID",
"size": 10
},
"aggs": {
"group_members": {
"max": {
"field": "MERGED_LINKS_TYPE"
}
},
"groupmember_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"groupmembers": "group_members"
},
"script": "params.groupmembers %2==1"
}
},
"aggs":{
"top_hits": {
"size": 1
}
}
}
}
}
}
But when I converted this into java code,
like this:
Map<String, String> bucketsPathsMap = new HashMap<>();
bucketsPathsMap.put("bucketselector", "FRONTLINK_OBJECT_GUID");
AggregationBuilder aggregation =AggregationBuilders
.terms("aggs").field(linkBackGuid).size(100000)
.subAggregation
(
AggregationBuilders.max("maxAgg").field("MERGED_LINKS_TYPE")
)
.subAggregation
(
PipelineAggregatorBuilders.bucketSelector("bucket_filter", bucketsPathsMap, script)
)
.subAggregation
(
AggregationBuilders.topHits("top").size(1)
);
it is giving all shards failed error
but when I run this :
AggregationBuilder aggregation =AggregationBuilders
.terms("aggs").field(linkBackGuid).size(100000)
.subAggregation
(
AggregationBuilders.max("maxAgg").field("MERGED_LINKS_TYPE")
)
.subAggregation
(
AggregationBuilders.topHits("top").size(1)
);
it is working fine
According to my information all shards failed will occur only if the docs are corrupted. But as I saw here, Max-Aggregation is working fine but at the same moment Bucket Selector Aggregation is giving error

it worked when I changed my bucketsPathsMap from
bucketsPathsMap.put("bucketselector","FRONTLINK_OBJECT_GUID")
to
bucketsPathsMap.put("bucketselector","maxAgg")
where FRONTLINK_OBJECT_GUID is a field and maxAgg has buckets

Related

Elasticsearch java high level client group by and max

I am using Scala 2.12 and Elasticsearch 6.5. Using the high level java client to query the ES.
Required Data is as E.g. Simple example of Documents has 2 sets of data (published 2 times) with different id and timestamp.
id: id_123 and id_234 (Theese are 2 different ids of required documents) and timestamp(representation only) 10 AM (for id_123) and 11 AM (for id_234).
So I just need those documents which are latest among these i.e. 11 AM one.
I have some filter conditions and then need to group on field1 and take the max of field2 (which is timestamp).
val searchRequest = new SearchRequest("index_name")
val searchSourceBuilder = new SearchSourceBuilder()
val qb = QueryBuilders.boolQuery()
.must(QueryBuilders.matchQuery("myfield.date", "2019-07-02"))
.must(QueryBuilders.matchQuery("myfield.data", "1111"))
.must(QueryBuilders.boolQuery()
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex1"))
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex2"))
)
val myAgg = AggregationBuilders.terms("group_by_Id").field("field1.Id").subAggregation(AggregationBuilders.max("timestamp").field("field1.timeStamp"))
searchSourceBuilder.query(qb)
searchSourceBuilder.aggregation(myAgg)
searchSourceBuilder.size(1000)
searchRequest.source(searchSourceBuilder)
val searchResponse = client.search(searchRequest, RequestOptions.DEFAULT)
Basically, all works good if I do not use the aggregation.
When I use the aggregation, I am getting the following error:
ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Expected numeric type on field [field1.timeStamp], but got [keyword]]]
So what am I missing here?
I am basically looking for SQL-like query, which has fileter (where, AND/OR clause) and then group by a field (Id) and take documents only where timeStamp is max.
UPDATE:
I tried the above query in cURL via command prompt and get the same error when using "max" on aggregaation.
{
"query": {
"bool": {
"must": [
{
"match": { "myfield.date" : "2019-07-02" }
},
{
"match": { "myfield.data" : "1111" }
},
{
"bool": {
"should": [
{
"regexp": { "myOtherFieldId": "myregex1" }
},
{
"regexp": { "myOtherFieldId": "myregex2" }
}
]
}
}
]
}
},
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp"
}
}
}
}
},
"size": "10000"
}
I am getting the same error.
I tried to check the mappings of the index.
It is showing as keyword. So how to do max on such fields?
Adding the relevant mappings:
{"index_name":{"mappings":{"data":{"dynamic_templates":[{"boolean_as_keyword":{"match":"*","match_mapping_type":"boolean","mapping":{"ignore_above":256,"type":"keyword"}}},{"double_as_keyword":{"match":"*","match_mapping_type":"double","mapping":{"ignore_above":256,"type":"keyword"}}},{"long_as_keyword":{"match":"*","match_mapping_type":"long","mapping":{"ignore_above":256,"type":"keyword"}}},{"string_as_keyword":{"match":"*","match_mapping_type":"string","mapping":{"ignore_above":256,"type":"keyword"}}}],"date_detection":false,"properties":{"header":{"properties":{"Id":{"type":"keyword","ignore_above":256},"otherId":{"type":"keyword","ignore_above":256},"someKey":{"type":"keyword","ignore_above":256},"dataType":{"type":"keyword","ignore_above":256},"processing":{"type":"keyword","ignore_above":256},"otherKey":{"type":"keyword","ignore_above":256},"sender":{"type":"keyword","ignore_above":256},"receiver":{"type":"keyword","ignore_above":256},"system":{"type":"keyword","ignore_above":256},"timeStamp":{"type":"keyword","ignore_above":256}}}}}}}}
UPDATE2:
I think I need to aggregate (timeStamp) on keyword.
Please note that timeStamp is a subfield i.e. under field1. So below syntax for keyword doesn't seem to work or I am missing something else.
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp.keyword"
}
}
}
}
}
It fails now saying:
"Invalid aggregator order path [field1.timeStamp]. Unknown aggregation [field1]"

Filter ResultSet after MongDB MapReduce

Consider the following MongoDB collection / prototype that keeps track of how many cookies a given person has at a given point in time:
{
"_id": ObjectId("5c5b5c1865e463c5b6a5b748"),
"person": "Drew",
"cookies": 1,
"timestamp": ISODate("2019-02-05T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2265e463c5b6a5b749"),
"person": "Max",
"cookies": 3,
"timestamp": ISODate("2019-02-06T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2e65e463c5b6a5b74a"),
"person": "Max",
"cookies": 0,
"timestamp": ISODate("2019-02-07T20:34:48.922Z")
}
Ultimately, I need to get all people who currently have more than 0 cookies - In the above example, only "Drew" would qualify - ("Max" had 3, but later only had 0).
I've written the following map / reduce functions to sort this out..
var map = function(){
emit(this.person, {'timestamp' : this.timestamp, 'cookies': this.cookies})
}
var reduce = function(person, cookies){
let latestCookie = cookies.sort(function(a,b){
if(a.timestamp > b.timestamp){
return -1;
} else if(a.timestamp < b.timestamp){
return 1
} else {
return 0;
}
})[0];
return {
'timestamp' : latestCookie.timestamp,
'cookies' : latestCookie.cookies
};
}
This works fine and I get the following resultSet:
db.cookies.mapReduce(map, reduce, {out:{inline:1}})
...
"results": [
{
"_id": "Drew",
"value": {
"timestamp": ISODate("2019-02-05T20:34:48.922Z"),
"cookies": 1
}
},
{
"_id": "Max",
"value": {
"timestamp": ISODate("2019-02-07T20:34:48.922Z"),
"cookies": 0
}
}
],
...
Max is included in the results - But I'd like for him to not be included (he has 0 cookies after all)
What are my options here? I'm still relatively new to MongoDB. I have looked at finalize as well as creating a temporary collection ({out: "temp.cookies"}) but I'm just curious if there's an easier option or parameter I am overlooking.
Am I crazy for using MapReduce to solve this problem? The actual workload behind this scenario will include millions of rows..
Thanks in advance

elasticsearch_dsl: Generate multiple buckets in aggregation

I want to generate this:
GET /packets-2017-09-25/_search
{
"size": 0,
"query": {
"match": {
"transport_protocol": "tcp"
}
},
"aggs": {
"clients": {
"terms": {
"field": "layers.ip.src.keyword",
"size": 1000,
"order":{ "num_servers.value":"desc" }
},
"aggs": {
"num_servers": {
"cardinality": {
"field": "layers.ip.dst.keyword",
"precision_threshold" : 40000
}
},
"server_list": {
"terms": {
"field": "layers.ip.dst.keyword"
}
}
}
}
}
}
i.e I want two buckets (num_servers) and (server_list) under clients.
I am trying the below piece of code, which errors out:
def get_streams_per_client(proto='tcp', max=40000):
s = Search(using=client, index="packets-2017-09-25") \
.query("match", transport_protocol=proto)
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})\
.bucket('num_servers', 'cardinality', field='layers.ip.dst.keyword', precision_threshold=40000)\
.bucket('server_list', 'terms', field='layers.ip.dst.keyword')
s = s.execute()
<snip>
I think I am missing on the right syntax. Appreciate some guidance.
You can always reach existing aggregation using the ["name"] notation if you want to define other sub-aggregations:
s = Search().query('match', transport_protocol='tcp')
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})
s.aggs['clients'].metric('num_servers', 'cardinality', field=..., precision_threshold=...)
s.aggs['clients'].bucket('server_list', 'terms', ...)
Hope this helps!
Got the answer from Honza on elasticsearch_dsl project page:
s = Search(using=client, index="packets-2017-09-25").query('match', transport_protocol=proto)
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})
s.aggs['clients'].bucket('num_servers', 'cardinality', field='layers.ip.dst.keyword', precision_threshold=40000)
s.aggs['clients'].bucket('server_list', 'terms', field='layers.ip.dst.keyword')
print json.dumps(s.to_dict(), indent=4)
s = s.execute()

RXjava complicated async task

I am trying to use async rxjava flow to handle a complicated task.
I get list of metrics, and list of cities and get their values (list of date and value). Each metric is stored in a different db, so lots of HTTP calls are expected.
The result order means nothing so I want to:
run over all the metrics and asynchronously.
For every metric I want to run over all the cities asynchronously.
For every city and metric I want to create a HTTP call to fetch the data.
The end response should look like this:
"result": {
"metric1": [
{
"cityId": 8,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
},
{
"cityId": 9,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
}
],
"metric2": [
{
"cityId": 8,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
},
{
"cityId": 9,
"values": [
{
"date": "2017-09-26T10:49:49",
"value": 445
},
{
"date": "2017-09-27T10:49:49",
"value": 341
}
]
}
]
}
}
}
How can I achieve this? Got lost with all the observables. Which should be blocking, which should return a real value?
EDIT: added code sample
Here is what I have so far:
1. for each metric (parallel I guess):
Observable.from(request.getMetrics())
.subscribe(metric -> {
List metricDataList = getMetricData(metric, citiesList);
result.put(metric.getName(), metricDataList);
});
return ImmutableMap.of("result", result);
2. Get metric data (according to cities):
final List<Map<String,Object>> result = new ArrayList<>();
Observable.from(citiesList).subscribe(city ->
result.add(ImmutableMap.of(
"id", city.getId(),
"name", city.getName(),
"value", getMetricValues(metricType, city.getId())
.toBlocking()
.firstOrDefault(new ArrayList<>()))));
return result;
3. According to metric I decide which service to invoke:
private Observable<List<AggregatedObject>> getMetricValues(AggregationMetricType metric, Integer cityId) {
switch (metric) {
case METRIC_1: return fetchMetric1(city);
case METRIC_2: return fetchMetric2(city);
}
return Observable.empty();
}
4. Invoke the service:
public Observable<List<AggregatedObject>> fetchMetric1(Integer city) {
return Observable.just(httpClient.getData(city)
.map(this::transformCountResult)
.onErrorResumeNext(err -> Observable.from(new ArrayList<>()));
}
5. Transform the received data:
protected List<AggregatedObject> transformCountResult(JsonNode jsonNode) {
...
}
It is working, though I am not sure that I have implemented it correctly regarding blocking and concurrency.
Help please.
Regards,
Ido
In the step you call "2) get metric data", you could refactor as
final List<Map<String,Object>> result = new ArrayList<>();
Observable.from(citiesList)
.flatMap(city -> getMetricValues(metricType, city.getId())
.firstOrDefault(new ArrayList<>())
.map( metric -> ImmutableMap.of(
"id", city.getId(),
"name", city.getName(),
"value", metric) ) )
.toList()
.toBlocking()
.subscribe( resultList -> result = resultList );
Moving the toBlocking() outside of the processing means that individual requests will occur in a more parallel fashion.

How to use id, find an element and only return one field in Meteor + MongoDB

How to use id, find an element and only return one field in Meteor + MongoDB. I wanted to only return status but this doesnt work it return the whole docs? what am I missing?
stuCourse.classId = awquMqKMrYKqNueGx
stuCourse.courseId = m7pcWesZnhWxJgojG
client side
const clas = Col_AllClasses.findOne({
_id: stuCourse.classId,
"courseList.courseId": stuCourse.courseId
}, {
field: {
"courseList.status": 1
}
})
mongodb data
{
"_id": "awquMqKMrYKqNueGx",
"title": "haha1",
"password": "123",
"courseList": [
{
"courseId": "52Eo6XJ33CMGLo4rL",
"status": 0
},
{
"courseId": "m7pcWesZnhWxJgojG",
"status": 0
}
],
}
your are writing incorrect query related to what you wanted, you need to replace field keyword with fields then your Meteor mongo query will appear like
Col_AllClasses.findOne({
_id: stuCourse.classId,
"courseList.courseId": stuCourse.courseId
}, {
fields: {
"courseList.status": 1
}
});
field: {
"courseList.status": 1
}
should be
fields: {
"courseList.status": 1
}