elasticsearch_dsl: Generate multiple buckets in aggregation - elasticsearch-dsl

I want to generate this:
GET /packets-2017-09-25/_search
{
"size": 0,
"query": {
"match": {
"transport_protocol": "tcp"
}
},
"aggs": {
"clients": {
"terms": {
"field": "layers.ip.src.keyword",
"size": 1000,
"order":{ "num_servers.value":"desc" }
},
"aggs": {
"num_servers": {
"cardinality": {
"field": "layers.ip.dst.keyword",
"precision_threshold" : 40000
}
},
"server_list": {
"terms": {
"field": "layers.ip.dst.keyword"
}
}
}
}
}
}
i.e I want two buckets (num_servers) and (server_list) under clients.
I am trying the below piece of code, which errors out:
def get_streams_per_client(proto='tcp', max=40000):
s = Search(using=client, index="packets-2017-09-25") \
.query("match", transport_protocol=proto)
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})\
.bucket('num_servers', 'cardinality', field='layers.ip.dst.keyword', precision_threshold=40000)\
.bucket('server_list', 'terms', field='layers.ip.dst.keyword')
s = s.execute()
<snip>
I think I am missing on the right syntax. Appreciate some guidance.

You can always reach existing aggregation using the ["name"] notation if you want to define other sub-aggregations:
s = Search().query('match', transport_protocol='tcp')
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})
s.aggs['clients'].metric('num_servers', 'cardinality', field=..., precision_threshold=...)
s.aggs['clients'].bucket('server_list', 'terms', ...)
Hope this helps!

Got the answer from Honza on elasticsearch_dsl project page:
s = Search(using=client, index="packets-2017-09-25").query('match', transport_protocol=proto)
s.aggs.bucket('clients', 'terms', field='layers.ip.src.keyword', size=max, order={"num_servers.value":"desc"})
s.aggs['clients'].bucket('num_servers', 'cardinality', field='layers.ip.dst.keyword', precision_threshold=40000)
s.aggs['clients'].bucket('server_list', 'terms', field='layers.ip.dst.keyword')
print json.dumps(s.to_dict(), indent=4)
s = s.execute()

Related

Elasticsearch java high level client group by and max

I am using Scala 2.12 and Elasticsearch 6.5. Using the high level java client to query the ES.
Required Data is as E.g. Simple example of Documents has 2 sets of data (published 2 times) with different id and timestamp.
id: id_123 and id_234 (Theese are 2 different ids of required documents) and timestamp(representation only) 10 AM (for id_123) and 11 AM (for id_234).
So I just need those documents which are latest among these i.e. 11 AM one.
I have some filter conditions and then need to group on field1 and take the max of field2 (which is timestamp).
val searchRequest = new SearchRequest("index_name")
val searchSourceBuilder = new SearchSourceBuilder()
val qb = QueryBuilders.boolQuery()
.must(QueryBuilders.matchQuery("myfield.date", "2019-07-02"))
.must(QueryBuilders.matchQuery("myfield.data", "1111"))
.must(QueryBuilders.boolQuery()
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex1"))
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex2"))
)
val myAgg = AggregationBuilders.terms("group_by_Id").field("field1.Id").subAggregation(AggregationBuilders.max("timestamp").field("field1.timeStamp"))
searchSourceBuilder.query(qb)
searchSourceBuilder.aggregation(myAgg)
searchSourceBuilder.size(1000)
searchRequest.source(searchSourceBuilder)
val searchResponse = client.search(searchRequest, RequestOptions.DEFAULT)
Basically, all works good if I do not use the aggregation.
When I use the aggregation, I am getting the following error:
ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Expected numeric type on field [field1.timeStamp], but got [keyword]]]
So what am I missing here?
I am basically looking for SQL-like query, which has fileter (where, AND/OR clause) and then group by a field (Id) and take documents only where timeStamp is max.
UPDATE:
I tried the above query in cURL via command prompt and get the same error when using "max" on aggregaation.
{
"query": {
"bool": {
"must": [
{
"match": { "myfield.date" : "2019-07-02" }
},
{
"match": { "myfield.data" : "1111" }
},
{
"bool": {
"should": [
{
"regexp": { "myOtherFieldId": "myregex1" }
},
{
"regexp": { "myOtherFieldId": "myregex2" }
}
]
}
}
]
}
},
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp"
}
}
}
}
},
"size": "10000"
}
I am getting the same error.
I tried to check the mappings of the index.
It is showing as keyword. So how to do max on such fields?
Adding the relevant mappings:
{"index_name":{"mappings":{"data":{"dynamic_templates":[{"boolean_as_keyword":{"match":"*","match_mapping_type":"boolean","mapping":{"ignore_above":256,"type":"keyword"}}},{"double_as_keyword":{"match":"*","match_mapping_type":"double","mapping":{"ignore_above":256,"type":"keyword"}}},{"long_as_keyword":{"match":"*","match_mapping_type":"long","mapping":{"ignore_above":256,"type":"keyword"}}},{"string_as_keyword":{"match":"*","match_mapping_type":"string","mapping":{"ignore_above":256,"type":"keyword"}}}],"date_detection":false,"properties":{"header":{"properties":{"Id":{"type":"keyword","ignore_above":256},"otherId":{"type":"keyword","ignore_above":256},"someKey":{"type":"keyword","ignore_above":256},"dataType":{"type":"keyword","ignore_above":256},"processing":{"type":"keyword","ignore_above":256},"otherKey":{"type":"keyword","ignore_above":256},"sender":{"type":"keyword","ignore_above":256},"receiver":{"type":"keyword","ignore_above":256},"system":{"type":"keyword","ignore_above":256},"timeStamp":{"type":"keyword","ignore_above":256}}}}}}}}
UPDATE2:
I think I need to aggregate (timeStamp) on keyword.
Please note that timeStamp is a subfield i.e. under field1. So below syntax for keyword doesn't seem to work or I am missing something else.
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp.keyword"
}
}
}
}
}
It fails now saying:
"Invalid aggregator order path [field1.timeStamp]. Unknown aggregation [field1]"

Can we skip documents in MongoDB Map Reduce based on the values computed in reduce?

In mapReduce function of Mongo, can we skip storing a document with the out option.
Eg:
Sample Documents - animals -
{
"type": "bird",
"value": "10"
},
{
"type": "fish",
"value": "30"
},
{
"type": "fish",
"value": "20"
},
{
"type": "plant",
"value": "40"
}
Map reduce functions
function map() {
emit(this.type, this.value);
}
function reduce(key, value) {
var total = Array.sum(value);
// --> IF total is less than 20 then skip
if (total < 20) {
// skip... Maybe something like returning null? --> Is something like this possible?
return;
}
return total;
}
db.animals.mapReduce(map, reduce, { out: 'mr_test' })
I tried searching for a solution around this in the documentation and I wasn't able to find it. Can anyone help me with this?
No you cannot but you can always delete documents from the out collection:
db.mr_test.deleteMany({ value: { $lt:20 } })

Filter ResultSet after MongDB MapReduce

Consider the following MongoDB collection / prototype that keeps track of how many cookies a given person has at a given point in time:
{
"_id": ObjectId("5c5b5c1865e463c5b6a5b748"),
"person": "Drew",
"cookies": 1,
"timestamp": ISODate("2019-02-05T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2265e463c5b6a5b749"),
"person": "Max",
"cookies": 3,
"timestamp": ISODate("2019-02-06T20:34:48.922Z")
}
{
"_id": ObjectId("5c5b5c2e65e463c5b6a5b74a"),
"person": "Max",
"cookies": 0,
"timestamp": ISODate("2019-02-07T20:34:48.922Z")
}
Ultimately, I need to get all people who currently have more than 0 cookies - In the above example, only "Drew" would qualify - ("Max" had 3, but later only had 0).
I've written the following map / reduce functions to sort this out..
var map = function(){
emit(this.person, {'timestamp' : this.timestamp, 'cookies': this.cookies})
}
var reduce = function(person, cookies){
let latestCookie = cookies.sort(function(a,b){
if(a.timestamp > b.timestamp){
return -1;
} else if(a.timestamp < b.timestamp){
return 1
} else {
return 0;
}
})[0];
return {
'timestamp' : latestCookie.timestamp,
'cookies' : latestCookie.cookies
};
}
This works fine and I get the following resultSet:
db.cookies.mapReduce(map, reduce, {out:{inline:1}})
...
"results": [
{
"_id": "Drew",
"value": {
"timestamp": ISODate("2019-02-05T20:34:48.922Z"),
"cookies": 1
}
},
{
"_id": "Max",
"value": {
"timestamp": ISODate("2019-02-07T20:34:48.922Z"),
"cookies": 0
}
}
],
...
Max is included in the results - But I'd like for him to not be included (he has 0 cookies after all)
What are my options here? I'm still relatively new to MongoDB. I have looked at finalize as well as creating a temporary collection ({out: "temp.cookies"}) but I'm just curious if there's an easier option or parameter I am overlooking.
Am I crazy for using MapReduce to solve this problem? The actual workload behind this scenario will include millions of rows..
Thanks in advance

Conversion of elasticsearch aggregation into code using java REST API is giving error

This is working fine in kibana console without any error
POST rangedattr_4_1/_search
{
"size": 0,
"aggs": {
"user_field": {
"terms": {
"field": "FRONTLINK_OBJECT_GUID",
"size": 10
},
"aggs": {
"group_members": {
"max": {
"field": "MERGED_LINKS_TYPE"
}
},
"groupmember_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"groupmembers": "group_members"
},
"script": "params.groupmembers %2==1"
}
},
"aggs":{
"top_hits": {
"size": 1
}
}
}
}
}
}
But when I converted this into java code,
like this:
Map<String, String> bucketsPathsMap = new HashMap<>();
bucketsPathsMap.put("bucketselector", "FRONTLINK_OBJECT_GUID");
AggregationBuilder aggregation =AggregationBuilders
.terms("aggs").field(linkBackGuid).size(100000)
.subAggregation
(
AggregationBuilders.max("maxAgg").field("MERGED_LINKS_TYPE")
)
.subAggregation
(
PipelineAggregatorBuilders.bucketSelector("bucket_filter", bucketsPathsMap, script)
)
.subAggregation
(
AggregationBuilders.topHits("top").size(1)
);
it is giving all shards failed error
but when I run this :
AggregationBuilder aggregation =AggregationBuilders
.terms("aggs").field(linkBackGuid).size(100000)
.subAggregation
(
AggregationBuilders.max("maxAgg").field("MERGED_LINKS_TYPE")
)
.subAggregation
(
AggregationBuilders.topHits("top").size(1)
);
it is working fine
According to my information all shards failed will occur only if the docs are corrupted. But as I saw here, Max-Aggregation is working fine but at the same moment Bucket Selector Aggregation is giving error
it worked when I changed my bucketsPathsMap from
bucketsPathsMap.put("bucketselector","FRONTLINK_OBJECT_GUID")
to
bucketsPathsMap.put("bucketselector","maxAgg")
where FRONTLINK_OBJECT_GUID is a field and maxAgg has buckets

Elasticsearch: Find substring match

I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result.
I using following settings and mappings:
Index settings:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Mappings:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Insert records:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "men's shaver" }
{ "index": { "_id": 2 }}
{ "name": "women's shaver" }
Query:
1. To search by exact phrase match --> "men's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "men's"
}
}
}
Above query returns "men's shaver" in the return result.
2. To search by Partial word match --> "en's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "en's"
}
}
}
Above query DOES NOT return anything.
I have also tried following query
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "%en's%"
}
}
}
}
Still not getting anything.
I figured it is because of "edge_ngram" type filter on Index which is not able to find "partial word/sbustring match".
I tried "n-gram" type filter as well but it is slowing down the search alot.
Please suggest me how to achieve both excact phrase match and partial phrase match using same index setting.
To search for partial field matches and exact matches, it will work better if you define the fields as "not analyzed" or as keywords (rather than text), then use a wildcard query.
See also this.
To use a wildcard query, append * on both ends of the string you are searching for:
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "*en's*"
}
}
}
}
To use with case insensitivity, use a custom analyzer with a lowercase filter and keyword tokenizer.
Custom Analyzer:
"custom_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
Make the search string lowercase
If you get search string as AsD: change it to *asd*
The answer given by #BlackPOP will work, but it uses the wildcard approach, which is not preferred as it has a performance issue and if abused can create a huge domino effect (performance issue) in the Elastic cluster.
I have written a detailed blog on partial search/autocomplete covering the latest options available in Elasticsearch as of today (Dec 2020) with performance in mind. For more trade-off information please refer to this answer.
IMHO a better approach will be to use the customized n-gram tokenizer according to use-case, which will have already tokens needed for search term so it will be faster, although it will have a bigger index size, but you size is not that costly and speed will be better with more control on how exactly you want substring search to work.
Also size can be controlled if you are conservative in defining the min and max gram in tokenizer setting.
By searching with any string or substring Use:
query: {
or: [{
match_phrase_prefix: {
name: str
}
}, {
match_phrase_prefix: {
surname: str
}
}]
}
Happy coding with Elastic Search....