Elasticsearch java high level client group by and max - scala

I am using Scala 2.12 and Elasticsearch 6.5. Using the high level java client to query the ES.
Required Data is as E.g. Simple example of Documents has 2 sets of data (published 2 times) with different id and timestamp.
id: id_123 and id_234 (Theese are 2 different ids of required documents) and timestamp(representation only) 10 AM (for id_123) and 11 AM (for id_234).
So I just need those documents which are latest among these i.e. 11 AM one.
I have some filter conditions and then need to group on field1 and take the max of field2 (which is timestamp).
val searchRequest = new SearchRequest("index_name")
val searchSourceBuilder = new SearchSourceBuilder()
val qb = QueryBuilders.boolQuery()
.must(QueryBuilders.matchQuery("myfield.date", "2019-07-02"))
.must(QueryBuilders.matchQuery("myfield.data", "1111"))
.must(QueryBuilders.boolQuery()
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex1"))
.should(QueryBuilders.regexpQuery("myOtherFieldId", "myregex2"))
)
val myAgg = AggregationBuilders.terms("group_by_Id").field("field1.Id").subAggregation(AggregationBuilders.max("timestamp").field("field1.timeStamp"))
searchSourceBuilder.query(qb)
searchSourceBuilder.aggregation(myAgg)
searchSourceBuilder.size(1000)
searchRequest.source(searchSourceBuilder)
val searchResponse = client.search(searchRequest, RequestOptions.DEFAULT)
Basically, all works good if I do not use the aggregation.
When I use the aggregation, I am getting the following error:
ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Expected numeric type on field [field1.timeStamp], but got [keyword]]]
So what am I missing here?
I am basically looking for SQL-like query, which has fileter (where, AND/OR clause) and then group by a field (Id) and take documents only where timeStamp is max.
UPDATE:
I tried the above query in cURL via command prompt and get the same error when using "max" on aggregaation.
{
"query": {
"bool": {
"must": [
{
"match": { "myfield.date" : "2019-07-02" }
},
{
"match": { "myfield.data" : "1111" }
},
{
"bool": {
"should": [
{
"regexp": { "myOtherFieldId": "myregex1" }
},
{
"regexp": { "myOtherFieldId": "myregex2" }
}
]
}
}
]
}
},
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp"
}
}
}
}
},
"size": "10000"
}
I am getting the same error.
I tried to check the mappings of the index.
It is showing as keyword. So how to do max on such fields?
Adding the relevant mappings:
{"index_name":{"mappings":{"data":{"dynamic_templates":[{"boolean_as_keyword":{"match":"*","match_mapping_type":"boolean","mapping":{"ignore_above":256,"type":"keyword"}}},{"double_as_keyword":{"match":"*","match_mapping_type":"double","mapping":{"ignore_above":256,"type":"keyword"}}},{"long_as_keyword":{"match":"*","match_mapping_type":"long","mapping":{"ignore_above":256,"type":"keyword"}}},{"string_as_keyword":{"match":"*","match_mapping_type":"string","mapping":{"ignore_above":256,"type":"keyword"}}}],"date_detection":false,"properties":{"header":{"properties":{"Id":{"type":"keyword","ignore_above":256},"otherId":{"type":"keyword","ignore_above":256},"someKey":{"type":"keyword","ignore_above":256},"dataType":{"type":"keyword","ignore_above":256},"processing":{"type":"keyword","ignore_above":256},"otherKey":{"type":"keyword","ignore_above":256},"sender":{"type":"keyword","ignore_above":256},"receiver":{"type":"keyword","ignore_above":256},"system":{"type":"keyword","ignore_above":256},"timeStamp":{"type":"keyword","ignore_above":256}}}}}}}}
UPDATE2:
I think I need to aggregate (timeStamp) on keyword.
Please note that timeStamp is a subfield i.e. under field1. So below syntax for keyword doesn't seem to work or I am missing something else.
"aggs": {
"NAME" : {
"terms": {
"field": "field1.Id"
},
"aggs": {
"NAME": {
"max" : {
"field": "field1.timeStamp.keyword"
}
}
}
}
}
It fails now saying:
"Invalid aggregator order path [field1.timeStamp]. Unknown aggregation [field1]"

Related

ElasticSearch Multi Index Query

simple question: I have multiple indexes in my elasticsearch engine mirrored by postgresql using logstash. ElasticSearch performs well for fuzzy searches, but now I need to use references within the indexes, that need to be handled by the queries.
Index A:
{
name: "alice",
_id: 5
}
...
Index B:
{
name: "bob",
_id: 3,
best_friend: 5
}
...
How do I query:
Get every match of index B with field name starting with "b" and index A referenced by "best_friend" with the name starting with "a"
Is this even possible with elasticsearch?
Yes, that's possible: POST A,B/_search will query multiple indexes.
In order to match a record from a specific index, you can use meta-data field _index
Below is a query that gets every match of index B with field name starting with "b" and index A with the name starting with "a" but not matches a reference as you usually do in relational SQL databases. foreign key reference matching (join) in Elastic and every NoSQL is YOUR responsibility AFAIK. refer to Elastic Definitive Guide to find out the best approach to your needs. Lastly, NoSQL is not SQL, change your mind.
POST A,B/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"prefix": {
"name": "a"
}
},
{
"term": {
"_index": "A"
}
}
]
}
},
{
"bool": {
"must": [
{
"prefix": {
"name": "b"
}
},
{
"term": {
"_index": "B"
}
}
]
}
}
]
}
}
}

RemoteTransportException, Fielddata is disabled on text fields when doing aggregation on text field

I am migrating from 2.x to 5.x
I am adding values to the index like this
indexInto (indexName / indexType) id someKey source foo
however I would also want to fetch all values by field:
def getValues(tag: String) ={
client execute {
search(indexName / indexType) query ("_field_names", tag) aggregations (termsAggregation( "agg") field tag size 1)
}
But I am getting this exception :
RemoteTransportException[[8vWOLB2][172.17.0.5:9300][indices:data/read/search[phase/query]]];
nested: IllegalArgumentException[Fielddata is disabled on text fields
by default. Set fielddata=true on [my_tag] in order to load fielddata
in memory by uninverting the inverted index. Note that this can
however use significant memory.];
I am thought maybe to use keyword as shown here , but the fields are not known in advanced (sent by the user) so I cannot use perpend mappings
By default all the unknown fields will be indexed/added to elasticsearch as text fields which are not specified in the mappings.
If you will take a look at mappings of such a field, you can see there a field is enabled with for such fields with type 'keyword' and these fields are indexed but not analyzed.
GET new_index2/_mappings
{
"new_index2": {
"mappings": {
"type": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
so you can use the fields values for the text fields for aggregations like the following
POST new_index2/_search
{
"aggs": {
"NAME": {
"terms": {
"field": "name.fields",
"size": 10
}
}
}
}
Check name.fields
So your scala query can work if you can shift to fields value.
def getValues(tag: String) = {
client.execute {
search(indexName / indexType)
.query("_field_name", tag)
.aggregations {
termsAgg("agg", "field_name.fields")
}.size(1)
}
}
Hope this helps.
Thanks

Find all objects, that's nested properties have desired value

I have collection with the following (sample) documents:
{
"label": "Tree",
"properties": {
"height": {
"type": "int",
"label": "Height",
"description": "In meters"
},
"coordinates": {
"type": "coords",
"label": "Coordinates"
},
"age": {
"type": "int",
"label": "Age"
}
}
}
Keys in the properties attribute are different for almost each of the documents in collection.
I want to find all documents that have at least one property of given type.
What I'm looking for is to query this for {"properties.*.type": "coords"}. But this is not working as it is only my invention of mongo query.
Every help I was able to find concerned the $elemMatch operator which I can not use here because properties is an object, not an array.
Hi as per my knowledge in mongodb not provide this kind of search. So for finding this first I separated out all keys using map-reduce and then find query form so below code will help you
var mapReduce = db.runCommand({
"mapreduce": "collectionName",
"map": function() {
for (var key in this.properties) {
emit(key, null);
}
},
"reduce": function(key, stuff) {
return null;
},
"out": "collectionName" + "_keys"
})
db[mapReduce.result].distinct("_id").forEach(function(data) {
findkey = [];
findkey.push("properties." + data + ".type");
var query = {};
query[findkey] = "coords";
var myCursor = db.collectionName.find(query);
while (myCursor.hasNext()) {
print(tojson(myCursor.next()));
}
})
MongoDB doesn't support searches on keys - things like properties.* to match all subkeys of properties, etc. You shouldn't have arbitrary keys or keys that you don't know about in your schema, unless they are just for display, generally, because you will not be able to interact with them very easily in MongoDB.
If you do want to store dynamic attributes, the best approach is usually an array like the following:
{
"properties" : [
{
"key" : "height",
"value" : {
"type" : "Int",
"label" : "Height",
"description" : "In meters"
}
},
...
]
}
Efficient querying for your use case
find all documents that have at least one property of given type
results from an index on { "key" : 1 }:
db.test.find({ "properties.key" : { "$in" : ["height", "coordinates", "age"] } })

Elasticsearch: Find substring match

I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result.
I using following settings and mappings:
Index settings:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Mappings:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Insert records:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "men's shaver" }
{ "index": { "_id": 2 }}
{ "name": "women's shaver" }
Query:
1. To search by exact phrase match --> "men's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "men's"
}
}
}
Above query returns "men's shaver" in the return result.
2. To search by Partial word match --> "en's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "en's"
}
}
}
Above query DOES NOT return anything.
I have also tried following query
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "%en's%"
}
}
}
}
Still not getting anything.
I figured it is because of "edge_ngram" type filter on Index which is not able to find "partial word/sbustring match".
I tried "n-gram" type filter as well but it is slowing down the search alot.
Please suggest me how to achieve both excact phrase match and partial phrase match using same index setting.
To search for partial field matches and exact matches, it will work better if you define the fields as "not analyzed" or as keywords (rather than text), then use a wildcard query.
See also this.
To use a wildcard query, append * on both ends of the string you are searching for:
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "*en's*"
}
}
}
}
To use with case insensitivity, use a custom analyzer with a lowercase filter and keyword tokenizer.
Custom Analyzer:
"custom_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
Make the search string lowercase
If you get search string as AsD: change it to *asd*
The answer given by #BlackPOP will work, but it uses the wildcard approach, which is not preferred as it has a performance issue and if abused can create a huge domino effect (performance issue) in the Elastic cluster.
I have written a detailed blog on partial search/autocomplete covering the latest options available in Elasticsearch as of today (Dec 2020) with performance in mind. For more trade-off information please refer to this answer.
IMHO a better approach will be to use the customized n-gram tokenizer according to use-case, which will have already tokens needed for search term so it will be faster, although it will have a bigger index size, but you size is not that costly and speed will be better with more control on how exactly you want substring search to work.
Also size can be controlled if you are conservative in defining the min and max gram in tokenizer setting.
By searching with any string or substring Use:
query: {
or: [{
match_phrase_prefix: {
name: str
}
}, {
match_phrase_prefix: {
surname: str
}
}]
}
Happy coding with Elastic Search....

How can I find records greater than or equal to a time in MongoDB?

I have a MongoDB document structured like this:
{
"_id": ObjectId("50cf904a07ef604c8cc3d091"),
"lessons": {
"0": {
"lesson_name": "View and Edit Lists",
"release_time": ISODate("2012-12-17T00:00:00Z"),
"requires_anim": false,
"requires_qq": true
},
"1": {
"lesson_name": "Leave a Tip",
"release_time": ISODate("2012-12-18T00:00:00Z"),
"requires_anim": false,
"requires_qq": true
}
}
}
I have a number of such documents. I'd like to get all documents for which the release time of a lesson is greater than or equal to a given time. Here's the query I wrote:
db.lessons.find({"lessons.release_time":{"$gte": ISODate("2012-12-16")}});
But this is not returning any documents. Any ideas on what I'm doing wrong and how to correct it. Thanks.
Here's the result of my testing:
> db.testc.insert( { lessons: [
{release_time: ISODate("2012-12-17T00:00:00Z")},
{release_time: ISODate("2012-12-18T00:00:00Z")}
] } )
> db.testc.find({"lessons.release_time":{"$gte": ISODate("2012-12-16")}})
{ "_id" : ObjectId("50cfa093ab08a4592c73f927"),
"lessons" : [
{ "release_time" : ISODate("2012-12-17T00:00:00Z") },
{ "release_time" : ISODate("2012-12-18T00:00:00Z") }
] }
Your query is fine but, as others have pointed out, most likely your data is not structured as an array.