Elasticsearch uses wrong Case Folding for Unicode Characters

Elasticsearch uses wrong Case Folding for Unicode Characters - unicode

In one of my project, I am trying to use Elasticsearch (1.7) to query data. But, It returns different result for unicode characters depending on if they are uppercased or not. I try to use icu_analyzer to get rid of problem.
Here is a small example to demonstrate my problem. My index is like this,
$ curl -X PUT http://localhost:9200/tr-test -d '
{
"mappings": {
"names": {
"properties": {
"name": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1",
"analysis": {
"filter": {
"nfkc_normalizer": {
"type": "icu_normalizer",
"name": "nfkc"
}
},
"analyzer": {
"my_lowercaser": {
"tokenizer": "icu_tokenizer",
"filter": [
"nfkc_normalizer"
]
}
}
}
}
}
}'
Here is a test data to demonstrate my problem.
$ curl -X POST http://10.22.20.140:9200/tr-test/_bulk -d '
{"index": {"_type":"names", "_index":"tr-test"}}
{"name":"BAHADIR"}'
Here is a similar query. If I query using BAHADIR as query_string, I can easily find my test data.
$ curl -X POST http://10.22.20.140:9200/tr-test/_search -d '
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "BAHADIR"
}
}
}
}
}'
In Turkish, lowercased version of of BAHADIR is bahadır. I am expecting same result while querying with bahadır. But Elasticsearch cannot find my data. And I cannot fix that with using ICU for analysis. It works perfectly fine if I query with bahadir.
I already read Living in a Unicode World and Unicode Case Folding. But cannot fix my problem. I still cannot make elasticsearch to use correct case folding.
Update
I also try to create my Index like this.
$ curl -X PUT http://localhost:9200/tr-test -d '
{
"mappings": {
"names": {
"properties": {
"name": {
"type": "string",
"analyzer" : "turkish"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
}
}'
But I am getting same results. My data can be found if I search using BAHADIR or bahadir but it cannot be found by searching bahadır which is correct lowercased version of BAHADIR.

You should try to use the Turkish Language Analyzer in your setting.
{
"mappings": {
"names": {
"properties": {
"name": {
"type": "string",
"analyzer": "turkish"
}
}
}
}
}
As you can see in the implementation details, it also defines a turkish_lowercase so I guess it'll take care of your problems for you. If you don't want all the other features of the Turkish Analyzer, define a custom one with only turkish_lowercase
If you need a full text search on your name field, you should also change the query method to match query, which is the basic full text search method on a single field.
{
"query": {
"match": {
"name": "bahadır"
}
}
}
On the other hand, query string query is more complex and searches on multiple fields allowing an advanced syntax; It also has an option to pass the analyzer you want to use, so if you really needed this kind of query you should have tried passing "analyzer": "turkish" within the query. I'm not an expert of query string query though.

Related

Which analyzer to use on specific strings?

I have a document in my collection with a property name like this:
name: [{Value: "steel 0.8x1000x2000mm"}]
Now I'm trying to create a search index for it, so far looks like this:
...
"name": {
"fields": {
"Value": [
{
"analyzer": "lucene.finnish",
"searchAnalyzer": "lucene.finnish",
"type": "string"
},
{
"dynamic": true,
"type": "document"
}
]
},
"type": "document"
},
...
And it works pretty fine except for such documents. The issue is that the query 0.8x1000x2000 doesn't match anything, though 0.8x1000x2000mm works fine.
I guess I'm using the wrong analyzer, but can't really figure out which one should I. Or I should make a custom one?

Searching numbers as keywords or strings with Mongo Atlas Search (as possible in Elastic Search)

Sometimes it's useful to allow numbers to be treated as keywords or strings when using a search index. For example, suppose I have transaction data something like this:
[
{ "amount": 715, "description": "paypal payment" },
{ "amount": 7500, "description": "second visa payment" },
{ "amount": 7500, "description": "third visa payment" }
]
I might want to allow a search box entry such as "7500 second" to produce the last two rows, with the "second visa payment" row scoring highest.
How can I achieve this with Mongo DB Atlas, using its search index facility?
In Elastic Search, it's possible by adding a keyword field on the numeric field, as per this example:
INDEX=localhost:9200/test
curl -X DELETE "$INDEX?pretty"
curl -X PUT "$INDEX?pretty" -H 'Content-Type: application/json' -d'
{
"mappings" : {
"properties" : {
"amount" : {
"type" : "long",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}'
curl -X POST "$INDEX/_bulk?pretty" -H 'Content-Type: application/x-ndjson' -d '
{ "index": {"_id":"61d244595c590a67157d5f82"}}
{ "amount": 512,"description": "paypal payment" }
{ "index": {"_id":"61d244785c590a67157d62b3"}}
{ "amount": 7500, "description": "second visa payment" }
{ "index": {"_id":"61d244785c590a67157d62b4"}}
{ "amount": 7500, "description": "third visa payment" }
'
sleep 1
curl -s -X GET "$INDEX/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"query_string": {
"query": "75* second"
}
}
}
' # | jq '.hits.hits[] | {_source,_score}'
Here the search on "75* second" gives the desired result:
{
"_source": {
"amount": 7500,
"description": "second visa payment"
},
"_score": 1.9331132
}
{
"_source": {
"amount": 7500,
"description": "third visa payment"
},
"_score": 1
}
With eqivalent data in Mongo Atlas (v5.0), I've tried setting up an index with a lucene.keyword on the "amount" field as a string, but it has no effect on the results (which only pay attention to the description field). Similarly, added a string field type on the amount field doesn't produce any rows: it seems Mongo Atlas Search insists on using number-type queries on numeric fields.
I'm aware that I can use a more complex compound query, combining numeric and string fields, to get the result (example below), but this isn't necessarily convenient for a user, who just wants to chuck terms in a box without worrying about field names. I may wish to search over ALL number fields in a row, rather than just one, and include results where only some of the terms match, potentially fuzzily. (A possible use case here is searching over transaction data, with a question like "when was my last payment for about 200 dollars to Steven?" in mind).
One possibility might be to create an "all text" field in the mongo DB, allowing the numbers to be stored as strings, and similar to what happens (or used to happen) in Elastic Search. This might require a materialized view on the data, or else an additional, duplicative field, which then would be indexed.... is there an easier solution, or one that involves less data duplication? (the table in question is large, so storage costs matter).
The data in mongo look something like this. amount could be a float or an integer (or likely both, in different fields).
{"_id":{"$oid":"61d244595c590a67157d5f82"},"amount":{"$numberInt":"512"},"description":"paypal payment"}
{"_id":{"$oid":"61d244785c590a67157d62b3"},"amount":{"$numberInt":"7500"},"description":"second visa payment"}
{"_id":{"$oid":"61d244785c590a67157d62b4"},"amount":{"$numberInt":"7500"},"description":"third visa payment"}
An example of a search index definition I've tried (among many!) is:
{
"mappings": {
"dynamic": false,
"fields": {
"amount": {
"multi": {
"test": {
"analyzer": "lucene.keyword",
"ignoreAbove": null,
"searchAnalyzer": "lucene.keyword",
"type": "string"
}
},
"type": "string"
},
"description": {
"type": "string"
}
}
},
"storedSource": true
}
...and a sample search pipeline is:
[
{
"$search": {
"index": "test",
"text": {
"path": {
"wildcard": "*"
},
"query": "7500 second"
}
}
},
{
"$project": {
"_id": 1,
"description": 1,
"amount": 1,
"score": {
"$meta": "searchScore"
}
}
}
]
This gives only the second row (i.e. the "7500" in the query is effectively ignored, and only the description field matches):
[
{
"_id": "61d244785c590a67157d62b3",
"amount": 7500,
"description": "second visa payment",
"score": 0.42414236068725586
}
]
The following compound query does work, but it's overly complex to produce, especially with many numeric and string fields:
{
"index": "test",
"compound": {
"should": [
{
"text": {
"query": "second",
"path": "description"
}
},
{
"near": {
"path": "amount",
"origin": 7500,
"pivot": 1
}
}
]
}
}
Documentation on field types and mappings is at https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings/, operators and collectors at https://www.mongodb.com/docs/atlas/atlas-search/operators-and-collectors/ .
See https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html for Elastic's guidance on why and when it can be useful to index numeric fields as keywords.

In Atlas Search, the data type defined in your index definition determines what operators you can use to query the values. In this case, I used the default and the following query options to target the values you are looking for above.
For only the numeric value:
{
compound: {
should: [
{
range: {
gt: 7499,
lte: 7500,
path: 'amount'
}
}
]
}
}
If I want query for both the text and the number, it's also simply a compound query, though an edgeGram autocomplete field type would be desired in an optimal state. It's really important for me to simplify before optimizing:
{
compound: {
must: [
{
range: {
gt: 7499,
lte: 7500,
path: 'amount'
}
},
{
wildcard: {
query: "*",
path: 'description',
allowAnalyzedField: true
}
}
]
}
}
I hope this is helpful. keyword is only a good analyzer for this field in the case of the description field if you want to do the wildcard. Standard or the language that description is written in would both be better.

Elastic Search Fuzzy Search root and nested fields

I am new to Elastic Search and facing a couple of issues when querying. I have a simple Mongodb database with collections of cities and places of interest. Each collection has a cityName and other details like website etc, and also a places object array. This is my mapping;
{
"mappings": {
"properties": {
"cityName": {
"type": "text"
},
"phone": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"website": {
"type": "keyword"
},
"notes": {
"type": "keyword"
},
"status": {
"type": "keyword"
},
"places": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"status": {
"type": "keyword"
},
"category": {
"type": "keyword"
},
"reviews": {
"properties": {
"rating": {
"type": "long"
},
"comment": {
"type": "keyword"
},
"user": {
"type": "nested"
}
}
}
}
}
}
}
}
I need a fuzzy query where user can search both cityName and places.name, however I get results when I search a single word, adding multiple words return 0 hits. I am sure I am missing something here because I started learning elastic search 2 days ago. The following query returns results because I have a document with cityName: Islamabad and places array having objects that have the keyword Islamabad in their name, in some places the keyword Islamabad is at the beginning of the place.name and in some places objects it might be in the middle or end
This is what I am using : Returns results when only one word
{
"query": {
"bool": {
"should": [
{
"fuzzy": {
"cityName": "Islamabad"
}
},
{
"nested": {
"path": "places",
"query": {
"fuzzy": {
"places.name": "Islamabad"
}
}
}
}
]
}
}
}
Adding another word, say, club, to the above query returns 0 hits when I actually do have places having names Islamabad club and Islamabad Golf club
Problem
The search query is sent from an app and so it is dynamic, so the term to search is same for both cityName and places.name AND places.name doesn't always have the cityName in it.
What do I need exactly??
I need a query where I can search cityName and the array of places (only searching places.name). The query should be of Fuzzy type so that it still returns results if the word Islamabad is spelled like Islambad or even return results for Islam or Abad. And the query should also return results for multiple words, I am sure am I doing something wrong there. Any help would be appreciated.
**P.S : ** I am actually using MongoDB as my database but migrating to Elastic Search ONLY for improving our search feature. I tried different ways with MongoDB, used the mongoose-fuzzy-searching npm module but that didn't work, so if there's a simpler solution for MongoDB please share that too.
Thanks.
EDIT 1:
I had to change the structure (mapping) of my data. Now I have 2 separate indices, one for cities with city details and a cityId and another index for all places, each place has a cityId which will be used for joining later if needed. Each place also has a cityName key so I will only be searching the places index because it has all the details (place name and city name).
I have a city including the word Welder's in it's name and also the some places inside the same location have the word Welder's in their name, which have a type:text. However when searched for welder both of the following queries don't return these documents, a search for welders OR welder's does return these documents. I am not sure why welder won't match with Welder's*. I didn't specify any analyzer during the creation of both the indices and neither am I explicitly defining it in the query can anyone help me out with this query so it behaves as expected:
Query 1 :
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "welder",
"fuzziness": 20
}
}
},
{
"match": {
"cityName": {
"query": "welder",
"fuzziness": 20
}
}
}
]
}
}
}
Query 2 :
{
"query": {
"match": {
"name": {
"query": "welder",
"fuzziness": 20
}
}
}
}

the fuzzy query is meant to be used to find approximations of your complete query within a certain distance :
To find similar terms, the fuzzy query creates a set of all possible
variations, or expansions, of the search term within a specified edit
distance. The query then returns exact matches for each expansion.
If you you cant to allow fuzzy matching of individual terms in your query your need to use a match query with the fuzziness activated.
POST <your_index>/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"cityName": {
"query": "Islamabad golf",
"fuzziness": "AUTO"
}
}
},
{
"nested": {
"path": "places",
"query": {
"match": {
"places.name": {
"query": "Islamabad golf",
"fuzziness": "AUTO"
}
}
}
}
}
]
}
}
}
Reminder: Fuzziness in elasticsearch allow at max 2 corrections per term. SO you will never be able to match Islam with Islamabad since there are 4 changes between those terms.
For more information on distance and fuzziness parameters please refer to this documentation page fuzziness parameters

Representing a Kibana query in a REST, curl form

I have a Kibana server in a classic ELK configuration, querying an Elasticsearch instance.
I use the Kibana console to execute sophisticated queries on elasticsearch. I would like to use some of these queries in the command linem using cURL or any other http tool.
How can I convert a Kibana search into a direct, cURL-like REST call to elasticsearch?

At the bottom of your visualization, there is a small caret you can click in order to view more details about the underlying query:
Then you can click on the "Request" button in order to view the underlying query, which you can copy/paste and do whatever suits you with it.
UPDATE
Then you can copy/paste the query from the "Request" textarea and simply paste it in a curl like this:
curl -XPOST localhost:9200/your_index/your_type/_search -d '{
"query": {
"filtered": {
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "blablabla AND blablabla"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": 1439762400000,
"lte": 1439848799999
}
}
}
],
"must_not": []
}
}
}
},
"highlight": {
"pre_tags": [
"#kibana-highlighted-field#"
],
"post_tags": [
"#/kibana-highlighted-field#"
],
"fields": {
"*": {}
}
},
"size": 420,
"sort": {
"#timestamp": "desc"
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "30m",
"pre_zone": "+02:00",
"pre_zone_adjust_large_interval": true,
"min_doc_count": 0,
"extended_bounds": {
"min": 1439762400000,
"max": 1439848799999
}
}
}
},
"fields": [
"*",
"_source"
],
"script_fields": {},
"fielddata_fields": [
"#timestamp"
]
}'
You may need to tweak a few stuff (like pre/post highlight tags, etc)

In case you are online using a Chrome browser you can go to your Kibana dashboard, open the developer console and write your query while having the Network tab open in the developer console. When you search for your query in the Kibana dashboard you will see the request appear in the developer console. There you can "right click" and select Copy as cURL, which will copy the curl command to your clipboard. Note that credentials of your basic auth may be copied as well. So be careful where you paste it.

Another option would be to query Elastic Search using lucene queries (same syntax Kibana uses) using the ES search API query_string queries:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Taken from one of the doc example, you would query ES using something like this:
GET /_search
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
}

Elasticsearch - Incoming birthdays

I'm new with elasticsearch and I'm stuck with a query.
I want to get the next (now+3d) birthdays among my users. It looks simple but it's not because i have only the birthdate of my users.
How I can compare only months and day directly in the query when I only have a birthdate (Eg: 1984-04-15 or 2015-04-15 sometimes) ?
My field mapping:
"birthdate": {
"format": "dateOptionalTime",
"type": "date"
}
My actual query that doesn't work at all:
{
"query": {
"range": {
"birthdate": {
"format": "dd-MM",
"gte": "now",
"lte": "now+3d"
}
}
}
}
I saw this post Elasticsearch filtering by part of date but I'm not a big fan of the solution, and I would prefer instead of a wilcard a "now+3d"
Maybe I can do do something with a script ?

"Format" field was added in 1.5.0 version of elasticsearch. If your
version is below 1.5.0 format will not work. We had a same problem where we had to send an email on user's birthday and we were using version 1.4.4. So we created a separate dob field where we stored date in "dd-MM" format.
We changed the mapping of date field:
PUT /user
{
"mappings": {
"user": {
"properties": {
"dob": {
"type": "date",
"format": "dd-MM"
}
}
}
}
}
Then you can search:
GET /user/_search
{
"query": {
"filtered": {
"filter": {
"range": {
"date": {
"from": "01-01",
"to": "01-01",
"include_upper" : true
}
}
}
}
}
}