Highlight characters within words in Opensearch query - opensearch

I have set up a custom analyser that uses an edge_ngram filter for a text field. I'm then trying to highlight the characters a user types but opensearch is highlighting the entire word, even if only a small number of characters have been typed.
E.g. Typing "Man" in the search bar will result in the word "Manly" being highlighted. <em>Manly</em> Trail Running Tour. What I really want is <em>Man</em>ly Trail Running Tour.
This should be possible with the fvh highlighting type and chars as the boundary_scanner argument per the docs https://opensearch.org/docs/2.1/opensearch/search/highlight/#highlighting-options
Settings
"title_autocomplete": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "autocomplete"
}
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"edge_ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
Query:
{
"track_total_hits": true,
"highlight": {
"type": "fvh",
"boundary_scanner": "chars",
"fields": {
"title_autocomplete": {}
}
},
"size": 6,
"query": {
"multi_match": {
"query": "Man",
"fields": [
"title_autocomplete^2"
]
}
}
}

Related

MongoDB Atlas Search not showing results when typing few characters

The problem I am facing is that I want to develop an autocomplete search bar using Mean Stack like the one in this site, but when I type, for example, 'ag' it's not returning the right location that should be 'Aguascalientes'.
I have two different search indexes set up and a different query for each.
First Index:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"foldDiacritics": false,
"maxGrams": 7,
"minGrams": 3,
"tokenization": "edgeGram",
"type": "autocomplete"
},
"searchName": {
"foldDiacritics": false,
"maxGrams": 7,
"minGrams": 3,
"tokenization": "edgeGram",
"type": "autocomplete"
}
}
}
}
First Query:
[
{
$search: {
index: "autocomplete2",
compound: {
must: [
{
text: {
query: search,
path: "searchName",
fuzzy: {
maxEdits: 2,
},
},
},
],
},
},
},
{
$limit: 10,
},
]
The first ones are not returning any document at all. But the second example is:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"analyzer": "lucene.standard",
"type": "string"
},
"searchName": {
"analyzer": "lucene.standard",
"type": "string"
}
}
}
}
Query:
[
{
$search: {
index: 'default',
compound: {
must: [
{
text: {
query: search,
path: 'name',
fuzzy: {
maxEdits: 1,
},
},
},
{
text: {
query: search,
path: 'searchName',
fuzzy: {
maxEdits: 1,
},
},
},
],
},
},
},
{
$limit: 5,
},
]
The second example is only returning documents if the search term 'aguascalient' but is not returning any document if the search term is shorter like the site. Maybe it has something to do with the fuzzy edits but if I set it up to greater than 2 I get an error.
Also the order is not right, it returns first the CITY and second the STATE but I need the STATE first because the search term is more similar than the city. Let me explain, search field for STATE is only 'Aguascalientes' but search field cities is 'Aguascalientes Aguascalientes' so I don't know why is not working properly. Maybe in that case I should give weights accordingly but I'm not sure if it's the right approach to solve this.
My data structure:
{
"_id": "638d0ffc34ad076c6bd12cb6",
"depth": 2,
"label": "CITY",
"location_id": "V1-C-247",
"name": "Aguascalientes",
"parent": "Aguascalientes",
"fullName": "Aguascalientes, Aguascalientes",
"parentId": "V1-B-61",
"searchName": "Aguascalientes Aguascalientes",
}
{
"_id": "638d0ffc34ad076c6bd12cb6",
"depth": 1,
"label": "STATE",
"location_id": "V1-C-248",
"name": "Aguascalientes",
"parent": null,
"fullName": "Aguascalientes",
"parentId": null,
"searchName": "Aguascalientes",
}
For the first index + query setup:
First, you are indexing the name field but are not searching on it. I will remove it from the code snippets for readability, but you can add it back to your index definition if you find you need to search on it.
There are two problems with the this index + query setup if you want to return results with a query for "ag". You have searchName defined as a field mapping of type autocomplete, but you also need to use the autocomplete operator in your query:
[
{
$search: {
index: "autocomplete2",
compound: {
must: [
{
autocomplete: {
query: search,
path: "searchName",
},
},
],
},
},
},
{
$limit: 10,
},
]
Second, in your index definition field mapping for searchName, you have minGram set to 3 and maxGram set to 7. Based on the documentation for the autocomplete field mapping, this means that your data will be tokenized into sequences of character lengths between 3 to 7, using the selected tokenization strategy. Since you have selected edgeGram, the tokens generated by the text "Aguascalientes" will be tokenized starting from the left edge, resulting in tokens "agu", "agua", "aguas", "aguasc", "aguasca". Since the search term "ag" does not match any of the tokens, nothing is returned. So, you must change the minGram to 2 to get the token "ag":
{
"mappings": {
"dynamic": false,
"fields": {
"searchName": {
"foldDiacritics": false,
"maxGrams": 7,
"minGrams": 2,
"tokenization": "edgeGram",
"type": "autocomplete"
}
}
}
}
Finally, if you want the document with an exact match to return over a partial match, ie. "Aguascalientes" should return before "Aguascalientes Aguascalientes", you need to implement exact matching. Here is a MongoDB blog post outlining a few options.
One option that I tried: In the index, use a keyword analyzer on the "searchName" field typed as a string data type. In the query, use the text operator nested in a should clause so that exact matches will return higher than other results.
Index:
{
"mappings": {
"dynamic": false,
"fields": {
"searchName": [
{
"foldDiacritics": false,
"maxGrams": 7,
"type": "autocomplete"
},
{
"analyzer": "lucene.keyword",
"searchAnalyzer": "lucene.keyword",
"type": "string"
}
]
}
}
}
Query:
[
{
$search: {
compound: {
must: [
{
autocomplete: {
query: search,
path: "searchName"
}
}
],
should:[
{
text: {
query: search,
path: "searchName"
}
}
],
},
},
},
]

JSON Schema - can array / list validation be combined with anyOf?

I have a json document I'm trying to validate with this form:
...
"products": [{
"prop1": "foo",
"prop2": "bar"
}, {
"prop3": "hello",
"prop4": "world"
},
...
There are multiple different forms an object may take. My schema looks like this:
...
"definitions": {
"products": {
"type": "array",
"items": { "$ref": "#/definitions/Product" },
"Product": {
"type": "object",
"oneOf": [
{ "$ref": "#/definitions/Product_Type1" },
{ "$ref": "#/definitions/Product_Type2" },
...
]
},
"Product_Type1": {
"type": "object",
"properties": {
"prop1": { "type": "string" },
"prop2": { "type": "string" }
},
"Product_Type2": {
"type": "object",
"properties": {
"prop3": { "type": "string" },
"prop4": { "type": "string" }
}
...
On top of this, certain properties of the individual product array objects may be indirected via further usage of anyOf or oneOf.
I'm running into issues in VSCode using the built-in schema validation where it throws errors for every item in the products array that don't match Product_Type1.
So it seems the validator latches onto that first oneOf it found and won't validate against any of the other types.
I didn't find any limitations to the oneOf mechanism on jsonschema.org. And there is no mention of it being used in the page specifically dealing with arrays here: https://json-schema.org/understanding-json-schema/reference/array.html
Is what I'm attempting possible?
Your general approach is fine. Let's take a slightly simpler example to illustrate what's going wrong.
Given this schema
{
"oneOf": [
{ "properties": { "foo": { "type": "integer" } } },
{ "properties": { "bar": { "type": "integer" } } }
]
}
And this instance
{ "foo": 42 }
At first glance, this looks like it matches /oneOf/0 and not oneOf/1. It actually matches both schemas, which violates the one-and-only-one constraint imposed by oneOf and the oneOf fails.
Remember that every keyword in JSON Schema is a constraint. Anything that is not explicitly excluded by the schema is allowed. There is nothing in the /oneOf/1 schema that says a "foo" property is not allowed. Nor does is say that "foo" is required. It only says that if the instance has a keyword "foo", then it must be an integer.
To fix this, you will need required and maybe additionalProperties depending on the situation. I show here how you would use additionalProperties, but I recommend you don't use it unless you need to because is does have some problematic properties.
{
"oneOf": [
{
"properties": { "foo": { "type": "integer" } },
"required": ["foo"],
"additionalProperties": false
},
{
"properties": { "bar": { "type": "integer" } },
"required": ["bar"],
"additionalProperties": false
}
]
}

Set language type in pattern - vs code language extension

I want for vscode to understand that the language in between <go> tags in a html file should be validated as golang code.
So given:
<go>
// I want to get intellisense and syntax highlighting for golang here
</go>
I currently have the following insides grammars in package.json:
{
"scopeName": "go.html.injection",
"path": "./syntaxes/go.tmLanguage.json",
"injectTo": [
"text.html"
],
"embeddedLanguages": {
"source.go": "go"
}
}
and in syntaxes/go.tmLanguage.json:
{
"scopeName": "go.html.injection",
"injectionSelector": "L:text.html",
"patterns": [
{
"include": "#go-tag"
}
],
"repository": {
"go-tag": {
"begin": "<go>",
"end": "<\/go>",
"name": "go"
}
}
}
Inspecting it using the debug gives it the name go as a textmate scope but the language is still set to html. How can I set the language of the match to golang:
Inspecting the content of script tags show the language set to javascript so this should be possible? I also realise that then the match includes the <go> tag so I understand I now need to add pattern matching and evaluation for that.
Update 20/02/20:
After referring to the vscode svelte extension I figured out how to get syntax highlighting for the tag and innerHTML using this inside syntaxes/go.tmLanguage.json (same package.json):
{
"scopeName": "go.html.injection",
"injectionSelector": "L:text.html",
"patterns": [
{
"include": "#go-tag"
}
],
"repository": {
"go-tag": {
"begin": "(<)(go)",
"beginCaptures": {
"1": {
"name": "punctuation.definition.tag.begin.html"
},
"2": {
"name": "entity.name.tag.html"
},
"3": {
"name": "punctuation.definition.tag.end.html"
}
},
"end": "(<\/)(go)(>)",
"endCaptures": {
"1": {
"name": "punctuation.definition.tag.begin.html"
},
"2": {
"name": "entity.name.tag.html"
},
"3": {
"name": "punctuation.definition.tag.end.html"
}
},
"patterns": [
{
"contentName": "source.go",
"begin": "(>)",
"beginCaptures": {
"1": {
"name": "punctuation.definition.tag.end.html"
}
},
"end": "(?=</go>)",
"patterns": [
{
"include": "source.go"
}
]
}
]
}
}
}
I can now see that vscode is correctly highlighting syntax for the tag and using the imported golang syntax tokens. However it is still displays the language as "html".

elasticsearch ngram and postgresql trigram search results are not match

I've crereated an index on elasticsearch same as bellow:
"settings" : {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"issue": {
"properties": {
"description": {
"type": "string",
"analyzer": "trigrams"
}
}
}
}
My test items are bellow:
"alici onay verdi basarili satisiniz gerceklesti diyor ama hesabima para transferi gerceklesmemis"
"otomatik onay işlemi gecikmiş"
"************* nolu iade islemi urun kargoya verilmedi zamaninda iade islemlerinde urun erorr hata veriyor"
I've test this index with bellow query:
GET issue/_search
{
"query": {
"match": {
"description":{
"query": "otomatik onay istemi zamaninda gerceklesmemis"
}
}
}
}
And result:
{
....
"hits": {
....
"max_score": 2.3507352,
"hits": [
{
....
"_score": 2.3507352,
"_source": {
"issue_id": "*******",
"description": "alici onay verdi basarili satisiniz gerceklesti diyor ama hesabima para transferi gerceklesmemis"
}
}
]
}
}
But same data on postgresql with bellow SQL response another result:
SELECT
public.tbl_issue_descriptions_big.description,
similarity(description, 'otomatik onay islemi zamaninda gerceklesmemis') AS sml
FROM
public.tbl_issue_descriptions_big
WHERE
description %'otomatik onay islemi zamaninda gerceklesmemis'
ORDER BY
sml DESC
LIMIT 10
Result is:
description | sml
======================================================|======
otomatik onay islemi gecikmis |0,351852
Why is this difference caused?
I dont know enough about postgres to give a qualified answer there (as this also depends on the documents that are indexed and if they scoring formulas are exactly the same, which I doubt), but Elasticsearch has an explain API and an explain parameter in the search, that help you to find out why a certain document was scored this way.

Using unicode characters in Elasticsearch synonyms

I am trying to setup elasticsearch index using synonyms and almost succeeded it. My configuration of index:
{
"index": {
"analysis": {
"analyzer": {
"syns": {
"filter": [
"standard",
"lowercase",
"syns_filter"
],
"type": "custom",
"tokenizer": "standard"
}
},
"filter": {
"syns_filter": {
"type": "synonym",
"synonyms": ["Киев , Kyiv", "jee,java"],
}
}
}
}
}
Only thing I could not solve is that it worked for jee and searches result output same results as for java, but does not work for Kyiv.