Elastic Search Fuzzy Search root and nested fields - mongodb

I am new to Elastic Search and facing a couple of issues when querying. I have a simple Mongodb database with collections of cities and places of interest. Each collection has a cityName and other details like website etc, and also a places object array. This is my mapping;
{
"mappings": {
"properties": {
"cityName": {
"type": "text"
},
"phone": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"website": {
"type": "keyword"
},
"notes": {
"type": "keyword"
},
"status": {
"type": "keyword"
},
"places": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"status": {
"type": "keyword"
},
"category": {
"type": "keyword"
},
"reviews": {
"properties": {
"rating": {
"type": "long"
},
"comment": {
"type": "keyword"
},
"user": {
"type": "nested"
}
}
}
}
}
}
}
}
I need a fuzzy query where user can search both cityName and places.name, however I get results when I search a single word, adding multiple words return 0 hits. I am sure I am missing something here because I started learning elastic search 2 days ago. The following query returns results because I have a document with cityName: Islamabad and places array having objects that have the keyword Islamabad in their name, in some places the keyword Islamabad is at the beginning of the place.name and in some places objects it might be in the middle or end
This is what I am using : Returns results when only one word
{
"query": {
"bool": {
"should": [
{
"fuzzy": {
"cityName": "Islamabad"
}
},
{
"nested": {
"path": "places",
"query": {
"fuzzy": {
"places.name": "Islamabad"
}
}
}
}
]
}
}
}
Adding another word, say, club, to the above query returns 0 hits when I actually do have places having names Islamabad club and Islamabad Golf club
Problem
The search query is sent from an app and so it is dynamic, so the term to search is same for both cityName and places.name AND places.name doesn't always have the cityName in it.
What do I need exactly??
I need a query where I can search cityName and the array of places (only searching places.name). The query should be of Fuzzy type so that it still returns results if the word Islamabad is spelled like Islambad or even return results for Islam or Abad. And the query should also return results for multiple words, I am sure am I doing something wrong there. Any help would be appreciated.
**P.S : ** I am actually using MongoDB as my database but migrating to Elastic Search ONLY for improving our search feature. I tried different ways with MongoDB, used the mongoose-fuzzy-searching npm module but that didn't work, so if there's a simpler solution for MongoDB please share that too.
Thanks.
EDIT 1:
I had to change the structure (mapping) of my data. Now I have 2 separate indices, one for cities with city details and a cityId and another index for all places, each place has a cityId which will be used for joining later if needed. Each place also has a cityName key so I will only be searching the places index because it has all the details (place name and city name).
I have a city including the word Welder's in it's name and also the some places inside the same location have the word Welder's in their name, which have a type:text. However when searched for welder both of the following queries don't return these documents, a search for welders OR welder's does return these documents. I am not sure why welder won't match with Welder's*. I didn't specify any analyzer during the creation of both the indices and neither am I explicitly defining it in the query can anyone help me out with this query so it behaves as expected:
Query 1 :
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "welder",
"fuzziness": 20
}
}
},
{
"match": {
"cityName": {
"query": "welder",
"fuzziness": 20
}
}
}
]
}
}
}
Query 2 :
{
"query": {
"match": {
"name": {
"query": "welder",
"fuzziness": 20
}
}
}
}

the fuzzy query is meant to be used to find approximations of your complete query within a certain distance :
To find similar terms, the fuzzy query creates a set of all possible
variations, or expansions, of the search term within a specified edit
distance. The query then returns exact matches for each expansion.
If you you cant to allow fuzzy matching of individual terms in your query your need to use a match query with the fuzziness activated.
POST <your_index>/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"cityName": {
"query": "Islamabad golf",
"fuzziness": "AUTO"
}
}
},
{
"nested": {
"path": "places",
"query": {
"match": {
"places.name": {
"query": "Islamabad golf",
"fuzziness": "AUTO"
}
}
}
}
}
]
}
}
}
Reminder: Fuzziness in elasticsearch allow at max 2 corrections per term. SO you will never be able to match Islam with Islamabad since there are 4 changes between those terms.
For more information on distance and fuzziness parameters please refer to this documentation page fuzziness parameters

Related

Searching numbers as keywords or strings with Mongo Atlas Search (as possible in Elastic Search)

Sometimes it's useful to allow numbers to be treated as keywords or strings when using a search index. For example, suppose I have transaction data something like this:
[
{ "amount": 715, "description": "paypal payment" },
{ "amount": 7500, "description": "second visa payment" },
{ "amount": 7500, "description": "third visa payment" }
]
I might want to allow a search box entry such as "7500 second" to produce the last two rows, with the "second visa payment" row scoring highest.
How can I achieve this with Mongo DB Atlas, using its search index facility?
In Elastic Search, it's possible by adding a keyword field on the numeric field, as per this example:
INDEX=localhost:9200/test
curl -X DELETE "$INDEX?pretty"
curl -X PUT "$INDEX?pretty" -H 'Content-Type: application/json' -d'
{
"mappings" : {
"properties" : {
"amount" : {
"type" : "long",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}'
curl -X POST "$INDEX/_bulk?pretty" -H 'Content-Type: application/x-ndjson' -d '
{ "index": {"_id":"61d244595c590a67157d5f82"}}
{ "amount": 512,"description": "paypal payment" }
{ "index": {"_id":"61d244785c590a67157d62b3"}}
{ "amount": 7500, "description": "second visa payment" }
{ "index": {"_id":"61d244785c590a67157d62b4"}}
{ "amount": 7500, "description": "third visa payment" }
'
sleep 1
curl -s -X GET "$INDEX/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"query_string": {
"query": "75* second"
}
}
}
' # | jq '.hits.hits[] | {_source,_score}'
Here the search on "75* second" gives the desired result:
{
"_source": {
"amount": 7500,
"description": "second visa payment"
},
"_score": 1.9331132
}
{
"_source": {
"amount": 7500,
"description": "third visa payment"
},
"_score": 1
}
With eqivalent data in Mongo Atlas (v5.0), I've tried setting up an index with a lucene.keyword on the "amount" field as a string, but it has no effect on the results (which only pay attention to the description field). Similarly, added a string field type on the amount field doesn't produce any rows: it seems Mongo Atlas Search insists on using number-type queries on numeric fields.
I'm aware that I can use a more complex compound query, combining numeric and string fields, to get the result (example below), but this isn't necessarily convenient for a user, who just wants to chuck terms in a box without worrying about field names. I may wish to search over ALL number fields in a row, rather than just one, and include results where only some of the terms match, potentially fuzzily. (A possible use case here is searching over transaction data, with a question like "when was my last payment for about 200 dollars to Steven?" in mind).
One possibility might be to create an "all text" field in the mongo DB, allowing the numbers to be stored as strings, and similar to what happens (or used to happen) in Elastic Search. This might require a materialized view on the data, or else an additional, duplicative field, which then would be indexed.... is there an easier solution, or one that involves less data duplication? (the table in question is large, so storage costs matter).
The data in mongo look something like this. amount could be a float or an integer (or likely both, in different fields).
{"_id":{"$oid":"61d244595c590a67157d5f82"},"amount":{"$numberInt":"512"},"description":"paypal payment"}
{"_id":{"$oid":"61d244785c590a67157d62b3"},"amount":{"$numberInt":"7500"},"description":"second visa payment"}
{"_id":{"$oid":"61d244785c590a67157d62b4"},"amount":{"$numberInt":"7500"},"description":"third visa payment"}
An example of a search index definition I've tried (among many!) is:
{
"mappings": {
"dynamic": false,
"fields": {
"amount": {
"multi": {
"test": {
"analyzer": "lucene.keyword",
"ignoreAbove": null,
"searchAnalyzer": "lucene.keyword",
"type": "string"
}
},
"type": "string"
},
"description": {
"type": "string"
}
}
},
"storedSource": true
}
...and a sample search pipeline is:
[
{
"$search": {
"index": "test",
"text": {
"path": {
"wildcard": "*"
},
"query": "7500 second"
}
}
},
{
"$project": {
"_id": 1,
"description": 1,
"amount": 1,
"score": {
"$meta": "searchScore"
}
}
}
]
This gives only the second row (i.e. the "7500" in the query is effectively ignored, and only the description field matches):
[
{
"_id": "61d244785c590a67157d62b3",
"amount": 7500,
"description": "second visa payment",
"score": 0.42414236068725586
}
]
The following compound query does work, but it's overly complex to produce, especially with many numeric and string fields:
{
"index": "test",
"compound": {
"should": [
{
"text": {
"query": "second",
"path": "description"
}
},
{
"near": {
"path": "amount",
"origin": 7500,
"pivot": 1
}
}
]
}
}
Documentation on field types and mappings is at https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings/, operators and collectors at https://www.mongodb.com/docs/atlas/atlas-search/operators-and-collectors/ .
See https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html for Elastic's guidance on why and when it can be useful to index numeric fields as keywords.
In Atlas Search, the data type defined in your index definition determines what operators you can use to query the values. In this case, I used the default and the following query options to target the values you are looking for above.
For only the numeric value:
{
compound: {
should: [
{
range: {
gt: 7499,
lte: 7500,
path: 'amount'
}
}
]
}
}
If I want query for both the text and the number, it's also simply a compound query, though an edgeGram autocomplete field type would be desired in an optimal state. It's really important for me to simplify before optimizing:
{
compound: {
must: [
{
range: {
gt: 7499,
lte: 7500,
path: 'amount'
}
},
{
wildcard: {
query: "*",
path: 'description',
allowAnalyzedField: true
}
}
]
}
}
I hope this is helpful. keyword is only a good analyzer for this field in the case of the description field if you want to do the wildcard. Standard or the language that description is written in would both be better.

What is the best way to query an array of subdocument in MongoDB?

let's say I have a collection like so:
{
"id": "2902-48239-42389-83294",
"data": {
"location": [
{
"country": "Italy",
"city": "Rome"
}
],
"time": [
{
"timestamp": "1626298659",
"data":"2020-12-24 09:42:30"
}
],
"details": [
{
"timestamp": "1626298659",
"data": {
"url": "https://example.com",
"name": "John Doe",
"email": "john#doe.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "https://www.myexample.com",
"name": "John Doe",
"email": "doe#john.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "http://example.com/sub/directory",
"name": "John Doe",
"email": "doe#johnson.com"
}
}
]
}
}
Now the main focus is on the array of subdocument("data.details"): I want to get output only of relevant matches e.g:
db.info.find({"data.details.data.url": "example.com"})
How can I get a match for all "data.details.data.url" contains "example.com" but won't match with "myexample.com". When I do it with $regex I get too many results, so if I query for "example.com" it also return "myexample.com"
Even when I do get partial results (with $match), It's very slow. I tried this aggregation stages:
{ $unwind: "$data.details" },
{
$match: {
"data.details.data.url": /.*example.com.*/,
},
},
{
$project: {
id: 1,
"data.details.data.url": 1,
"data.details.data.email": 1,
},
},
I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.
More info:
My collection has dozens of GB, I created two indexes:
Compound like so:
"data.details.data.url": 1,
"data.details.data.email": 1
Text Index:
"data.details.data.url": "text",
"data.details.data.email": "text"
It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!
Your mistake is in the regex. It matches all URLs because the substring example.com is in all URLs. For example: https://www.myexample.com matches the bolded part.
To avoid this you have to use another regex, for example that just start with that domain.
For example:
(http[s]?:\/\/|www\.)YOUR_SEARCH
will check that what you are searching for is behind an http:// or www. marks.
https://regex101.com/r/M4OLw1/1
I leave you the full query.
[
{
'$unwind': {
'path': '$data.details'
}
}, {
'$match': {
'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
}
}
]
Note: you must scape special characters from the regex. A dot matches any character and the slash will close your regex causing an error.

How can I count all possible subdocument elements for a given top element in Mongo?

Not sure I am using the right terminology here, but assume following oversimplified JSON structure available in Mongo :
{
"_id": 1234,
"labels": {
"label1": {
"id": "l1",
"value": "abc"
},
"label3": {
"id": "l2",
"value": "def"
},
"label5": {
"id": "l3",
"value": "ghi"
},
"label9": {
"id": "l4",
"value": "xyz"
}
}
}
{
"_id": 5678,
"labels": {
"label1": {
"id": "l1",
"value": "hjk"
},
"label5": {
"id": "l5",
"value": "def"
},
"label10": {
"id": "l10",
"value": "ghi"
},
"label24": {
"id": "l24",
"value": "xyz"
}
}
}
I know my base element name (labels in the example), but I do not know the various sub elements I can have (so in this case the labelx names).
How can I group / count the existing elements (like as if I would be using a wildcard) so I would get some distinct overview like
"label1":2
"label3":1
"label5":2
"label9":1
"label10":1
"label24":1
as a result? So far I only found examples where you actually need to know the element names. But I don't know them and want to find some way to get all possible sub element names for a given top element for easy review.
In reality the label names can be pretty wild, I used labelx for readability in the example.
You can try below aggregation in 3.4.
Use $objectToArray to transform object to array of key value pairs followed by $unwind and $group on key to count occurrences.
db.col.aggregate([
{"$project":{"labels":{"$objectToArray":"$labels"}}},
{"$unwind":"$labels"},
{"$group":{"_id":"$labels.k","count":{"$sum":1}}}
])

Filtering nested results an OData Query

I have a OData query returning a bunch of items. The results come back looking like this:
{
"d": {
"__metadata": {
"id": "http://dev.sp.swampland.local/_api/SP.UserProfiles.PeopleManager/GetPropertiesFor(accountName=#v)",
"uri": "http://dev.sp.swampland.local/_api/SP.UserProfiles.PeopleManager/GetPropertiesFor(accountName=#v)",
"type": "SP.UserProfiles.PersonProperties"
},
"UserProfileProperties": {
"results": [
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "UserProfile_GUID",
"Value": "66a0c6c2-cbec-4abb-9e25-cc9e924ad390",
"ValueType": "Edm.String"
},
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "ADGuid",
"Value": "System.Byte[]",
"ValueType": "Edm.String"
},
{
"__metadata": {
"type": "SP.KeyValue"
},
"Key": "SID",
"Value": "S-1-5-21-2355771569-1952171574-2825027748-500",
"ValueType": "Edm.String"
}
]
}
}
}
In reality, there's a lot of items (100+) coming back in the UserProfileProperties collection however I'm only looking for a few where the KEY matches a few items but I can't figure out exactly what I need my filter to be. I've tried $filter=UserProfileProperties/Key eq 'SID' but that still gives me everything. Also trying to figure out how to pull back multiple items.
Ideas?
I believe you forgot about how each of the results have a key, not the UserProfileProperties so UserProfileProperties/Key doesn't actually exist. Instead because result is an array you must check either a certain position (eq. result(1)) or use the oData functions any or all.
Try $filter=UserProfileProperties/results/any(r: r/Key eq 'SID') if you want all the profiles where just one of the keys is SID or use
$filter=UserProfileProperties/results/all(r: r/Key eq 'SID') if you want the profiles where every result has a key equaling SID.

Querying Multi Level Nested fields on Elastic Search

I'm new to Elastic Search and to the non-SQL paradigm.
I've been following ES tutorial, but there is one thing I couldn't put to work.
In the following code (I'me using PyES to interact with ES) I create a single document, with a nested field (subjects), that contains another nested field (concepts).
from pyes import *
conn = ES('127.0.0.1:9200') # Use HTTP
# Delete and Create a new index.
conn.indices.delete_index("documents-index")
conn.create_index("documents-index")
# Create a single document.
document = {
"docid": 123456789,
"title": "This is the doc title.",
"description": "This is the doc description.",
"datepublished": 2005,
"author": ["Joe", "John", "Charles"],
"subjects": [{
"subjectname": 'subject1',
"subjectid": [210, 311, 1012, 784, 568],
"subjectkey": 2,
"concepts": [
{"name": "concept1", "score": 75},
{"name": "concept2", "score": 55}
]
},
{
"subjectname": 'subject2',
"subjectid": [111, 300, 141, 457, 748],
"subjectkey": 0,
"concepts": [
{"name": "concept3", "score": 88},
{"name": "concept4", "score": 55},
{"name": "concept5", "score": 66}
]
}],
}
# Define the nested elements.
mapping1 = {
'subjects': {
'type': 'nested'
}
}
mapping2 = {
'concepts': {
'type': 'nested'
}
}
conn.put_mapping("document", {'properties': mapping1}, ["documents-index"])
conn.put_mapping("subjects", {'properties': mapping2}, ["documents-index"])
# Insert document in 'documents-index' index.
conn.index(document, "documents-index", "document", 1)
# Refresh connection to make queries.
conn.refresh()
I'm able to query subjects nested field:
query1 = {
"nested": {
"path": "subjects",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.subjectname": "subject1"}
},
{
"range": {"subjects.subjectkey": {"gt": 1}}
}
]
}
}
}
}
results = conn.search(query=query1)
for r in results:
print r # as expected, it returns the entire document.
but I can't figure out how to query based on concepts nested field.
ES documentation refers that
Multi level nesting is automatically supported, and detected,
resulting in an inner nested query to automatically match the relevant
nesting level (and not root) if it exists within another nested query.
So, I tryed to build a query with the following format:
query2 = {
"nested": {
"path": "concepts",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"concepts.name": "concept1"}
},
{
"range": {"concepts.score": {"gt": 0}}
}
]
}
}
}
}
which returned 0 results.
I can't figure out what is missing and I haven't found any example with queries based on two levels of nesting.
Ok, after trying a tone of combinations, I finally got it using the following query:
query3 = {
"nested": {
"path": "subjects",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.concepts.name": "concept1"}
}
]
}
}
}
}
So, the nested path attribute (subjects) is always the same, no matter the nested attribute level, and in the query definition I used the attribute's full path (subject.concepts.name).
Shot in the dark since I haven't tried this personally, but have you tried the fully qualified path to Concepts?
query2 = {
"nested": {
"path": "subjects.concepts",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.concepts.name": "concept1"}
},
{
"range": {"subjects.concepts.score": {"gt": 0}}
}
]
}
}
}
}
I have some question for JCJS's answer. why your mapping shouldn't like this?
mapping = {
"subjects": {
"type": "nested",
"properties": {
"concepts": {
"type": "nested"
}
}
}
}
I try to define two type-mapping maybe doesn't work, but be a flatten data; I think we should nested in nested properties..
At last... if we use this mapping nested query should like this...
{
"query": {
"nested": {
"path": "subjects.concepts",
"query": {
"term": {
"name": {
"value": "concept1"
}
}
}
}
}
}
It's vital for using full path for path attribute...but not for term key can be full-path or relative-path.