MongoDB Atlas Search not showing results when typing few characters - mongodb

The problem I am facing is that I want to develop an autocomplete search bar using Mean Stack like the one in this site, but when I type, for example, 'ag' it's not returning the right location that should be 'Aguascalientes'.
I have two different search indexes set up and a different query for each.
First Index:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"foldDiacritics": false,
"maxGrams": 7,
"minGrams": 3,
"tokenization": "edgeGram",
"type": "autocomplete"
},
"searchName": {
"foldDiacritics": false,
"maxGrams": 7,
"minGrams": 3,
"tokenization": "edgeGram",
"type": "autocomplete"
}
}
}
}
First Query:
[
{
$search: {
index: "autocomplete2",
compound: {
must: [
{
text: {
query: search,
path: "searchName",
fuzzy: {
maxEdits: 2,
},
},
},
],
},
},
},
{
$limit: 10,
},
]
The first ones are not returning any document at all. But the second example is:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"analyzer": "lucene.standard",
"type": "string"
},
"searchName": {
"analyzer": "lucene.standard",
"type": "string"
}
}
}
}
Query:
[
{
$search: {
index: 'default',
compound: {
must: [
{
text: {
query: search,
path: 'name',
fuzzy: {
maxEdits: 1,
},
},
},
{
text: {
query: search,
path: 'searchName',
fuzzy: {
maxEdits: 1,
},
},
},
],
},
},
},
{
$limit: 5,
},
]
The second example is only returning documents if the search term 'aguascalient' but is not returning any document if the search term is shorter like the site. Maybe it has something to do with the fuzzy edits but if I set it up to greater than 2 I get an error.
Also the order is not right, it returns first the CITY and second the STATE but I need the STATE first because the search term is more similar than the city. Let me explain, search field for STATE is only 'Aguascalientes' but search field cities is 'Aguascalientes Aguascalientes' so I don't know why is not working properly. Maybe in that case I should give weights accordingly but I'm not sure if it's the right approach to solve this.
My data structure:
{
"_id": "638d0ffc34ad076c6bd12cb6",
"depth": 2,
"label": "CITY",
"location_id": "V1-C-247",
"name": "Aguascalientes",
"parent": "Aguascalientes",
"fullName": "Aguascalientes, Aguascalientes",
"parentId": "V1-B-61",
"searchName": "Aguascalientes Aguascalientes",
}
{
"_id": "638d0ffc34ad076c6bd12cb6",
"depth": 1,
"label": "STATE",
"location_id": "V1-C-248",
"name": "Aguascalientes",
"parent": null,
"fullName": "Aguascalientes",
"parentId": null,
"searchName": "Aguascalientes",
}

For the first index + query setup:
First, you are indexing the name field but are not searching on it. I will remove it from the code snippets for readability, but you can add it back to your index definition if you find you need to search on it.
There are two problems with the this index + query setup if you want to return results with a query for "ag". You have searchName defined as a field mapping of type autocomplete, but you also need to use the autocomplete operator in your query:
[
{
$search: {
index: "autocomplete2",
compound: {
must: [
{
autocomplete: {
query: search,
path: "searchName",
},
},
],
},
},
},
{
$limit: 10,
},
]
Second, in your index definition field mapping for searchName, you have minGram set to 3 and maxGram set to 7. Based on the documentation for the autocomplete field mapping, this means that your data will be tokenized into sequences of character lengths between 3 to 7, using the selected tokenization strategy. Since you have selected edgeGram, the tokens generated by the text "Aguascalientes" will be tokenized starting from the left edge, resulting in tokens "agu", "agua", "aguas", "aguasc", "aguasca". Since the search term "ag" does not match any of the tokens, nothing is returned. So, you must change the minGram to 2 to get the token "ag":
{
"mappings": {
"dynamic": false,
"fields": {
"searchName": {
"foldDiacritics": false,
"maxGrams": 7,
"minGrams": 2,
"tokenization": "edgeGram",
"type": "autocomplete"
}
}
}
}
Finally, if you want the document with an exact match to return over a partial match, ie. "Aguascalientes" should return before "Aguascalientes Aguascalientes", you need to implement exact matching. Here is a MongoDB blog post outlining a few options.
One option that I tried: In the index, use a keyword analyzer on the "searchName" field typed as a string data type. In the query, use the text operator nested in a should clause so that exact matches will return higher than other results.
Index:
{
"mappings": {
"dynamic": false,
"fields": {
"searchName": [
{
"foldDiacritics": false,
"maxGrams": 7,
"type": "autocomplete"
},
{
"analyzer": "lucene.keyword",
"searchAnalyzer": "lucene.keyword",
"type": "string"
}
]
}
}
}
Query:
[
{
$search: {
compound: {
must: [
{
autocomplete: {
query: search,
path: "searchName"
}
}
],
should:[
{
text: {
query: search,
path: "searchName"
}
}
],
},
},
},
]

Related

Highlight characters within words in Opensearch query

I have set up a custom analyser that uses an edge_ngram filter for a text field. I'm then trying to highlight the characters a user types but opensearch is highlighting the entire word, even if only a small number of characters have been typed.
E.g. Typing "Man" in the search bar will result in the word "Manly" being highlighted. <em>Manly</em> Trail Running Tour. What I really want is <em>Man</em>ly Trail Running Tour.
This should be possible with the fvh highlighting type and chars as the boundary_scanner argument per the docs https://opensearch.org/docs/2.1/opensearch/search/highlight/#highlighting-options
Settings
"title_autocomplete": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "autocomplete"
}
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"edge_ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
Query:
{
"track_total_hits": true,
"highlight": {
"type": "fvh",
"boundary_scanner": "chars",
"fields": {
"title_autocomplete": {}
}
},
"size": 6,
"query": {
"multi_match": {
"query": "Man",
"fields": [
"title_autocomplete^2"
]
}
}
}

MongoDB: $set specific fields for a document array elements only if not null

I have a collection with the following documents (for example):
{
"_id": {
"$oid": "61acefe999e03b9324czzzzz"
},
"matchId": {
"$oid": "61a392cc54e3752cc71zzzzz"
},
"logs": [
{
"actionType": "CREATE",
"data": {
"talent": {
"talentId": "qq",
"talentVersion": "2.10",
"firstName": "Joelle",
"lastName": "Doe",
"socialLinks": [
{
"type": "FACEBOOK",
"url": "https://www.facebook.com"
},
{
"type": "LINKEDIN",
"url": "https://www.linkedin.com"
}
],
"webResults": [
{
"type": "VIDEO",
"date": "2021-11-28T14:31:40.728Z",
"link": "http://placeimg.com/640/480",
"title": "Et necessitatibus",
"platform": "Repellendus"
}
]
},
"createdBy": "DEVELOPER"
}
},
{
"actionType": "UPDATE",
"data": {
"talent": {
"firstName": "Joelle new",
"webResults": [
{
"type": "VIDEO",
"date": "2021-11-28T14:31:40.728Z",
"link": "http://placeimg.com/640/480",
"title": "Et necessitatibus",
"platform": "Repellendus"
}
]
}
}
}
]
},
{
"_id": {
"$oid": "61acefe999e03b9324caaaaa"
},
"matchId": {
"$oid": "61a392cc54e3752cc71zzzzz"
},
"logs": [....]
}
a brief breakdown: I have many objects like this one in the collection. they are a kind of an audit log for actions takes on other documents, 'Match(es)'. for example CREATE + the data, UPDATE + the data, etc.
As you can see, logs field of the document is an array of objects, each describing one of these actions.
data for each action may or may not contain specific fields, that in turn can also be an array of objects: socialLinks and webResults.
I'm trying to remove sensitive data from all of these documents with specified Match ids.
For each document, I want to go over the logs array field, and change the value of specific fields only if they exist, for example: change firstName to *****, same for lastName, if those appear. also, go over the socialLinks array if exists, and for each element inside it, if a field url exists, change it to ***** as well.
What I've tried so far are many minor variations for this query:
$set: {
'logs.$[].data.talent.socialLinks.$[].url': '*****',
'logs.$[].data.talent.webResults.$[].link': '*****',
'logs.$[].data.talent.webResults.$[].title': '*****',
'logs.$[].data.talent.firstName': '*****',
'logs.$[].data.talent.lastName': '*****',
},
and some play around with this kind of aggregation query:
[{
$set: {
'talent.socialLinks.$[el].url': {
$cond: [{ $ne: ['el.url', null] },'*****', undefined],
},
},
}]
resulting in errors like: message: "The path 'logs.0.data.talent.socialLinks' must exist in the document in order to apply array updates.",
But I just cant get it to work... :(
Would love an explanation on how to exactly achieve this kind of set-only-if-exists behaviour.
A working example would also be much appreciated, thx.
Would suggest using $\[<indentifier>\] (filtered positional operator) and arrayFilters to update the nested document(s) in the array field.
In arrayFilters, with $exists to check the existence of the certain document which matches the condition and to be updated.
db.collection.update({},
{
$set: {
"logs.$[a].data.talent.socialLinks.$[].url": "*****",
"logs.$[b].data.talent.webResults.$[].link": "*****",
"logs.$[b].data.talent.webResults.$[].title": "*****",
"logs.$[c].data.talent.firstName": "*****",
"logs.$[d].data.talent.lastName": "*****",
}
},
{
arrayFilters: [
{
"a.data.talent.socialLinks": {
$exists: true
}
},
{
"b.data.talent.webResults": {
$exists: true
}
},
{
"c.data.talent.firstName": {
$exists: true
}
},
{
"d.data.talent.lastName": {
$exists: true
}
}
]
})
Sample Mongo Playground

What is the best way to query an array of subdocument in MongoDB?

let's say I have a collection like so:
{
"id": "2902-48239-42389-83294",
"data": {
"location": [
{
"country": "Italy",
"city": "Rome"
}
],
"time": [
{
"timestamp": "1626298659",
"data":"2020-12-24 09:42:30"
}
],
"details": [
{
"timestamp": "1626298659",
"data": {
"url": "https://example.com",
"name": "John Doe",
"email": "john#doe.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "https://www.myexample.com",
"name": "John Doe",
"email": "doe#john.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "http://example.com/sub/directory",
"name": "John Doe",
"email": "doe#johnson.com"
}
}
]
}
}
Now the main focus is on the array of subdocument("data.details"): I want to get output only of relevant matches e.g:
db.info.find({"data.details.data.url": "example.com"})
How can I get a match for all "data.details.data.url" contains "example.com" but won't match with "myexample.com". When I do it with $regex I get too many results, so if I query for "example.com" it also return "myexample.com"
Even when I do get partial results (with $match), It's very slow. I tried this aggregation stages:
{ $unwind: "$data.details" },
{
$match: {
"data.details.data.url": /.*example.com.*/,
},
},
{
$project: {
id: 1,
"data.details.data.url": 1,
"data.details.data.email": 1,
},
},
I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.
More info:
My collection has dozens of GB, I created two indexes:
Compound like so:
"data.details.data.url": 1,
"data.details.data.email": 1
Text Index:
"data.details.data.url": "text",
"data.details.data.email": "text"
It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!
Your mistake is in the regex. It matches all URLs because the substring example.com is in all URLs. For example: https://www.myexample.com matches the bolded part.
To avoid this you have to use another regex, for example that just start with that domain.
For example:
(http[s]?:\/\/|www\.)YOUR_SEARCH
will check that what you are searching for is behind an http:// or www. marks.
https://regex101.com/r/M4OLw1/1
I leave you the full query.
[
{
'$unwind': {
'path': '$data.details'
}
}, {
'$match': {
'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
}
}
]
Note: you must scape special characters from the regex. A dot matches any character and the slash will close your regex causing an error.

elasticsearch ngram and postgresql trigram search results are not match

I've crereated an index on elasticsearch same as bellow:
"settings" : {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"issue": {
"properties": {
"description": {
"type": "string",
"analyzer": "trigrams"
}
}
}
}
My test items are bellow:
"alici onay verdi basarili satisiniz gerceklesti diyor ama hesabima para transferi gerceklesmemis"
"otomatik onay işlemi gecikmiş"
"************* nolu iade islemi urun kargoya verilmedi zamaninda iade islemlerinde urun erorr hata veriyor"
I've test this index with bellow query:
GET issue/_search
{
"query": {
"match": {
"description":{
"query": "otomatik onay istemi zamaninda gerceklesmemis"
}
}
}
}
And result:
{
....
"hits": {
....
"max_score": 2.3507352,
"hits": [
{
....
"_score": 2.3507352,
"_source": {
"issue_id": "*******",
"description": "alici onay verdi basarili satisiniz gerceklesti diyor ama hesabima para transferi gerceklesmemis"
}
}
]
}
}
But same data on postgresql with bellow SQL response another result:
SELECT
public.tbl_issue_descriptions_big.description,
similarity(description, 'otomatik onay islemi zamaninda gerceklesmemis') AS sml
FROM
public.tbl_issue_descriptions_big
WHERE
description %'otomatik onay islemi zamaninda gerceklesmemis'
ORDER BY
sml DESC
LIMIT 10
Result is:
description | sml
======================================================|======
otomatik onay islemi gecikmis |0,351852
Why is this difference caused?
I dont know enough about postgres to give a qualified answer there (as this also depends on the documents that are indexed and if they scoring formulas are exactly the same, which I doubt), but Elasticsearch has an explain API and an explain parameter in the search, that help you to find out why a certain document was scored this way.

Querying Multi Level Nested fields on Elastic Search

I'm new to Elastic Search and to the non-SQL paradigm.
I've been following ES tutorial, but there is one thing I couldn't put to work.
In the following code (I'me using PyES to interact with ES) I create a single document, with a nested field (subjects), that contains another nested field (concepts).
from pyes import *
conn = ES('127.0.0.1:9200') # Use HTTP
# Delete and Create a new index.
conn.indices.delete_index("documents-index")
conn.create_index("documents-index")
# Create a single document.
document = {
"docid": 123456789,
"title": "This is the doc title.",
"description": "This is the doc description.",
"datepublished": 2005,
"author": ["Joe", "John", "Charles"],
"subjects": [{
"subjectname": 'subject1',
"subjectid": [210, 311, 1012, 784, 568],
"subjectkey": 2,
"concepts": [
{"name": "concept1", "score": 75},
{"name": "concept2", "score": 55}
]
},
{
"subjectname": 'subject2',
"subjectid": [111, 300, 141, 457, 748],
"subjectkey": 0,
"concepts": [
{"name": "concept3", "score": 88},
{"name": "concept4", "score": 55},
{"name": "concept5", "score": 66}
]
}],
}
# Define the nested elements.
mapping1 = {
'subjects': {
'type': 'nested'
}
}
mapping2 = {
'concepts': {
'type': 'nested'
}
}
conn.put_mapping("document", {'properties': mapping1}, ["documents-index"])
conn.put_mapping("subjects", {'properties': mapping2}, ["documents-index"])
# Insert document in 'documents-index' index.
conn.index(document, "documents-index", "document", 1)
# Refresh connection to make queries.
conn.refresh()
I'm able to query subjects nested field:
query1 = {
"nested": {
"path": "subjects",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.subjectname": "subject1"}
},
{
"range": {"subjects.subjectkey": {"gt": 1}}
}
]
}
}
}
}
results = conn.search(query=query1)
for r in results:
print r # as expected, it returns the entire document.
but I can't figure out how to query based on concepts nested field.
ES documentation refers that
Multi level nesting is automatically supported, and detected,
resulting in an inner nested query to automatically match the relevant
nesting level (and not root) if it exists within another nested query.
So, I tryed to build a query with the following format:
query2 = {
"nested": {
"path": "concepts",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"concepts.name": "concept1"}
},
{
"range": {"concepts.score": {"gt": 0}}
}
]
}
}
}
}
which returned 0 results.
I can't figure out what is missing and I haven't found any example with queries based on two levels of nesting.
Ok, after trying a tone of combinations, I finally got it using the following query:
query3 = {
"nested": {
"path": "subjects",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.concepts.name": "concept1"}
}
]
}
}
}
}
So, the nested path attribute (subjects) is always the same, no matter the nested attribute level, and in the query definition I used the attribute's full path (subject.concepts.name).
Shot in the dark since I haven't tried this personally, but have you tried the fully qualified path to Concepts?
query2 = {
"nested": {
"path": "subjects.concepts",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"text": {"subjects.concepts.name": "concept1"}
},
{
"range": {"subjects.concepts.score": {"gt": 0}}
}
]
}
}
}
}
I have some question for JCJS's answer. why your mapping shouldn't like this?
mapping = {
"subjects": {
"type": "nested",
"properties": {
"concepts": {
"type": "nested"
}
}
}
}
I try to define two type-mapping maybe doesn't work, but be a flatten data; I think we should nested in nested properties..
At last... if we use this mapping nested query should like this...
{
"query": {
"nested": {
"path": "subjects.concepts",
"query": {
"term": {
"name": {
"value": "concept1"
}
}
}
}
}
}
It's vital for using full path for path attribute...but not for term key can be full-path or relative-path.