I have a collection. The document structure is,
{
model: {
name: 'string name'
}
}
I have enabled atlas search, Also created a search index for model.name field. Search works fine, But the only issue is couldn't get results for very minimal query letters.
Example:
I have a document,
{
model: {
name: "space1duplicate"
}
}
If I query space, I couldn't get the result.
{
index: 'search_index',
compound: {
must: [
{
text: {
query: 'space',
path: 'model.name'
}
}
]
}
}
But If I query space1duplica, It returns the result.
During indexing, full text search engine tokenizes the input by splitting up text into searchable chunks. Check out the relevant section in the documentation.
By default Atlas Search does not split words by digits, but if you need that, try to define a custom analyzer with the regex tokenizer and use it for your field:
{
"mappings": {
"dynamic": false,
"fields": {
"name": [
{
"analyzer": "digitSplitter",
"type": "string"
}
]
}
},
"analyzers": [
{
"charFilters": [],
"name": "digitSplitter",
"tokenFilters": [],
"tokenizer": {
"pattern": "[0-9]+",
"type": "regexSplit"
}
}
]
}
Also note that you can use multiple analyzers for string fields, if needed.
Atlas search uses Lucene to do the job. Documentation on mongodb site is mostly focused on mongo specific syntax to pass the query to Lucene and might be a bit confusing if you are not familiar with its query language.
First of all, there are number of tokenizers and analizers available, each serve specific purpose. You really need include index definition when you ask quetions about atlas search.
Default tokeniser uses word separators to build the index, then removes endings to store stems, again depending on language, English by default.
So in order to find "space1duplicate" by beginning of the word you can use "autocomplete" analizer with nGram tokens. The index should be created as following:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"tokenization": "nGram",
"type": "autocomplete"
}
}
},
"storedSource": {
"include": [
"name"
]
}
}
Once it's indexed (you may need to wait a bit you you have larger dataset), you can find the document with following search:
{
index: 'search_index',
compound: {
must: [
{
autocomplete: {
query: 'spa',
path: 'name'
}
}
]
}
}
Related
The most common use case for this would probably be a user table, with name, lname, email, phone.
I might search for name contains "paul", email contains 2#yahoo"
I might search for phone = 01234567890
I might search for email = "foo#bar.com"
It is my understanding that in a mongo index works in order. So an index that looks like
name:1, lname:1, email:1, phone:1 wouldn't work for any of the above queries?
What's the best indexing strategy to account for search tables like this?
so, paul you will need to create an index definition before you can run the query. Creating your first search index definition in the collection view in Atlas Data Explorer can be tricky.
Here's what I would recommend for an index definition based on those docs:
{
"mappings": {
"fields": {
"email": {
"analyzer": "lucene.keyword",
"type": "string"
},
"phone": {
"analyzer": "lucene.keyword",
"type": "string"
},
"name": {
"analyzer": "lucene.keyword",
"type": "string"
},
"lname": {
"analyzer": "lucene.keyword",
"type": "string"
}
}
}
}
Here is what I would recommend for a contains-style query on the email and name fields:
{
$search: {
index: 'default',
compound: {
must: [{
wildcard: {
query: '*paul*',
path: 'name'
}
},{
wildcard: {
query: '*2#yahoo*',
path: 'email'
}
}]
}
}
}
Should be a lightning fast query, even for a large index, and as one of multiple clauses as you have described. Let me know if you have any more trouble. There's lot of features like highlighting that should be helpful as well. Note that this query is a single clause. If you want multiple clauses as you have described, embed this clause in a compound operator as seen here.
In the documentation of MongoDB Atlas search, it says the following for the autocomplete operator:
query: String or strings to search for. If there are multiple terms in
a string, Atlas Search also looks for a match for each term in the
string separately.
For the text operator, the same thing applies:
query: The string or strings to search for. If there are multiple
terms in a string, Atlas Search also looks for a match for each term
in the string separately.
Matching each term separately seems odd behaviour to me. We need multiple searches in our app, and for each we expect less results the more words you type, not more.
Example: When searching for "John Doe", I expect only results with both "John" and "Doe". Currently, I get results that match either "John" or "Doe".
Is this not possible using MongoDB Atlas Search, or am I doing something wrong?
Update
Currently, I have solved it by splitting the search-term on space (' ') and adding each individual keyword to a separate must-sub-clause (with the compound operator). However, then the search query no longer returns any results if there is one keyword with only one character. To account for that, I split keywords with one character from those with multiple characters.
The snippet below works, but for this I need to save two generated fields on each document:
searchString: a string with all the searchable fields concatenated. F.e. "John Doe Man Streetstreet Citycity"
searchArray: the above string uppercased & split on space (' ') into an array
const must = [];
const searchTerms = 'John D'.split(' ');
for (let i = 0; i < searchTerms.length; i += 1) {
if (searchTerms[i].length === 1) {
must.push({
regex: {
path: 'searchArray',
query: `${searchTerms[i].toUpperCase()}.*`,
},
});
} else if (searchTerms[i].length > 1) {
must.push({
autocomplete: {
query: searchTerms[i],
path: 'searchString',
fuzzy: {
maxEdits: 1,
prefixLength: 4,
maxExpansions: 20,
},
},
});
}
}
db.getCollection('someCollection').aggregate([
{
$search: {
compound: { must },
},
},
]).toArray();
Update 2 - Full example of unexpected behaviour
Create collection with following documents:
db.getCollection('testing').insertMany([{
"searchString": "John Doe ExtraTextHere"
}, {
"searchString": "Jane Doe OtherName"
}, {
"searchString": "Doem Sarah Thisistestdata"
}])
Create search index 'default' on this collection:
{
"mappings": {
"dynamic": false,
"fields": {
"searchString": {
"type": "autocomplete"
}
}
}
}
Do the following query:
db.getCollection('testing').aggregate([
{
$search: {
autocomplete: {
query: "John Doe",
path: 'searchString',
fuzzy: {
maxEdits: 1,
prefixLength: 4,
maxExpansions: 20,
},
},
},
},
]).toArray();
When a user searches for "John Doe", this query returns all the documents that have either "John" OR "Doe" in the path "searchString". In this example, that means all 3 documents. The more words the user types, the more results are returned. This is not expected behaviour. I would expect more words to match less results because the search term gets more precise.
An edgeGram tokenization strategy might be better for your use case because it works left-to-right.
Try this index definition take from the docs:
{
"mappings": {
"dynamic": false,
"fields": {
"searchString": [
{
"type": "autocomplete",
"tokenization": "edgeGram",
"minGrams": 3,
"maxGrams": 10,
"foldDiacritics": true
}
]
}
}
}
Also, add change your query clause from must to filter. That will exclude the documents that do not contain all the tokens.
I'm not quite understanding how fuzzy works in the $searchBeta stage of aggregation. I'm not getting the desired result that I want when I'm trying to implement full-text search on my backend. Full text search for MongoDB was released last year (2019), so there really aren't many tutorials and/or references to go by besides the documentation. I've read the documentation, but I'm still confused, so I would like some clarification.
Let's say I have these 5 documents in my db:
{
"name": "Lightning Bolt",
"set_name": "Masters 25"
},
{
"name": "Snapcaster Mage",
"set_name": "Modern Masters 2017"
},
{
"name": "Verdant Catacombs",
"set_name": "Modern Masters 2017"
},
{
"name": "Chain Lightning",
"set_name": "Battlebond"
},
{
"name": "Battle of Wits",
"set_name": "Magic 2013"
}
And this is my aggregation in MongoDB Compass:
db.cards.aggregate([
{
$searchBeta: {
search: { //search has been deprecated, but it works in MongoDB Compass; replace with 'text'
query: 'lightn',
path: ["name", "set_name"],
fuzzy: {
maxEdits: 1,
prefixLength: 2,
maxExpansion: 100
}
}
}
}
]);
What I'm expecting my result to be:
[
{
"name": "Lightning Bolt", //lightn is in 'Lightning'
"set_name": "Masters 25"
},
{
"name": "Chain Lightning", //lightn is in 'Lightning'
"set_name": "Battlebond"
}
]
What I actually get:
[] //empty array
I don't really understand why my result is empty, so it would be much appreciated if someone explained what I'm doing wrong.
What I think is happening:
db.cards.aggregate... is looking for documents in the "name" and "set_name" fields for words that have a max edit of one character variation from the "lightn" query. The documents that are in the cards collection contain edits that are greater than 2, and therefor your expected result is an empty array. "Fuzzy is used to find strings which are similar to the search term or terms"; used with maxEdits and prefixLength.
Have you tried the term operator with the wildcard option? I think the below aggregation would get you the results you were actually expecting.
e.g.
db.cards.aggregate([
{$searchBeta:
{"term":
{"path":
["name","set_name"],
"query": "l*h*",
"wildcard":true}
}}]).pretty()
You need to provide an index to use with your search query.
The index is basically the analyzer that your query will use to process your results regarding if you want to a full match of the text, or you want a partial match etc.
You can read more about Analyzers from here
In your case, an index based on STANDARD analyzer will help.
After you create your index your code, modified below, will work:
db.cards.aggregate([
{
$search:{
text: { //search has been deprecated, but it works in MongoDB Compass; replace with 'text'
index: 'index_name_for_analyzer (STANDARD in your case)'
query: 'lightn',
path: ["name"] //since you only want to search in one field
fuzzy: {
maxEdits: 1,
prefixLength: 2,
maxExpansion: 100
}
}
}
}
]);
simple question: I have multiple indexes in my elasticsearch engine mirrored by postgresql using logstash. ElasticSearch performs well for fuzzy searches, but now I need to use references within the indexes, that need to be handled by the queries.
Index A:
{
name: "alice",
_id: 5
}
...
Index B:
{
name: "bob",
_id: 3,
best_friend: 5
}
...
How do I query:
Get every match of index B with field name starting with "b" and index A referenced by "best_friend" with the name starting with "a"
Is this even possible with elasticsearch?
Yes, that's possible: POST A,B/_search will query multiple indexes.
In order to match a record from a specific index, you can use meta-data field _index
Below is a query that gets every match of index B with field name starting with "b" and index A with the name starting with "a" but not matches a reference as you usually do in relational SQL databases. foreign key reference matching (join) in Elastic and every NoSQL is YOUR responsibility AFAIK. refer to Elastic Definitive Guide to find out the best approach to your needs. Lastly, NoSQL is not SQL, change your mind.
POST A,B/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"prefix": {
"name": "a"
}
},
{
"term": {
"_index": "A"
}
}
]
}
},
{
"bool": {
"must": [
{
"prefix": {
"name": "b"
}
},
{
"term": {
"_index": "B"
}
}
]
}
}
]
}
}
}
I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result.
I using following settings and mappings:
Index settings:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Mappings:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Insert records:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "men's shaver" }
{ "index": { "_id": 2 }}
{ "name": "women's shaver" }
Query:
1. To search by exact phrase match --> "men's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "men's"
}
}
}
Above query returns "men's shaver" in the return result.
2. To search by Partial word match --> "en's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "en's"
}
}
}
Above query DOES NOT return anything.
I have also tried following query
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "%en's%"
}
}
}
}
Still not getting anything.
I figured it is because of "edge_ngram" type filter on Index which is not able to find "partial word/sbustring match".
I tried "n-gram" type filter as well but it is slowing down the search alot.
Please suggest me how to achieve both excact phrase match and partial phrase match using same index setting.
To search for partial field matches and exact matches, it will work better if you define the fields as "not analyzed" or as keywords (rather than text), then use a wildcard query.
See also this.
To use a wildcard query, append * on both ends of the string you are searching for:
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "*en's*"
}
}
}
}
To use with case insensitivity, use a custom analyzer with a lowercase filter and keyword tokenizer.
Custom Analyzer:
"custom_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
Make the search string lowercase
If you get search string as AsD: change it to *asd*
The answer given by #BlackPOP will work, but it uses the wildcard approach, which is not preferred as it has a performance issue and if abused can create a huge domino effect (performance issue) in the Elastic cluster.
I have written a detailed blog on partial search/autocomplete covering the latest options available in Elasticsearch as of today (Dec 2020) with performance in mind. For more trade-off information please refer to this answer.
IMHO a better approach will be to use the customized n-gram tokenizer according to use-case, which will have already tokens needed for search term so it will be faster, although it will have a bigger index size, but you size is not that costly and speed will be better with more control on how exactly you want substring search to work.
Also size can be controlled if you are conservative in defining the min and max gram in tokenizer setting.
By searching with any string or substring Use:
query: {
or: [{
match_phrase_prefix: {
name: str
}
}, {
match_phrase_prefix: {
surname: str
}
}]
}
Happy coding with Elastic Search....