What is the indexing strategy for a variable query? - mongodb

The most common use case for this would probably be a user table, with name, lname, email, phone.
I might search for name contains "paul", email contains 2#yahoo"
I might search for phone = 01234567890
I might search for email = "foo#bar.com"
It is my understanding that in a mongo index works in order. So an index that looks like
name:1, lname:1, email:1, phone:1 wouldn't work for any of the above queries?
What's the best indexing strategy to account for search tables like this?

so, paul you will need to create an index definition before you can run the query. Creating your first search index definition in the collection view in Atlas Data Explorer can be tricky.
Here's what I would recommend for an index definition based on those docs:
{
"mappings": {
"fields": {
"email": {
"analyzer": "lucene.keyword",
"type": "string"
},
"phone": {
"analyzer": "lucene.keyword",
"type": "string"
},
"name": {
"analyzer": "lucene.keyword",
"type": "string"
},
"lname": {
"analyzer": "lucene.keyword",
"type": "string"
}
}
}
}
Here is what I would recommend for a contains-style query on the email and name fields:
{
$search: {
index: 'default',
compound: {
must: [{
wildcard: {
query: '*paul*',
path: 'name'
}
},{
wildcard: {
query: '*2#yahoo*',
path: 'email'
}
}]
}
}
}
Should be a lightning fast query, even for a large index, and as one of multiple clauses as you have described. Let me know if you have any more trouble. There's lot of features like highlighting that should be helpful as well. Note that this query is a single clause. If you want multiple clauses as you have described, embed this clause in a compound operator as seen here.

Related

Some documents not appear in atlas-search when query by few letters

I have a collection. The document structure is,
{
model: {
name: 'string name'
}
}
I have enabled atlas search, Also created a search index for model.name field. Search works fine, But the only issue is couldn't get results for very minimal query letters.
Example:
I have a document,
{
model: {
name: "space1duplicate"
}
}
If I query space, I couldn't get the result.
{
index: 'search_index',
compound: {
must: [
{
text: {
query: 'space',
path: 'model.name'
}
}
]
}
}
But If I query space1duplica, It returns the result.
During indexing, full text search engine tokenizes the input by splitting up text into searchable chunks. Check out the relevant section in the documentation.
By default Atlas Search does not split words by digits, but if you need that, try to define a custom analyzer with the regex tokenizer and use it for your field:
{
"mappings": {
"dynamic": false,
"fields": {
"name": [
{
"analyzer": "digitSplitter",
"type": "string"
}
]
}
},
"analyzers": [
{
"charFilters": [],
"name": "digitSplitter",
"tokenFilters": [],
"tokenizer": {
"pattern": "[0-9]+",
"type": "regexSplit"
}
}
]
}
Also note that you can use multiple analyzers for string fields, if needed.
Atlas search uses Lucene to do the job. Documentation on mongodb site is mostly focused on mongo specific syntax to pass the query to Lucene and might be a bit confusing if you are not familiar with its query language.
First of all, there are number of tokenizers and analizers available, each serve specific purpose. You really need include index definition when you ask quetions about atlas search.
Default tokeniser uses word separators to build the index, then removes endings to store stems, again depending on language, English by default.
So in order to find "space1duplicate" by beginning of the word you can use "autocomplete" analizer with nGram tokens. The index should be created as following:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"tokenization": "nGram",
"type": "autocomplete"
}
}
},
"storedSource": {
"include": [
"name"
]
}
}
Once it's indexed (you may need to wait a bit you you have larger dataset), you can find the document with following search:
{
index: 'search_index',
compound: {
must: [
{
autocomplete: {
query: 'spa',
path: 'name'
}
}
]
}
}

Search and update in array of objects MongoDB

I have a collection in MongoDB containing search history of a user where each document is stored like:
"_id": "user1"
searchHistory: {
"product1": [
{
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
},
{
"timestamp": 1623481234,
"query": {
"query": "lindor",
"qty": 4
}
},
],
"product2": [
{
"timestamp": 1623473622,
"query": {
"query": "table",
"qty": 1
}
},
{
"timestamp": 1623438232,
"query": {
"query": "ike",
"qty": 1
}
},
]
}
Here _id of document acts like a foreign key to the user document in another collection.
I have backend running on nodejs and this function is used to store a new search history in the record.
exports.updateUserSearchCount = function (userId, productId, searchDetails) {
let addToSetData = {}
let key = `searchHistory.${productId}`
addToSetData[key] = { "timestamp": new Date().getTime(), "query": searchDetails }
return client.db("mydb").collection("userSearchHistory").updateOne({ "_id": userId }, { "$addToSet": addToSetData }, { upsert: true }, async (err, res) => {
})
}
Now, I want to get search history of a user based on query only using the db.find().
I want something like this:
db.find({"_id": "user1", "searchHistory.somewildcard.query": "some query"})
I need a wildcard which will replace ".somewildcard." to search in all products searched.
I saw a suggestion that we should store document like:
"_id": "user1"
searchHistory: [
{
"key": "product1",
"value": [
{
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
}
]
}
]
However if I store document like this, then adding search history to existing document becomes a tideous and confusing task.
What should I do?
It's always a bad idea to save values are keys, for this exact reason you're facing. It heavily limits querying that field, obviously the trade off is that it makes updates much easier.
I personally recommend you do not save these searches in nested form at all, this will cause you scaling issues quite quickly, assuming these fields are indexed you will start seeing performance issues when the arrays get's too large ( few hundred searches ).
So my personal recommendation is for you to save it in a new collection like so:
{
"user_id": "1",
"key": "product1",
"timestamp": 1623482432,
"query": {
"query": "chocolate",
"qty": 2
}
}
Now querying a specific user or a specific product or even a query substring is all very easily supported by creating some basic indexes. an "update" in this case would just be to insert a new document which is also much faster.
If you still prefer to keep the nested structure, then I recommend you do switch to the recommended structure you posted, as you mentioned updates will become slightly more tedious, but you can still do it quite easily using arrayFilters for updating a specific element or just using $push for adding a new search

MongoDB Atlas Search - Multiple terms in search-string with 'and' condition (not 'or')

In the documentation of MongoDB Atlas search, it says the following for the autocomplete operator:
query: String or strings to search for. If there are multiple terms in
a string, Atlas Search also looks for a match for each term in the
string separately.
For the text operator, the same thing applies:
query: The string or strings to search for. If there are multiple
terms in a string, Atlas Search also looks for a match for each term
in the string separately.
Matching each term separately seems odd behaviour to me. We need multiple searches in our app, and for each we expect less results the more words you type, not more.
Example: When searching for "John Doe", I expect only results with both "John" and "Doe". Currently, I get results that match either "John" or "Doe".
Is this not possible using MongoDB Atlas Search, or am I doing something wrong?
Update
Currently, I have solved it by splitting the search-term on space (' ') and adding each individual keyword to a separate must-sub-clause (with the compound operator). However, then the search query no longer returns any results if there is one keyword with only one character. To account for that, I split keywords with one character from those with multiple characters.
The snippet below works, but for this I need to save two generated fields on each document:
searchString: a string with all the searchable fields concatenated. F.e. "John Doe Man Streetstreet Citycity"
searchArray: the above string uppercased & split on space (' ') into an array
const must = [];
const searchTerms = 'John D'.split(' ');
for (let i = 0; i < searchTerms.length; i += 1) {
if (searchTerms[i].length === 1) {
must.push({
regex: {
path: 'searchArray',
query: `${searchTerms[i].toUpperCase()}.*`,
},
});
} else if (searchTerms[i].length > 1) {
must.push({
autocomplete: {
query: searchTerms[i],
path: 'searchString',
fuzzy: {
maxEdits: 1,
prefixLength: 4,
maxExpansions: 20,
},
},
});
}
}
db.getCollection('someCollection').aggregate([
{
$search: {
compound: { must },
},
},
]).toArray();
Update 2 - Full example of unexpected behaviour
Create collection with following documents:
db.getCollection('testing').insertMany([{
"searchString": "John Doe ExtraTextHere"
}, {
"searchString": "Jane Doe OtherName"
}, {
"searchString": "Doem Sarah Thisistestdata"
}])
Create search index 'default' on this collection:
{
"mappings": {
"dynamic": false,
"fields": {
"searchString": {
"type": "autocomplete"
}
}
}
}
Do the following query:
db.getCollection('testing').aggregate([
{
$search: {
autocomplete: {
query: "John Doe",
path: 'searchString',
fuzzy: {
maxEdits: 1,
prefixLength: 4,
maxExpansions: 20,
},
},
},
},
]).toArray();
When a user searches for "John Doe", this query returns all the documents that have either "John" OR "Doe" in the path "searchString". In this example, that means all 3 documents. The more words the user types, the more results are returned. This is not expected behaviour. I would expect more words to match less results because the search term gets more precise.
An edgeGram tokenization strategy might be better for your use case because it works left-to-right.
Try this index definition take from the docs:
{
"mappings": {
"dynamic": false,
"fields": {
"searchString": [
{
"type": "autocomplete",
"tokenization": "edgeGram",
"minGrams": 3,
"maxGrams": 10,
"foldDiacritics": true
}
]
}
}
}
Also, add change your query clause from must to filter. That will exclude the documents that do not contain all the tokens.

Why doesn't this Cloudant/couchdb $regex query work?

I am trying to pull (and delete) all records from our database that don't have a URL with the word 'box' in it. This is the query I'm using:
{
"selector": {
"$not": {
"url": {
"$regex": ".*box.*"
}
}
},
"limit": 50
}
This query returns no records. But if I remove the $not, I get all records that do have the word 'box' in the url, but that's the opposite of what I want. Why do I get no results when adding the $not?
I have tried adding a simple base to the query like "_id":{"$gte":0} but that doesn't help.
from the Cloudant doc:
You can create more complex selector expressions by combining
operators. However, for Cloudant NoSQL DB Query indexes of type json,
you cannot use 'combination' or 'array logical' operators such as
$regex as the basis of a query.
$not is a combination operator and therefore cannot be the basis of a query
i am able to get the following to work:
index
{
"index": {
"fields": ["url"]
},
"name" : "url-json-index",
"type" : "json"
}
query
{
"selector": {
"url": {
"$not": {
"$regex": ".*box.*"
}
}
},
"limit": 50,
"use_index": "url-json-index"
}
if you are still seeing problems, can you provide the output from _/explain and the indexes you have in place.
The "no results" issue is due to a bug in text indexes that has been recently fixed. However, neither $not nor $regex operators are able to take advantage of global indexes so will always result in a full database or index scan.
The way to optimise this query is to use a partial index. A partial index filters documents at indexing time rather than at query time, creating an index over a subset of the database. You then need to tell the _find endpoint to explicitly use the partial index. For example, create an index which only includes documents not matching your regex:
POST /<db>/_index
{
"index": {
"partial_filter_selector": {
"url": {
"$not": {
"$regex": ".*box.*"
}
}
},
"fields": ["type"]
},
"ddoc" : "url-not-box",
"type" : "json"
}
then at query time:
{
"selector": {
"url": {
"$not": {
"$regex": ".*box.*"
}
}
},
"limit": 50,
"use_index": "url-not-box"
}
You can see how many documents are scanned to fulfil the query in the Cloudant UI - the execution statistics are displayed in a popup underneath the query text area.
You may also find this This article about partial indexes helpful.

Elasticsearch: Find substring match

I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result.
I using following settings and mappings:
Index settings:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Mappings:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Insert records:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "men's shaver" }
{ "index": { "_id": 2 }}
{ "name": "women's shaver" }
Query:
1. To search by exact phrase match --> "men's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "men's"
}
}
}
Above query returns "men's shaver" in the return result.
2. To search by Partial word match --> "en's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "en's"
}
}
}
Above query DOES NOT return anything.
I have also tried following query
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "%en's%"
}
}
}
}
Still not getting anything.
I figured it is because of "edge_ngram" type filter on Index which is not able to find "partial word/sbustring match".
I tried "n-gram" type filter as well but it is slowing down the search alot.
Please suggest me how to achieve both excact phrase match and partial phrase match using same index setting.
To search for partial field matches and exact matches, it will work better if you define the fields as "not analyzed" or as keywords (rather than text), then use a wildcard query.
See also this.
To use a wildcard query, append * on both ends of the string you are searching for:
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "*en's*"
}
}
}
}
To use with case insensitivity, use a custom analyzer with a lowercase filter and keyword tokenizer.
Custom Analyzer:
"custom_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
Make the search string lowercase
If you get search string as AsD: change it to *asd*
The answer given by #BlackPOP will work, but it uses the wildcard approach, which is not preferred as it has a performance issue and if abused can create a huge domino effect (performance issue) in the Elastic cluster.
I have written a detailed blog on partial search/autocomplete covering the latest options available in Elasticsearch as of today (Dec 2020) with performance in mind. For more trade-off information please refer to this answer.
IMHO a better approach will be to use the customized n-gram tokenizer according to use-case, which will have already tokens needed for search term so it will be faster, although it will have a bigger index size, but you size is not that costly and speed will be better with more control on how exactly you want substring search to work.
Also size can be controlled if you are conservative in defining the min and max gram in tokenizer setting.
By searching with any string or substring Use:
query: {
or: [{
match_phrase_prefix: {
name: str
}
}, {
match_phrase_prefix: {
surname: str
}
}]
}
Happy coding with Elastic Search....