I am trying to search for multiple words in text inclusively(AND operation)
without losing word stemming.
For example:
db.supplies.runCommand("text", {search:"printers inks"})
should return results with (printer and ink) or (printers ink) or (printers ink) or (printers inks) , instead of all results with either printer or ink.
This post covers the search for multiple words as an AND operation, but the solution doesn't search for stemmed words ->MongoDB Text Search AND multiple search words.
The only way I could think of is creating a permutation of all the words and then running the search for the number of permutations(which could be large)
This may not be an effective way to search on a large collection.
Is there a better and smarter way to do it ?
So is there a reason you have to use a text search? If it were me i would use a regular expression.
https://docs.mongodb.com/manual/reference/operator/query/regex/
Off the top of my head something like this.
db.collection.find({products:/printers inks|printers|inks/})
Now i suppose you can do the same thing with a text search too.
db.collection.find({$text:{$search : "\"printers inks\" printers inks"}})
note the escaped quotes.
Related
I saw a SO post that says you can search using regex or an actual literal text on it to search multiline texts. But what if you want to (quickly) search two or three of words within a specified lines of text content?
For example, what if you want to search for multiline text area that contains "ruby" and "regex" (assuming you want to know where you took a note on your txt (or markdown or rich text format) file. you may want to search for "how to use regex in ruby" or "the ruby regex tutorial", right? )
Now you can use a simple (but redundant) regex like ruby(.*\n)+regex|regex(.*\n)+ruby. But to me it doesn't look beautiful. For three or more words, this kind of regex workaround increases its redundancy exponentially also, not good.
So is there a smarter way to do this? Thanks.
I was experimenting with PostgreSQL's text search feature - particularly with the normalization function to_tsquery.
I was using english dictionary(config) and for some reason s and t won't normalize. I understand why i and a would not, but s and t? Interesting.
Are they matched to single space and tab?
Here is the query:
select
to_tsquery('english', 'a:*') as for_a,
to_tsquery('english', 's:*') as for_s,
to_tsquery('english', 't:*') as for_t,
to_tsquery('english', 'u:*') as for_u
fiddle just in case.
You would see 'u:*' is returning as 'u:*' and 'a:*' is not returning anything.
The letters s and t are considered stop words in the english text search dictionary, therefore they get discarded. You can read the stop word list under tsearch_data/english.stop in the postgres shared folder, which you can locate by typing pg_config --sharedir
With pg 11 on ubuntu/debian/mint, that would be
cat /usr/share/postgresql/11/tsearch_data/english.stop
Quoting from the docs,
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching.
It is best to discard english grammar and think of words in a programmatic and logical way as described above. Full text search does not try to infer context based on sentence structuring so it has no use for these words. After all, it's called full text search and not natural language search.
As to how they arrived on the conclusion to add s and t to the stop word list, statistical analysis must have revealed these characters to be noise.
Our project is built on mongodb and I'm doing a full text search with a $and operator on words.
I've searched for similarly questions for this problem (MongoDB Text Search AND multiple search words), but the solution of surrounding words with quotation marks would take out Word Stemming, which is the whole interest of using full text search.
For example, this code with not find "Items that want to be found":
find({$text:{$search:"\"Item\"",$language:"en"}})
Does anyone can provide a solution to this WITH Word Stemming?
How can I apply multiple search criteria to the document for obtaining a refined result/search? I 'tried' using wildcards -> ?[!a-z][!0-9][!^s] <- to find a character except from range a-z, range 0-9, and the non breaking space(^s). i.e. I do not want to find any character, any number or a space, but tabs, operators, special characters, etc. At least that's what I think it does. How can I use multiple "find what" criteria together in a document?
As a starting point, use wildcards and
[!0-9,a-z,A-Z, ]
should help. It may be possible to refine that further, but if not, VBA or equivalent and either a character-by-character check or multiple find loops are your options.
I am using Lucene seacrh engine for fulltext search it give search result for non ascii character also but the problem is suppose I added a text 帕普部分分配数量 and will search with
only one character 帕 it will give result but when will search with full non-ascii word 帕普部分分配数量 it is not giving any result, the strange thing is when I put spaces between each charcter for example 帕 普 部 分 分 配 数 量 and theb will search it give result
Will realy appreciate any help
Thanx
Be sure to use the same Analyzer when indexing and searching.
What happens is your Analyzer is indexing each characters as an individual Term, and then if you search with a different analyzer (IE WhiteSpaceAnalyzer) it searches for a Token containing all the specified characters in your Query.
To search for a sequence of characters like you want, you need to use the same Analyzer and have the QueryParser build a PhraseQuery with all the individual Tokens.
Some sample code of your indexing and searching routines would make it easier to help you.