Problem with sphinx search when searching by second word in phrase - sphinx

I have a sphinx index set up with a field that has the following data:
James Smith
When I search for 'James' or 'James Smi', it will return the proper result, but when I search for 'James S' it doesn't return anything.
I also have a real time index (RT) and it has the same data in it and I'm able to search for 'James S' and it returns the proper result.
Here is what I have in my config file for the main index
min_word_len = 1
html_strip = 0
min_infix_len = 2
expand_keywords = 1
min_prefix_len = 1
index_exact_words = 1

Min_infix_len=2 means part word matches are only going to work for two letter queries or longer. The j does not match in part of word, only a standalone word. Because of min_word_len

Related

count number of elements with a specific value in a field of a structure in Matlab

I have a structure myS with several fields, including myField, which in turns includes several other fields such as BB. I need to count how many time *'R_value' appears in BB.
I have tried:
sum(myS.myField.BB = 'R_value')
and this:
count = 0;
for i = 1:numel(myS.myField)
number_of_element = numel(myS.myField(i).BB)=='R_value'
count = count+number_of_element;
end
but it doesn't work. Any suggestion?
If you are just checking if BB is that literal string, then your loop is just:
count = 0;
for i = 1:numel(myS.myField)
count = count+strcmp(myS.myField(i).BB,'R_value')
end
numel counts how many elements are. Zero is an element. so is False. Just sum the array.
count = 0;
for i = 1:numel(myS.myField)
number_of_element = sum(myS.myField(i).BB==R_value)
count = count+number_of_element;
end
Also note you had the parenthesis wrong, so you where counting how many BB where in total, then comparing that number to R_value. I am assuming R_value is a number.
e.g.:
myS.myField(1).BB=[1 2 3 4 1 1 1]
myS.myField(2).BB=[4 5 65 1]
R_value=1

Mongdb combined limit and sort when using find function

I have db mongdb example with document a and document b
a_id type
1 1
2 2
3 3
4 4
Now. I want to extract the last N (1,2,3,4,5,....) values in table b in the same order as in the example above. But if I use skip function :
b.find().skip(M)
if M > N then result empty => wrong. I want dynamic M.
If I use sort and limit then it does not give the correct order.
b.find().sort({$natural:-1}).limit(M)
result:
4 4
3 3
I want a solution!
You can use the same skip() to access the last N documents in the collection.
N = Last N documents to be accessed
So the query is
b.find().skip(b.count() - N).pretty()
or you can play with the mongo shell just as javascript like
var totalCount = b.count()
db.find().skip(totalCount - N).pretty()

MongoDB Full-Text Search Score "What does Score means?"

I'm working on a MongoDB project for my school. I have a Collection of sentences, and I do a normal Text search to find the most similar sentence in the collection, this is based on the scoring.
I run this Query
db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})
Take a look at these results when i query sentences,
"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*
"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0*
What is the score value? what does it mean?
What if I want to show the results that only have similarity of 70% and above.
How can I interpret the score result so I can display a similarity percentage, I'm using C# to do this, but don't worry about the implementation. I don't mind a Pseudo-code solution!
When you use a MongoDB text index, it generates a score for every matching document. This score indicates how strongly your search string matches the document. The higher the score more is the chances of resemblance to the searched text. The score is calculated by:
Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
score = (weight * data.freq * coeff * adjustment);
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = ​(0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)
So as we can see above a score is influenced by the following factors:
Number of Terms matching with the actual searched text, more the match more will be the score
Number of tokens in the document field
Whether the searched text exactly matches the document field or not
Following is the derivation for one of your document:
Search String = This sentence have nothing to do with any other
Document = Who is the “He” in this sentence?
Score Calculation:
Step 1: Tokenize search string.Apply Stemming and remove stop words.
Token 1: "sentence"
Token 2: "nothing"
Step 2: For every search token obtained in Step 1, do steps 3-11:
Step 3: Take Sample Document and Remove Stop Words
Input Document: Who is the “He” in this sentence?
Document after stop word removal: "sentence"
Step 4: Apply Stemming
Document in Step 3: "sentence"
After Stemming : "sentence"
Step 5: Calculate data.count per search token
data.count(sentence)= 1
data.count(nothing)= 1
Step 6: Calculate total number of token in document
numTokens = 1
Step 7: Calculate coefficient per search token
coeff = ​(0.5 * data.count / numTokens) + 0.5
coeff(sentence) =​ 0.5*(1/1) + 0.5 = 1.0
coeff(nothing) =​ 0.5*(1/1) + 0.5 = 1.0
Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
adjustment(sentence) = 1
adjustment(nothing) =​ 1
Step 9: weight of field (1 is default weight)
weight = 1
Step 10: Calculate frequency of search token in document (data.freq)
For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
a. Data.freq(sentence)= 1/(2^0) = 1
b. Data.freq(nothing)= 0
Step 11: Calculate score per search token per field:
score = (weight * data.freq * coeff * adjustment);
score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
Step 12: Add individual score for every token of search string to get total score
Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0
In the same way, you can derive the other one.
For more detailed MongoDB analysis, check:
Mongo Scoring Algorithm Blog
Text search assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
For each indexed field in the document, MongoDB multiplies the number of matches by the weight and sums the results. Using this sum, MongoDB then calculates the score for the document.
The default weight is 1 for the indexed fields.
https://docs.mongodb.com/manual/tutorial/control-results-of-text-search/

Sphinx is matching beginning of words?

I thought a sphinx search only matched whole works by default. In searching for various acronyms I seem to match beginnings as well. For instance 'prov' will match 'provide' not 'improve'. I only want it in fact to match " prov ". Is there some setting I need to change to make it 'Word' only?
Index Settings:
wordforms = /home/indexer/wordform.txt
stopwords = /home/indexer/stopwords.txt
stopword_step = 0
morphology = stem_en
index_sp=1
html_strip = 1
min_word_len = 1
min_infix_len = 1
Follow-up: I min_infix_len = 0 (and rotated index) to no avail.

Power Query - remove characters from number values

I have a table field where the data contains our memberID numbers followed by character or character + number strings
For example:
My Data
1234567Z1
2345T10
222222T10Z1
111
111A
Should Become
123456
12345
222222
111
111
I want to get just the member number (as shown in Should Become above). I.E. all the digits that are LEFT of the first character.
As the length of the member number can be different for each person (the first 1 to 7 digit) and the letters used can be different (a to z, 0 to 8 characters long), I don't think I can SPLIT the field.
Right now, in Power Query, I do 27 search and replace commands to clean this data (e.g. find T10 replace with nothing, find T20 replace with nothing, etc)
Can anyone suggest a better way to achieve this?
I did successfully create a formula for this in Excel...but I am now trying to do this in Power Query and I don't know how to convert the formula - nor am I sure this is the most efficient solution.
=iferror(value(left([MEMBERID],7)),
iferror(value(left([MEMBERID],6)),
iferror(value(left([MEMBERID],5)),
iferror(value(left([MEMBERID],4)),
iferror(value(left([MEMBERID],3)),0)
)
)
)
)
Thanks
There are likely several ways to do this. Here's one way:
Create a query Letters:
let
Source = { "a" .. "z" } & { "A" .. "Z" }
in
Source
Create a query GetFirstLetterIndex:
let
Source = (text) => let
// For each letter find out where it shows up in the text. If it doesn't show up, we will have a -1 in the list. Make that positive so that we return the index of the first letter which shows up.
firstLetterIndex = List.Transform(Letters, each let pos = Text.PositionOf(text, _), correctedPos = if pos < 0 then Text.Length(text) else pos in correctedPos),
minimumIndex = List.Min(firstLetterIndex)
in minimumIndex
in
Source
In the table containing your data, add a custom column with this formula:
Text.Range([ColumnWithData], 0, GetFirstLetterIndex([ColumnWithData]))
That formula will take everything from your data text until the first letter.