MongoDB Full-Text Search Score "What does Score means?"

MongoDB Full-Text Search Score "What does Score means?" - mongodb

I'm working on a MongoDB project for my school. I have a Collection of sentences, and I do a normal Text search to find the most similar sentence in the collection, this is based on the scoring.
I run this Query
db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})
Take a look at these results when i query sentences,
"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*
"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0*
What is the score value? what does it mean?
What if I want to show the results that only have similarity of 70% and above.
How can I interpret the score result so I can display a similarity percentage, I'm using C# to do this, but don't worry about the implementation. I don't mind a Pseudo-code solution!

When you use a MongoDB text index, it generates a score for every matching document. This score indicates how strongly your search string matches the document. The higher the score more is the chances of resemblance to the searched text. The score is calculated by:
Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
score = (weight * data.freq * coeff * adjustment);
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = (0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)
So as we can see above a score is influenced by the following factors:
Number of Terms matching with the actual searched text, more the match more will be the score
Number of tokens in the document field
Whether the searched text exactly matches the document field or not
Following is the derivation for one of your document:
Search String = This sentence have nothing to do with any other
Document = Who is the “He” in this sentence?
Score Calculation:
Step 1: Tokenize search string.Apply Stemming and remove stop words.
Token 1: "sentence"
Token 2: "nothing"
Step 2: For every search token obtained in Step 1, do steps 3-11:
Step 3: Take Sample Document and Remove Stop Words
Input Document: Who is the “He” in this sentence?
Document after stop word removal: "sentence"
Step 4: Apply Stemming
Document in Step 3: "sentence"
After Stemming : "sentence"
Step 5: Calculate data.count per search token
data.count(sentence)= 1
data.count(nothing)= 1
Step 6: Calculate total number of token in document
numTokens = 1
Step 7: Calculate coefficient per search token
coeff = (0.5 * data.count / numTokens) + 0.5
coeff(sentence) = 0.5*(1/1) + 0.5 = 1.0
coeff(nothing) = 0.5*(1/1) + 0.5 = 1.0
Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
adjustment(sentence) = 1
adjustment(nothing) = 1
Step 9: weight of field (1 is default weight)
weight = 1
Step 10: Calculate frequency of search token in document (data.freq)
For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
a. Data.freq(sentence)= 1/(2^0) = 1
b. Data.freq(nothing)= 0
Step 11: Calculate score per search token per field:
score = (weight * data.freq * coeff * adjustment);
score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
Step 12: Add individual score for every token of search string to get total score
Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0
In the same way, you can derive the other one.
For more detailed MongoDB analysis, check:
Mongo Scoring Algorithm Blog

Text search assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
For each indexed field in the document, MongoDB multiplies the number of matches by the weight and sums the results. Using this sum, MongoDB then calculates the score for the document.
The default weight is 1 for the indexed fields.
https://docs.mongodb.com/manual/tutorial/control-results-of-text-search/

Related

How can you access an element in a list with over 10k values for a dataset in code.org

so I am trying to recreate a wordle game inside of code.org using the wordle dataset.
//Getting Wordle Answer
var answers = getColumn("Wordle", "validWordleAnswer");
var letters = ["letter1", "letter2", "letter3", "letter4", "letter5"];
var index = (randomNumber(0, answers.length));
console.log(index);
So ofcourse the console.log outputs my index number, but out of the 10k words that are stored in the dataset, how can I access the specific one that the randomnumber generated? Foe example the index was 500, what can I write inorder to console.log the correct answer? Thanks?

The length of the array doesn't matter, 1 element or 20 billion. You still access it with bracket notation:
console.log(answers[index]);
Here's a small example (remember, indexing starts at 0):
const answers = ["chewy", "rocks", "piano", "forte", "phone"];
const index = Math.floor(Math.random() * answers.length);
console.log(index);
console.log(answers[index]);

Summary Percentage Yes vs No

Basically I'm looking for a formula to see how many times Yes was used vs. No.
I have something like this:
(({Command.result} ="Yes") / {Command.result})*100
Which makes sense in my head, but I keep getting:
A number, or currency amount is required.

Your current formula attempts to divide a boolean type through by a string. You can only perform division with numbers.
Instead, create two formulas as individual counts of Yes or No:
#YesCount:
If ({Command.result} = "Yes") Then 1 Else 0
#NoCount:
If ({Command.result} = "No") Then 1 Else 0
For the percentage, create two more formulas:
#YesPercent:
100 / Count ({Command.result}) * Sum ({#YesCount})
#NoPercent:
100 / Count ({Command.result}) * Sum ({#NoCount})

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)

You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Power Query - remove characters from number values

I have a table field where the data contains our memberID numbers followed by character or character + number strings
For example:
My Data
1234567Z1
2345T10
222222T10Z1
111
111A
Should Become
123456
12345
222222
111
111
I want to get just the member number (as shown in Should Become above). I.E. all the digits that are LEFT of the first character.
As the length of the member number can be different for each person (the first 1 to 7 digit) and the letters used can be different (a to z, 0 to 8 characters long), I don't think I can SPLIT the field.
Right now, in Power Query, I do 27 search and replace commands to clean this data (e.g. find T10 replace with nothing, find T20 replace with nothing, etc)
Can anyone suggest a better way to achieve this?
I did successfully create a formula for this in Excel...but I am now trying to do this in Power Query and I don't know how to convert the formula - nor am I sure this is the most efficient solution.
=iferror(value(left([MEMBERID],7)),
iferror(value(left([MEMBERID],6)),
iferror(value(left([MEMBERID],5)),
iferror(value(left([MEMBERID],4)),
iferror(value(left([MEMBERID],3)),0)
)
)
)
)
Thanks

There are likely several ways to do this. Here's one way:
Create a query Letters:
let
Source = { "a" .. "z" } & { "A" .. "Z" }
in
Source
Create a query GetFirstLetterIndex:
let
Source = (text) => let
// For each letter find out where it shows up in the text. If it doesn't show up, we will have a -1 in the list. Make that positive so that we return the index of the first letter which shows up.
firstLetterIndex = List.Transform(Letters, each let pos = Text.PositionOf(text, _), correctedPos = if pos < 0 then Text.Length(text) else pos in correctedPos),
minimumIndex = List.Min(firstLetterIndex)
in minimumIndex
in
Source
In the table containing your data, add a custom column with this formula:
Text.Range([ColumnWithData], 0, GetFirstLetterIndex([ColumnWithData]))
That formula will take everything from your data text until the first letter.

Random Sampling from Mongo

I have a mongo collection with documents. There is one field in every document which is 0 OR 1. I need to random sample 1000 records from the database and count the number of documents who have that field as 1. I need to do this sampling 1000 times. How do i do it ?

For people coming to the answer, you should now use the new $sample aggregation function, new in 3.2.
https://docs.mongodb.org/manual/reference/operator/aggregation/sample/
db.collection_of_things.aggregate(
[ { $sample: { size: 15 } } ]
)
Then add another step to count up the 0s and 1s using $group to get the count. Here is an example from the MongoDB docs.

For MongoDB 3.0 and before, I use an old trick from SQL days (which I think Wikipedia use for their random page feature). I store a random number between 0 and 1 in every object I need to randomize, let's call that field "r". You then add an index on "r".
db.coll.ensureIndex(r: 1);
Now to get random x objects, you use:
var startVal = Math.random();
db.coll.find({r: {$gt: startVal}}).sort({r: 1}).limit(x);
This gives you random objects in a single find query. Depending on your needs, this may be overkill, but if you are going to be doing lots of sampling over time, this is a very efficient way without putting load on your backend.

Here's an example in the mongo shell .. assuming a collection of collname, and a value of interest in thefield:
var total = db.collname.count();
var count = 0;
var numSamples = 1000;
for (i = 0; i < numSamples; i++) {
var random = Math.floor(Math.random()*total);
var doc = db.collname.find().skip(random).limit(1).next();
if (doc.thefield) {
count += (doc.thefield == 1);
}
}

I was gonna edit my comment on #Stennies answer with this but you could also use a seprate auto incrementing ID index here as an alternative if you were to skip over HUGE amounts of record (talking huge here).
I wrote another answer to another question a lot like this one where some one was trying to find nth record of the collection:
php mongodb find nth entry in collection
The second half of my answer basically describes one potential method by which you could approach this problem. You would still need to loop 1000 times to get the random row of course.

If you are using mongoengine, you can use a SequenceField to generate an incremental counter.
class User(db.DynamicDocument):
counter = db.SequenceField(collection_name="user.counters")
Then to fetch a random list of say 100, do the following
def get_random_users(number_requested):
users_to_fetch = random.sample(range(1, User.objects.count() + 1), min(number_requested, User.objects.count()))
return User.objects(counter__in=users_to_fetch)
where you would call
get_random_users(100)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MongoDB Full-Text Search Score "What does Score means?" - mongodb

Related

How can you access an element in a list with over 10k values for a dataset in code.org

Summary Percentage Yes vs No

partial Distance Based RDA - Centroids vanished from Plot

Power Query - remove characters from number values

Random Sampling from Mongo

Categories

Resources