Sphinx: Understanding Proximity Factor Ranking for a large field - sphinx

This document explains the sphinx proximity factor ranking algorithm (see section: Phrase proximity factor).
Will the proximity factor ranker give a higher field weight if the keyword is found more than once in the field?
eg. using similar logic to the referenced section of the linked document. For a single instance match phrase weight would be 2:
1) query = one two three, field = one and two three
field_phrase_weight = 2 (because 2-keyword long "two three" subphrase matched)
What about if the same phrase was matched twice? Would the weight be double?:
2) query = one two three, field = one and two three one and two three
field_phrase_weight = 4? (because 2-keyword long "two three" subphrase matched twice?)
I suspect that the answer to the above question is no - sphinx will return the same field weight whether the keyword/keyword subsequence is found multiple times. If this is the case, how to make good use of the proximity algorithm for large sphinx fields like an essay if this will return the same field weight regardless of content size? Especially given the go-to sphinx algorithm for searching is the proximity_bm25 ranker which relies very heavily on "proximity ranking" (for a multiple field document at least 60% of the algorithm would be weighted towards proximity ranking over bm25?

Will the proximity factor ranker give a higher field weight if the keyword is found more than once in the field?
No. The same field weight will apply.
eg. using similar logic to the referenced section of the linked document. For a single instance match phrase weight would be 2:
1) query = one two three, field = one and two three
field_phrase_weight = 2 (because 2-keyword long "two three" subphrase matched)
What about if the same phrase was matched twice? Would the weight be double?:
2) query = one two three, field = one and two three one and two three
field_phrase_weight = 4? (because 2-keyword long "two three" subphrase matched twice?)
In this example, the weight would not be double in the 2nd query.
how to make good use of the proximity algorithm for large sphinx fields like an essay if this will return the same field weight regardless of content size?
The only way I can figure is through a combination of the proximity to give the higher weights to multi keyword phrases, but also to give the BM25 enough weight in the algorithm to add value in providing the "rare keywords occurring more often in documents" factor. The BM25 part of the Proximity_BM25 ranking algorithm is designed for this purpose.
This is the proximity_bm25 expression: sum(lcs*user_weight)*1000+bm25, the bm25 component of this algorithm progressively becomes more irrelevant the more sphinx fields you have in a document as the sum(lcs*user_weight)*1000 part of the formulate applies to each individual field while the bm25 part of the equation applies to the document as a whole.
In my situation with 10 sphinx fields, the bm25 was accounting for just 5% of the total weight - I bumped up the weight of the bm25 portion of the formula to account for around 20% of the total weight changing the formula as such:
sum(lcs*user_weight)*1000+bm25*4

Related

How to predict word using trained CBOW

I have a question about CBOW prediction. Suppose my job is to use 3 surrounding words w(t-3), w(t-2), w(t-1)as input to predict one target word w(t). Once the model is trained and I want to predict a missing word after a sentence. Does this model only work for a sentence with four words which the first three are known and the last is unknown? If I have a sentence in 10 words. The first nine words are known, can I use 9 words as input to predict the last missing word in that sentence?
Word2vec CBOW mode typically uses symmetric windows around a target word. But it simply averages the (current in-training) word-vectors for all words in the window to find the 'inputs' for the prediction neural-network. Thus, it is tolerant of asymmetric windows – if there are fewer words are available on either side, fewer words on that side are used (and perhaps even zero on that side, for words at the front/end of a text).
Additionally, during each training example, it doesn't always use the maximum-window specified, but some random-sized window up-to the specified size. So for window=5, it will sometimes use just 1 on either side, and other times 2, 3, 4, or 5. This is done to effectively overweight closer words.
Finally and most importantly for your question, word2vec doesn't really do a full-prediction during training of "what exact word does the model say should be heat this target location?" In either the 'hierarchical softmax' or 'negative-sampling' variants, such an exact prediction can be expensive, requiring calculations of neural-network output-node activation levels proportionate to the size of the full corpus vocabulary.
Instead, it does the much-smaller number-of-calculations required to see how strongly the neural-network is predicting the actual target word observed in the training data, perhaps in contrast to a few other words. In hierarchical-softmax, this involves calculating output nodes for a short encoding of the one target word – ignoring all other output nodes encoding other words. In negative-sampling, this involves calculating the one distinct output node for the target word, plus a few output nodes for other randomly-chosen words (the 'negative' examples).
In neither case does training know if this target word is being predicted in preference over all other words – because it's not taking the time to evaluate all others words. It just looks at the current strength-of-outputs for a real example's target word, and nudges them (via back-propagation) to be slightly stronger.
The end result of this process is the word-vectors that are usefully-arranged for other purposes, where similar words are close to each other, and even certain relative directions and magnitudes also seem to match human judgements of words' relationships.
But the final word-vectors, and model-state, might still be just mediocre at predicting missing words from texts – because it was only ever nudged to be better on individual examples. You could theoretically compare a model's predictions for every possible target word, and thus force-create a sort of ranked-list of predicted-words – but that's more expensive than anything needed for training, and prediction of words like that isn't the usual downstream application of sets of word-vectors. So indeed most word2vec libraries don't even include any interface methods for doing full target-word prediction. (For example, the original word2vec.c from Google doesn't.)
A few versions ago, the Python gensim library added an experimental method for prediction, [predict_output_word()][1]. It only works for negative-sampling mode, and it doesn't quite handle window-word-weighting the same way as is done in training. You could give it a try, but don't be surprised if the results aren't impressive. As noted above, making actual predictions of words isn't the usual real goal of word2vec-training. (Other more stateful text-analysis, even just large co-occurrence tables, might do better at that. But they might not force word-vectors into interesting constellations like word2vec.)

Algolia search only by distance

Is it possible to rank search results based solely on proximity (and other filters) to a specific point and ignore the custom ranking formula (likes)?
For example, I'd like to rank results closest to times square strictly by distance and for the query to not care about the likes attribute.
I suppose the likes attributes is one of you own.
You can do what you need by removing the attribute like from the custom ranking.
If you need 2 way to search, you can do a replica index, with the same data but a different ranking formula: https://www.algolia.com/doc/guides/ranking/sorting/?language=go#multiple-sorting-strategies-for-an-index

Classify many documents with many different unordered items

I've the following documents with different unordered items, e.g.,
doc_1:
item 1,
item 2,
...
item n
doc_2:
item 7,
item 3,
...
item n
.
.
.
doc_n:
item 20,
item 17,
...
item n
how To classify into similar groups, e.g.,
1- doc_1:
all item 1
2- doc_2:
all item 2
...etc.
From your comment, How to collect similar items in different documents?, I am assuming you want to compute similarity score between lines in a document to the lines in another document.
One simple approach is to represent one line/sentence of a document through bag-of-words model. Then you can compute cosine similarity of two sentence/line representations.
Given two lines of two different documents are textually very close, you should observe higher cosine similarity between their sentence representation.
Please note, a useful representation and similarity computation depends on the problem that you are trying to solve. Say for example, if different documents contain product reviews (one or more lines) associated with different products (cell phones, laptops) and you want to collect the lines associated with a single product, you can simply follow the approach mentioned above.
Also note, the solution I have suggested, its pretty naive. To achieve higher accuracy in your desired task, you may need to design more focused which will be effective for the target task.

Function to check similarity of a string with a list of strings

I have a collection of similar strings in a bucket and have multiple such buckets. What kind of function should I use on the strings to compare a random string with the buckets to find out which bucket it belongs to?
To clarify on each entity in the bucket, it is a sentence that can have multiple words.
An example:
Consider the list of strings in the bucket:
1. round neck black t-shirt
2. printed tee shirt
3. brown polo t-shirt
If we have as input, "blue high neck t-shirt", we want to check if this can be added to the same bucket. This might be a simpler example but consider doing this for a bucket of let's say 100s of strings.
Any reference to an article or paper will be really helpful.
First of all, I am thinking about two kinds of similarities: syntactic and semantic.
1) Syntactic
Levenstein distance can be used to measure distance between two sequences (of characters)
Stemming can be used to increase match probability. There are various implementations (e.g. here). You get the stem (or root) word for your random string and compare with stems from your buckets. Of course, buckets stems should be precomputed for efficiency.
2) Semantic
for general info you can read the article from Wikipedia
for actual implementation you can read this article from CodeProject. It is based on WordNet, a great ontology for English language that stores concepts in synsets and also provides various relations between these synsets.
To get more details you should tell us what kind of similarity you need.
[edit]
Based on provided information, I think you can do something like this:
1) split all strings in words => random string will be named Array1 and current bucket Array2
2) compute similarity as number_of_common_words(Array1, Array2) / count(Array2)
3) choose maximum similarity
Specificity may also be increased, by adding points to position match: Array1[i] = Array2[i]
For better performance I would store buckets as Hash tables, Dictionary etc., so that existence check is done in O(1).

Aggregate bins in Tableau

I want to aggregate bins in tableau.
See the following figure:
I want to aggregate (merge) the NumberM from 6 untill 16 in one category. 5+/(6 and higher) for example and sum the values of 6-16 in that category. I think this can be done with a few simple clicks but I am not able to manage.
Thanks in advance,
Tim
There are several ways to classify data rows into different groups or classes: each with different strengths.
Create a calculated field As emh mentioned, one approach is to create a calculated field to assign a value to a new field indicating which group each data row belongs to. For the effect you want, the calculated field should be discrete (blue). If your calculation doesn't return a value for in one case, e.g. an if statement without an else clause, then the field will be null in that case which is a group in itself. This is a very general approach, and can handle much more complex cases. The only downsides are the need to maintain the calculated field definition and that the cutoff values are hard coded and by itself can't be changed dynamically via a control on the view. BUT those issues can by easily resolved by using a parameter instead of a numeric literal in your calculated field. In fact, that's probably the number one use case for parameters. If you think in SQL, a discrete field on a shelf is like a group by clause.
Use a filter If you only want a subset of the data in your view, e.g. data rows with NumberM in [6, 16] then you can drag the NumberM field onto the filters shelf and select the range you want. Note for continuous (green) numeric fields, filter ranges include their endpoints. Filters are very quick and easy to drop on a view. They can be made dynamically adjustable by right clicking on them and creating a quick filter. Its obvious from the view that a filter is in use, and the caption will include the filter settings in its description. But a filter doesn't let you define multiple bins. If you think in SQL, a filter is like a where clause (or in some cases using the condition tab, like a having clause)
Define histogram bins If you want to create regular sized bins to cover a numeric range, such as values in [1,5], [6,10], [11-15] ..., Tableau can create the bin field for you automatically. Just right click on a numeric field, and select Create Bins.
Define a group Very useful for aggregating discrete values, such as string fields, into categories. Good for rolling up detail or handling multiple spellings or variants in your data. Just right click on a field and select Create Group. Or select some discrete values on an axis or legend and press the paperclip option. If you then edit a group, you'll see what's going on. If you think in SQL, a group is like a SQL case statement.
Define a set Another way to roll up values. The definition of a set can be dynamically computed or a hard coded list of members. Both kinds are useful. You can combine sets with union, intersection, set difference operators, and can test set membership in calculated fields. Sets are useful for binary decisions, rows are divided into those that are members of the set and those that are not.
Filters, sets, groups, calculated fields and parameters can often be combined to accomplish different effects.
Most if not all of these features can be implemented using calculated fields, especially if the business rules get complicated. But if a filter, bin, group or set fits your problem well, then it's often best to start with that, rather than define a calculated field for each and every situation. That said, learning about the 4 kinds of calculated fields really makes a difference in being able to use Tableau well.
You can do this with calculated fields.
Go to: Select Analysis > Create Calculated Field.
Then use this formula:
IF NumberM > 5 THEN "OVER 5"
You can then use that calculated field as a filter on the worksheet in your screenshot.
Answering my own question:
With Tableau 9 this can be easily done with the increased flexibility of the level of detail expressions (LOD). I can really recommend this blog on that subject and many more Tableau functions.