Sphinx Search and rank based on word position - sphinx

Is it possible, using Sphinx Search, to have the weight of a result to be determined on the position of words in a list?
For example, if you have rows with a column containing the following text:
Row #1: "dog, bird, horse, cat"
Row #2: "dog, bird, cat"
and then perform a OR search using "dog | cat" I would like row #2 to rank higher than #1 because both "dog" and "cat" were found, but #2 has these two closer together than #1.
Hope this makes sense.
Thanks
Michael

You can do this by using field level ranking. Use "SPH_RANK_EXPR" as your ranker and look at the field level factor "min_hit_pos" to tell which word matched first.
All the information can be found at http://sphinxsearch.com/docs/manual-2.0.4.html#weighting
If you look closely at the SPH_RANK_SPH04 ranking algorithm below, it includes min_hit_pos, but only gives credit to rows where the matched word is the first word.
sum((4*lcs+2*(min_hit_pos==1)+exact_hit)*user_weight)*1000+bm25
What you can do is use the same algorithm but change "2*(min_hit_pos==1)" to be something like this:-
(101-IF(min_hit_pos<100,min_hit_pos,100))
A row will get an extra 100 weight if matched on the first word, 99 if matched on the second word and so on until the 100th word, after which no more weight is given.
You can play around with the values and include a multiplier to see if the results are any better.
Hope that helps. Let me know if you have any questions.

Have you tried SPH_RANK_PROXIMITY ranking mode?
Otherwise could be more explicit and do a query like - with SPH_RANK_WORDCOUNT
"dog cat"/1 | "dog cat"~10 | "dog cat"~8 | "dog cat"~6 | "dog cat"~4 | "dog cat"~3 | "dog cat"~2 | "dog cat"~1
or similar.

Related

Sphinx search: bug in handling multiple blend_chars in a single term?

I'm using Sphinx 2.2.11 and believe I've found a bug regarding how Sphinx indexes terms that contain more than one instance of a blend character.
For example, I have the hyphen and period set as blend_chars:
blend_chars = ., -
Let's say I have a term in the database as follows:
part1-part2.part3
I would expect that Sphinx would index this term in all possible combinations for each blend_char. For example:
Variant 1: part1-part2.part3
Variant 2: part1 part2.part3
Variant 3: part1-part2 part3
Variant 4: part1 part2 part3
However, that doesn't seem to be the case.
If I search for:
part2.part3
I don't find the record containing the term part1-part2.part3.
However, if I search for:
part2 part3
OR
part1 part2 part3
I do find the record.
This suggests to me that Sphinx does not index all possible combinations of the blend_chars. Instead, it appears to index just two versions:
part1-part2.part3 (with blend_chars intact)
part1 part2 part3 (with blend_chars ignored, treated as whitespace)
If true, I would consider this a bug, as it tends to break searches that use just one of the blend_chars.
Can anyone confirm that they are seeing the same behavior? And can anyone suggest tips on how to fix or work around it?
Thanks very much!
When you have blend_chars = ., - and search for part2.part3 or part1-part2 Sphinx leaves those as single tokens, it doesn't convert them to part2 AND part3 and part1 AND part2.
BUT when you index part1-part2.part3 it generates 4 tokens: part1-part2.part3, part1, part2 and part3. That's why you can't find neither of them with part1-part2 or part2.part3.
The solution is to not use blended chars in your query. If you want to automate it you can use CALL KEYWORDS to see how it would be tokenized during indexation prior to your search query and then use the results to modify your query, e.g.:
mysql> call keywords('part1-part2.part3', 'blend');
+------+-------------------+-------------------+
| qpos | tokenized | normalized |
+------+-------------------+-------------------+
| 1 | part1-part2.part3 | part1-part2.part3 |
| 1 | part1 | part1 |
| 2 | part2 | part2 |
| 3 | part3 | part3 |
+------+-------------------+-------------------+
4 rows in set (0.00 sec)

Sphinx(Search) - documents which matches keyword twice (thrice, etc.)

Is there a way to only output documents which contains n matches of a search term in it?
F.e. I want to output all documents containing the search term "Pablo Picasso" | "Picasso Pablo" at least two (three, n) times.
How would such a query look like?
My current query is:
SELECT * FROM myIndex WHERE MATCH('"Pablo Picasso" | "Picasso Pablo"');
You could do it with filtering by weight (ie results with it multiple time wil rank higher)
But a useful trick is the Strict order operator...
MATCH('Pablo << Pablo')
would require the word twice (ie one before the other!)
You can also use the primoxity operator to simplify your original query, it just wants the words near each other, which is more conise than two phrase operators
MATCH('"Pablo Picasso"~1')
... ie within 1 word of each other - ie adjent.
Combine the two..
MATCH('"Pablo Picasso"~1 << "Pablo Picasso"~1')
and for theree occurances
MATCH('"Pablo Picasso"~1 << "Pablo Picasso"~1 << "Pablo Picasso"~1')

Syncsort Sum Fields=None not removing duplicates

I'm trying to run a SYNCSORT job that will remove duplicate entries and when I run it, I'm still getting duplicates. The following is the SYNCSORT code I'm using:
INCLUDE COND=(((61,1,CH,EQ,C'P'),OR,
(61,1,CH,EQ,C'V')),AND,
(8,2,CH,EQ,C'FL'))
OUTREC FIELDS=(1:12,20,
30:36,20,
55:61,1)
SORT FIELDS=(30,20,CH,A,
01,20,CH,A)
SUM FIELDS=NONE
The input is as follows:
----+----1----+----2----+----3----+----4----+----5----+----6
FL AMELIA CITY
32034 FL NASSAU FERNANDINA BEACH P
32034 FL NASSAU AMELIA CITY V
32034 FL NASSAU AMELIA ISLAND S
32034 FL NASSAU FERNANDINA S
I'm getting most of the expected output, except that I'm still getting duplicates. The output that I have is as follows:
----+----1----+----2----+----3----+----4----+----5----+
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADENTON P
MANATEE BRADINGTON V
POLK BRADLEY P
HILLSBOROUGH BRANDON P
SUWANNEE BRANFORD P
MIAMI-DADE BRICKELL V
Any help would be appreciated as I'm not able to find my error.
This is what you are sort summing on:
< ------------ Sort Field ----------------------->
----+----1----+----2----+----3----+----4----+----5----+----6
FL AMELIA CITY
32034 FL NASSAU FERNANDINA BEACH P
32034 FL NASSAU AMELIA CITY V
32034 FL NASSAU AMELIA ISLAND S
32034 FL NASSAU FERNANDINA S
the Duplicate records will be different in the first 11 bytes which you can not see. Try removing the outrec to check.
Possible changes -
Change the outrec to an inrec
re-code the sort with fields associated with the output, see the following:
The following sort sorts based on the output records:
INCLUDE COND=(((61,1,CH,EQ,C'P'),OR,
(61,1,CH,EQ,C'V')),AND,
(8,2,CH,EQ,C'FL'))
OUTREC FIELDS=(1:12,20,
30:36,20,
55:61,1)
SORT FIELDS=(42,20,CH,A,
12,20,CH,A)
SUM FIELDS=NONE
It does not matter what order you code the different stages of a "sort", they will be executed in the order that SORT wants.
In your case this will be INCLUDE, then SORT, then SUM, then OUTREC. You can check that this is the case by entirely inverting the control cards, you will get identical output.
If you want to do something before SORT you use INREC, not just try to locate OUTREC before the SORT statement. Here, since you are SORTing, you only want to include the data you need. You do not want to include the spacing for formatting. Why would you want to load up your file to SORT with extra identical data on each record?
On INREC and OUTREC please don't use FIELDS. On OUTFIL please don't use OUTREC. It should be obvious that FIELDS is "overloaded" (see how many times you used FIELDS, and see how many are "the same") and OUTREC is "overloaded". More than 10 years ago BUILD was introduced to allow things to be much clearer - it describes what it is doing, and every time you see BUILD it only only means BUILD.
INCLUDE COND=(((61,1,CH,EQ,C'P'),
OR,
(61,1,CH,EQ,C'V')),
AND,
(8,2,CH,EQ,C'FL'))
INREC BUILD=(36,20,
12,20,
61,1)
SORT FIELDS=(1,40,CH,A)
OUTREC BUILD=(21,10,
10X,
1,20,
5X,
41,1)
The INREC selects only the data you want, and in an order where you need specify only one SORT key.
The OUTREC then formats the data how you want it. For each record in the SORT 15 bytes were saved (the blanks). 10X is 10 blanks, 5X is five blanks.
Note that it is much easier, to code and understand, and more maintainable therefore, if you include "explicit" blanks rather than implicit ones using column numbers. Imaging 10 columns of a report, and the spacing between columns one and two are incorrect. Do you want to change all the column references, just to add one extra space, or would you prefer to change 7X to 8X and the rest works itself out? Even if you enjoy tedious changes, remember your colleagues :-)
If your data is already in order don't use SUM FIELDS=NONE. Use OUTFIL reporting features, REMOVECC, NODETAIL and SECTIONS with TRAILER3. NEVER SORT data just to allow you to remove duplicates with SUM FIELDS=NONE.

Splunk csv to match country code

I'm using splunk but having trouble trying to match first 2 or 3 digits in this:
sample:
messageId=9492947, to=61410428007
My csv looks like this:
to, Country
93, Afghanistan
355, Albania
213, Algeria
61, Australia
I'm trying to push the fields into a CSV and tell me what Country they matched.
I think I need to be doing a regex or something, but i have interesting fields marked in splunk which is "to"
This is one of those messy ones, but it does not require regular expressions. Let's say you have a CSV file with a header - something like:
code,country
92,Afghanistan
355,Albania
214,Algeria
61,Australia
44,United Kingdom
1,United States
You need the header for this. I created an example file, but the source can come from anywhere just as long as you have the to field extracted properly.
source="/opt/testfiles/test-country-code.log"
| eval lOne=substr(to,1,1)
| eval lTwo=substr(to,1,2)
| eval lThree=substr(to,1,3)
| lookup countries.csv code as lOne OUTPUT country as cOne
| lookup countries.csv code as lTwo OUTPUT country as cTwo
| lookup countries.csv code as lThree OUTPUT country as cThree
| eval country=coalesce(cOne,cTwo,cThree)
| table to,country
The substr calls extract one, two or three characters from the start of the string. The lookups convert each of those variables to the country name using the lookup table. The coalesce will take the first one of those with a value.

PostgreSQL prevent non-matching tsqueries from matching tsvector

Given the following query:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('cats ate');
This query will return true as a result. Now, what if I don't want "cats" to also match the word "cat", is there any way I can prevent this?
Also, is there any way I can make sure that the tsquery matches the entire string in that particular order (e.g. the "cats ate" is counted as a single token rather than two). At the moment the following query will also match:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('ate cats');
cat matching cats is due to english stemming, english being probably your default text search configuration. See the result of show default_text_search_config to be sure.
It can be avoided by using the simple configuration. Try the function calls with explicit text configurations:
select to_tsvector('simple', 'fat cat ate rat') ## plainto_tsquery('simple', 'cats ate');
Or change it with:
set default_text_search_config='simple';