Sphinx search ranking broken? - sphinx

Has anyone ever gotten the Sphinx ranking options to work? I've read the manual and the book but cannot get ranking working at all. From what I understand, ranking simply computes the weights in a different manner, doesn't do any type of sorting. I have my results sorted by #weight (internal sphinx field) and using sort mode extended, which you need for this, yet cannot see any difference between different ranking modes. My config is something like this:
$cl->SetMatchMode( SPH_MATCH_EXTENDED2 );
$cl->SetSortMode ( SPH_SORT_EXTENDED, "mylang DESC, #weight DESC, #id");
Neither of these makes any difference:
$cl->setRankingMode(SPH_RANK_SPH04);
$cl->setRankingMode(SPH_RANK_PROXIMITY_BM25);
And the weights are the same in either mode.
Ultimately, what I'm trying to achieve is to have terms that match exactly be sorted towards the top. So for example, if searching for "Harry Potter" the results should be as follows:
Harry Potter
Harry Potter and the potters
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Deathly Hallows: Part 1
This is just an example, but the first result should be the one that contains the exact search term, then the others would follow. This is not happening. Anyone have any experience with this?

Do you have any other records in index except which start from "Harry Potter"?
If no, then phrase "Harry Potter" will be penalized by ranking algorithm.
See my article about that: Interesting thing about BM25 in Sphinx Search
All of you records have exact match for "Harry Potter", so I suppose records with more words would ranked higher.
Solution could be to use attribute which store records size in bytes:
sql_query = select field, length(field) as f_size from ....
Attribute:
sql_attr_uint = f_size
Sphinx sort mode:
$cl->SetSortMode ( SPH_SORT_ATTR_ASC, 'f_size' );

Turns out that SPH_RANK_SPH04 is not included in the sphinxapi.php file in version 0.9.9!!! So even though you're calling it it's not taken into account and furthermore does not produce an error.
This is terrible because it makes it very hard to troubleshoot.
I've posted this as the answer in the hopes that it helps someone else. We lost almost 2 days going crazy over this until we figured it out.
Furthermore, there is a bug in 2.0.1 which doesn't really bring some exact matches to the front, for that you need 2.0.2 (which you need to get from SVN) or above, but I'd be very weary of using experimental versions in production.
Hopefully the Sphinx developers will take care of this soon.
PS
Looking back at the developer diaries, it does say:
"As of 1.10-beta, Sphinx has 8 different rankers"
We upgraded from 0.9.9 to 2.0.1 and must have left the api file behind, and in desperation I never even checked this. It would still be nice for Sphinx to throw an error if the ranking mode doesn't exist (as it does for other modes such as matching), and the 2.0.1 bug is still there as far as we can tell in our tests.

Related

Project Academic Knowlede | Query for and list papers by AA.AuId?

I've got a list of author names but I don't have Id's for any of them.
I'd like to:
Query by author name and store the most probable AuId.
List all papers written by a given AuId.
Is there any way to do this with the current interpret/evaluate APIs? It seems like everything is tied to a paper entity and I want to be sure I am only ever selecting and using one AuId.
Thanks.
I am not aware of such a feature. But indirectly, you could first search for the author name (AA.AuN in the expr-field), obtain all the (unique) various author IDs (AA.AuId in the attributes field), and search for their publications.
(You could even add orderby=logprob:desc, but to be honest, I am not 100% sure what logprob does.)
So, the first step could be to search for the author name (e.g. John Smith) like this and fetch all those AA.AuId where the names (AA.AuN) seem to fit John Smith (let's just add the orderby=logprob:desc):
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?&expr=Composite(AA.AuN=%27john%20smith%27)&count=100&attributes=AA.AuN,AA.AuId&orderby=logprob:desc&subscription-key={YOUR-KEY}
As a second step, if you have an Author ID AA.AuId (here, for example, 3038752200), use this to list their papers (ordered by year, in a descending manner orderby=Y:desc):
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?&expr=Composite(AA.AuId=3038752200)&count=100&attributes=AA.AuN,AA.AuId,DOI,Ti,VFN,Y&orderby=Y:desc&subscription-key={YOUR-KEY}
The approach would be more promising if you had an institutional affiliation as well. Then you could change the expr field to Composite(And(AA.AuN='{AUTHOR-NAME}',AA.AfId={AFFILIATION-ID})) so as to search for all {AUTHOR-NAMES} affiliated to {AFFILIATION-ID}.

Sphinx: Show all results order by previous searches

I use SphinxQL for searching and filtering in product database and I store last x search phrases of each user. I wonder if is it possible to show all products (all rows) to every user but with relevance on previous search.
Let's say one user sought for mobile phones (iphone, galaxy s7...), ie. electronics category. I want to show him all products randomly, but products from category electronics more often and products with those searched keywords even more often.
Is it even possible with Sphinx?
Thanks and sorry for english.
An alternative, would be perhaps to create random numbers attached to each result. A high and a low number, with an overlapping range.
sql_query = SELECT id, RAND()*100 AS rand_low, (RAND()*100)+50 AS rand_high, ...
sql_attr_uint = rand_low
sql_attr_uint = rand_high
Can then arrange the ranking expression to pick either of these numbers depending on if matches or not, and sort by the result.
SELECT id FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
OPTION ranker=expr('IF(doc_word_count>1,rand_high,rand_low)');
Will be mixed up. But results that match one of the words, have a greater chance of showing up first (because use the weighted random number) - its still only a chance, because a rand_high CAN still be smaller than rand_low.
... can change the size of the number 'overlap' to tweak the mix of matching/non matching results.
(added as a new answer as its a quite differnt idea, although uses the same 'all' keyword)
Sphinx doesn't have a 'mode' to just do that. But can get very close...
Can use MAYBE operator
MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
The complication is need a way to match all products. Depending on your data you may already have a word can use (eg word like 'the' in every single product), or add the word to every document, during indexing.
... using MAYBE allows the matching results to have a higher weight.
But you dont want to sort strictly by weight. So need a different alogithm, something to shuffle the results a bit (as you not really wanting 'random'!)
SELECT id, IDIV(id/10000) AS int,WEIGHT() AS w
FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
ORDER BY int DESC, w DESC;
This creates banding by ID, as in theory results can be spread over all the id-space will mix them up a bit. But the category results will still tend to be shown first within each band.
If you have one a different attribute other than ID might be better, something more spread out. Or can add a deliberate random attribute to results)
... there are all sort so variations, your imagination is the only limitation, this basic techqiue can be used to mix things up quote a bit.
(There are other possiblities, Sphinxes little known GROUP N BY function, can be used to produce a sampling search result. This is isnt random, but it might give the similar enough result - ie just mixing up results)

Echonest query by country is not working

i'm using Echonest in order to make an query for the TopSinger in specific country. The problem is that some 'not popular' artist are returned anyway. For example, in Italy "Armando" is on of the top-100 artist. But this artist is now popular in Italy, also is totally unknow artist.
http://developer.echonest.com/api/v4/artist/search?api_key=xxx&format=json&start=0&results=100&rank_type=relevance&sort=familiarity-desc&artist_start_year_after=1950&artist_location=Italy
At this point i'm not sure about the accuracy of EchoNest.
Somebody can explain if i'm making something wrong?
No you are doing nothing wrong. It is just the result of Echonest's algorithm
http://developer.echonest.com/api/v4/artist/profile?api_key=xxx&id=AR22QZN1187FB4D4C0&bucket=familiarity
Armando's familiarity is currently calculated as 0.563938 deserves that rank according to this number :)

haystack/elasticsearch: trying to find "s04e07"

i have a kinda weird problem with haystack/elasticsearch trying to find tv episodes stored in my database based on a string like this: 's04e07' which means season 4 episode 7 and is a kind of standard format, but the search index has its problems with that.
Trying a few different things it looks like numbers are not indexed in EdgeNgramFields.
In a CharField i can only find exact word matches like '2013' if contained in the titel, but i have no luck finding 's04e07'.
How do i get my results out of the index?
How could i possibly change the hardcoded default mapping in haystack to index my stuff correctly?
I actually wrote about haystack a few days ago, I would suggest reading that one first:
Django Haystack Distinct Value for Field
It's not directly on point, but my advice here is the same. Stop using haystack.
Haystack comes with an outofthebox edgengram and ngram analyzers, which is cool, except these analyzers don't work in nearly all use cases.
They will especially not work in yours because you are mixing numbers and chars.
But my first question is, why can't you index the data like this:
"season":1
"episode":1
And then at search time break down the users search into the above format?
If that isn't possible, you can still PUT a mapping manually without letting haystack do it for you (which I recommend highly anyway because it's mappings are not correct). It's pretty easy to do with elasticutils.
Keep in mind, I don't think edgengram is exactly what you want here in any event. because it only grams from the edge and is most useful for autocompletes, for example if someone is typing s04e and you want to display a list of possible matches.
So, this depends on how users will query the data. Will it always be the above string whole, or parts of the string, or will they sometimes search for e07 and you want to show all seasons with episode 7's?
The last possibility here is to just index it as normal (haystack will choose snowball) and use prefix queries / regex queries to get what you want.

How to form an Endeca query where a field must start with certain letters

Is it possible to form an Endeca query to retrieve a field that must start with certain letters? Say like get all users who's first letter is A? I checked with Range filters but it is supporting only numerical fields as well as Wild card search. But nothing worked well so far.
Creating a dimension is one way of approaching the problem as Paul Lemke mentioned.Wildcard is not an option since the performance overhead as well as irrelevant records.
But we solved it using couple of other alternatives.
Create a new property for the Object called "StartWith", store the first letter of the Object and make it searchable. We found it easier than creating a Dimension.
There is a problem where letters like 'A' are usually stop words in Endeca. We can do you a couple of work around to solve this.
Get the ASCII value of the first letter and store the numerical value in to that property. One more advantage with this approach is that we can use Range Filters. But you can't search for 'AB' kind of requirements.
Pre-pend some characters like ^^^My name and search for ^^^M. The advantage with this approach is you can search conditions like letters starts with AB.
Endeca at it's current version (6.1) does not have a search filter that works like a "startswith" function in other programming languages.
I do have two options that might possibly get you close:
If you are truly only looking for the first letter you can setup a Dimension value for each letter of the Alphabet (A,B,C...). You can then refine on each letter and see only the values that start with letter A, B, C, etc. The only downside to this is you can only filter based on how many dimension values you setup. So if you added "A", you couldn't filter anything that started with "AB". You could go down the line and add "AB", "BA, "CA", and so on but that would get unwieldy very fast.
If you want something closer to a "startswith" function the only other option is to use a wildcard search. Basically you would do a property search like this: N=0&Ntk=Username&Ntt=ab*
The trick with wildcard searching is it will do that across multiple words in that property. So assuming you had a data set of these values:
Smithers Smith
Larry Smith
Jenna-Smith
Doing a search of sm* would actually return all 3 results because "sm" was in their last name. Even the one with the dash would return as Endeca think's that is a seperate word. (It might be possible to turn that off though, not sure).
So basically it comes down to this: Stick a one word in a property, set that property to allow wildcard search, then do a "blah*" against that property and you should have the results you're looking for.
Have you tried the First relevance rank module which is supposed to rank based on proximity to the beginning of the field?
It sounds similar to what you are looking for and together with a wild card may produce your intended results.