Sphinx frequency ranking

Sphinx frequency ranking - sphinx

I have read sphinx documentation and cant find an answer.
I have a sphinx index set up on 2 fields (title, description)
I would like ranking to be higher the more times a search term is present in one or both of the fields. For example: A search for "car"
Product A:
Title: Red car
Desc: good car
Product B:
Title: Green car
Desc: A really good car. The best car.
In the above example I would like product B to rank higher than A because "car" is present 2 times in its description. Currently they two items are ranked equally.
What settings do i need to adjust in Sphinx to get this to work?
Thanks for the help

The quick and simple way would be to use SPH_RANK_WORDCOUNT
But you could get better results by using a custom ranker
http://sphinxsearch.com/docs/current.html#expression-ranker
eg with word_count or maybe hit_count.

Related

Plotting frequencies in Seaborn

I'm looking at an SNL dataset and I want to use seaborn to take a look at a couple different things.
I'm using this to learn more about visualizations in jupyter (aka I'm a beginner).
The data set looks like this:
aid: actor
capacity: what their role was
charid: unique character id
impid: unique impersonation id
role: name of role they played
tid: sketch id
voice: were they just a voiceover?
epid: episode id
sid: season id
Some questions:
Who are the top 20 actors who appeared on SNL?
The characters used most frequently?
The impressions most frequently?
Which characters were played by multiple actors?
I tried this but it's so many people, I want to limit it to maybe 20 people. Or if you have suggestions of other visualizations to try I'm all ears.
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(5,5))
sns.countplot(y="aid", data=appearances);
Some example plots of how to answer some of these questions would be amazing!!

Your question is quite broad but in general, for each series, you can do this:
Get the count for each unique element of a specific column and only take the 20 elements with the highest count:
top20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values().tail(20)
bot20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values().head(20)
NB: sort_values, by default, sort in ascending order. Hence to find the values with the highest count we use tail(). You can sort in descending order by using the following .sort_values(ascending=False). In this case you would select the elements with the highest count using .head(). e.g.
top20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values(ascending=False).head(20)
bot20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values(ascending=False).tail(20)
Then simply plot the results in a barplot
sns.barplot(top20aid.values,top20aid.index)

Schema.org ranking vs rating

I would like to use schema.org markup on my site. There are rating options/values to use but I need ranking. Only thing I can think to do is use rating of 10 stars = ranking of 1, rating of 9 stars = ranking of 2, etc. I was using search in google and was directed here to post question. If there is someone from google or familiar with schema.org values please comment. Best option would be to add ranking as a feature of schema.org. Thanks for the help.

There is no a ranking property in any of the schema items but rankings are just ordered lists, they specify the relationship of things between them in a list.
You can use https://schema.org/position to set each item position in an ordered list https://schema.org/ListItem simulating a ranking.

Sphinx: Show all results order by previous searches

I use SphinxQL for searching and filtering in product database and I store last x search phrases of each user. I wonder if is it possible to show all products (all rows) to every user but with relevance on previous search.
Let's say one user sought for mobile phones (iphone, galaxy s7...), ie. electronics category. I want to show him all products randomly, but products from category electronics more often and products with those searched keywords even more often.
Is it even possible with Sphinx?
Thanks and sorry for english.

An alternative, would be perhaps to create random numbers attached to each result. A high and a low number, with an overlapping range.
sql_query = SELECT id, RAND()*100 AS rand_low, (RAND()*100)+50 AS rand_high, ...
sql_attr_uint = rand_low
sql_attr_uint = rand_high
Can then arrange the ranking expression to pick either of these numbers depending on if matches or not, and sort by the result.
SELECT id FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
OPTION ranker=expr('IF(doc_word_count>1,rand_high,rand_low)');
Will be mixed up. But results that match one of the words, have a greater chance of showing up first (because use the weighted random number) - its still only a chance, because a rand_high CAN still be smaller than rand_low.
... can change the size of the number 'overlap' to tweak the mix of matching/non matching results.
(added as a new answer as its a quite differnt idea, although uses the same 'all' keyword)

Sphinx doesn't have a 'mode' to just do that. But can get very close...
Can use MAYBE operator
MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
The complication is need a way to match all products. Depending on your data you may already have a word can use (eg word like 'the' in every single product), or add the word to every document, during indexing.
... using MAYBE allows the matching results to have a higher weight.
But you dont want to sort strictly by weight. So need a different alogithm, something to shuffle the results a bit (as you not really wanting 'random'!)
SELECT id, IDIV(id/10000) AS int,WEIGHT() AS w
FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
ORDER BY int DESC, w DESC;
This creates banding by ID, as in theory results can be spread over all the id-space will mix them up a bit. But the category results will still tend to be shown first within each band.
If you have one a different attribute other than ID might be better, something more spread out. Or can add a deliberate random attribute to results)
... there are all sort so variations, your imagination is the only limitation, this basic techqiue can be used to mix things up quote a bit.
(There are other possiblities, Sphinxes little known GROUP N BY function, can be used to produce a sampling search result. This is isnt random, but it might give the similar enough result - ie just mixing up results)

Mahout: Recommending Items for a user in particular product category

What do we have as of now? - We are using Mahout's GenericItemBasedRecommender to get a list of recommended products for a user using TanimotoCoefficientSimilarity as ItemSimilarity.
Where do we want to go from here? - The above works fine when we don't care about product category but what we want to know is the Product Category specific recommendations i.e. Say if a user has been buying, browsing, liking etc. specifically more in Men's and Gadgets category, I would then want to show this user recommendation in that specific category saying Recommended for you in [X] where X would be replaced by Mens or Gadgets in this case. We are thinking about a couple of options below to achieve this and we need some leads/opinion/feedback etc. so as to make sure we are going in the right direction. Options:
Firstly we'll have to move to a non-tanimoto version for calculating item similarity so that we account for users buying, liking, etc and not only view/browsing data.
Figuring out product category for a particular user (this is where we need direction) - Our product category hierarchy is basically a tree and we need to know which top 4 nodes (with best recommendations) in tree we would show to the user. Also if we are saying that node X is a category which we are showing to the user and node Y is a parent of node X we then don't want show user products in category Y or any parent for that matter. Couple of ways achieving this:
For every user calculate SUM of similarity scores values of items for a nodes at leaf level and recursively calculate for parent node till the root. Now at each node we have A = SUM of similarity scores & B = Number of Items Recommended so we also have A/B=Value (V) at each node. Now we pick the top 4 V values from the tree and recommend that to the user. The challenge here is that if we try to calculate this online during the request it we would tough to limit this under 150 ms for the entire request. An Example:
Root Level - Category12 (A=11, B=4) (category1 + category2)
|
_____________________|_________________________
/ \
/ \
Leaf Level - category1 (A=6, B=2) category2 (A=5, B=2)
Recommended products in Category 1: Item1 (score = 2), Item2 (score = 4)
Recommended products in Category 2: Item3 (score = 1), Item4 (score = 4)
Second option: For every category create a cluster of users based on their behaviour (likes, buying, viewing etc.) and then figure out the top 4 categories to which the user belongs. Not sure if we can achieve this using clustering in Mahout but I think we can do this offline.
Please provide your feedback/suggestions/leads/thoughts.
Thanks in advance!

If you want to model more than one thing in your data, I would suggest to use the SVD recommender instead with the ALSWR factorizer set to implicit feedback. With that done you can have user,item,preference in your data and the preference value would be how strongly associated your user is to the item. You can play with the numbers, for example a purchase is a 20 and a view is just a 2. I'm just throwing numbers here, I wouldn't know what will work best for your data, because you can also model things proportionally, as in if a purchase is 30 times less likely to happen than a view, then a purchase should be 30 times stronger than a view.
Mahout provides a way to influence the recommendations through the IDRescorer. You implement your own logic here and decide how to affect the recommendations. For example, the IDRescorer would check if a recommendation candidate belongs to the same category and if it does, boost the score by X. There's an example here (link) from the Mahout in Action Book (which you should definitely read), showing a rescorer.
Hope this helps

Most efficient way to store nested categories (or hierarchical data) in Mongo?

We have nested categories for several products (e.g., Sports -> Basketball -> Men's, Sports -> Tennis -> Women's ) and are using Mongo instead of MySQL.
We know how to store nested categories in a SQL database like MySQL, but would appreciate any advice on what to do for Mongo. The operation we need to optimize for is quickly finding all products in one category or subcategory, which could be nested several layers below a root category (e.g., all products in the Men's Basketball category or all products in the Women's Tennis category).
This Mongo doc suggests one approach, but it says it doesn't work well when operations are needed for subtrees, which we need (since categories can reach multiple levels).
Any suggestions on the best way to efficiently store and search nested categories of arbitrary depth?

The first thing you want to decide is exactly what kind of tree you will use.
The big thing to consider is your data and access patterns. You have already stated that 90% of all your work will be querying and by the sounds of it (e-commerce) updates will only be run by administrators, most likely rarely.
So you want a schema that gives you the power of querying quickly on child through a path, i.e.: Sports -> Basketball -> Men's, Sports -> Tennis -> Women's, and doesn't really need to truly scale to updates.
As you so rightly pointed out MongoDB does have a good documentation page for this: https://docs.mongodb.com/manual/applications/data-models-tree-structures/ whereby 10gen actually state different models and schema methods for trees and describes the main ups and downs of them.
The one that should catch the eye if you are looking to query easily is materialised paths: https://docs.mongodb.com/manual/tutorial/model-tree-structures-with-materialized-paths/
This is a very interesting method to build up trees since to query on the example you gave above into "Womens" in "Tennis" you could simply do a pre-fixed regex (which can use the index: http://docs.mongodb.org/manual/reference/operator/regex/ ) like so:
db.products.find({category: /^Sports,Tennis,Womens[,]/})
to find all products listed under a certain path of your tree.
Unfortunately this model is really bad at updating, if you move a category or change its name you have to update all products and there could be thousands of products under one category.
A better method would be to house a cat_id on the product and then separate the categories into a separate collection with the schema:
{
_id: ObjectId(),
name: 'Women\'s',
path: 'Sports,Tennis,Womens',
normed_name: 'all_special_chars_and_spaces_and_case_senstive_letters_taken_out_like_this'
}
So now your queries only involve the categories collection which should make them much smaller and more performant. The exception to this is when you delete a category, the products will still need touching.
So an example of changing "Tennis" to "Badmin":
db.categories.update({path:/^Sports,Tennis[,]/}).forEach(function(doc){
doc.path = doc.path.replace(/,Tennis/, ",Badmin");
db.categories.save(doc);
});
Unfortunately MongoDB provides no in-query document reflection at the moment so you do have to pull them out client side which is a little annoying, however hopefully it shouldn't result in too many categories being brought back.
And this is basically how it works really. It is a bit of a pain to update but the power of being able to query instantly on any path using an index is more fitting for your scenario I believe.
Of course the added benefit is that this schema is compatible with nested set models: http://en.wikipedia.org/wiki/Nested_set_model which I have found time and time again are just awesome for e-commerce sites, for example, Tennis might be under both "Sports" and "Leisure" and you want multiple paths depending on where the user came from.
The schema for materialised paths easily supports this by just adding another path, that simple.
Hope it makes sense, quite a long one there.

If all categories are distinct then think of them as tags. The hierarchy isn't necessary to encode in the items because you don't need them when you query for items. The hierarchy is a presentational thing. Tag each item with all the categories in it's path, so "Sport > Baseball > Shoes" could be saved as {..., categories: ["sport", "baseball", "shoes"], ...}. If you want all items in the "Sport" category, search for {categories: "sport"}, if you want just the shoes, search for {tags: "shoes"}.
This doesn't capture the hierarchy, but if you think about it that doesn't matter. If the categories are distinct, the hierarchy doesn't help you when you query for items. There will be no other "baseball", so when you search for that you will only get things below the "baseball" level in the hierarchy.
My suggestion relies on categories being distinct, and I guess they aren't in your current model. However, there's no reason why you can't make them distinct. You've probably chosen to use the strings you display on the page as category names in the database. If you instead use symbolic names like "sport" or "womens_shoes" and use a lookup table to find the string to display on the page (this will also save you hours of work if the name of a category ever changes -- and it will make translating the site easier, if you would ever need to do that) you can easily make sure that they are distinct because they don't have anything to do with what is displayed on the page. So if you have two "Shoes" in the hierarchy (for example "Tennis > Women's > Shoes" and "Tennis > Men's > Shoes") you can just add a qualifier to make them distinct (for example "womens_shoes" and "mens_shoes", or "tennis_womens_shoes") The symbolic names are arbitrary and can be anything, you could even use numbers and just use the next number in the sequence every time you add a category.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse