How would you identify a user's most active sections? - tags

On a website, everything is tagged with keywords assigned by the staff (it's not a community driven site, due to its nature). I am able to determine which tags a user is most active in (or, what tags they view the most). However, I'm not sure how to choose the list. A few options present themselves, but they don't seem right to me.
Take the top n (or m < n if they have fewer than n viewed tags) tags
Take the top n tags where n is a percentage of the total tags viewed
Take the top n tags with m views where n and m are percentages of total tags viewed and total page views
Take all of the tags, regardless of views
The goal is to identify what is most interesting to the user and show them other things that they might be interested in, with respect to the tags that are assigned to the content.

You could look at machine learning algorithms to find algorithms with which to evaluate the effectiveness of your choice.
Like for instance: http://en.wikipedia.org/wiki/Supervised_learning#Approaches_and_algorithms Stuff like nearest neighbour and bayes could help you improve your suggestions.
This is however overkill for just suggesting "Would you like to look at this too?", but it's an interesting approach to providing better tie-ins. It would, however require some method to figure out whether or not your users value your suggestions (e.g. "I like this!"-links or log-analysis based on time spent on links, etc.)

A simple solution is to try several reports and check which report is more informative. The nature of your site and your data may mean that some reports are unexpectedly useful and some are not. If a report get a 'flat' area chart for example - look for something else.
Even better give the consumers of the reports a choice and an ability to provide feedback. Tune the reports based on what they will be really looking for.
P.S. I would go fro the "Take the top n tags with m views where n and m are percentages of total tags viewed and total page views" report first

Related

Why does my Google Optimize experiment show no clear winner

I did a very simple text, the footer contact form on the left of the website or the right of the website. The results showed "no clear winner". But the below data shows that one has 5 conversions vs 1, which I consider to be significant (albeit low numbers). It also says there is a 95% probability that this one will be better.
What am I not understanding about this data? Is it that the numbers are too low in volume to give a reading or is it a bug or is there something I've missed?
Its probably because your AB Test did not have a lot of traffic, in each variant. So 5 conversions vs 1, is not really a big difference between the two.

Summary/Cross Tab Using Using Multiple Variables + Percentage Change Columns

I am trying to use gtsummary to count the number of times someone engaged in an action (a; binary variable, yes/no) in a given year (b, continuous variable, ranging from 2002-2020) by various demographic factors (c-z; i.e. race, income, educational attainment) for complex survey data. Is there anyway to do this in gtsummary? Furthermore, is there any way to use gtsummary to generate columns that would provide the percentage change (in absolute and relative terms) between two years for a given demographic factor (i.e. what is the percentage change between 2006 and 2020 in the number of times someone engaged in action "a" for (black/white/hispanic/mixed race) participants?
So far, I'm seeing the tbl_cross function can handle up to two variables, and tbl_svysummary seems equipped for more general summary statistics (i.e. counting the number of (black/white/hispanic) people by whether they engaged in action "a" or not) and not this more granular question I was wondering about.
Any guidance you have here would be much appreciated (and totally understand if this is beyond the scope of the package)! Thank you as always for your awesome work with gtsummary.

Sampling search domain

In Minizinc, is it possible to sample the domain ? lets say my domain has many solutions, running --all-solutions will initially return very similar solutions.
1) is there a way to sample the domain ? perhaps BFS ? the purpose is for follow up solutions analysis.
2) Is there any methods to estimate search domain size in CP?
my domain is a Staff Rostering Problem
Regards,
H
It is not possible to choose BFS in MiniZinc but there is search annotations. With the search annotations you can choose in which order the variables should be branched on. You can also choose which value will be branched on. Unfortunately, MiniZinc does not support random variable search.
In your case I would branch on a dom_w_deg with a random value but any other variable selection can work, try them.
solve::seq_search([int_search(some_array, dom_w_deg, indomain_random,complete)]) satisfy;
Do note that not all solvers support the usage of search annotations.
Other alternatives are to add constraints that remove the similar results.
You can always calculate the number of permutations you can have in your solution, the number of variables multiplied with their domain. This will not consider any constraints and the real search space can be much smaller.
Another way of visualizing the search is by using gist or other programs to visualize the search.
(source: marco at www.imada.sdu.dk)
You can expand and retract parts of the search tree and see which variables have been branched on.

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

Get best 3 scores in Optaplanner?

Can we get top 3 best scores using constraints in Optaplanner?
For eg i have a use case where i need to show top 3 results which has highest score to user so that user can select the solution according to their need.
Sounds like pareto optimization (see docs). Not yet supported in OptaPlanner officially.
But users have hacked it before, by implementing their own BestSolutionRecaller (= that class that holds the best solution(s)) and replacing the DefaultSolver's bestSolutionRecaller with it. This implies "taking the red pill" and "following the rabbit hole down to wonderland". Good luck :)
Important note: Pareto optimization goes much further than just remember the n best solutions. It's about remember the n best solutions which aren't dominated by one of the other best solutions. So it entails changing the score comparison (and breaking the transitive aspect of score comparison).