Determine sample size for A/B testing, more than 2 variants - ab-testing

What R function should we use if we want to decide the sample size for such a test:
10 ads, we want to use a test to decide which ads has the best click through rate. We are able to count the flow and click throughs.

I don’t think the number of variant experiences makes a difference. In each, you compare a metric to the same metric in control, so in each you’ll have its own significant sample size: the smaller the difference with the control, the larger the sample size.
The point of active debate in recent years is something related: how, at run time, to optimize the traffic split between the experiences so that by the time all the variants are called, the most has gone through your winning experience. Google (Experiments) have devised something they call the Multi-Arm Bandid algorithm for that, but as far as I know it hasn't been published in a peer-reviewed journal, and probably for a reason.
Good Luck!

Related

Why does my Google Optimize experiment show no clear winner

I did a very simple text, the footer contact form on the left of the website or the right of the website. The results showed "no clear winner". But the below data shows that one has 5 conversions vs 1, which I consider to be significant (albeit low numbers). It also says there is a 95% probability that this one will be better.
What am I not understanding about this data? Is it that the numbers are too low in volume to give a reading or is it a bug or is there something I've missed?
Its probably because your AB Test did not have a lot of traffic, in each variant. So 5 conversions vs 1, is not really a big difference between the two.

Does Google Optimize anti-flicker snippet affect LCP negatively?

FYI:
anti-flicker snippet
LCP
The snippet makes document.documentElement hidden for certain TIMEOUT value (defaulting 4 seconds), It seems like LCP would probably being delayed for 4 seconds...
Possibly yes. However there’s more things to consider.
LCP is largest contentful paint. It’s basically the point at which the largest item (image or text block) on the screen was last changed.
So yes the Optimize snippet may delay that showing and will almost certainly affect First Contentful Paint (FCP). However if the experiment impacts the LCP text or image then it’s going to change anyway as the experiment is loaded - at which point LCP will be affected anyway. Though if the experiment is on something else that doesn’t affect the LCP image/text then yes it will likely be needlessly being held up.
It should also be noted that it doesn’t hold it up for 4 seconds - it’s a maximum of 4 seconds. If the experiment loads after 1 second it will display.
Also even if it does impact LCP, it will also be massively reducing CLS (Cumulative Layout Shift) - a new metric that is expected to gain increasing importance over next few years.
Ultimately experiments rendered on the client side (like Optimize provides) will take time and something’s going to give with that. The anti-flicker snippet reduces confusing shifts as the experiment kicks in. Is this worth the delay? Depends on the experiment!
On that note, at the end of the day you should think in terms of your users. Would they prefer to see the page drawn as quickly as possible even if that means it changes as it loads? Or would they prefer a white screen for longer? What makes better sense and UX to them? The metrics (be they LCP, FCP, CLS or whatever ever other metric you choose) are simple attempts at measuring (or at least proxying) user satisfaction. Don’t lose site on that when chasing the numbers.

Does OptaPlanner have a "built-in" way to perform multi-unit score normalization?

At the moment, my problem has four metrics. Each of these measures something entirely different (each has different units, a different range, etc.) and each is weighted externally. I am using Drools for scoring.
I only have only one score level (SimpleLongScore) and I have to find a way to appropriately combine the individual scores of these metrics onto one long value
The most significant problem at the moment is that the range of values for the metrics can be wildly different.
So if, for example, after a move the score of a metric with a small possible range improves by, say, 10%, that could be completely dwarfed by an alternate move which improves the metric with a larger range's score by only 1% because OptaPlanner only considers the actual score value rather than the possible range of values and how changes affect them proportionally (to my knowledge).
So, is there a way to handle this cleanly which is already part of OptaPlanner that I cannot find?
Is the only feasible solution to implement Pareto scoring? Because that seems like a hack-y nightmare.
So far I have code/math to compute the best-possible and worst-possible scores for a metric that I access from within the Drools and then I can compute where in that range a move puts us, but this also feel quite hack-y and will cause issues with incremental scoring if we want to scale non-linearly within that range.
I keep coming back to thinking I should just just bite the bullet and implement Pareto scoring.
Thanks!
Take a look at #ConstraintConfiguration and #ConstraintWeight in the docs.
Also take a look at the chapter "explaning the score", which can exactly tell you which constraint had which score impact on the best solution found.
If, however, you need pareto optimization, so you need multiple best solutions that don't dominate each other, then know that OptaPlanner doesn't support that yet, but I know of 2 cases that implemented it in OptaPlanner by hacking BestSolutionRecaller.
That being said, 99% of the cases that think of pareto optimization, are 100% happy with #ConstraintWeight instead, because users don't want multiple best solutions (except during simulations), they just want one in production.

how can I set a goal on recommendation system ?(mean average precision, baselineRmse)

I starting to develop offline recommendation system using ALS algorithm.
and I need to set a goal about system.
so I wanna know what criteria used to evaluate recommendation system.
I already know MAP (mean average precision) and improvement to baselineRmse and I would like to know: how is the performance on these criterions in modern recommendation systems to set my goal.
Back in the early days of recommenders people thought predicting ratings was a good idea. This has since proven to be nearly useless of itself. If you have enough space in a UI to show a few recommendations are you going to pick the one you think the user will pick with the highest ratings? That will always result in bad performance. Rating prediction is what RMSE was designed to measure.
MAP#k on the other hand is meant to find the predictiveness in a recommender. It measures how well the training data predicts what is in the test data. It also accounts for the ordering of recommendations. Ranking/ordering of recommendations has more recently been discovered to have a much greater effect on the effectiveness of recommendations because if you can only show a limited number they had better be the most likely to cause a user to take action.
MAP#k also takes account of ranking in the sense that if you measure MAP#1 and MAP#10, you will see decreasing MAP scores if your first recommendation was more likely to be in the test data than the 10th. This means you are ordering recommendations roughly correct.
For these reason we use MAP#k. Split the "gold standard" dataset you will use in later rests and keep the split static—something like 80%-20% will work split by random choice or by time, the most recent 20% used as the test split. Build you model on the 80%, then for each interaction in the 20% get recommendations and see if the recommendations contain the item actually interacted with in the test set. The aggregate of all these will go into the MAP#k calculation, k is based on how many recommendation you ask for.
See these references and some tools we have to do this:
Kaggle blog references python code they and we ActionML use. https://www.kaggle.com/wiki/MeanAveragePrecision
ActionML analysis python code to split data sets and run MAP#k, where we use the Kaggle function. https://github.com/actionml/analysis-tools

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.