Surprising results with Google Optimize AB - test (variants without any differences) - ab-testing

I am running an AB-test to see how much "noise" there is on results between test variants. AB - test is active on all pages and both of test variants (A and B) are similar. So there is no difference visible for customers. That being said, I would assume that differences between variant conversion rates on reports would have been smaller.
If we take a look at the results on attachment (here) , it shows that calculated conversion percent is ~
10% higher on variant A. This difference would be significant in Ecommerce and would be worth actions. However, there is nothing different between variants.
So my question is that how reliable Google Optimize results are as it shows big difference even for identical variants? Or am I missing something?
I have been trying to find an answer to this from Google Optimize's official documentation but without luck.

Related

Google optimize A/B testing weightage not working properly

I am using Google optimize A/B testing on my website, I've kept 3 different variants for the same, I am dividing all the weightage among these new variants & giving 0 weightage to my original variant, still, sometimes I'm getting the original page.
Does anyone have any idea why is this happening?
It could be due the target rules
An example would be if your page is visible as both www.example.io/original.html and example.io/original.html
Also, you need to take about the session if you are managing any,
If you change the weights after the experiment starts (i.e. you set
the Variant 1 to 100%) then all the returning visitors that
were assigned to another variant or the original will remain there and
see that.
Read more at. : https://support.google.com/optimize/thread/7450936/in-weighting-does-100-really-mean-100?hl=en

Why Google Analytics API v3 is triggering ALWAYS sampling at 50%?

I have build a very simple crawler for Google Analytics (v3) and it used to work well until this week that I started to get sampled data in all queries.
I used to overcome sampling by simply reducing the date range of the queries, but now I get 50% of all sessions (aprox.), even for sample spaces of less than 100 sessions.
It seems like that something is triggering sampling, but I cannot realize what can be. Anyone has suffered similar issues?
EDITED
We are also suffering sampling when querying the "Users Overview" standard report from GA web interface (along with others), even when there are only 883 sessions and we are asking for a single day.
A sample query is below, where we are querying several metrics over 3 dimensions, with a sample size of 883 sessions and a sampling or around 50% (query URL is cropped, but parameters are listed on "query" key).
It seems that the reason could be related with querying ga:users metric with several dimensions, including ga:appId.
I have tried different combinations and only ga:users is returning sampled data when queried with more dimensions than ga:date.
In summary, if I query any other metric from the example with the same 3 dimensions it returns full space data.
Two weeks ago this was not happening, so I suppose that Google has changed the way ga:users is computed recently.
Moreover, as a side-effect I realized that querying users on batches is somehow misleading if you plan to compute the total number of users, because you cannot simply sum them. That is, ga:users is similar to ga:1dayUsers when queried with ga:date, and then you cannot aggregate data. Also weird is the fact that you cannot use ga:appId with ga:1dayUsers, but you can with ga:users.
We have also detected another problem after discarding ga:users in crawler. The issue is related with segment parameter, that it is also triggering sampling when used in combination with the remaining metrics and dimensions.
We collect data from several apps in the same view (not recommendable, but it is there for legacy reasons). Therefore we use a segment defined on-the-fly like "sessions::condition::ga:appId=#com.xxx.yyy.zzz".
The fact is that when we filter that way we suffer sampling, but if we use a common filter like "ga:appId=com.xxx.yyy.zzz" we do not get sampled results.
Probably the question is why we use the segment-based filter instead of standard filter, and the reason is because we need it for some specific metrics like ga:7dayUsers and related, which cannot be combined with ga:appId as dimension and so you cannot either use ga:appId in filters. Confusingly, for those metrics, when we use the segment-based filter we do not get sampled results.
Now it seems that all our API calls are returning real data.
Not sure yet however, why a default report in web interface like "Users Overview" is returning sampled data for a single day with less than 1000 sessions.
Hope this information could help someone else if having similar issues with sampling.

How to break up large document into smaller answer units on Retrieve and Rank?

I am still very new to Retrieve and Rank, and Document Conversion services, so I have been playing around with that lately.
I encountered a problem where when I upload a large document (100+ pages) - Retrieve and Rank would help me automatically break it up into answer units, which is great and helpful.
However, some questions only require ONE small line in the big chunks of answer units, is there a way that I can manually break further down the answer units that Retrieve and Rank service has provided me?
I heard that you can do it through JavaScript, but is there a way to do it through the UI?
I am contemplating to manually break up the huge doc into multiple smaller documents, but that could potentially lead to 100s of them - which is probably the last option that I'd resort to.
Any help or suggestions is greatly appreciated!
Thank you all!
First off, one clarification:
Retrieve and Rank does not break up your documents into answer units. That is something that the Document Conversion Service does when your conversion target is ANSWER_UNITS.
Regarding your question:
I don't fully understand exactly what you're trying to do, but if the answer units that are produced by default don't meet your requirements, you can customize different steps of the conversion process to adjust the produced answer units. Take a look at the documentation here.
Specifically, you want to make sure that the heading levels (for Word, PDF or HTML, depending on your document type) are defined in a way that
they detect the start of each answer unit. Then, make sure that the heading levels that you defined (h1, h2, h3, etc.) are included in the selector_tags list within the answer_units section.
Once your custom Document Conversion Service configuration produces the answer units you are looking for, you will be ready to send them to Retrieve and Rank to be indexed.

Tool or technique to compare and group diffs by similarity

I have developed a system that allows visitors to submit typo corrections for my blog. It works by having a small client-side app which then sends unified diffs to a server. Behind that, I have an interface which allows me to see all diffs in a nice graphical way, sort them, etc.
However I am thinking that as time passes, many visitors will submit corrections for the same things before I have time to fix them. So I would need a way to group similar or identical diffs together.
Identical diffs are easy enough. But there might be people who fix errors differently, e.g. using American or British spellings, different rules for punctuation, varying understandings of unclear phrases, that kind of thing. Grouping similar diffs would be tremendously helpful.
Are there techniques, algorithms, or tools that are specifically designed or can be used to compute the similarity of diffs?
I believe that you have two problems to solve: 1. recognizing fixes for the same text (e.g. same typo location), 2. potentially remove those with the same or nearly equal solutions and at least group all the patches that are related to that location.
Problem 1. The unified diff format is somewhat OK as it gives the lines, but a word level or character level diff (for example, counting each word as a line as wdiff does) might be more precise and help you group more precisely the patches.
Problem 2. if the patches are identical, as you noted it is trivial, if they are different, solving the problem 1 already did much of the work. You can of course use a normalization such as "inflected word parts removal" (removing 's', 'ing' and so on at end of words for example) or "lower casing" before the comparison the replacements part in the unified diffs, thus helping group together nearly identical solutions.
The problem 1 is the problem paused by integration or merge of patches. Problem 2 is more relevant to your particular case.
Maybe you could adopt the Damerau-Levenshtein algorithm. It is used to calculate the distance between two strings.

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.