How to debug AA tests that are returning false positives in Google Optimize - ab-testing

We have run a number of AA tests to validate that our setup of Optimize is correct, however we are seeing a large number of false positives (i.e. Optimize is reporting clear leaders despite no differences in variants).
Experience Setup
We have now setup and run 6 experiences configured as AA tests. Each experience is setup as an AB test with 2 variants that have no differences, and a 50/50 traffic split across each variant. The primary objective is of type "events" and we've used different events as the primary objective across the different experiences. Page targeting is set to all pages on domain (WHEN URL starts with https://www.ourdomain.com), and audience targeting is set to all visitors. For every experience we ran diagnostics and Optimize installation was successfully validated.
Experience reporting
All 6 experiences have run to the point where Optimize is reporting either a clear leader or no clear leader found. We would have expected most experiences tested to report inconclusive results (no clear leader) given there are no differences between the variants. However of the 6 experiences, 2 are reporting that there is a clear leader based on the primary objective set. This is a false positive rate of 33% - much higher than we would have expected.
How would we go about debugging what is causing this high rate of false positives?

Related

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

How to troubleshoot and reduce communication overhead on Rockwell ControlLogix

Need help. We have a plc that's cpu keeps getting maxed out. We've already upgraded it once. Now we need work on optimize it.
We have over 50 outgoing msg instructions, 60 incoming, and 103 number of ethernet devices (flow meters, drives, etc) I've gone through and tried to make sure everything is cached that can be, only instructions that are currently needed are running, and communication to the same plc happen in the same scan, but I haven't made a dent.
I'm having trouble identifying which instructions are significant. It seems the connections will be consolidated so lots of msgs shouldn't be too big of a problem. Considering Produced & Consumed tags but our team isn't very familiar with them and I believe you have to do a download to modify them, which is a problem. Our IO module RPIs are all set to around 200ms, but that didn't seem to make a difference (from 5ms).
We have a shutdown this weekend and I plan on disabling everything and turning it back on one part at a time to see where the load is really coming from.
Does anyone have any suggestions? The task monitor doesn't have a lot of detail that I can understand, i.e. It's either too summarized or too instant for me to make heads or tales of it. Here is a couple screens from the Task Monitor to shed some light on what I'm seeing.
First question coming to mind is are you using the Continues Task or is all in Periodic tasks?
I had a similar issue many years ago with a CLX. Rockwell suggested increasing the System Overhead Time Slice to around 40 to 50%. The default is 20%.
Some details:
Look at the System Overhead Time Slice (go to Advanced tab under Controller Properties). Default is 20%. This determines the time the controller spends running its background tasks (communications, messaging, ASCII) relative to running your continuous task.
From Rockwell:
For example, at 25%, your continuous task accrues 3 ms of run time. Then the background tasks can accrue up to 1 ms of run time, then the cycle repeats. Note that the allotted time is interrupted, but not reduced, by higher priority tasks (motion, user periodic or event tasks).
Here is a detailed Word Doc from Rockwell:
https://rockwellautomation.custhelp.com/ci/fattach/get/162759/&ved=2ahUKEwiy88qq0IjeAhUO3lQKHf01DYcQFjADegQIAxAB&usg=AOvVaw125pgiSor_bf-BpNSvNVF8
And here is a detailed KB from Rockwell:
https://rockwellautomation.custhelp.com/app/answers/detail/a_id/42964

How many parallel processes?

I am running some code in parallel by using a forking module in perl called Parallel::ForkManager. I have currently setting the maximum number of processes to 30:
my $pm = Parallel::ForkManager->new(30);
What would be an advisable maximum number of processes to create? I am doing this on a commercial grade Solaris server, but I still don't want to overload the system.
In downloading files, this really depends on
how many different hosts you're downloading from, and
how fast they will give you the requested files compared to your maximum bandwidth.
If you're downloading files from a single machine to a single machine on a local network, 2-3 is about max. If you're downloading files from 30 different servers on the internet, all of which are slow, but you have a fat pipe, then 30 might be reasonable.
There is no one universal right answer here. Unless you count "it depends."
The purpose of "downloading files" was mentioned, but in comments a while ago and I take the question as stated, to also be more general.
The only relevant measure is when you start reaching saturation in performance gains, with particular software on that system. The formal limits are huge and meaningless while rules of thumb are very general.
Let's imagine to run 10 processes and the time to complete the job drops 10 times. Increase to 20 processes and the time drops 20 times -- but for 30 processes the gain is the factor of 10. At this point we have loaded the system. Push further and the performance will degrade rapidly, and for everyone. At that point the server is overloaded, even though it allows, say, 1024 processes per user (and really ten or more times that for a server).
With a few processes per core the machine is engaged and I'd say that that is a good rule of thumb. However, it is too general. I doubt that you'd gain much in performance by going to that many processes, given the many other factors that affect it.
Accessing one web server The server's capability is the gospel. They may have posted how many requests per seconds they are happy with. Or they may have a limit on number of processes per user, say 10 or 20. If that means that many simultaneous downloads then that's your limit. But I'd be careful -- if the site is close and fast a request may complete in as little as 0.1 or 0.2 seconds. Then, with 10 processes you may be hitting the server 100 times a second. I do not recommend that. If there is no information I'd say keep it to a few requests per second. The performance and server load also depend on the content -- big downloads are different from pulling many skinny web pages. The I/O on your side may matter but I'd expect the server to set the limit. If you are going to use their service a lot why not send an email and ask what they are OK with.
I/O, network (many servers) or disk With network the performance depends on every piece of hardware in the path as well as on software. Nobody can tell without trying it out. The disk I/O is very complex. To add to trouble it is unclear whether it'd be your disks or network that is the bottleneck. I'd expect clear performance gains up to a few tens of processes, and probably fewer.
CPU or memory bound This may be easiest -- processing that can be broken up in parallel on 30 cores can enjoy close to a factor of 30 speedup (given no other bottlenecks). Going beyond the number of cores clearly leads to reduced performance gain. Concurrent (but not parallel) processing is far more complicated. If your code is memory intensive that is yet completely different.
Useful basic tools for assessing above components are iostat -xzn, netstat -I, and vmstat. But there is a bit of a curve to learning how to interpret their output and hopefully it doesn't come to that.
The conclusion is that you have to time it. Take your real application and time it running in one process. Do this 3 to 5 times and see the average (throw away obvious outliers). Then repeat with 5 processes, then with 10, etc. I'd expect that the trend will start slowing down far sooner than the 30 processors you mention. Once it gets to that the system is loaded and whoever works on it will notice. Very soon after that the performance will likely degrade rapidly. Proper benchmarking tools, like Benchmark, are far more sophisticated but this may well settle the issue. If you see strange or inconsistent behavior you may have to dig into details, starting with tools mentioned above.
What "overloaded" means is a bit unclear. I like to cap my use of resources well before other people are affected. But it may be possible to push it, in particular if you can run when it's quiet. I doubt that you'll keep having a worthy gain all the way to the number of available processors.
So there is no concern about "overloading" the server if you first time things. The performance limit will tell you when to stop. I'd say that your limit of 30 is very reasonable. Unless this is really about downloading files, in which case the web server is likely all that matters.
You should set the maximum number of processes to 60.

AWS EC2 Spot Instances fit for edge services?

Spot instances can randomly get shut down by Amazon. Does this mean that they would not work well as edge services (e.g. REST services)? Using an Elastic Load Balancer (ELB) plus some persistent EC2 nodes (plus the spot instances), would this work well if the client retried a few times upon failure? Or could they get numerous 404s, even with a few retries?
You will have a little bit of an impact if you decide to use spot instances in this scenario. The key will be getting the load balancer to recognize that the instance is out of service quickly. Also, not using sticky sessions can reduce the chance that they would get repeated 504 (Gateway timeout) errors.
Spot instances are a bit tricky to grok. On one hand they can give you compute power for a very low price, but on the other hand you might lose these instances with minimal notice.
One thing you can do is to give a "max bid" which represents the risk of losing the instances and not only the price you are willing to pay. Since you are not paying your bid price, but only the market price until the market price is higher than your max bid, most of the times you will pay a lower price than your max bid. For example, if you are bidding 90% of the on-demand (OD) price, you will most likely pay less than (for example, 30% of the on-demand price), on average during a run of a week or a month. You can even consider giving a max bid which is higher than on-demand (up to 4 times OD price), and still on average pay much less than the OD price.
It is best to analyze the spot prices for the last 3 months that are provided by the API, and check the behaviour of the market price for the different instance types and in the different regions and availability zones.
Another option you can consider is running 2 auto scaling groups (ASG). One will try to scale (or heal) your spot based instances, and one that will work with on-demand instances. The latter will be slower to kick in, and will work only if the Spot based group is not available due to higher market prices.

StarCounter and CAP

I have been reading about a database named Starcounter. It makes a claim that it can handle loads that a "NoSql"-database only can handle without dropping consistency. As far as I understand the CAP-theorem, if you keep consistency, you lose availability or partition tolerance. So what trick makes StarCounter work?
I can imagine that StarCounter is fast, but the claim that NoSql needs to drop consistency to keep up seems a little bit strange to me. Can anyone please explain?
Thanks in advance
Roland
The short answer
The CAP theorem (aka Brewers theorem) cannot be beaten for a single piece of information (like a consistent database). If you have a horizontally scaled database, you won't get consistency and performance. This conclusion comes from the laws of physics and can be deducted from Brewers theorem and Einsteins theories of relativity. You need to scale-in/up, not out. Not very "cloudy", but as the enemies of Galileo would probably confess if they were alive today, nature does a poor job at honouring human fashion.
Scaling consistent data
I'm sure there are other approaches, but Starcounter works by hosting the database image in RAM. Instead of moving database data to the application code, parts of the application code is moved to the database. Only data in the final response gets moved from the original place in RAM memory (where the data was in the first place). This makes most of the data stay put even if there are millions of requests processed every second. The downside is that the database needs to know the programming language of your application logic. The upside, however, is obvious if you have ever tried to serve millions of HTTP requests/sec, each requiring extensive database access.
A more thourough answer
The question is a good one. It is no wonder you find it strange as it was only a few years back that CAP was proven (turned into a theorem). Many developers are as disappointed as a kid would be when theoretical physicist tells him to stop looking for the perpetual motion machine because it cannot work. We still want the scale-out consistent database, don't we?
The CAP theorem
The CAP theorem gives that any piece of information cannot have consistency (C), availability (A) and partition tolerance (P). It applies to a unit of information (such as a database). You can of course have independent pieces of information that operates differently. One piece could be AP, another could be CA and a third could be CP. You just cant have the same information being CAP.
The problem with the impossibility of the 'P' in a consistent and available database results in how a scaled-out database MUST do signalling between the nodes. The conclusion must be, that even in a hundred years from now, CAP gives that a single piece of consistent data will have to live on hardware interconnected using hard wires or light beams.
The problem with the P in CAP
The problem lies in performance if you apply horizontal scaling to an available consistent database. A good performance was the very reason to do horizontally scaling in the first place, this is a very bad thing. As every node needs communicate with the other nodes whenever there is database access to achieve consistency, and given the fact that signalling is ultimately limited by the speed of light, you are left with sad but true fact that database scientist (as well as CPU scientists) are not just being stubborn for failing to see scale-out as a a magical silver bullet. It will not happen because it cannot happen (however, parts of your database could be placed in a AP set, so remember, we are talking about consistent data here). Adding the theories of Einstein to the CAP theorem and the small box wins of the cloudy data-center for consistent data.
Perpetual machines and CAP
The state of things in the database community is a little bit like the state of perpetual motion machines when horse and carriage was the way to get to work. Without any theoretical evidence against it, the patent offices granted hundreds of patents for impossible perpetual machines. Today, we may laugh at this, but we have a similar situations in the database industry with consistent scale-out databases. When you hear somebody claim that they have a scale-out ACID database, be cautious. It was only after the dot com crash mathematicians at MIT proved Brewer right at the CAP theorem was officially born, so the hunt for the impossible has unfortunately not died off just yet. You can compare this, if you want, to the way laggards kept trying to invent the perpetual machine for years after modern theoretical physics should reasonably have put a stop to it. Old habits die hard (my apologies to anyone on Stack overflow still making drawings of bearings and arms moving ad finitum on their own accord - I don't mean to be offensive).
CAP and performance
All is not lost however. Not all pieces of information needs to be consistent. Not all pieces needs to scale-out. You just have the accept Brewers theorem and make the best out of it.
For applications such as Facebook, consistency is dropped. This is okay as data is entered once and then is manipulated by a single users. Still we can experience the side effects in everyday Facebook usage such as things popping in and out of existence for a while.
However, in most business applications, data needs to be correct. The sum of all accounts in your bookkeeping needs to amount to zero. Your stock inventory must equal to 8 if you sold 2 out of 10 items even if there are multiple users buying from the same stock.
The problem with scaling out available data is that you have to make do without partition tolerance. This fancy word simply means that you have to signal between the nodes in your cloud at all times. And as it takes light a few nanoseconds to travel a single meter, this becomes impossible without making your scale-out result in less performance rather than more performance. Of course, this is only true for consistent data. The implications of this has been known by the engineers of Intel, AMD, Oracle et. al for a long time. It is not their scientist haven't heard of scale-out. It is just that they have come to accept the world as Einstein described it.
Some comfort in the gloom
If you do the math, you find that a single PC has instructions to spare on each human being living on Earth for each second it is running (google on 'modern CPU' and 'MIPS'). If you do some more math, like taking the total turnover of Amazon.com (you can find it at wwww.nasdaq.com) divided by the price of an average book, you will find that the total number of sales transactions can fit in RAM of a single modern PC. The cool thing is that the number of items, customers, orders, products etc. occupies the same amount of space in 2012 as it did in 1950. Images, video and audio has increased in size, but numeric and textual information does not grow per item. Sure the number of transactions grows, but not in the same phase as computer power grows. So the logical solution is to scale out read-only and AP data and "scale-in/up" business data.
"Scale-in" instead of "scale-out"
Database engines and business logic running in a VM (like the Java VM or the .NET CLR) typically use fairly effective machine code. This means that moving memory is the overshadowing bottleneck of total throughput for a consistent database. This is often referred to as the memory wall (wikipedia has some useful information).
The trick is to transfer code to the database image instead of data from the database image to the code (if using a MVC or a MVVM pattern). This means that the consuming code executes in the same address space as the database image and that data is never moved (and the disk is merely securing transactions and images). Data can stay in the original database image and does not have to be copied into the memory of the application. Instead of treating the database as a RAM database, the database is treated as primary memory. Everything stays put.
Only data that is part of the final user response is moved out of the database image. For a large scale applications with hundreds of millions of simultaneous users this typically amounts to only a few million requests per second, something that a single PC has no problem with handling given that the HTTP packaging is done on gateway servers. Fortunately, such servers scales out beautifully as they don't need to share data.
As it turns out, the disk is fast at sequential writes so a raided disk can persist terabytes or changes every minute.
Horizontal scaling in Starcounter
Normally you do not scale a Starcounter node. It scales-in rather than out. This works well for a few million simultaneous users. To go above that, you need to add more Starcounter nodes. They can be used to partition data (but then you lose consistency and Starcounter is not designed for partitioning so it is less elegant than solutions such as Volt DB). So a better alternative is to use the additional Starcounter nodes as gateway servers. These servers simple accumulates all incoming HTTP requests for a millisecond at a time. This might sound like a short amount of time, but it is enough to accumulate thousands of request if you decided you need to scale Starcounter. The batch of requests are then sent to the ZLATAN node (Zero LATency Atomicity Node) a thousand times a second. Each such batch can contain thousands of requests. In this way, a few hundred million user sessions can be served by a single ZLATAN node. Although you can have several ZLATAN nodes, there is only one active ZLATAN node at a time. This is how the CAP theorem is honored. To go above that, you need to consider the same tradeoff as Facebook and others.
Another important note is that the ZLATAN node does not serve applications with data. Instead, the applications controller code is run by the ZLATAN node. The cost of serializing/deserializing and sending data to an application is far greater than to process the controller logic cycles. I.e. the code is sent to the database instead of the other way around (a traditional approach is that the applications asks for data or sends data).
Making the "shared-everything" node faster by doing less
The use of the database as a "heap" for the programming language instead of a remote system for serialization and deserialization is a trick that Starcounter calls VMDBMS. If the database is in RAM, you should not move data from one place in RAM to another place in RAM which is the case with most RAM databases.
There is no 'trick'. Starcounter is talking about speed, while CAP/NoSQL are talking about scalability. There is a trade-off between features+scalability vs speed.
Sometimes it's OK to ignore scalability if you can prove there are bottlenecks elsewhere. For instance, a new startup shouldn't worry about their website scaling to a million users, they should worry about getting their first hundred users. (Does anyone remember how often Twitter was down in the early days?) Starcounter can be useful if their transaction rate is much greater than your web page hit rate.
On the other hand, I don't trust anyone who lumps all "NoSQL" Databases together. The various NoSQL databases are more different than alike. They have radically different architectures and properties. Some of them scale to thousands of nodes, some of them don't scale beyond one node. Sometimes adding scalability slows you down. Sometimes removing features speeds you up.
http://strata.oreilly.com/2010/12/strata-gems-mysql-handlersocket.html