What do you do if the sample size for an A/B test is larger than the population? - hypothesis-test

I have a list of 7337 customers (selected because they only had one booking from March-August 2018). We are going to contact them and are trying to test the impact of these activities on their sales. The idea is that contacting them will cause them to book more and increase the sales of this largely inactive group.
I have to setup an A/B test and am currently stuck on the sample size calculation.
Here's my sample data:
Data
The first column is their IDs and the second column is the total sales for this group for 2 weeks in January (i took 2 weeks as the customers in this group purchase very infrequently).
The metric I settled on was Revenue per customer (RPC = total revenue/total customer) so I can take into account both the number of orders and the average order value of the group.
The RPC for this group is $149,482.7/7337=$20.4
I'd like to be able to detect at least a 5% increase in this metric at 80% power and 5% significance level. First I calculated the effect size.
Standard Deviation of the data set = 153.9
Effect Size = (1.05*20.4-20.4)/153.9 = 0.0066
I then used the pwr package in R to calculate the sample size.
pwr.t.test(d=0.0066, sig.level=.05, power = .80, type = 'two.sample')
Two-sample t test power calculation
n = 360371.048
d = 0.0066
sig.level = 0.05
power = 0.8
alternative = two.sided
The sample size I am getting however is 360,371. This is larger than the size of my population (7337).
Does this mean I can not run my test at sufficient power? The only way I can determine to lower the sample size without compromising on significance or power is to increase the effect size to determine a minimum increase of 50% which would give me an n=3582.
That sounds like a pretty high impact and I'm not sure that high of an impact is reasonable to expect.
Does this mean I can't run an A/B test here to measure impact?

Related

Why am I getting a lot of 0 predictions in collaborative filtering using Alternating Least Squares?

Here the userCol are sellers and itemCol are products sold by them. The ratingCol is the sales amount for a given seller product combination. The sales amount varies between a few hundreds to millions.
Questions
Should I treat sales as an implicit feedback (implicitPrefs = True)? Currently I am treating it as an explicit feedback and setting it False
When I look at the predictions around 70 percent of the observations get a value of 0. Any reason why this is happening? The records having a non-zero value are pretty accurate though
When I get the predicted values for my test set, I see that the records returned are a lot lesser. Is it because the model is encountering users or items not present in the training set? I had set coldStartStrategy = "drop"
predictions = model.transform(test)
The length of predictions dataframe is significantly lesser than test dataframe

Using RANK function in Tableau on a Combined column. However, display individual columns as well

I am using RANK function in Tableau and I am displaying the Rank of calculated measure (Eg: 1 to 50)
The calculated Measure I have is Total Amount for Combined Periods.
When there is no Period displayed on the dashboard, the Total Amount is the sum of both periods and this is exactly what I want. I am good in this case.
However, When I want to display the Period in the Rows, the Total Amount changes to "Total Amount for Period 1 and Total Amount for Period 2".
How can I add a different axis to show Individual Periods as well as Rank of Total Amount for Combined Period?
I guess this might come down to Dual axis in Tableau and I believe this is not available yet and users are voting for this in Ideas.

Prometheus query quantile of pod memory usage performance

I'd like to get the 0.95 percentile memory usage of my pods from the last x time. However this query start to take too long if I use a 'big' (7 / 10d) range.
The query that i'm using right now is:
quantile_over_time(0.95, container_memory_usage_bytes[10d])
Takes around 100s to complete
I removed extra namespace filters for brevity
What steps could I take to make this query more performant ? (except making the machine bigger)
I thought about calculating the 0.95 percentile every x time (let's say 30min) and label it p95_memory_usage and in the query use p95_memory_usage instead of container_memory_usage_bytes, so that i can reduce the amount of points the query has to go through.
However, would this not distort the values ?
As you already observed, aggregating quantiles (over time or otherwise) doesn't really work.
You could try to build a histogram of memory usage over time using recording rules, looking like a "real" Prometheus histogram (consisting of _bucket, _count and _sum metrics) although doing it may be tedious. Something like:
- record: container_memory_usage_bytes_bucket
labels:
le: 100000.0
expr: |
container_memory_usage_bytes > bool 100000.0
+
(
container_memory_usage_bytes_bucket{le="100000.0"}
or ignoring(le)
container_memory_usage_bytes * 0
)
Repeat for all bucket sizes you're interested in, add _count and _sum metrics.
Histograms can be aggregated (over time or otherwise) without problems, so you can use a second set of recording rules that computes an increase of the histogram metrics, at much lower resolution (e.g. hourly or daily increase, at hourly or daily resolution). And finally, you can use histogram_quantile over your low resolution histogram (which has a lot fewer samples than the original time series) to compute your quantile.
It's a lot of work, though, and there will be a couple of downsides: you'll only get hourly/daily updates to your quantile and the accuracy may be lower, depending on how many histogram buckets you define.
Else (and this only came to me after writing all of the above) you could define a recording rule that runs at lower resolution (e.g. once an hour) and records the current value of container_memory_usage_bytes metrics. Then you could continue to use quantile_over_time over this lower resolution metric. You'll obviously lose precision (as you're throwing away a lot of samples) and your quantile will only update once an hour, but it's much simpler. And you only need to wait for 10 days to see if the result is close enough. (o:
The quantile_over_time(0.95, container_memory_usage_bytes[10d]) query can be slow because it needs to take into account all the raw samples for all the container_memory_usage_bytes time series on the last 10 days. The number of samples to process can be quite big. It can be estimated with the following query:
sum(count_over_time(container_memory_usage_bytes[10d]))
Note that if the quantile_over_time(...) query is used for building a graph in Grafana (aka range query instead of instant query), then the number of raw samples returned from the sum(count_over_time(...)) must be multiplied by the number of points on Grafana graph, since Prometheus executes the quantile_over_time(...) individually per each point on the displayed graph. Usually Grafana requests around 1000 points for building smooth graph. So the number returned from sum(count_over_time(...)) must be multiplied by 1000 in order to estimate the number of raw samples Prometheus needs to process for building the quantile_over_time(...) graph. See more details in this article.
There are the following solutions for reducing query duration:
To add more specific label filters in order to reduce the number of selected time series and, consequently, the number of raw samples to process.
To reduce the lookbehind window in square brackets. For example, changing [10d] to [1d] reduces the number of raw samples to process by 10x.
To use recording rules for calculating coarser-grained results.
To try using other Prometheus-compatible systems, which may process heavy queries at faster speed. Try, for example, VictoriaMetrics.

How can a Neural Network learn from testing outputs against external conditions which it can not directly control

In order to simplify the question and hopefully the answer I will provide a somewhat simplified version of what I am trying to do.
Setting up fixed conditions:
Max Oxygen volume permitted in room = 100,000 units
Target Oxygen volume to maintain in room = 100,000 units
Maximum Air processing cycles per sec == 3.0 cycles per second (min is 0.3)
Energy (watts) used per second is this formula : (100w * cycles_per_second)SQUARED
Maximum Oxygen Added to Air per "cycle" = 100 units (minimum 0 units)
1 person consumes 10 units of O2 per second
Max occupancy of room is 100 person (1 person is min)
inputs are processed every cycle and outputs can be changed each cycle - however if an output is fed back in as an input it could only affect the next cycle.
Lets say I have these inputs:
A. current oxygen in room (range: 0 to 1000 units for simplicity - could be normalized)
B. current occupancy in room (0 to 100 people at max capacity) OR/AND could be changed to total O2 used by all people in room per second (0 to 1000 units per second)
C. current cycles per second of air processing (0.3 to 3.0 cycles per second)
D. Current energy used (which is the above current cycles per second * 100 and then squared)
E. Current Oxygen added to air per cycle (0 to 100 units)
(possible outputs fed back in as inputs?):
F. previous change to cycles per second (+ or - 0.0 to 0.1 cycles per second)
G. previous cycles O2 units added per cycle (from 0 to 100 units per cycle)
H. previous change to current occupancy maximum (0 to 100 persons)
Here are the actions (outputs) my program can take:
Change cycles per second by increment/decrement of (0.0 to 0.1 cycles per second)
Change O2 units added per cycle (from 0 to 100 units per cycle)
Change current occupancy maximum (0 to 100 persons) - (basically allowing for forced occupancy reduction and then allowing it to normalize back to maximum)
The GOALS of the program are to maintain a homeostasis of :
as close to 100,000 units of O2 in room
do not allow room to drop to 0 units of O2 ever.
allows for current occupancy of up to 100 people per room for as long as possible without forcibly removing people (as O2 in room is depleted over time and nears 0 units people should be removed from room down to minimum and then allow maximum to recover back up to 100 as more and more 02 is added back to room)
and ideally use the minimum energy (watts) needed to maintain above two conditions. For instance if the room was down to 90,000 units of O2 and there are currently 10 people in the room (using 100 units per second of 02), then instead of running at 3.0 cycles per second (90 kw) and 100 units per second to replenish 300 units per second total (a surplus of 200 units over the 100 being consumed) over 50 seconds to replenish the deficit of 10,000 units for a total of 4500 kw used. - it would be more ideal to run at say 2.0 cycle per second (40 kw) which would produce 200 units per second (a surplus of 100 units over consumed units) for 100 seconds to replenish the deficit of 10,000 units and use a total of 4000 kw used.
NOTE: occupancy may fluctuate from second to second based on external factors that can not be controlled (lets say people are coming and going into the room at liberty). The only control the system has is to forcibly remove people from the room and/or prevent new people from coming into the room by changing the max capacity permitted at that next cycle in time (lets just say the system could do this). We don't want the system to impose a permanent reduction in capacity just because it can only support outputting enough O2 per second for 30 people running at full power. We have a large volume of available O2 and it would take a while before that was depleted to dangerous levels and would require the system to forcibly reduce capacity.
My question:
Can someone explain to me how I might configure this neural network so it can learn from each action (Cycle) it takes by monitoring for the desired results. My challenge here is that most articles I find on the topic assume that you know the correct output answer (ie: I know A, B, C, D, E inputs all are a specific value then Output 1 should be to increase by 0.1 cycles per second).
But what I want is to meet the conditions I laid out in the GOALS above. So each time the program does a cycle and lets say it decides to try increasing the cycles per second and the result is that available O2 is either declining by a lower amount than it was the previous cycle or it is now increasing back towards 100,000, then that output could be considered more correct than reducing cycles per second or maintaining current cycles per second. I am simplifying here since there are multiple variables that would create the "ideal" outcome - but I think I made the point of what I am after.
Code:
For this test exercise I am using a Swift library called Swift-AI (specifically the NeuralNet module of it : https://github.com/Swift-AI/NeuralNet
So if you want to tailor you response in relation to that library it would be helpful but not required. I am more just looking for the logic of how to setup the network and then configure it to do initial and iterative re-training of itself based on those conditions I listed above. I would assume at some point after enough cycles and different conditions it would have the appropriate weightings setup to handle any future condition and re-training would become less and less impactful.
This is a control problem, not a prediction problem, so you cannot just use a supervised learning algorithm. (As you noticed, you have no target values for learning directly via backpropagation.) You can still use a neural network (if you really insist). Have a look at reinforcement learning. But if you already know what happens to the oxygen level when you take an action like forcing people out, why would you learn such a simple facts by millions of evaluations with trial and error, instead of encoding it into a model?
I suggest to look at model predictive control. If nothing else, you should study how the problem is framed there. Or maybe even just plain old PID control. It seems really easy to make a good dynamical model of this process with few state variables.
You may have a few unknown parameters in that model that you need to learn "online". But a simple PID controller can already tolerate and compensate some amount of uncertainty. And it is much easier to fine-tune a few parameters than to learn the general cause-effect structure from scratch. It can be done, but it involves trying all possible actions. For all your algorithm knows, the best action might be to reduce the number of oxygen consumers to zero permanently by killing them, and then get a huge reward for maintaining the oxygen level with little energy. When the algorithm knows nothing about the problem, it will have to try everything out to discover the effect.

Alert in RAM/CPU Usage Detection in e-Commerce Server

Currently I'm building my monitoring services for my e-commerce Server, which mostly focus on CPU/RAM usage. It's likely Anomaly Detection on Timeseries data.
My approach is building LSTM Neural Network to predict next CPU/RAM value on chart trending and compare with STD (standard deviation) value multiply with some number (currently is 10)
But in real life conditions, it depends on many differents conditions, such as:
1- Maintainance Time (in this time "anomaly" is not "anomaly")
2- Sales time in day-off events, holidays, etc., RAM/CPU usages increase is normal, of courses
3- If percentages of CPU/RAM decrement are the same over 3 observations: 5 mins, 10 mins & 15 mins -> Anomaly. But if 5 mins decreased 50%, but 10 mins it didn't decrease too much (-5% ~ +5%) -> Not an "anomaly".
Currently I detect anomaly on formular likes this:
isAlert = (Diff5m >= 10 && Diff10m >= 15 && Diff30m >= 40)
where Diff is Different Percentage in Absolute value.
Unfortunately I don't save my "pure" data for building neural network, for example, when it detects anomaly, I modified that it is not an anomaly anymore.
I would like to add some attributes to my input for model, such as isMaintenance, isPromotion, isHoliday, etc. but sometimes it leads to overfitting.
I also want to my NN can adjust baseline over the time, for example, when my Service is more popular, etc.
There are any hints on these aims?
Thanks
I would say that an anomaly is an unusual outcome, i.e. a outcome that's not expected given the inputs. As you've figured out, there are a few variables that are expected to influence CPU and RAM usage. So why not feed those to the network? That's the whole point of Machine Learning. Your network will make a prediction of CPU usage, taking into account the sales volume, whether there is (or was) a maintenance window, etc.
Note that you probably don't need an isPromotion input if you include actual sales volumes. The former is a discrete input, and only captures a fraction of the information present in the totalSales input
Machine Learning definitely needs data. If you threw that away, you'll have to restart capturing it. As for adjusting the baseline, you can achieve that by overweighting recent input data.