User-written program winsor by Nick Cox: determining p - command

I would like to use the winsor command written by Nick Cox. According to this page http://www.stata.com/statalist/archive/2011-09/msg01340.html the author states that the usual percentile of winsorizing is 1/99 and 5/95.
Am I correct that p(0.1) corresponds to the 1/99 percentile winsorizing? or is it p(0.01)?
The latter seems more intuitive, however the value p(0.5) would yield the average (which make no sense in this case).
Thank you very much.
EDIT: I am sorry, I tried p(0.5) and it does not work. Therefore, I guess p(0.01) corresponds to 1/99 and p(0.05) to 5/95 percentile.
EDIT2: I am sorry for the misunderstanding. I have misinterepreted the author´s procedure of handling outliers (drawing boxplots in order to identify points beyond the
1/99 or 5/95 percentiles).

0.1 corresponds to winsorising at 10th and 90th percentile:
. sysuse auto
(1978 Automobile Data)
. sum price , detail
Price
-------------------------------------------------------------
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 74
25% 4195 3748 Sum of Wgt. 74
50% 5006.5 Mean 6165.257
Largest Std. Dev. 2949.496
75% 6342 13466
90% 11385 13594 Variance 8699526
95% 13466 14500 Skewness 1.653434
99% 15906 15906 Kurtosis 4.819188
. winsor price , p(0.1) gen(wp)
. sum wp, detail
price, Winsorized fraction .1
-------------------------------------------------------------
Percentiles Smallest
1% 3895 3895
5% 3895 3895
10% 3895 3895 Obs 74
25% 4195 3895 Sum of Wgt. 74
50% 5006.5 Mean 5997.432
Largest Std. Dev. 2434.708
75% 6342 11385
90% 11385 11385 Variance 5927804
95% 11385 11385 Skewness 1.294202
99% 11385 11385 Kurtosis 3.29362

Related

Locust 95 percentile is higher than max

Sometimes when I run locust for some scenarios 95 percentile value is more than max. As far as I understood 95 percentile means the 95% of requests took lesser time than this.So how can Max value be lesser than 95 percentile? I am I doing something wrong here.
I also found that this only happens when there very less number of requests like 15 or less.
Percentiles are approximated in Locust.
This is done for performance reasons, as calculating an exact percentile would need to consider every sample (and doing this continously for large runs would just not work)
Min, max and average (mean) are accurate though.
And in longer runs (more than those 15 requests) the 95th percentile should not exceed your max.

Does DBI (data bus inversion) conserve Entropy?

I have been reading up on DBI on Wikipedia, which references this research paper: http://www.cs.columbia.edu/~cs4823/handouts/stan-burleson-tvlsi-95.pdf
The paper says:
While the maximum number of transitions is reduced by half the
decrease in the average number of transitions is not as good. For an
8-bit bus for example the average number of transitions per time-slot
by using the Bus-invert coding becomes 3.27 (instead of 4), or 0.41
(instead of 0.5) transitions per bus-line per time-slot.
However this would suggest it reduces the entropy of the 8 bit message, no?
So the entropy of a random 8 bit message is 8 (duh). Add a DBI bit which shifts the probability distribution to the left, but it (I thought) wouldn't reduce the area under the curve. You should still be left with a minimum of 8 bits of entropy, but spread over 9 bits. But they claim the average is now 0.41, instead of 0.5, which suggests the entropy is now -log( (0.59)^9) = ~6.85. I would have assumed the average would (at best) become 0.46 (-log(0.54 ^9) = ~8).
Am I misunderstanding something?

What is the difference between Coefficient of Regression and Elasticity

I am studying elasticity of demand and how to get the optimal price from elasticity using regression. I have referred Rbloggers and medium blogs to understand the concepts. But still I have a doubt. Say I have a linear equation as below
Sales of Eggs = 137.37 – (16.12)Price.Eggs + 4.15 (Ad.Type) – (8.71)Price.Cookies
Mean of Price.Eggs= 4.43,
Mean of Price.Cookies= 4.37,
Mean of Sales of Eggs= 30
We can deduce the equation as : increase in sales of eggs increases the price of cookies by 8.71 and price of eggs by 16.12.
But in the case of elasticity, we calculate the formula and the elasticity of price of eggs is -2.38 and elasticity of price of cookies is -1.27 which also tells the unit increase in value with respect to dependant variable. What is the difference between these two ? I know the values are different but both meant the same right ? Please advice and correct if I am wrong
Well it depends. I'm going to simplify the model a bit to one product (eggs for example):
Assuming elasticity is not constant and the demand curve is linear:
E = Elasticity
Q = Quantity Demanded
P = Price
t = time
b0 = constant
b1 = coefficient (slope)
See the breakdown for elasticity here
Picture a graph of the Demand Curve with Q on the vertical axis and P on the horizontal axis - because we're assuming Quantity Demanded will change in response to changes in Price.
I can't emphasize this enough - in the case where demand is linear and elasticity is not constant along the entire demand curve:
The coefficient (slope) is the change (difference) in the dependent variable (Q) divided by the change in the independent variable (P) measured in units - it is the derivative of your linear equation. The coefficient is the change in Q units with respect to a change in P units. Using your eggs example, increasing the price of eggs by 1 unit will decrease the quantity demanded of eggs by 16.12 units - regardless of whether the price of eggs increases from 1 to 2 or 7 to 8, the quantity demanded will decrease by 16.12 units.
From the link above, you can see that Elasticity adds a bit more information. That is because elasticity is the percent change in Quantity Demanded divided by the percent change in Price - ie the relative difference in Quantity Demanded with respect to the relative difference in Price. Let's use your eggs model but exclude Ad.Type and Price.Cookies
Sales of Eggs = 137.37 - 16.12 * Price.Eggs
"P" "Qd" "E"
1.00 121.25 -0.13
2.00 105.13 -0.31
3.00 89.01 -0.54
4.00 72.89 -0.88
5.00 56.77 -1.42
6.00 40.65 -2.38
7.00 24.53 -4.60
8.00 8.41 -15.33
See graph of Demand Curve vs Elasticity
In the table you can see that as P increases by 1.00, Qd decreases by 16.12 regardless if it's from 1.00 to 2.00 or 7.00 to 8.00.
Elasticity, however, does change rather significantly relative to changes in price, so even if the change in units for each variable remains the same, the percent change for each variable will change.
A price increase from 1 to 2 is a 100% increase and would result in a change in quantity demanded from 121.25 to 105.13 which is a 13% decrease.
A price change from 7 to 8 is a 14% increase and would result in a quantity demanded change from 24.53 to 8.41 which is a 66% decrease.
If you're interested in learning more about different ways to measure elasticity I highly recommend These lecture slides especially slide 6.26.

Trending Percentage over time

When I am trying to calculate a percentage over time with the rolling weighted denominator, what is that called? The calculation is basically
Users With WiFi Activity/Users
The line graph that I have is graphed daily on the x axis but at each day, it is only calculating the day of so the percentage is extremely lower, but what I want is on the line graph, on day, the percentage of WiFi users rolling 30 up to that day is this % vs only day x $.
Is that called a moving average?
Also, how is that calculate?
What Data should look like
Day Percentage SUMTotalWiFiUsersRolling SumTotalUsersRolling
8/1 85% 1800 2000
8/1 81% 1700 2100
What Tableau is doing
Day Percentage SUMTotalWiFiUsersDayOnly SumTotalUsersRolling
8/1 30% 600 2000
8/1 35% 735 2100
You're right, that is a moving average. MA since the day that you're fixing the 30 days period at is moving (each new day the window jumps by one day forward)
Is your data at the day level of detail? If so, this is the table calc you need:
WINDOW_AVG([SUMTotalUsersRolling],-29,0)
In the formula, we're starting at 29 days backwards from this day (0). So the 30 day average is including today
Formula format:
WINDOW_AVG(<field to average>,<number of periods to start from>,<period to end at>)

Calculate overall accuracy with FAR and FRR

I made a system that detects and counts traffic violations specifically vehicular obstructions in the pedestrian crossing lane. My inputs are videos. To test the program, I'd compare the violation count from my manual observation of the video (ground truth) against the violation count from my program.
Example:
Video 1
Ground Truth: 10 violations
Program Count: 8 violations (False accept: 2, False Reject: 4)
FAR: 2/8 = 25%
FRR: 4/8 = 50%
Overall accuracy: (8 violations - 2 false accepts) / 10 total violations = 60%
Are my computations correct especially the overall accuracy? Also what is the formula for the equal error rate (EER)?
FAR and FRR should be computed relatively to the number of observations, not the expected number of positive observations.
EDIT
As an example, imagine there have been 100 observations and your program splitted them in 8 violations (including 2 false accept) and 92 non violations (including 4 false reject) when it should have been 10 violations and 90 non violations, then :
FAR = 2/100 = 2%
FRR = 4/100 = 4%
I think accuracy is correct as the program has indeed detected 60% of the violations.