Why does calling random() inside of a case statement produce unexpected results? - postgresql

https://gist.github.com/anonymous/2463d5a8ee2849a6e1f5
Query 1 does not produce the expected results. However, queries 2 and 3 do. Why does moving the call to random() outside of the case statement matter?

Consider the first expression:
select (case when round(random()*999999) + 1 between 000001 and 400000 then 1
when round(random()*999999) + 1 between 400001 and 999998 then 2
when round(random()*999999) + 1 between 999999 and 999999 then 3
else 4
end)
from generate_series(1, 8000000)
Presumably, you are thinking that the value "4" should almost never be selected. But, the problem is that random() is being called separately for each when clause.
So, the chance the it fails each clause is independent:
About 60% of the time a random number will not match "1".
About 40% of the time a random number will not match "2".
About 99.9999% of the time a random number will not match "3" (I apologize if the number of nines is off, but the value is practically 1).
That means that about 24% of the time (60% * 40% * 99.9999%), the value "4" will appear. In actual fact, the first query returns "4" 23.98% of the time. To be honest, this is very close to the actual value, but given this size of data, but it is a bit further off than I would expect. However, it is close enough to explain what is happening.

Related

TSQL divide one count by another to give a proportion

I would like to calculate the proportion of animals in column BreedTypeID with a value of 1. I think the easiest way is to count the n BreedTypeID = 1 / total BreedTypeID. (I also wnat them to have the same YearDOB and substring in their ID as shown) I tried the following:
(COUNT([dbo].[tblBreed].[BreedTypeID])=1 OVER (PARTITION BY Substring([AnimalNo],6,6), YEAR([DOB]))/ COUNT([dbo].[tblBreed].[BreedTypeID]) OVER (PARTITION BY Substring([AnimalNo],6,6), YEAR([DOB]))) As Proportion
But it bugged with the COUNT([dbo].[tblBreed].[BreedTypeID])=1
How can I specify to only count [BreedTypeID] when =1?
Many thanks
This will fix your problem, although I would suggest you use table aliases instead of schema.table.column. Much easier to read:
Just replace:
COUNT([dbo].[tblBreed].[BreedTypeID])=1
WITH
SUM( CASE WHEN [dbo].[tblBreed].[BreedTypeID] = 1 THEN 1 ELSE 0 END)

Efficient method to query percentile in a list

I've come across the requirement to collect the percentiles from a list a few times:
Within what percentile is a certain number?
What is the nth percentile in a list?
I have written these methods to solve the issue:
/for 1:
percentileWithinThreshold:{[threshold;list] (100 * count where list <= threshold) % count list};
/for 2:
thresholdForPercentile:{[percentile;list] (asc list)[-1 + "j"$((percentile % 100) * count list)]};
They work well for both use cases, but I was thinking this is a too common use case, so probably Q offers already something out of the box that does the same. Any idea if there already exists something else?
'100 xrank' generates percentiles.
q) 100 xrank 1 2 3 4
q) 0 25 50 75
Solution for your second requirement:
q) f:{ y (100 xrank y:asc y) bin x}
Also, note that your second function result will not be always same as xrank. Reason for that is 'xrank' uses floor for fractional index output which is the normal scenario with calculating percentiles and your function round up the value and subtracts -1 which ensures that output will always be lesser-equal to input percentile. For example:
q) thresholdForPercentile[63;til 21] / output 12
q) f[63;til 21] / output 13
For first requirement, there is no inbuilt function. However you could improve your function if you keep your input list sorted because in that case you could use 'bin' function which runs faster on big lists.
q) percentileWithinThreshold:{[threshold;list] (100 * 1+list bin threshold) % count list};
Remember that 'bin' will throw type error if one argument is of float type and other is an integer. So make sure to cast them correctly inside the function.
qtln:{[x;y;z]cf:(0 1;1%2 2;0 0;1 1;1%3 3;3%8 8) z-4;n:count y:asc y;?[hf<1;first y;last y]^y[hf-1]+(h-hf)*y[hf]-y -1+hf:floor h:cf[0]+x*n+1f-sum cf}
qtl:qtln[;;8];

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}
Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

postgresql random() bug?

I have the following statement select random() * 999 + 111 from generate_series(1,10)
which results in:
690,046183290426
983,732229881454
1091,53674799064
659,380498787854
398,545482470188
775,65887248842
1044,79942638567
173,288027528208
584,690435883589
522,077123570256
as you see; Two values are over 999!
This one works fine:
select random() * 9 + 1 from generate_series(1,10)
My system is: PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
Is that a bug?
No, it's not a bug. The expression random() * 999 gives you a number from 0 to 999 inclusive.
Adding one to that will give you a number 1 through 1000. Adding 111 will give you a number 111 through 1110. So, in addition to seeing some values over a thousand, you'll also see none under a hundred and eleven.
Your mistake appears to have been assuming that, when you use 999 instead of 9, you have to add 111 instead of 1. That's not the case. To get a number 1 through 1000, you need:
random() * 999 + 1
No, this isn't a bug.
random generates a number in the range [0.0, 1.0]. Multiply this by 999, you get a number in the range [0.0, 999.0]. Add 111 and you have a number in the range [111.0, 1110.0].
In fact, given an even distribution by random, there's roughly a 11% chance of getting a number larger than 999. The result you're describing isn't only possible, it's expected.

Problem looking at data between 0 and -1

I'm trying to write a program that cleans data, using Matlab. This program takes in the max and min that the data can be, and throws out data that is less than the min or greater than the max. There looks like a small issue with the cleaning part. This case ONLY happens when the minimum range of the variable being checked is 0. If this is the case, for one reason or another, the program won't throw away data points that are between 0 and -1. I've been trying to fix this for some time now, and noticed that this is the only case where this happens, and if you try to run a SQL query selecting data that is < 0, it will leave out data between 0 and -1, so effectively the same error as what's happening to me. Wondering if anyone might recognize this and know what it could be.
I would write such a function as:
function data = cleanseData(data, limits)
limits = sort(limits);
data = data( limits(1) <= data & data <= limits(2) );
end
an example usage:
a = rand(100,1)*10;
b = cleanseData(a, [-2 5]);
c = cleanseData(a, [0 -1]);
-1 is less than 0, so 0 should be the max value. And if this is the case it will keep points between -1 and 0 by your definition of the cleaning operation:
and throws out data that is less than the min or greater than the max.
If you want to throw away (using the above definition)
data points that are between 0 and -1
then you need to set 0 as the min value and -1 as the max value --- which does not make sense.
Also, I think you mean
and throws out data that is less than the min AND greater than the max.
It may be that the floats are getting casted to ints before the comparison. I don't know matlab, but in python int(-0.5)==0, which could explain the extra data points getting in. You can test this by setting the min to -1, if you then also get values from -1 to -2 then you'll need to make sure casting isn't being done.
If I try to mimic your situation with SQL, and run the following query against a datatable that has 1.00, 0.00, -0.20, -0.80. -1.00, -1.20 and -2.00 in the column SomeVal, it correctly returns -0.20 and -0.80, which is as expected.
SELECT SomeVal
FROM SomeTable
WHERE (SomeVal < 0) AND (SomeVal > - 1)
The same is true for MatLab. Perhaps there's an error in your code. Dheck the above statement with your own SELECT statement to see if something's amiss.
I can imagine such a bug if you do something like
minimum = 0
if minimum and value < minimum