postgresql random() bug? - postgresql

I have the following statement select random() * 999 + 111 from generate_series(1,10)
which results in:
690,046183290426
983,732229881454
1091,53674799064
659,380498787854
398,545482470188
775,65887248842
1044,79942638567
173,288027528208
584,690435883589
522,077123570256
as you see; Two values are over 999!
This one works fine:
select random() * 9 + 1 from generate_series(1,10)
My system is: PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
Is that a bug?

No, it's not a bug. The expression random() * 999 gives you a number from 0 to 999 inclusive.
Adding one to that will give you a number 1 through 1000. Adding 111 will give you a number 111 through 1110. So, in addition to seeing some values over a thousand, you'll also see none under a hundred and eleven.
Your mistake appears to have been assuming that, when you use 999 instead of 9, you have to add 111 instead of 1. That's not the case. To get a number 1 through 1000, you need:
random() * 999 + 1

No, this isn't a bug.
random generates a number in the range [0.0, 1.0]. Multiply this by 999, you get a number in the range [0.0, 999.0]. Add 111 and you have a number in the range [111.0, 1110.0].
In fact, given an even distribution by random, there's roughly a 11% chance of getting a number larger than 999. The result you're describing isn't only possible, it's expected.

Related

Efficient method to query percentile in a list

I've come across the requirement to collect the percentiles from a list a few times:
Within what percentile is a certain number?
What is the nth percentile in a list?
I have written these methods to solve the issue:
/for 1:
percentileWithinThreshold:{[threshold;list] (100 * count where list <= threshold) % count list};
/for 2:
thresholdForPercentile:{[percentile;list] (asc list)[-1 + "j"$((percentile % 100) * count list)]};
They work well for both use cases, but I was thinking this is a too common use case, so probably Q offers already something out of the box that does the same. Any idea if there already exists something else?
'100 xrank' generates percentiles.
q) 100 xrank 1 2 3 4
q) 0 25 50 75
Solution for your second requirement:
q) f:{ y (100 xrank y:asc y) bin x}
Also, note that your second function result will not be always same as xrank. Reason for that is 'xrank' uses floor for fractional index output which is the normal scenario with calculating percentiles and your function round up the value and subtracts -1 which ensures that output will always be lesser-equal to input percentile. For example:
q) thresholdForPercentile[63;til 21] / output 12
q) f[63;til 21] / output 13
For first requirement, there is no inbuilt function. However you could improve your function if you keep your input list sorted because in that case you could use 'bin' function which runs faster on big lists.
q) percentileWithinThreshold:{[threshold;list] (100 * 1+list bin threshold) % count list};
Remember that 'bin' will throw type error if one argument is of float type and other is an integer. So make sure to cast them correctly inside the function.
qtln:{[x;y;z]cf:(0 1;1%2 2;0 0;1 1;1%3 3;3%8 8) z-4;n:count y:asc y;?[hf<1;first y;last y]^y[hf-1]+(h-hf)*y[hf]-y -1+hf:floor h:cf[0]+x*n+1f-sum cf}
qtl:qtln[;;8];

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}
Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

Why does calling random() inside of a case statement produce unexpected results?

https://gist.github.com/anonymous/2463d5a8ee2849a6e1f5
Query 1 does not produce the expected results. However, queries 2 and 3 do. Why does moving the call to random() outside of the case statement matter?
Consider the first expression:
select (case when round(random()*999999) + 1 between 000001 and 400000 then 1
when round(random()*999999) + 1 between 400001 and 999998 then 2
when round(random()*999999) + 1 between 999999 and 999999 then 3
else 4
end)
from generate_series(1, 8000000)
Presumably, you are thinking that the value "4" should almost never be selected. But, the problem is that random() is being called separately for each when clause.
So, the chance the it fails each clause is independent:
About 60% of the time a random number will not match "1".
About 40% of the time a random number will not match "2".
About 99.9999% of the time a random number will not match "3" (I apologize if the number of nines is off, but the value is practically 1).
That means that about 24% of the time (60% * 40% * 99.9999%), the value "4" will appear. In actual fact, the first query returns "4" 23.98% of the time. To be honest, this is very close to the actual value, but given this size of data, but it is a bit further off than I would expect. However, it is close enough to explain what is happening.

T-SQL Decimal Multiplication

MSDN says about precision and scale of decimal multiplicatuion result:
The result precision and scale have an absolute maximum of 38. When a result precision is greater than 38, the corresponding scale is reduced to prevent the integral part of a result from being truncated.
So when we execute this:
DECLARE #a DECIMAL(18,9)
DECLARE #b DECIMAL(19,9)
set #a = 1.123456789
set #b = 1
SELECT #a * #b
the result is 1.12345689000000000 (9 zeros) and we see that it is not truncated because 18 + 19 + 1 = 38 (up limit).
When we raise precision of #a to 27 we lose all zeros and the result is just 1.123456789. Going futher we proceed with truncating and get the result being rounded. For example, raising precision of #a to 28 results in 1.12345679 (8 digits).
The interesting thing is that at some point, with precision equal to 30, we have 1.123457 and this result won't change any futher (it stops being truncated).
31, 32 and up to 38 results in the same. How could this be explained?
Decimal and numeric operation results have a minimum scale of 6 - this is specified in the table of the msdn documentation for division, but the same behavior applies for multiplication as well in case of scale truncation as in your example.
This behavior is described in more detail on the sqlprogrammability blog.

how to create unique integer number from 3 different integers numbers(1 Oracle Long, 1 Date Field, 1 Short)

the thing is that, the 1st number is already ORACLE LONG,
second one a Date (SQL DATE, no timestamp info extra), the last one being a Short value in the range 1000-100'000.
how can I create sort of hash value that will be unique for each combination optimally?
string concatenation and converting to long later:
I don't want this, for example.
Day Month
12 1 --> 121
1 12 --> 121
When you have a few numeric values and need to have a single "unique" (that is, statistically improbable duplicate) value out of them you can usually use a formula like:
h = (a*P1 + b)*P2 + c
where P1 and P2 are either well-chosen numbers (e.g. if you know 'a' is always in the 1-31 range, you can use P1=32) or, when you know nothing particular about the allowable ranges of a,b,c best approach is to have P1 and P2 as big prime numbers (they have the least chance to generate values that collide).
For an optimal solution the math is a bit more complex than that, but using prime numbers you can usually have a decent solution.
For example, Java implementation for .hashCode() for an array (or a String) is something like:
h = 0;
for (int i = 0; i < a.length; ++i)
h = h * 31 + a[i];
Even though personally, I would have chosen a prime bigger than 31 as values inside a String can easily collide, since a delta of 31 places can be quite common, e.g.:
"BB".hashCode() == "Aa".hashCode() == 2122
Your
12 1 --> 121
1 12 --> 121
problem is easily fixed by zero-padding your input numbers to the maximum width expected for each input field.
For example, if the first field can range from 0 to 10000 and the second field can range from 0 to 100, your example becomes:
00012 001 --> 00012001
00001 012 --> 00001012
In python, you can use this:
#pip install pairing
import pairing as pf
n = [12,6,20,19]
print(n)
key = pf.pair(pf.pair(n[0],n[1]),
pf.pair(n[2], n[3]))
print(key)
m = [pf.depair(pf.depair(key)[0]),
pf.depair(pf.depair(key)[1])]
print(m)
Output is:
[12, 6, 20, 19]
477575
[(12, 6), (20, 19)]