arc4random() and arc4random_uniform() not really random? - swift

I have been using arc4random() and arc4random_uniform() and I always had the feeling that they wasn't exactly random, for example, I was randomly choosing values from an Array but often the values that came out were the same when I generated them multiple times in a row, so today I thought that I would use an Xcode playground to see how these functions are behaving, so I first tests arc4random_uniform to generate a number between 0 and 4, so I used this algorithm :
import Cocoa
var number = 0
for i in 1...20 {
number = Int(arc4random_uniform(5))
}
And I ran it several times, and here is how to values are evolving most of the time :
So as you can see the values are increasing and decreasing repeatedly, and once the values are at the maximum/minimum, they often stay at it during a certain time (see the first screenshot at the 5th step, the value stays at 3 during 6 steps, the problem is that it isn't at all unusual, the function actually behaves in that way most of the time in my tests.
Now, if we look at arc4random(), it's basically the same :
So here are my questions :
Why is this function behaving in this way ?
How to make it more random ?
Thank you.
EDIT :
Finally, I made two experiments that were surprising, the first one with a real dice :
What surprised me is that I wouldn't have said that it was random, since I was seeing the same sort of pattern that as described as non-random for arc4random() & arc4random_uniform(), so as Jean-Baptiste Yunès pointed out, humans aren't good to see if a sequence of numbers is really random.
I also wanted to do a more "scientific" experiment, so I made this algorithm :
import Foundation
var appeared = [0,0,0,0,0,0,0,0,0,0,0]
var numberOfGenerations = 1000
for _ in 1...numberOfGenerations {
let randomNumber = Int(arc4random_uniform(11))
appeared[randomNumber]++
}
for (number,numberOfTimes) in enumerate(appeared) {
println("\(number) appeard \(numberOfTimes) times (\(Double(numberOfGenerations)/Double(numberOfTimes))%)")
}
To see how many times each number appeared, and effectively the numbers are randomly generated, for example, here is one output from the console :
0 appeared 99 times.
1 appeared 97 times.
2 appeared 78 times.
3 appeared 80 times.
4 appeared 87 times.
5 appeared 107 times.
6 appeared 86 times.
7 appeared 97 times.
8 appeared 100 times.
9 appeared 91 times.
10 appeared 78 times.
So it's definitely OK 😊
EDIT #2 : I made again the dice experiment with more rolls, and it's still as surprising to me :

A true random sequence of numbers cannot be generated by an algorithm. They can only produce pseudo-random sequence of numbers (something that looks like a random sequence). So depending on the algorithm chosen, the quality of the "randomness" may vary. The quality of arc4random() sequences is generally considered to have a good randomness.
You cannot analyze the randomness of a sequence visually... Humans are very bad to detect randomness! They tend to find some structure where there is no. Nothing really hurts in your diagrams (except the rare subsequence of 6 three in-a-row, but that is randomness, sometimes unusual things happens). You would be surprised if you had used a dice to generate a sequence and draw its graph. Beware that a sample of only 20 numbers cannot be seriously analyzed against its randomness, your need much bigger samples.
If you need some other kind of randomness, you can try to use /dev/random pseudo-file, which generate a random number each time you read in. The sequence is generated by a mix of algorithms and external physical events that ay happens in your computer.

It depends on what you mean when you say random.
As stated in the comments, true randomness is clumpy. Long strings of repeats or close values are expected.
If this doesn't fit your requirement, then you need to better define your requirement.
Other options could include using a shuffle algorithm to dis-order things in an array, or use an low-discrepancy sequence algorithm to give a equal distribution of values.

I don’t really agree with the idea of humans who are very bad to detect randomness.
Would you be satisfied if you obtain 1-1-2-2-3-3-4-4-5-5-6-6 after throwing 6 couples of dices ? however the dices frequencies are perfect…
This is exactly the problem i’m encountering with arc4random or arc4random_uniform functions.
I’m developing a backgammon application since many years which is based on a neural network trained by word champions players. I DO know that it plays much better than any one but many users think it is cheating. I also have doubts sometimes so I’ve decided to throw all dices by myself…
I’m not satisfied at all with arc4random, even if frequencies are OK.
I always throw a couple of dices and results lead to unacceptable situations, for example : getting five consecutive double dices for the same player, waiting 12 turns (24 dices) until the first 6 occurs.
It is easy to test (C code) :
void randomDices ( int * dice1, int * dice2, int player )
{
( * dice1 ) = arc4random_uniform ( 6 ) ;
( * dice2 ) = arc4random_uniform ( 6 ) ;
// Add to your statistics
[self didRandomDice1:( * dice1 ) dice2:( * dice2 ) forPlayer:player] ;
}
Maybe arc4random doesn’t like to be called twice during a short time…
So I’ve tried several solutions and finally choose this code which runs a second level of randomization after arc4random_uniform :
int CFRandomDice ()
{
int __result = -1 ;
BOOL __found = NO ;
while ( ! __found )
{
// random int big enough but not too big
int __bigint = arc4random_uniform ( 10000 ) ;
// Searching for the first character between '1' and '6'
// in the string version of bigint :
NSString * __bigString = #( __bigint ).stringValue ;
NSInteger __nbcar = __bigString.length ;
NSInteger __i = 0 ;
while ( ( __i < __nbcar ) && ( ! __found ) )
{
unichar __ch = [__bigString characterAtIndex:__i] ;
if ( ( __ch >= '1' ) && ( __ch <= '6' ) )
{
__found = YES ;
__result = __ch - '1' + 1 ;
}
else
{
__i++ ;
}
}
}
return ( __result ) ;
}
This code create a random number with arc4random_uniform ( 10000 ), convert it to string and then searches for the first digit between ‘1’ and ‘6’ in the string.
This appeared to me as a very good way to randomize the dices because :
1/ frequencies are OK (see the statistics hereunder) ;
2/ Exceptional dice sequences occur at exceptional times.
10000 dices test:
----------
Game Stats
----------
HIM :
Total 1 = 3297
Total 2 = 3378
Total 3 = 3303
Total 4 = 3365
Total 5 = 3386
Total 6 = 3271
----------
ME :
Total 1 = 3316
Total 2 = 3289
Total 3 = 3282
Total 4 = 3467
Total 5 = 3236
Total 6 = 3410
----------
HIM doubles = 1623
ME doubles = 1648
Now I’m sure that players won’t complain…

Related

Why are my printed outputs different?

I've been slowly building up skillsets in Swift. Drawing with loops is a great way I find, to understand the subtleties of the language.
Here is an interesting puzzle I can't quite figure out:
I've been trying to generate a truncated pyramid like this for a little while.
I finally got a rough one produced using a for loop. screenshot here BUT, as you can see, one of my earlier attempts generated a half truncated pyramid.
The only difference between the two is that on lines 19 and 33, the variable "negativeSpaceThree" is diminished by 2 and 1 respectively.
Can anyone explain why the outputs are so different? I'd really like to understand these nuances. It might simply be my math, but I'm wondering if its a bug.
Many thanks for any input offered.
Code added below:
let space = " "
var negativeSpaceTwo = 22
var xTwo = 3
for circumTwo in 1...11{
xTwo += 2
negativeSpaceTwo -= 2
print(String(repeating: "-", count: negativeSpaceTwo) , String(repeating:"*", count: xTwo ))
}
print(space)
print(space)
var negativeSpaceThree = 11
var xThree = 3
for circumTwo in 1...11{
xThree += 2
negativeSpaceThree -= 1
print(String(repeating: "-", count: negativeSpaceThree) , String(repeating:"*", count: xThree ))
}
It's because of the difference in how many * characters you are printing on each line. If your total line length is total = dashes + stars and you subtract 2 from dashes each time while adding 2 to stars each time, the total line length will remain the same.
In the second pyramid, you reduce the dashes by one, but add 2 to stars, so the total length increases by 1 each line, giving the pyramid effect on the right-hand side of the text.

Is there any reasonable way to access the contents of a CharacterSet?

For a random string generator, I thought it would be nice to use CharacterSet as input type for the alphabet to use, since the pre-defined sets such as CharacterSet.lowercaseLetters are obviously useful (even if they may contain more diverse character sets than you'd expect).
However, apparently you can only query character sets for membership, but not enumerate let alone index them. All we get is _.bitmapRepresentation, a 8kb chunk of data with an indicator bit for every (?) character. But even if you peel out individual bits by index i (which is less than nice, going through byte-oriented Data), Character(UnicodeScalar(i)) does not give the correct letter. Which means that the format is somewhat obscure -- and, of course, it's not documented.
Of course we can iterate over all characters (per plane) but that is a bad idea, cost-wise: a 20-character set may require iterating over tens of thousands of characters. Speaking in CS terms: bit-vectors are a (very) bad implementation for sparse sets. Why they chose to make the trade-off in this way here, I have no idea.
Am I missing something here, or is CharacterSet just another deadend in the Foundation API?
Following the documentation, here is an improvement on Satachito answer to support cases of non-continuous planes, by actually taking into account the plane index:
extension CharacterSet {
func codePoints() -> [Int] {
var result: [Int] = []
var plane = 0
// following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
for (i, w) in bitmapRepresentation.enumerated() {
let k = i % 8193
if k == 8192 {
// plane index byte
plane = Int(w) << 13
continue
}
let base = (plane + k) << 3
for j in 0 ..< 8 where w & 1 << j != 0 {
result.append(base + j)
}
}
return result
}
func printHexValues() {
codePoints().forEach { print(String(format:"%02X", $0)) }
}
}
Usage
print("whitespaces:")
CharacterSet.whitespaces.printHexValues()
print()
print("two characters from different planes:")
CharacterSet(charactersIn: "𝚨󌞑").printHexValues()
Results
whitespaces:
09
20
A0
1680
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
200A
200B
202F
205F
3000
two characters from different planes:
1D6A8
CC791
Performances
This is effectively 3 to 10 times faster than iterating over all characters: comparison is done with the previous answers at NSArray from NSCharacterset.
bitmapRepresentation has been documented.
https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
So iterate over that Data like below:
var offset = 0
for ( var i, w ) in CharacterSet.whitespaces.bitmapRepresentation.enumerated() {
if i % 8193 == 8192 {
offset += 1
continue
}
i -= offset
if w != 0 {
for j in 0 ..< 8 {
if w & ( 1 << j ) != 0 {
print( String( format:"%02X", i * 8 + j ) )
}
}
}
}
Result:
09
20
A0
1680
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
200A
200B
202F
205F
3000
By your definition, no, there is no "reasonable" way. That's just how NSCharacterSet stores it. It's optimized for testing membership, not enumerating all members.
Your loop can increment a counter over the codepoints, or it can shift the bits (one per codepoint), but either way you have to loop and test. The highest "Ll" character on my Mac is U+1D7CB (#120,779), so if you want to compute this list of characters at runtime, your code will have to loop at least that many times. See the Objective-C version of the documentation for details on how the bit vector is organized.
The good news is that this is fast. With unoptimized code on my 10-year-old Mac, it takes less than 1/10th of a second to find all 1,841 lowercaseLetters. If that's still not fast enough, it's easy to hide the cost by doing it once, in the background, at startup time.

How do I generate a random number not including one without using a while loop?

Let's say I want to generate a random number between 1 and 100, but I don't want to include 42. How would I do this without repeating the random method until it is not 42.
Updated for Swift 5.1
Excluding 1 value
var nums = [Int](1...100)
nums.remove(at: 42)
let random = Int(arc4random_uniform(UInt32(nums.count)))
print(nums[random])
Excluding multiple values
This extension of Range does provide a solution when you want to exclude more than 1 value.
extension ClosedRange where Element: Hashable {
func random(without excluded:[Element]) -> Element {
let valid = Set(self).subtracting(Set(excluded))
let random = Int(arc4random_uniform(UInt32(valid.count)))
return Array(valid)[random]
}
}
Example
(1...100).random(without: [40,50,60])
I believe the computation complexity of this second solution is O(n) where n is the number of elements included in the range.
The assumption here is the no more than n excluded values are provided by the caller.
appzYourLife has some great general purpose solutions, but I want to tackle the specific problem in a lightweight way.
Both of these approaches work roughly the same way: Narrow the range to the random number generator to remove the impossible answer (99 answers instead of 100), then map the result so it isn't the illegal value.
Neither approach increases the probability of an outcome relative to another outcome. That is, assuming your random number function is perfectly random the result will still be random (and no 2x chance of 43 relative to 5, for instance).
Approach 1: Addition.
Get a random number from 1 to 99. If it's greater than or equal to the number you want to avoid, add one to it.
func approach1()->Int {
var number = Int(arc4random_uniform(99)+1)
if number >= 42 {
number = number + 1
}
return number
}
As an example, trying to generate a random number from 1-5 that's not 3, take a random number from 1 to 4 and add one if it's greater than or equal to 3.
rand(1..4) produces 1, +0, = 1
rand(1..4) produces 2, +0, = 2
rand(1..4) produces 3, +1, = 4
rand(1..4) produces 4, +1, = 5
Approach 2: Avoidance.
Another simple way would be to get a number from 1 to 99. If it's exactly equal to the number you're trying to avoid, make it 100 instead.
func approach2()->Int {
var number = Int(arc4random_uniform(99)+1)
if number == 42 {
number = 100
}
return number
}
Using this algorithm and narrowing the range to 1-5 (while avoiding 3) again, we get these possible outcomes:
rand(1..4) produces 1; allowed, so Result = 1
rand(1..4) produces 2, allowed, so Result = 2
rand(1..4) produces 3; not allowed, so Result = 5
rand(1..4) produces 4, allowed, so Result = 4

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}
Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

Generate a hash sum for several integers

I am facing the problem of having several integers, and I have to generate one using them. For example.
Int 1: 14
Int 2: 4
Int 3: 8
Int 4: 4
Hash Sum: 43
I have some restriction in the values, the maximum value that and attribute can have is 30, the addition of all of them is always 30. And the attributes are always positive.
The key is that I want to generate the same hash sum for similar integers, for example if I have the integers, 14, 4, 10, 2 then I want to generate the same hash sum, in the case above 43. But of course if the integers are very different (4, 4, 2, 20) then I should have a different hash sum. Also it needs to be fast.
Ideally I would like that the output of the hash sum is between 0 and 512, and it should evenly distributed. With my restrictions I can have around 5K different possibilities, so what I would like to have is around 10 per bucket.
I am sure there are many algorithms that do this, but I could not find a way of googling this thing. Can anyone please post an algorithm to do this?.
Some more information
The whole thing with this is that those integers are attributes for a function. I want to store the values of the function in a table, but I do not have enough memory to store all the different options. That is why I want to generalize between similar attributes.
The reason why 10, 5, 15 are totally different from 5, 10, 15, it is because if you imagine this in 3d then both points are a totally different point
Some more information 2
Some answers try to solve the problem using hashing. But I do not think this is so complex. Thanks to one of the comments I have realized that this is a clustering algorithm problem. If we have only 3 attributes and we imagine the problem in 3d, what I just need is divide the space in blocks.
In fact this can be solved with rules of this type
if (att[0] < 5 && att[1] < 5 && att[2] < 5 && att[3] < 5)
Block = 21
if ( (5 < att[0] < 10) && (5 < att[1] < 10) && (5 < att[2] < 10) && (5 < att[3] < 10))
Block = 45
The problem is that I need a fast and a general way to generate those ifs I cannot write all the possibilities.
The simple solution:
Convert the integers to strings separated by commas, and hash the resulting string using a common hashing algorithm (md5, sha, etc).
If you really want to roll-your-own, I would do something like:
Generate large prime P
Generate random numbers 0 < a[i] < P (for each dimension you have)
To generate hash, calculate: sum(a[i] * x[i]) mod P
Given the inputs a, b, c, and d, each ranging in value from 0 to 30 (5 bits), the following will produce an number in the range of 0 to 255 (8 bits).
bucket = ((a & 0x18) << 3) | ((b & 0x18) << 1) | ((c & 0x18) >> 1) | ((d & 0x18) >> 3)
Whether the general approach is appropriate depends on how the question is interpreted. The 3 least significant bits are dropped, grouping 0-7 in the same set, 8-15 in the next, and so forth.
0-7,0-7,0-7,0-7 -> bucket 0
0-7,0-7,0-7,8-15 -> bucket 1
0-7,0-7,0-7,16-23 -> bucket 2
...
24-30,24-30,24-30,24-30 -> bucket 255
Trivially tested with:
for (int a = 0; a <= 30; a++)
for (int b = 0; b <= 30; b++)
for (int c = 0; c <= 30; c++)
for (int d = 0; d <= 30; d++) {
int bucket = ((a & 0x18) << 3) |
((b & 0x18) << 1) |
((c & 0x18) >> 1) |
((d & 0x18) >> 3);
printf("%d, %d, %d, %d -> %d\n",
a, b, c, d, bucket);
}
You want a hash function that depends on the order of inputs and where similar sets of numbers will generate the same hash? That is, you want 50 5 5 10 and 5 5 10 50 to generate different values, but you want 52 7 4 12 to generate the same hash as 50 5 5 10? A simple way to do something like this is:
long hash = 13;
for (int i = 0; i < array.length; i++) {
hash = hash * 37 + array[i] / 5;
}
This is imperfect, but should give you an idea of one way to implement what you want. It will treat the values 50 - 54 as the same value, but it will treat 49 and 50 as different values.
If you want the hash to be independent of the order of the inputs (so the hash of 5 10 20 and 20 10 5 are the same) then one way to do this is to sort the array of integers into ascending order before applying the hash. Another way would be to replace
hash = hash * 37 + array[i] / 5;
with
hash += array[i] / 5;
EDIT: Taking into account your comments in response to this answer, it sounds like my attempt above may serve your needs well enough. It won't be ideal, nor perfect. If you need high performance you have some research and experimentation to do.
To summarize, order is important, so 5 10 20 differs from 20 10 5. Also, you would ideally store each "vector" separately in your hash table, but to handle space limitations you want to store some groups of values in one table entry.
An ideal hash function would return a number evenly spread across the possible values based on your table size. Doing this right depends on the expected size of your table and on the number of and expected maximum value of the input vector values. If you can have negative values as "coordinate" values then this may affect how you compute your hash. If, given your range of input values and the hash function chosen, your maximum hash value is less than your hash table size, then you need to change the hash function to generate a larger hash value.
You might want to try using vectors to describe each number set as the hash value.
EDIT:
Since you're not describing why you want to not run the function itself, I'm guessing it's long running. Since you haven't described the breadth of the argument set.
If every value is expected then a full lookup table in a database might be faster.
If you're expecting repeated calls with the same arguments and little overall variation, then you could look at memoizing so only the first run for a argument set is expensive, and each additional request is fast, with less memory usage.
You would need to define what you mean by "similar". Hashes are generally designed to create unique results from unique input.
One approach would be to normalize your input and then generate a hash from the results.
Generating the same hash sum is called a collision, and is a bad thing for a hash to have. It makes it less useful.
If you want similar values to give the same output, you can divide the input by however close you want them to count. If the order makes a difference, use a different divisor for each number. The following function does what you describe:
int SqueezedSum( int a, int b, int c, int d )
{
return (a/11) + (b/7) + (c/5) + (d/3);
}
This is not a hash, but does what you describe.
You want to look into geometric hashing. In "standard" hashing you want
a short key
inverse resistance
collision resistance
With geometric hashing you susbtitute number 3 with something whihch is almost opposite; namely close initial values give close hash values.
Another way to view my problem is using the multidimesional scaling (MS). In MS we start with a matrix of items and what we want is assign a location of each item to an N dimensional space. Reducing in this way the number of dimensions.
http://en.wikipedia.org/wiki/Multidimensional_scaling