How can I use reproducible randomization in Perl? - perl

I have a Perl script that uses rand to generate pseudorandom integers in some range. I want it to be random (i.e. not set the seed by myself to some constant), but also want to be able to reproduce the results of a specific run if needed.
What would you do?

McWafflestix says:
Possibly you want to have a default randomly determined seed, that will give you complete randomness when desired, but which can be set prior to a run manually to give reproducibility.
The obvious way to implement this is to follow your normal seeding process (either manually from a strong random source, or letting perl do it automatically on the first call to rand), then use the first generated random value as the seed, and record it. If you want to reproduce later, just use a recorded value for the seed.
# something like this?
if ( defined $input_rand_seed ) {
srand($input_rand_seed);
} else {
my $seed = rand(); # or something fancier
log_random_seed($seed);
srand($seed);
}

If the purpose is to be able to reproduce simulation paths which incorporate random shocks (say, when you are running an economic model to produce projections, I would give up on the idea of storing the seed, but rather store each sequence alongside the model data.
Note that the built in rand is subject to vagaries of the rand implementation provided by the C runtime. On all Windows machines and across all perl versions I have used, this usually means that rand will only ever produce 32768 unique values.
That is severely limited for any serious purpose. In simulations, a crucial criterion is that random sequences used be independent of each other so that each run can be considered an independent realization.
In fact, if you are going to run a simulation 1,000 times, I would pre-produce 1,000 corresponding random sequences using known-good generators that are consistent across platforms and store them with the model inputs.
You can update the simulations using the same sequences or a new set if parameter estimates change when you get new data.

Log the seed for each run and provide a method to call the script and set the seed?

Why don't you want to set the seed, but at the same time set the seed? As I've said to you before, you need to explain why you don't want to do something so we know what you are actually asking.
You might just set it yourself only in certain conditions:
srand( $ENV{SOME_SEED} ) if defined $ENV{SOME_SEED};
If you don't call srand, rand calls it for you automatically but it doesn't report the seed that it used (at least not until Perl 5.14).
It's really just a simple programming problem. Just turn what you outlined into the code that does what you said.

Your goals are at odds with each other. One one hand, you want a self-seeding, completely random sequence of integers; on the other hand, you want reproducibility. Completely random and reproducibility are at odds with each other.
You can set the seed to something you want. Possibly you want to have a default randomly determined seed, that will give you complete randomness when desired, but which can be set prior to a run manually to give reproducibility.

Related

Matlab random number rng: choosing a seed

I would like to know more precisely what happends when you choose a custom seed in Matlab, e.g.:
rng(101)
From my (limited, nut nevertheless existing) understanding of how pseudo-random number generators work, one can see the seed conceptually as choosing a position in a "very long list of pseudo-random numbers".
Question: lets say, (in my Matlab script), I choose rng(100) for my first computation (a sequence of instructions) and then rng(1e6) for my second. Please, note that each time I do some computations it involves generating up to about 300k random numbers (each time).
-> Does that imply that I make sure there is no overlap between the sequence in the "list" starting at 100 and ending around 300k and the one starting at 1e6 and ending at 1'300'000 ? (the idead of "no overlap" comes from the fact since the rng(100) and rng(1e6) are separated by much more than 300k)
i.e. that these are 2 "independent" sequences, (as far as I remember this 'long list' would be generated by a special PRNG algorithm, most likely involing modular arithmetic..?)
No that is not the case. The mapping between the seed and the "position" in our list of generated numbers is not linear, you could actually interpret it as a hash/one way function. It could actually happen that we get the same sequence of numbers shifted by one position (but it is very unlikely).
By default, MATLAB uses the Mersenne Twister (source).
Not quite. The seed you give to rng is the initiation point for the Mersenne Twister algorithm (by default) that is used to generate the pseudorandom numbers. If you choose two different seeds (no matter their relative non-negative integer values, except for maybe a special case or two), you will have effectively independent pseudorandom number streams.
For "99%" of people, the major uses of seeding the rng are using the 'shuffle' argument (to use a non-default seed based on the time to help ensure independence of numbers generated across multiple sessions), or to give it one particular seed (to be able to reproduce the same pseudorandom stream at a later date). If you try to finesse the seeds further without being extremely careful, you are more likely to cause issues than do anything helpful.
RandStream can be used to break off separate streams of pseudorandom numbers if that really matters for your application (it likely doesn't).

How does random number generation ensure reproducibility?

While reading about Transfer Learning with MATLab I came across a piece of code which says...
rng(2016) % For reproducibility
convnet = trainNetwork(trainDigitData,layers,options);
...before training the network so that the results can be reproduced exactly as given in the example by anyone who tries that code. I would like to know how generating a pseudo-random number using rng(seed_value) function can help with reproduciblity of the entire range of results?
Not random number generation, the random number generator seed.
There is no such things as random numbers, just pseudo-random numbers, numbers that behave almost as random, generally arising from some complex mathematical function, function that usually requires an initial value. Often, computers get this initial value from the time register in the microchip in your PC, thus "ensuring" randomness.
However, if you have an algorithm that is based in random numbers (e.g. a NN), reproducibility may be a problem when you want to share your results. Someone that re-runs your code will be ensured to get different results, as randomness is part of the algorithm. But, you can tell the random number generator to instead of starting from a seed taken randomly, to start from a fixed seed. That will ensure that while the numbers generated are random between themseves, they are the same each time (e.g. [3 84 12 21 43 6] could be the random output, but ti will always be the same).
By setting a seed for your NN, you ensure that for the same data, it will output the same result, thus you can make your code "reproducible", i.e. someone else can run your code and get EXACTLY the same results.
As a test I suggest you try the following:
rand(1,10)
rand(1,10)
and then try
rng(42)
rand(1,10)
rng(42)
rand(1,10)
Wikipedia for Pseudo-random number generator
Because some times is good to use the same random numbers, this is what matlab says about that
Set the seed and generator type together when you want to:
Ensure that the behavior of code you write today returns the same results when you run that code in a future MATLABĀ® release.
Ensure that the behavior of code you wrote in a previous MATLAB release returns the same results using the current release.
Repeat random numbers in your code after running someone else's random number code
this is te point of repating the seed, and generate the same random numbers. matlab points it out in two good articles one for repeating numbers and one for different numbers
You dont want to start with weights all equal zeros, so in the initializing stage you give the weights some random value. There maybe other random values involved in searching for minimum later in the learning process, or in the way you feed your data.
So the real input to all neural network learning process is your data and the random number generator.
If they are the same, than all going to be the same.
And 'rng' command put the random number generator in predefined state so it will generate same sequence of number.
anquegi's answer, pretty much answers your question, so this post is just to elaborate a bit more.
Whenever you ask for a random number, what MATLAB really does, is that it generates a pseudo random number, which has distribution U(0,1) (that is the uniform on [0,1]) This is done via some deterministic formula, typically something like, see Linear congruential generator:
X_{n+1} = (a X_{n} + b) mod M
then a uniform number is obtained by U = X_{n+1}/M.
There is, however, a problem, If you want X_{1}, then you need X_{0}. You need to initialise the generator, this is the seed. This also means that once X_{0} is specified you will draw the same random numbers, every time. Try open a new MATLAB instance, run randn, close MATLAB, open it again and run randn again. It will be the same number. That is because MATLAB always uses the same seed whenever it is opened.
So what you do with rng(2016) is that you "reset" the generator, and put X_{0} = 2016, such that you now know all numbers that you ask for, and thus reproduce the results.

One-time randomization

I have a matrix, ECGsig, with each row containing a 1-second-long ECG signal,
I will classify them later but I want to randomly change the rows like,
idx = randperm(size(ECGsig,1));
ECGsig = ECGsig(idx,:);
However I want this to happen just once and not every time that I run the program,
Or in other words to have the random numbers generated only once,
Because if it changes every time I would have different results for classification,
Is there any way to do this beside doing in a separate m file and saving it in a mat file?
Thanks,
You can set the random generation seed so that every time you run a random result, it will generate the same random result each time. You can do this through rng. This way, even though run the program multiple times, it will still generate the same random sequence regardless. As such, try doing something like:
rng(1234);
The input into rng would be the seed. However, as per Luis Mendo's comment, rng is only available with newer versions of MATLAB. Should rng not be available with your distribution of MATLAB, do this instead:
rand('seed', 1234);
You can also take a look at randstream, but that's a bit too advanced so let's not look at it right now. To reset the seed to what it was before you opened MATLAB, choose a seed of 0. Therefore:
rng(0); %// or
rand('seed', 0);
By calling this, any random results you generate from this point will be based on a pre-determined order. The seed can be any integer you want really, but use something that you'll remember. Place this at the very beginning of your code before you do anything. The main reason why we have control over how random numbers are generated is because this encourages the production of reproducible results and research. This way, other people can generate the results you have created should you decide to do anything with random or randomizing.
Even though you said you only want to run this randomization once, this will save you the headache of saving your results to a different file before you run the program multiple times. By setting the seed, even though you're running the program multiple times, you're guaranteed to generate the same random sequence each time.

Declaring rng('shuffle','twister') many times through the use of functions degrade computation time

I have an optimization program where I have a main program and three subprograms (functions) in MATLAB. I declared rng('shuffle','twister') in my main program but I thought that I needed to declare the same rng('shuffle','twister') under my functions since they also use random sampling. My question is if it is necessary to declare rng('shuffle','twister') in my functions since it greatly degrades the computation time. I seem to be getting the same answers anyway. Is there a way around this?
You do not need to repeatedly run rng(...) in your functions, just once when you start MATLAB if you want to get different numbers each time. The random number functions in MATLAB (i.e. rand, randn, randi, etc.) share a global/system-wide generator, so there is no need to reseed it except when you restart MATLAB.
Since all of these functions access the same underlying stream, a call to one affects the values produced by the others at subsequent calls.
Hence, numbers generated in the different functions and in repeated calls to the functions will be different whether or not you reseed the generator.
More about the 'shuffle' option from this page, which indicates that not only is it not useful to re-seed frequently, but it may actually be undesirable from a statistical standpoint:
'shuffle' is a very easy way to reseed the random number generator. You might think that it's a good idea, or even necessary, to use it to get "true" randomness in MATLAB. For most purposes, though, it is not necessary to use 'shuffle' at all. Choosing a seed based on the current time does not improve the statistical properties of the values you'll get from rand, randi, and randn, and does not make them "more random" in any real sense. While it is perfectly fine to reseed the generator each time you start up MATLAB, or before you run some kind of large calculation involving random numbers, it is actually not a good idea to reseed the generator too frequently within a session, because this can affect the statistical properties of your random numbers.

Uniform Random Number blocks in my simulation model

I've used 2 Uniform Random Number blocks in my simulation model, but every time I run the program they generate last numbers (exactly the same). I need to test the model with new generated numbers. what should I do?
thanks for your helps in advance
The fact that random number generators generate the same random numbers "from the start" is a feature, not a bug. It allows for reproducible testing. You need to initialize your random number generator with a "random seed" in order to give a different result each time - you could use the current time, for example. When you do, it is recommended that you store the seed used - it means you can go back and run exactly the same code again.
For initializing a random seed, you can use the methods given in this earlier answer
In that answer, they are setting the seed to 0 - this is the opposite of what you are trying to do. You will want to generate a non-random number (like the date), and use that. A very useful article can be found here. To quote:
If you look at the output from rand, randi, or randn in a new MATLAB
session, you'll notice that they return the same sequences of numbers
each time you restart MATLAB. It's often useful to be able to reset
the random number generator to that startup state, without actually
restarting MATLAB. For example, you might want to repeat a calculation
that involves random numbers, and get the same result.
They recommend the command
rng shuffle
To generate a new random seed. You can access the seed that was used with
rng.seed
and store that for future use. So if you co
rng shuffle
seedStore = rng.seed;
Then next time you want to reproduce results, you set
rng(seedStore);