Matlab random number rng: choosing a seed - matlab

I would like to know more precisely what happends when you choose a custom seed in Matlab, e.g.:
rng(101)
From my (limited, nut nevertheless existing) understanding of how pseudo-random number generators work, one can see the seed conceptually as choosing a position in a "very long list of pseudo-random numbers".
Question: lets say, (in my Matlab script), I choose rng(100) for my first computation (a sequence of instructions) and then rng(1e6) for my second. Please, note that each time I do some computations it involves generating up to about 300k random numbers (each time).
-> Does that imply that I make sure there is no overlap between the sequence in the "list" starting at 100 and ending around 300k and the one starting at 1e6 and ending at 1'300'000 ? (the idead of "no overlap" comes from the fact since the rng(100) and rng(1e6) are separated by much more than 300k)
i.e. that these are 2 "independent" sequences, (as far as I remember this 'long list' would be generated by a special PRNG algorithm, most likely involing modular arithmetic..?)

No that is not the case. The mapping between the seed and the "position" in our list of generated numbers is not linear, you could actually interpret it as a hash/one way function. It could actually happen that we get the same sequence of numbers shifted by one position (but it is very unlikely).
By default, MATLAB uses the Mersenne Twister (source).

Not quite. The seed you give to rng is the initiation point for the Mersenne Twister algorithm (by default) that is used to generate the pseudorandom numbers. If you choose two different seeds (no matter their relative non-negative integer values, except for maybe a special case or two), you will have effectively independent pseudorandom number streams.
For "99%" of people, the major uses of seeding the rng are using the 'shuffle' argument (to use a non-default seed based on the time to help ensure independence of numbers generated across multiple sessions), or to give it one particular seed (to be able to reproduce the same pseudorandom stream at a later date). If you try to finesse the seeds further without being extremely careful, you are more likely to cause issues than do anything helpful.
RandStream can be used to break off separate streams of pseudorandom numbers if that really matters for your application (it likely doesn't).

Related

matlab: different instances start with the same random seed

Using MATLAB and trying to use a computer cluster to perform 100 repetitions of certain calculation with inherent stochastic nature. Each of those repetitions should include the same code, but with different random seed.
It seems that
rng('shuffle')
recommended by documentation may not achieve this if all jobs start running at the same time (on different machines) as the seed used is an integer which seems to be initialized from time (it is monotonously increasing, seems like precision of 100th of a second.
The precision seems reasonable, but "collisions" are still very likely if running 100-1000 instances at the same time, thus corrupting the results statistical interpretation as independent.
Any way to avoid such collisions without manually giving each instance an "instance id" used as seed?
Whatever you choose for the seed, it can only take on a 32-bit value, even if it will initialize a generator with a bigger state, such as Mersenne Twister ('twister', 19937 bits). There are certain issues with 32-bit seeds, as discussed in "C++ Seeding Surprises" by M. O'Neill. Presumably, the time-based seeds are likewise 32 bits long. A short seed means that only a limited number of pseudorandom sequences can be generated.
It appears that rng doesn't support seeds longer than 32 bits. On the other hand, recent versions of MATLAB support random number streams, which are designed, among other things, if you "want separate sources of randomness in a simulation". For your purposes, choose a generator that supports multiple streams, such as mrg32k3a, and create random number streams as follows (see also "Multiple Streams"):
[stream1, stream2]=RandStream.create('mrg32k3a','NumStreams',2)
I usually try to get some serial numbers from the machine or HDD, e.g.
dos('wmic bios get serialnumber')
or
dos('wmic cpu')
ProcessorId e.g. "BFEBFBFF000506E3" is another one that could be used and
be different across your cluster. Likely multicores thus use NumberOfCores
to split and have different seeds, maybe.

Declaring rng('shuffle','twister') many times through the use of functions degrade computation time

I have an optimization program where I have a main program and three subprograms (functions) in MATLAB. I declared rng('shuffle','twister') in my main program but I thought that I needed to declare the same rng('shuffle','twister') under my functions since they also use random sampling. My question is if it is necessary to declare rng('shuffle','twister') in my functions since it greatly degrades the computation time. I seem to be getting the same answers anyway. Is there a way around this?
You do not need to repeatedly run rng(...) in your functions, just once when you start MATLAB if you want to get different numbers each time. The random number functions in MATLAB (i.e. rand, randn, randi, etc.) share a global/system-wide generator, so there is no need to reseed it except when you restart MATLAB.
Since all of these functions access the same underlying stream, a call to one affects the values produced by the others at subsequent calls.
Hence, numbers generated in the different functions and in repeated calls to the functions will be different whether or not you reseed the generator.
More about the 'shuffle' option from this page, which indicates that not only is it not useful to re-seed frequently, but it may actually be undesirable from a statistical standpoint:
'shuffle' is a very easy way to reseed the random number generator. You might think that it's a good idea, or even necessary, to use it to get "true" randomness in MATLAB. For most purposes, though, it is not necessary to use 'shuffle' at all. Choosing a seed based on the current time does not improve the statistical properties of the values you'll get from rand, randi, and randn, and does not make them "more random" in any real sense. While it is perfectly fine to reseed the generator each time you start up MATLAB, or before you run some kind of large calculation involving random numbers, it is actually not a good idea to reseed the generator too frequently within a session, because this can affect the statistical properties of your random numbers.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Hash operator in Matlab for linear indices of vectors

I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.

How can I use reproducible randomization in Perl?

I have a Perl script that uses rand to generate pseudorandom integers in some range. I want it to be random (i.e. not set the seed by myself to some constant), but also want to be able to reproduce the results of a specific run if needed.
What would you do?
McWafflestix says:
Possibly you want to have a default randomly determined seed, that will give you complete randomness when desired, but which can be set prior to a run manually to give reproducibility.
The obvious way to implement this is to follow your normal seeding process (either manually from a strong random source, or letting perl do it automatically on the first call to rand), then use the first generated random value as the seed, and record it. If you want to reproduce later, just use a recorded value for the seed.
# something like this?
if ( defined $input_rand_seed ) {
srand($input_rand_seed);
} else {
my $seed = rand(); # or something fancier
log_random_seed($seed);
srand($seed);
}
If the purpose is to be able to reproduce simulation paths which incorporate random shocks (say, when you are running an economic model to produce projections, I would give up on the idea of storing the seed, but rather store each sequence alongside the model data.
Note that the built in rand is subject to vagaries of the rand implementation provided by the C runtime. On all Windows machines and across all perl versions I have used, this usually means that rand will only ever produce 32768 unique values.
That is severely limited for any serious purpose. In simulations, a crucial criterion is that random sequences used be independent of each other so that each run can be considered an independent realization.
In fact, if you are going to run a simulation 1,000 times, I would pre-produce 1,000 corresponding random sequences using known-good generators that are consistent across platforms and store them with the model inputs.
You can update the simulations using the same sequences or a new set if parameter estimates change when you get new data.
Log the seed for each run and provide a method to call the script and set the seed?
Why don't you want to set the seed, but at the same time set the seed? As I've said to you before, you need to explain why you don't want to do something so we know what you are actually asking.
You might just set it yourself only in certain conditions:
srand( $ENV{SOME_SEED} ) if defined $ENV{SOME_SEED};
If you don't call srand, rand calls it for you automatically but it doesn't report the seed that it used (at least not until Perl 5.14).
It's really just a simple programming problem. Just turn what you outlined into the code that does what you said.
Your goals are at odds with each other. One one hand, you want a self-seeding, completely random sequence of integers; on the other hand, you want reproducibility. Completely random and reproducibility are at odds with each other.
You can set the seed to something you want. Possibly you want to have a default randomly determined seed, that will give you complete randomness when desired, but which can be set prior to a run manually to give reproducibility.