Strange rand() behaviour in MATLAB - matlab

rand() does not seem to generate really random numbers. I have a simple program that returns a 6-digit number by calling :
for i=1:6
r=rand(1,1)
end
so I ran this 4-5 times yesterday. And saved the output. Today I opened MATLAB again and called the same function again 4-5 times. The same numbers were returned.
Why is this happening?
Should I provide a random seed or any other fix?
Thanks for any help!

To expand on #alexforrence's answer, rand and other related functions produce pseudo-random numbers (PRNs) that require an initial value to begin production. These numbers are not truly random since, following the initial seed, the numbers are produced via an algorithm, which is deterministic by its very nature.
However, being pseudo-random isn't necessarily a bad thing since models that use PRNs (e.g., Monte Carlo Methods) can generate portable, repeatable results across many users and platforms.
Additionally, the seed can be changed to create sets of random numbers and results that are statistically independent but also produce repeatable results.
For many scientific applications, this is very important.
Also, "true" random numbers (next paragraph) have a tendency to "clump" together and not evenly spread over their range for a small sampling of the space, which will degrade the performance of some methods that rely on stochastic processes.
There are methods to create "true-er" random numbers by the introduction of randomness from various analogue sources (e.g., hardware noise). These types of numbers are extremely important for cryptographically secure PRNs, where non-repeatability is an important feature (in contrast to the scientific usage). True random number generators require special hardware that leverages natural noise (e.g., quantum effects).
Although, it is important to remember that the total number of random numbers that can be generated and computationally used is limited by the precision of the numbers being used.
You can re-seed MATLAB with a pseudo-random seed using the rng function.
However, "reseeding the generator too frequently within a session is not a good idea because the statistical properties of your random numbers can be adversely affected" [src].

From the Mathworks documentation, you can use
rng('shuffle');
before calling rand to set a "random" seed (based on the current time). Setting the seed manually (either by not changing the seed at startup, by resetting using rng('default'), or setting the seed manually by rng(number)) allows you to exactly repeat previous behavior.

Related

Neural Networks w/ Fixed Point Parameters

Most neural networks are trained with floating point weights/biases.
Quantization methods exist to convert the weights from float to int, for deployment on smaller platforms.
Can you build neural networks from the ground up that constrain all parameters, and their updates to be integer arithmetic?
Could such networks achieve a good accuracy?
(I know a bit about fixed-point and have only some rusty NN experience from the 90's so take what I have to say with a pinch of salt!)
The general answer is yes, but it depends on a number of factors.
Bear in mind that floating-point arithmetic is basically the combination of an integer significand with an integer exponent so it's all integer under the hood. The real question is: can you do it efficiently without floats?
Firstly, "good accuracy" is highly dependent on many factors. It's perfectly possible to perform integer operations that have higher granularity than floating-point. For example, 32-bit integers have 31 bits of mantissa while 32-bit floats effectively have only 24. So provided you do not require the added precision that floats give you near zero, it's all about the types that you choose. 16-bit -- or even 8-bit -- values might suffice for much of the processing.
Secondly, accumulating the inputs to a neuron has the issue that unless you know the maximum number of inputs to a node, you cannot be sure what the upper bound is on the values being accumulated. So effectively you must specify this limit at compile time.
Thirdly, the most complicated operation during the execution of a trained network is often the activation function. Again, you firstly have to think about what the range of values are within which you will be operating. You then need to implement the function without the aid of an FPU with all of the advanced mathematical functions it provides. One way to consider doing this is via lookup tables.
Finally, training involves measuring the error between values and that error can often be quite small. Here is where accuracy is a concern. If the differences you are measuring are too low, they will round down to zero and this may prevent progress. One solution is to increase the resolution of the value by providing more fractional digits.
One advantage that integers have over floating-point here is their even distribution. Where floating-point numbers lose accuracy as they increase in magnitude, integers maintain a constant precision. This means that if you are trying to measure very small differences in values that are close to 1, you should have no more trouble than you would if those values were as close to 0. The same is not true for floats.
It's possible to train a network with higher precision types than those used to run the network if training time is not the bottleneck. You might even be able to train the network using floating-point types and run it using lower-precision integers but you need to be aware of differences in behavior that these shortcuts will bring.
In short the problems involved are by no means insurmountable but you need to take on some of the mental effort that would normally be saved by using floating-point. However, especially if your hardware is physically constrained, this can be a hugely benneficial approach as floating-point arithmetic requires as much as 100 times more silicon and power than integer arithmetic.
Hope that helps.

MATLAB: Generating random numbers in parfor or parallel computing

In a single for loop, I use a single random seed to generate all the "random numbers". They are very random as I take one from the stream at a time, without any gap.
However, in parfor, each worker uses a different random seed, therefore, the numbers obtained may have interference with each other. Therefore, they are not really random as they do not come from a single seed.
Also, for my case, I do not know how many random numbers each worker needs beforehand. How can I solve this problem?
In parfor, the workers use different streams from a random number generator that is specifically designed to be used in parallel. Therefore, you can rely on the random numbers generated inside parfor having reasonable statistical qualities. More here: http://www.mathworks.com/help/distcomp/control-random-number-streams.html

Correct way to generate random numbers

On page 3 of "Lecture 8, White Noise and Power Spectral Density" it is mentioned that rand and randn create Pseudo-random numbers. Please correct me if I am wrong: a sequence of random number is that which for the same seed, two sequences are never really exact.
Whereas, Pseudo-random numbers are deterministic i.e., two sequences are same if generated from the same seed.
How can I create random numbers and not pseudo-random numbers since I was under the impression that Matlab's rand and randn functions are used to generate identically independent random numbers? But, the slides mention that they create pseudo random numbers. Googling for creating of random numbers return rand and randn() functions.
The reason for distinguishing random numbers from pseudo-random numbers is that I need to compare performance of cryptography (A) random with white noise characteristics and (B) pseudo-random signal with white noise characteristic. So, (A) must be different from (B). I shall be grateful for any code and the correct way to generate random numbers and pseudo-random numbers.
Generation of "true" random numbers is a tricky exercise, you can check Wikipedia on RNG and the tests of randomness (http://en.wikipedia.org/wiki/Random_number_generation). This link offers RNG based on atmospheric noise (http://www.random.org/).
As mentioned above, it is really difficult (probably impossible) to create real random numbers with computer software. There are numerous projects on the internet that provide real random numbers that are generated by physical processes (for example the one Kostya mentioned). A Particularly interesting one is this from HU Berlin.
That being said, for experiments like the one you want to perform, Maltab's psedo RNGs are more than fine. Matlab's algorithms include Mersenne Twister which is one of the best known pseudo RNG (I would suggest you google the Mersenne Twister's properties). See Maltab rng documentation here.
Since you did not mention which type of system you want to simulate, one simple approach to solve your issue would be to use a good RNG (Mersenne Twister) for process A and a not-so-good for process B.

Fitness evaluation and training set in realtime simulation neuro-evolution

I am attempting to train a neural network to control a simple entity in a simulated 2D environment, currently by using a genetic algorithm.
Perhaps due to lack of familiarity with the correct terms, my searches have not yielded much information on how to treat fitness and training in cases where all the following conditions hold:
There is no data available on correct outputs for given inputs.
A performance evaluation can only be made after an extended period of interaction with the environment (with continuous controller input/output invocation).
There is randomness inherent in the system.
Currently my approach is as follows:
The NN inputs are instantaneous sensor readings of the entity and environment state.
The outputs are instantaneous activation levels of its effectors, for example, a level of thrust for an actuator.
I generate a performance value by running the simulation for a given NN controller, either for a preset period of simulation time, or until some system state is reached. The performance value is then assigned as appropriate based on observations of behaviour/final state.
To prevent over-fitting, I repeat the above a number of times with different random generator seeds for the system, and assign a fitness using some metric such as average/lowest performance value.
This is done for every individual at every generation. Within a given generation, for fairness each individual will use the same set of random seeds.
I have a couple of questions.
Is this a reasonable, standard approach to take for such a problem? Unsurprisingly it all adds up to a very computationally expensive process. I'm wondering if there are any methods to avoid having to rerun a simulation from scratch every time I produce a fitness value.
As stated, the same set of random seeds is used for the simulations for each individual in a generation. From one generation to the next, should this set remain static, or should it be different? My instinct was to use different seeds each generation to further avoid over-fitting, and that doing so would not have an adverse effect on the selective force. However, from my results, I'm unsure about this.
It is a reasonable approach, but genetic algorithms are not known for being very fast/efficient. Try hillclimbing and see if that is any faster. There are numerous other optimization methods, but nothing is great if you assume the function is a black box that you can only sample from. Reinforcement learning might work.
Using random seeds should prevent overfitting, but may not be necessary depending on how representative a static test is of average, and how easy it is to overfit.

How can MLE Likelihood evaluations be so different if I break up a log likelihood into its sum?

This is something I noticed in Matlab when trying to do MLE. My first estimator used the log likelihood of a pdf and broke the product up as a sum. For example, a log weibull pdf (f(x)=b ax^(a-1)exp(-bx^a)) broken up is:
log_likelihood=log(b)+log(a)+(a-1)log(x)-bx^a
Evaluating this is wildly different to this:
log_likelihood=log(bax^(a-1)exp(-bx^a))
What is the computer doing differently in the two stages? The first one gives a much larger number (by a couple orders of magnitude).
Depending on the numbers you use, this could be a numerical issue: If you combine very large numbers with very small numbers, you can get inaccuracies due to limitations in number precision.
One possibility is that you lose some accuracy in the second case since you are operating at different scales.
I work on a scientific software project implementing maximum likelihood of phylogenetic trees, and consistently run into issues regarding numerical precision. Often the descepency is ...
between competing applications with the same values in the model,
when calculating the MLE scores by hand,
in the order of the operations in the computation.
It really all comes down to number three, and even in your case. Mulitplication of small and very large numbers can cause weird results when their exponents are scaled during computation. There is a lot about this in the (in)famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic". But, what I've mentioned is the short of it if that's all you are interested in.
Over all, the issue you are seeing are strictly numerical issues in the representation of floating point / double precision numbers and operations when computing the function. I'm not too familiar with MATLAB, but they may have an arbitrary-precision type that would give you better results.
Aside from that, keep them symbolic as long as possible and if you have any intuition about the variables size (as in a is always very large compared to x), then make sure you are choosing the order of parenthesis wisely.
The first equation should be better since it is dealing with adding logs, and should be much more stable then the second --although x^a makes me a bit weary as it would dominate the equation, but it would in practice anyways.