How does List::Util 'shuffle' actually work? - perl

I am currently working on building a classifier using c5.0. I have a dataset of 8000 entries and each entry has its own i.d number (1-8000). When testing the performance of the classifier I had to make 5sets of 10:90 (training data: test data) splits. Of course any training cases cannot appear again in the test cases, and duplicates cannot occur in either set.
To solve the problem of picking examples at random for the training data, and making sure the same cannot be picked for the test data I have developed a horribly slow method;
fill a file with numbers from 1-8000 on separate lines.
randomly pick a line number (from a range of 1-8000) and use the contents of the line as the id number of the training example.
write all unpicked numbers to a new file
decrement the range of the random number generator by 1
redo
Then all unpicked numbers are used as test data. It works but its slow. To speed things up I could use List::Util 'shuffle' to just 'randomly' shuffle and array of these numbers. But how random is 'shuffle'? It is essential that the same level of accuracy is maintained. Sorry about the essay, but does anyone know how 'shuffle' actually works. Any help at all would be great

Here is the shuffle algorithm used in List::Util::PP
sub shuffle (#) {
my #a=\(#_);
my $n;
my $i=#_;
map {
$n = rand($i--);
(${$a[$n]}, $a[$n] = $a[$i])[0];
} #_;
}
Which looks like a Fisher-Yates shuffle.

Related

behavior of matlab's rng() function in loop

With rng() before for loop Matlab generate one array of random number, within for loop another one. Both results are repeatable, so rng() seeds work. But I want to know the reason of such behavior.I was expecting the results to be same. I think actually it is not because of rng but for loop
Code exapmle?
for i = 1:2
rng(1,'philox');
disp(randn(2,1)); % 1st number is 0.0906, 2nd one is -0.7327
end
rng(1,'philox');
for i = 1:2
disp(randn(2,1)); % 1st number is 0.7565, 2nd one is -0.7096
end
Shouldn't the results be same? Isn't rng(1,..) storing same array of numbers for seed 1
Your code works as you expect and would describe, I believe. You are definitely not showing in your code example all the outputs. This is what I get when running it.
>> test
First case:
0.0906
-0.7327
0.0906
-0.7327
Second case:
0.0906
-0.7327
0.7565
-0.7096
In the first case, you reset the random number generator inside the loop, and there fore you get twice the same results (two numbers). In the second case, you only set it once, and therefore you get the same numbers as before, and then the second loop you get the next 2 corresponding random numbers produced by the algorithm. These algorithms, once started, will produce an infinite amount of different numbers, and they won't produce the same unless you explicitly call the restart of the algorithm with the seed, like you do in the first example.
All this is way less confusing if you call disp(randn(1)) inside the loop, instead of generating 2 numbers each time.

How do I compare two weighted regressions in MatLab?

I've been using MatLab as a statistics tool. I like how much I can customise and code myself.
I was delighted to find that it's fairly straightforward to do a weighted linear regression in MatLab. As a slightly silly example, I can load the "carbig" data file and compare horsepower vs mileage for US cars to that of cars from other countries, but decide I only trust 8-cylinder cars.
load carbig
w=(Cylinders==8)+0.5*(Cylinders~=8)%1 if 8 cylinders, 0.5 otherwise.
for i=1:length(org)
o(i,1)=strcmp(org(i,:),org(1,:));%strcmp only works on one string.
end
x1=Horsepower(o==1)
x2=Horsepower(o==0)
y1=MPG(o==1)
y2=MPG(o==0)
w1=w(o==1)
w2=w(o==0)
lm1=fitlm(x1,y1,'Weights',w1)
lm2=fitlm(x2,y2,'Weights',w2)
This way, data from 8-cylinder cars will count as one data-point, and data frm 3,4,5,6-cylinder cars will count as half a data point.
Problem is, the obvious way to compare the two regressions is to use ANCOVA, which MatLab has a function for:
aoctool(Horsepower,MPG,o)
This function compares linear regressions on the two groups, but I haven't found an obvious way to include weights.
I suspect I can have a closer look at what the ANCOVA does and include the weights manually. Any easier solution?
I figured if I give the "trusted" measuremets weight 2, the "untrusted" measurements weight 1, for regression purposes that's the same thing as having an extra 1 identical measurement for each trusted one. Setting the weight to 1 and 0.5 should do the same thing. I can do this with a script.
That also increases the degrees of freedom quite a bit, so I manually set the degrees of freedom to sum(w)-rank instead on n-rank.
x=[];
y=[];
g=[];
w=(Cylinders==8)+0.5*(Cylinders~=8);
df=sum(w)
for i=1:length(w)
while w(i)>0
x=[x;Horsepower(i)];
y=[y;MPG(i)];
g=[g;o(i)];
w(i)=w(i)-0.5
end
end
I then copied the aoctool.m file (edit aoctool) and inserted the value of df somewhere in the new file. It isn't elegant, but it seems to work.
edit aoctool.m
%(insert new df somewhere. Save as aoctool2.m)
aoctool2(x,y,g)

Create different `randperm` numbers in loops

Suppose that we have this structure:
for i=1:x1
Out = randperm(40);
Out_Final = %% divide 'Out' to 10 parts. and select these parts for some purposes
for j=1:x2
%% Process on `Out_Final`
end
end
I'm using outer loop (for i=1:x1) to repeat main process (for j=1:x2) loop and average between outputs to have more robust results. I want randperm doesn't result equal (or near equal) outputs. I want have different Output for this function as far as possible in every calling in (for i=1:x1) loop.
How can i do that in MATLAB R2014a?
The randomness algorithms used by randperm are very good. So, don't worry about that.
However, if you draw 10 random numbers from 1 to 10, you are likely to see some more frequently than others.
If you REALLY don't want this, you should probably not focus on randomly selecting the numbers, but on selecting the numbers in a way that they are nicely spread out througout their possible range. (This is a quite different problem to solve).
To address your comment:
The rng function allows you to create reproducible results, make sure to check doc rng for examples.
In your case it seems like you actually don't want to reset the rng each time, as that would lead to correlated random numbers.

Vectorize matlab code to map nearest values in two arrays

I have two lists of timestamps and I'm trying to create a map between them that uses the imu_ts as the true time and tries to find the nearest vicon_ts value to it. The output is a 3xd matrix where the first row is the imu_ts index, the third row is the unix time at that index, and the second row is the index of the closest vicon_ts value above the timestamp in the same column.
Here's my code so far and it works, but it's really slow. I'm not sure how to vectorize it.
function tmap = sync_times(imu_ts, vicon_ts)
tstart = max(vicon_ts(1), imu_ts(1));
tstop = min(vicon_ts(end), imu_ts(end));
%trim imu data to
tmap(1,:) = find(imu_ts >= tstart & imu_ts <= tstop);
tmap(3,:) = imu_ts(tmap(1,:));%Use imu_ts as ground truth
%Find nearest indecies in vicon data and map
vic_t = 1;
for i = 1:size(tmap,2)
%
while(vicon_ts(vic_t) < tmap(3,i))
vic_t = vic_t + 1;
end
tmap(2,i) = vic_t;
end
The timestamps are already sorted in ascending order, so this is essentially an O(n) operation but because it's looped it runs slowly. Any vectorized ways to do the same thing?
Edit
It appears to be running faster than I expected or first measured, so this is no longer a critical issue. But I would be interested to see if there are any good solutions to this problem.
Have a look at knnsearch in MATLAB. Use cityblock distance and also put an additional constraint that the data point in vicon_ts should be less than its neighbour in imu_ts. If it is not then take the next index. This is required because cityblock takes absolute distance. Another option (and preferred) is to write your custom distance function.
I believe that your current method is sound, and I would not try and vectorize any further. Vectorization can actually be harmful when you are trying to optimize some inner loops, especially when you know more about the context of your data (e.g. it is sorted) than the Mathworks engineers can know.
Things that I typically look for when I need to optimize some piece of code liek this are:
All arrays are pre-allocated (this is the biggest driver of performance)
Fast inner loops use simple code (Matlab does pretty effective JIT on basic commands, but must interpret others.)
Take advantage of any special data features that you have, e.g. use sort appropriate algorithms and early exit conditions from some loops.
You're already doing all this. I recommend no change.
A good start might be to get rid of the while, try something like:
for i = 1:size(tmap,2)
C = max(0,tmap(3,:)-vicon_ts(i));
tmap(2,i) = find(C==min(C));
end

How can I work around a round-off error that causes an infinite loop in Perl's Statistics::Descriptive?

I'm using the Statistics::Descriptive library in Perl to calculate frequency distributions and coming up against a floating point rounding error problem.
I pass in two values, 0.205 and 0.205, (taken from other numbers and sprintf'd to those) to the stats module and ask it to calculate the frequency distribution but it's getting stuck in an infinite loop.
Stepping through with a debugger I can see that it's doing:
my $interval = $self->{sample_range}/$partitions;
my $iter = $self->{min};
while (($iter += $interval) < $self->{max}) {
$bins{$iter} = 0;
push #k, $iter; ##Keep the "keys" unstringified
}
$self->sample_range (The range is max-min)is returning 2.77555756156289e-17 rather than 0 as I'd expect. This means that the loop ((min+=range) < max)) enters a (for all intents and purposes) infinite loop.
DB<8> print $self->{max};
0.205
DB<9> print $self->{min};
0.205
DB<10> print $self->{max}-$self->{min};
2.77555756156289e-17
So this looks like a rounding problem. I can't think how to fix this on my side though, and I'm not sure editing the library is a good idea. I'm looking for suggestions of a workaround or alternative.
Cheers,
Neil
I am the Statistics::Descriptive maintainer. Due to its numeric nature, many rounding problems have been reported. I believe this particular one was fixed in a later version to the one you were using that I released recently, by using multiplication for the divisions instead of +=.
Please use the most up-to-date version from the CPAN, and it should be better.
Not exactly a rounding problem; you can see the more precise values with something like
printf("%.18g %.18g", $self->{max}, $self->{min});
Looks to me like there's a flaw in the module where it assumes the sample range can be divided up into $partitions pieces; because floating point doesn't have infinite precision, this isn't always possible. In your case, the min and max values are exactly adjacent representable values, so there can't be more than one partition. I don't know what exactly the module is using the partitions for, so I'm not sure what the impact of this may be.
Another possible problem in the module is that it is using numbers as hash keys, which
implicitly stringifies them which slightly rounds the value.
You may have some success in laundering your data through stringization before feeding it
to the module:
$data = 0+"$data";
This will at least ensure that two numbers that (with the default printing precision) appear equal are actually equal.
That shouldn't cause an infinite loop. What would cause that loop to be infinite would be if $self->{sample_range}/$partitions is 0.