I'm currently working on a ANN (feed forward MLP via neurolab), and have trained my network to do a linear regression of ~40 data points to a single value, e.g.
350, 10, -6.3, ...., -9.78
12 , 5 , -2.5, ...., -8.23
5 , 18, -10 , ...., -8.78
And so on, with the last column being my target value. However, in the process, I scale all of the data to be -1 < x < 1 (keying off of max and min values for each feature). After running my calculations, I am now needing to 'un-scale' or 're-scale' my data back to the appropriate range (e.g. from -1 < x < 1 to X < x < Y. The 'obvious' solution is to re-scale the data using something akin to:
I = Imin + (Imax-Imin)*(X-Dmin)/(Dmax-Dmin)
However, won't this 'rescaling' of my data force my predictions to be within the same window (e.g. max and min values) of my training data? It would seem that some of my predictions should be able to fall well outside of the original training data. Is there a way to re-scale the data in a less constrictive manner?
Related
Assume that I have vector shown in the figure below. By common sense, we can see that there are 2 values which suddenly depart from the trend of the vector.
How do I eliminate these sudden changes. I mean how do I automatically detect and replace these noise values by the average value of their neighbors.
Define a threshold, compute the average values, then compare the relative error between the values and the averages of their neighbors:
threshold = 5e-2;
averages = [v(1); (v(3:end) + v(1:end-2)) / 2; v(end)];
is_outlier = (v.^2 - averages.^2) > threshold^2 * averages.^2;
Then replace the outliers:
v(is_outlier) = averages(is_outlier);
I'm trying to generate a 2-by-6 matrix of random numbers based on their density function, for example
f(x)= 2x-4 for 2 < x < 3; 0 otherwise
So from what I understand I have to find the cumulative density function first, x2-4x, and then I have to invert it so that I can use the rand function.
This is that part I do not understand, how do I get the inverted function
Try something similar to this method: https://stackoverflow.com/a/13914141/1011724.
However, your PDF is continuous so you need to adjust it slightly. The cumsum part becomes your CDF and the sum(r >= ... part becomes a definite integral from 0 to rand (which is just your CDF since it evaluates to 0 at x==0) so (ignoring your limits) you get
X = #(x)x.^2 - 4x
To generate a random matrix go X(rand(2,6))
To account for your limits you can just multiply the entire function by x > 2 & x < 3 but also if it's greater than 3 then although the PDF is 0, the CDF should still be 32 - 4 =5
X_limited = #(x)(x.^2 - 4x ).*(x > 2 & x < 3) + (x>=3)*5
If you plot a graph of (x > 2 & x < 3) you will see it is a rectangular function between 2 and 3 and so multiplying by it makes anything outside of that window 0 but leaves anything inside the window unchanged. Similarly, x >= 3 is a step function start at x == 3 and thus it adds 5 to any values higher than 3 and since the windowing function will make sure the first term is zero when x is greater then 3, this step function ensures a value of 5 for all x greater than 3.
Now you just need to generate random numbers in whatever your range is. Assuming it's between 0 and 5
x = rand(2,6)*5
X_limited(x)
I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?
For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.
So my computer is not too strong.. to say the least..
Yet I want to create a median of all pixels in an entire specific movie.
I was able to do it for a sequence of frames in memory.. but I am not sure on how to do it when reading more frames each time... how do I give median weight?
(like I'll read 100 frames each time but the median has to update according to the current median * 100 * times I read + 100 * current image..)
I have this code:
mov = VideoReader('MVI_3478.MOV');
seq = read(mov, [1 frames]);
% create background
channels = size(seq, 3);
height = size(seq,1);
width = size(seq,2);
BG = zeros(height, width, channels, 'uint8');
for c = 1:channels
for y = 1:height
for x = 1:width
BG(y,x,c) = median(seq(y,x,c,:));
end
end
end
and my question is, given that I will add another loop above everything, how to give median weight?
Thanks!
There is no possibility to calculate the median this way. The required Information is lost.
Example:
median([1,2,3,4,5,6,7]) is 4
median([1,2,3,3,5,6,7]) is 3
median([1,2,3])=2
median([4,5,6,7])=5
median([3,5,6,7])=5
Thus, for both subsequence you get the partial results 2 and 5, while the median is 3 in one case and 4 in the other case.
The only possibility I see is some binary search approach:
smaller=0
larger=0
equal=0
el=numel(s)
while(smaller>=el/2||larger>el/2||equal==0)
guess=..
smaller=0
larger=0
equal=0
for c = 1:channels
for y = 1:height
for x = 1:width
s=seq(y,x,c,:)
smaller=smaller+numel(s(s<guess);
larger=larger+numel(s(s>guess);
equal=equal+numel(s(s=guess);
end
end
end
end
This is only a sketch, the code has to be completed. Guess has to be filled with some binary search strategy.
In case of a large number of frames, calculating the median in a progressive manner can be problem since the median is a global order statistic and does not have a structure. The classical method is to use the fact that we are working with grayscale 8 bit values (256). Thus for any pixel p(x,y,n) one needs to maintain a histogram with 256 bins with each bin counting n values( as there are n frames).
Thus at each update we will have:
value = p(x,y,i); %for the ith frame
H(x,y,value) = H(x,y,value) + 1; %updating your histogram,
and then sort the histogram by their frequencies and pick the middle value: https://math.stackexchange.com/questions/202302/how-to-calculate-median-and-standard-deviation-from-histogram
The size of this counter can be decided based on the number of frames you have in the video N = log2(n) bit. The median search now is simplified since its constant time search within a histogram. This also helps when concatenating many histograms since the search remains a constant time search independent.
Thus finally the total size of your histograms would be XYN bits, where X and Y are the dimensions of your image.
I am using 64 bit matlab with 32g of RAM (just so you know).
I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by the inverse distance from that position (actually it's position ^-0.1, not ^-1, but for example purposes). I can't use matlab's 'filter' function, because it can only average things before the current point, right? To explain more clearly, here's an example of 3 elements
data = [ 2 6 9 ]
weights = [ 1 1/2 1/3; 1/2 1 1/2; 1/3 1/2 1 ]
results=data*weights= [ 8 11.5 12.666 ]
i.e.
8 = 2*1 + 6*1/2 + 9*1/3
11.5 = 2*1/2 + 6*1 + 9*1/2
12.666 = 2*1/3 + 6*1/2 + 9*1
So each point in the new vector is the weighted average of the entire first vector, weighting by 1/(distance from that position+1).
I could just remake the weight vector for each point, then calculate the results vector element by element, but this requires 1.3 million iterations of a for loop, each of which contains 1.3million multiplications. I would rather use straight matrix multiplication, multiplying a 1x1.3mil by a 1.3milx1.3mil, which works in theory, but I can't load a matrix that large.
I am then trying to make the matrix using a shell script and index it in matlab so only the relevant column of the matrix is called at a time, but that is also taking a very long time.
I don't have to do this in matlab, so any advice people have about utilizing such large numbers and getting averages would be appreciated. Since I am using a weight of ^-0.1, and not ^-1, it does not drop off that fast - the millionth point is still weighted at 0.25 compared to the original points weighting of 1, so I can't just cut it off as it gets big either.
Hope this was clear enough?
Here is the code for the answer below (so it can be formatted?):
data = load('/Users/mmanary/Documents/test/insertion.txt');
data=data.';
total=length(data);
x=1:total;
datapad=[zeros(1,total) data];
weights = ([(total+1):-1:2 1:total]).^(-.4);
weights = weights/sum(weights);
Fdata = fft(datapad);
Fweights = fft(weights);
Fresults = Fdata .* Fweights;
results = ifft(Fresults);
results = results(1:total);
plot(x,results)
The only sensible way to do this is with FFT convolution, as underpins the filter function and similar. It is very easy to do manually:
% Simulate some data
n = 10^6;
x = randi(10,1,n);
xpad = [zeros(1,n) x];
% Setup smoothing kernel
k = 1 ./ [(n+1):-1:2 1:n];
% FFT convolution
Fx = fft(xpad);
Fk = fft(k);
Fxk = Fx .* Fk;
xk = ifft(Fxk);
xk = xk(1:n);
Takes less than half a second for n=10^6!
This is probably not the best way to do it, but with lots of memory you could definitely parallelize the process.
You can construct sparse matrices consisting of entries of your original matrix which have value i^(-1) (where i = 1 .. 1.3 million), multiply them with your original vector, and sum all the results together.
So for your example the product would be essentially:
a = rand(3,1);
b1 = [1 0 0;
0 1 0;
0 0 1];
b2 = [0 1 0;
1 0 1;
0 1 0] / 2;
b3 = [0 0 1;
0 0 0;
1 0 0] / 3;
c = sparse(b1) * a + sparse(b2) * a + sparse(b3) * a;
Of course, you wouldn't construct the sparse matrices this way. If you wanted to have less iterations of the inside loop, you could have more than one of the i's in each matrix.
Look into the parfor loop in MATLAB: http://www.mathworks.com/help/toolbox/distcomp/parfor.html
I can't use matlab's 'filter' function, because it can only average
things before the current point, right?
That is not correct. You can always add samples (i.e, adding or removing zeros) from your data or from the filtered data. Since filtering with filter (you can also use conv by the way) is a linear action, it won't change the result (it's like adding and removing zeros, which does nothing, and then filtering. Then linearity allows you to swap the order to add samples -> filter -> remove sample).
Anyway, in your example, you can take the averaging kernel to be:
weights = 1 ./ [3 2 1 2 3]; % this kernel introduces a delay of 2 samples
and then simply:
result = filter(w,1,[data, zeros(1,3)]); % or conv (data, w)
% removing the delay introduced by the kernel
result = result (3:end-1);
You considered only 2 options:
Multiplying 1.3M*1.3M matrix with a vector once or multiplying 2 1.3M vectors 1.3M times.
But you can divide your weight matrix to as many sub-matrices as you wish and do a multiplication of n*1.3M matrix with the vector 1.3M/n times.
I assume that the fastest will be when there will be the smallest number of iterations and n is such that creates the largest sub-matrix that fits in your memory, without making your computer start swapping pages to your hard drive.
with your memory size you should start with n=5000.
you can also make it faster by using parfor (with n divided by the number of processors).
The brute force way will probably work for you, with one minor optimisation in the mix.
The ^-0.1 operations to create the weights will take a lot longer than the + and * operations to compute the weighted-means, but you re-use the weights across all the million weighted-mean operations. The algorithm becomes:
Create a weightings vector with all the weights any computation would need:
weights = (-n:n).^-0.1
For each element in the vector:
Index the relevent portion of the weights vector to consider the current element as the 'centre'.
Perform the weighted-mean with the weights portion and the entire vector. This can be done with a fast vector dot-multiply followed by a scalar division.
The main loop does n^2 additions and subractions. With n equal to 1.3 million that's 3.4 trillion operations. A single core of a modern 3GHz CPU can do say 6 billion additions/multiplications a second, so that comes out to around 10 minutes. Add time for indexing the weights vector and overheads, and I still estimate you could come in under half an hour.