How to randomly sample data with seeding?

How to randomly sample data with seeding? - matlab

I would like to randomly choose elements from a finite set that contains both numbers and NaNs while seeding the random number generation procedure.
So far I can make it work without seeding:
data = [0, 1, 2, 3, 4, 5, nan];
sample = datasample(data, 50);
but if I want to seed the number generation:
seed = rng(100);
sample = datasample(seed, data, 50);
I get the following error:
Error using datasample (line 89)
Sample size K must be a non-negative integer.
even if the syntax for datasample is (*):
[y,...] = datasample(s,data,k,...)
I have tried using randsample, too, but I get similar results.
(*) https://it.mathworks.com/help/stats/datasample.html

The documentation isn't super explicit about the first input. You need to pass a RandStream object as the first input argument rather than the struct that rng generates (As a sidenote, the output of rng is the previous setting not the new settings).
Here is the equivalent of what it seems you were trying to do
stream = RandStream('mt19937ar', 'Seed', 100);
output = datasample(stream, data, k);
If you want to instead use rng to specify the seed, you can call rng and then use RandStream.getGlobalStream to get the current global random number stream and then pass that to datasample. This is slightly redudant though since datasample is going to use the global random number stream if one isn't provided.
rng(100)
stream = RandStream.getGlobalStream();
output = datasample(stream, data, k);

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.

loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Change the random number generator in Matlab function

I have a task to complete that requires quasi-random numbers as input, but I notice that the Matlab function I want to use does not have an option to select any of the quasi generators I want to use (e.g. Halton, Sobol, etc.). Matlab has them as stand alone functions and not as options in the ubiquitous 'randn' and 'rng' functions. What MatLab uses is the Mersenne Twister, a pseudo generator. So for instance the copularnd uses 'randn'/'rng' which is based on pseudo random numbers....
Is there a way to incorporate them into the rand or rng functions embedded in other code (e.g.copularnd)? Any pointers would be much appreciated. Note; 'copularnd' calls 'mvnrnd' which in turn uses 'randn' and then pulls 'rng'...

First you need to initialize the haltonset using the leap, skip, and scramble properties.
You can check the documents but the easy description is as follows:
Scramble - is used for shuffling the points
Skip - helps to exclude a range of points from the set
Leap - is the size of jump from the current selected point to the next one. The points in between are ignored.
Now you can built a haltonset object:
p = haltonset(2,'Skip',1e2,'Leap',1e1);
p = scramble(p,'RR2');
This makes a 2D halton number set by skipping the first 100 numbers and leaping over 10 numbers. The scramble method is 'PR2' which is applied in the second line. You can see that many points are generated:
p =
Halton point set in 2 dimensions (818836295885536 points)
Properties:
Skip : 100
Leap : 10
ScrambleMethod : RR2
When you have your haltonset object, p, you can access the values by just selecting them:
x = p(1:10,:)
Notice:
So, you need to create the object first and then use the generated points. To get different results, you can play with Leap and Scramble properties of the function. Another thing you can do is to use a uniform distribution such as randi to select numbers each time from the generated points. That makes sure that you are accessing uniformly random parts of the dataset each time.
For instance, you can generate a random index vector (4 points in this example). And then use those to select points from the halton points.
>> idx = randi(size(p,1),1,4)
idx =
1.0e+14 *
3.1243 6.2683 6.5114 1.5302
>> p(idx,:)
ans =
0.5723 0.2129
0.8918 0.6338
0.9650 0.1549
0.8020 0.3532

link
'qrandstream' may be the answer I am looking for....with 'qrand' instead of 'rand'
e.g..from MatLab doc
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
q = qrandstream(p);
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
X = qrand(q,sampSize);
[h,pval] = kstest(X,[X,X]);
PVALS(test) = pval;
end
I will post my solution once I am done :)

How to use classifiers

I want to use the svm, knn, adaboost classifier on my data features. I build up code where I calculated the frame differences and calculated the features (eigenvalues, strain energy, potential energy).... build up an array of [number of frames , features]. I try to use svm as:
Features = data; % Features array [40, 5]
class = ones(numFrames-1, 1); % numFrames=41
class(1:(fix(numFrames/2))) = -1;
SVMstruct = svmtrain(Features, class, 'Kernel_Function', 'rbf');
newclass = svmclassify(SVMstruct, [40 5]); %Test data
I got an error:
The number of columns in TEST and training data must be equal.
%classperf(cp,newclass); %performance of the class given by cp'`
What is the reason for this error? And how do I to use further classifiers with this features set?

I can infer following things from the error which you are getting.
There is no error in svmtrain that means size(features)=[40 5]. The error is in the last line. See the syntax of svmclassify. You pass a sample of test data which has same number of features/columns as the training data in your case 5). Instead you are passing the size which is [40 5] which has only two columns. Pass the actual test set of n rows and 5 columns. The last line should be
newclass= svmclassify(SVMstruct,testData); %where size(testData)=[n 5], n indicates how many test samples you have.

For loop with multiplication step in MATLAB

Is there any way to use a for-loop in MATLAB with a custom step? What I want to do is iterate over all powers of 2 lesser than a given number. The equivalent loop in C++ (for example) would be:
for (int i = 1; i < 65; i *= 2)
Note 1: This is the kind of iteration that best fits for-loops, so I'd like to not use while-loops.
Note 2: I'm actually using Octave, not MATLAB.

Perhaps you want something along the lines of
for i=2.^[1:6]
disp(i)
end
Except you will need to figure out the range of exponents. This uses the fact that since
a_(i+1) = a_i*2 this can be rewritten as a_i = 2^i.
Otherwise you could do something like the following
i=1;
while i<65
i=i*2;
disp(i);
end

You can iterate over any vector, so you can use vector operations to create your vector of values before you start your loop. A loop over the first 100 square numbers, for example, could be written like so:
values_to_iterate = [1:100].^2;
for i = values_to_iterate
i
end
Or you could loop over each position in the vector values_to_iterate (this gives the same result, but has the benefit that i keeps track of how many iterations you have done - this is useful if you are writing a result from each loop sequentially to an output vector):
values_to_iterate = [1:100].^2;
for i = 1:length(values_to_iterate)
values_to_iterate(i)
results_vector(i) = some_function( values_to_iterate(i) );
end
More concisely, you can write the first example as simply:
for i = [1:100].^2
i
end
Unlike in C, there doesn't have to be a 'rule' to get from one value to the next.
The vector iterated over can be completely arbitrary:
for i = [10, -1000, 23.3, 5, inf]
i
end

What kind of data/format should matlabs clustering toolbox use [duplicate]

I'm trying to cluster some data I have from the KDD 1999 cup dataset
the output from the file looks like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
The clusters however don't look like clusters or its not accepting and working the way I need it to.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
The method:
Chose number of clusters (K)
Initialize centroids (K patterns randomly chosen from data set)
Assign each pattern to the cluster with closest centroid
Calculate means of each cluster to be its new centroid
Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)
This is what I'm trying to achieve:
This is what I'm getting:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)

Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
Finally, I suggest you familiarize yourself with a simpler dataset...

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to randomly sample data with seeding? - matlab

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

Change the random number generator in Matlab function

How to use classifiers

For loop with multiplication step in MATLAB

What kind of data/format should matlabs clustering toolbox use [duplicate]

Categories

Resources