i have a data set similar to the following:
bthd = sort(floor(1+(10-1).*rand(10,1)));
bthd2 = sort(floor(1+(10-1).*rand(10,1)));
bthd3 = sort(floor(1+(10-1).*rand(10,1)));
Depth = [bthd;bthd2;bthd3];
Jday = [repmat(733774,10,1);repmat(733775,10,1);repmat(733776,10,1)];
temp = 10+(30-10).*rand(30,1);
Data = [Jday,Depth,temp];
where I have a matrix similar to 'Data' with Julian Date in the first column, depth in the second, and then temperature in the third column. I would like to find what are the first and last values are for each unique Jday. This can be obtained by:
Data = [Jday,Depth,temp];
[~,~,b] = unique(Data(:,1),'rows');
for j = 1:length(unique(b));
top_temp(j) = temp(find(b == j,1,'first'));
bottom_temp(j) = temp(find(b == j,1,'last'));
end
However, my data set is extremely large and using this loop results in long running time. Can anyone suggest a vectorized solution to do this?
use diff:
% for example
Jday = [1 1 1 2 2 3 3 3 5 5 6 7 7 7];
last = find( [diff(Jday) 1] );
first = [1 last(1:end-1)+1];
top_temp = temp(first) ;
bottom_temp = temp(last);
Note that this solution assumes Jday is sorted. If this is not the case, you may sort Jday prior to the suggested procedure.
You should be able to accomplish this using the occurrence option of the unique function:
[~, topidx, ~] = unique(Data(:, 1), 'first', 'legacy');
[~, bottomidx, ~] = unique(Data(:, 1), 'last', 'legacy');
top_temp = temp(topidx);
bottom_temp = temp(bottomidx);
The legacy option is needed if you're using MATLAB R2013a. You should be able to remove it if you're running R2012b or earlier.
Related
I have a list of numerical values C that represent hours and minuts: first column hours, second column minuts
C=[19 44;15 57;15 19;0 21;20 21;20 20;0 6;22 0;21 17;17 47;23 51;22 27;21 39;21 36]
I want to split them in ranges:
ranges= {[0 0; 3 59] [4 0; 7 59] [8 0; 11 59] [12 0; 15 59] [16 0; 19 59] [20 0; 23 59]}
can you help me?
You can use arrayfun to achieve this. Try the following code:
times = randi(20,1,30)+rand(1,30); %% Example data.
s = arrayfun(#(n) times(times>=0+4*n & times<(4*(n+1)-1)), 0:(24/4-1),'UniformOutput', False)'
celldisp(s)
s{1} =
1.2963 2.4468 2.7948 1.5328 1.3507
s{2} =
5.4868 5.6443 4.9390
s{3} =
9.5470 10.6868 10.1835 8.7802 8.7757 9.4359 8.3786 10.5870
s{4} =
12.8176 13.8759 13.6225
s{5} =
16.9294 17.8116
s{6} =
20.5108
If you want your values sorted:
s = arrayfun(#(n) sort(times(times>=0+4*n & times<(4*(n+1)-1))), 0:(24/4-1),'UniformOutput', False)'
celldisp(s)
s{1} =
1.2963 1.3507 1.5328 2.4468 2.7948
s{2} =
4.9390 5.4868 5.6443
s{3} =
8.3786 8.7757 8.7802 9.4359 9.5470 10.1835 10.5870 10.6868
s{4} =
12.8176 13.6225 13.8759
s{5} =
16.9294 17.8116
s{6} =
20.5108
easiest way would be to use "hist()" and "histcounts()"
as mentioned by user4694 those arent doubles but either durations or timestamps.
either way you have to transform them into doubles first i.e. with minutes() in the case of timestamps, and create the specific bins the same way. This is coded for duration
X=[duration(0,0,0) duration(4,0,0) duration(3,15,0)]; %and so on
bins=[duration(0,0,0) duration(4,0,0) duration(8,0,0)]
% if you just want the histogramm
hist(X,bins);
% if you want to know which element in X goes to which bin try
[amount_in_bin,Bins,which_bin]=histcounts(minutes(X),minutes(bins));
%or just go for the last one
[~,~,which_bin]=histcounts(minutes(X),minutes(bins));
I have written a code that stores data in a matrix, but I want to shorten it so it iterates over itself.
The number of matrices created is the known variable. If it was 3, the code would be:
for i = 1:31
if idx(i) == 1
C1 = [C1; Output2(i,:)];
end
if idx(i) == 2
C2 = [C2; Output2(i,:)];
end
if idx(i) == 3
C3 = [C3; Output2(i,:)];
end
end
If I understand correctly, you want to extract rows from Output2 into new variables based on idx values? If so, you can do as follows:
Output2 = rand(5, 10); % example
idx = [1,1,2,2,3];
% get rows from Output which numbers correspond to those in idx with given value
C1 = Output2(find(idx==1),:);
C2 = Output2(find(idx==2),:);
C3 = Output2(find(idx==3),:);
Similar to Marcin i have another solution. Here i predefine my_C as a cell array. Output2 and idx are random generated and instead of find i just use logical adressing. You have to convert the data to type cell {}
Output2 = round(rand(31,15)*10);
idx = uint8(round(1+rand(1,31)*2));
my_C = cell(1,3);
my_C(1,1) = {Output2(idx==1,:)};
my_C(1,2) = {Output2(idx==2,:)};
my_C(1,3) = {Output2(idx==3,:)};
If you want to get your data back just use e.g. my_C{1,1} for the first group.
If you have not 3 but n resulting matrices you can use:
Output2 = round(rand(31,15)*10);
idx = uint8(round(1+rand(1,31)*(n-1)));
my_C = cell(1,n);
for k=1:n
my_C(1,k) = {Output2(idx==k,:)};
end
Where n is a positive integer number
I would recommend a slighty different approach. Except for making the rest of the code more maintainable it may also slightly speed up the execution. This due to that matlab uses a JIT compiler and eval must be recompiled every time. Try this:
nMatrices = 3
for k = 1:nMatrices
C{k} = Output2(idx==k,:);
end
As patrik said in the comments, naming variables like this is poor practice. You would be better off using cell arrays M{1}=C1, or if all the Ci are the same size, even just a 3D array M, for example, where M(:,:,1)=C1.
If you really want to use C1, C2, ... as you variable names, I think you will have to use eval, as arielnmz mentioned. One way to do this in matlab is
for i=1:3
eval(['C' num2str(idx(i)) '=[C' num2str(idx(i)) ';Output2(' num2str(i) ',:)];'])
end
Edited to add test code:
idx=[2 1 3 2 2 3];
Output2=rand(6,4);
C1a=[];
C2a=[];
C3a=[];
for i = 1:length(idx)
if idx(i) == 1
C1a = [C1a; Output2(i,:)];
end
if idx(i) == 2
C2a = [C2a; Output2(i,:)];
end
if idx(i) == 3
C3a = [C3a; Output2(i,:)];
end
end
C1=[];
C2=[];
C3=[];
for i=1:length(idx)
eval(['C' num2str(idx(i)) '=[C' num2str(idx(i)) ';Output2(' num2str(i) ',:)];'])
end
all(C1a(:)==C1(:))
all(C2a(:)==C2(:))
all(C3a(:)==C3(:))
I am trying to implement decision tree with recursion: So far I have written the following:
From a give data set, find the best split and return the branches, to give more details lets say I have data with features as columns of matrix and last column indicate the class of the data 1, -1.
Based on 1. I have a best feature to split along with the branches under that split, lets say based on Information gain I get feature 9 is the best split and unique values in feature 9 {1,3,5} are the branches of 9
I have figured how to get the data related to ach branch, then I need to iterate over each branch's data to get the next set of split. I am having trouble figuring this recursion.
Here is the code that I have so far, the recursion that I am doing right now doesn't look right: How can I fix this?
function [indeces_of_node, best_split] = split_node(X_train, Y_train)
%cell to save split information
feature_to_split_cell = cell(size(X_train,2)-1,4);
%iterate over features
for feature_idx=1:(size(X_train,2) - 1)
%get current feature
curr_X_feature = X_train(:,feature_idx);
%identify the unique values
unique_values_in_feature = unique(curr_X_feature);
H = get_entropy(Y_train); %This is actually H(X) in slides
%temp entropy holder
%Storage for feature element's class
element_class = zeros(size(unique_values_in_feature,1),2);
%conditional probability H(X|y)
H_cond = zeros(size(unique_values_in_feature,1),1);
for aUnique=1:size(unique_values_in_feature,1)
match = curr_X_feature(:,1)==unique_values_in_feature(aUnique);
mat = Y_train(match);
majority_class = mode(mat);
element_class(aUnique,1) = unique_values_in_feature(aUnique);
element_class(aUnique,2) = majority_class;
H_cond(aUnique,1) = (length(mat)/size((curr_X_feature),1)) * get_entropy(mat);
end
%Getting the information gain
IG = H - sum(H_cond);
%Storing the IG of features
feature_to_split_cell{feature_idx, 1} = feature_idx;
feature_to_split_cell{feature_idx, 2} = max(IG);
feature_to_split_cell{feature_idx, 3} = unique_values_in_feature;
feature_to_split_cell{feature_idx, 4} = element_class;
end
%set feature to split zero for every fold
feature_to_split = 0;
%getting the max IG of the fold
max_IG_of_fold = max([feature_to_split_cell{:,2:2}]);
%vector to store values in the best feature
values_of_best_feature = zeros(size(15,1));
%Iterating over cell to get get the index and the values under best
%splited feature.
for i=1:length(feature_to_split_cell)
if (max_IG_of_fold == feature_to_split_cell{i,2});
feature_to_split = i;
values_of_best_feature = feature_to_split_cell{i,4};
end
end
display(feature_to_split)
display(values_of_best_feature(:,1)')
curr_X_feature = X_train(:,feature_to_split);
best_split = feature_to_split
indeces_of_node = unique(curr_X_feature)
%testing
for k = 1 : length(values_of_best_feature)
% Condition to stop the recursion, if clases are pure then we are
% done splitting, if both classes have save number of attributes
% then we are done splitting.
if (sum(values_of_best_feature(:,2) == -1) ~= sum(values_of_best_feature(:,2) == 1))
if((sum(values_of_best_feature(:,2) == -1) ~= 0) || (sum(values_of_best_feature(:,2) == 1) ~= 0))
mat1 = X_train(X_train(:,5)== values_of_best_feature(k),:);
[indeces_of_node, best_split] = split_node(mat1, Y_train);
end
end
end
end
Here is the out of my code: and looks like some in my recursion I am only going depth of one branch and after that I never go back to rest of the branches
feature_to_split =
5
ans =
1 2 3 4 5 6 7 8 9
feature_to_split =
9
ans =
3 5 7 8 11
feature_to_split =
21
feature_to_split =
21
feature_to_split =
21
feature_to_split =
21
if you are interest in running this code: git
After multiple rounds of debug, I figured the answers, I hope someone will benefit from this:
for k = 1 : length(values_of_best_feature)
% Condition to stop the recursion, if clases are pure then we are
% done splitting, if both classes have save number of attributes
% then we are done splitting.
if((sum(values_of_best_feature(:,2) == -1) ~= 0) || (sum(values_of_best_feature(:,2) == 1) ~= 0))
X_train(:,feature_to_split) = [];
mat1 = X_train(X_train(:,5)== values_of_best_feature(k),:);
%if(level >= curr_level)
split_node(mat1, Y_train, 1, 2, level-1);
%end
end
end
return;
I have 4 different lengths of data (in rows) and they all have a differing ammount of columns. I need to apply an equation to each of these columns and then extract the max value from each of them.
The equation I am trying to use is:
averg = mean([interpolate(1:end-2),interpolate(3:end)],2); % this is just getting your average value.
real_num = interpolate(2:end-1);
streaking1 = (abs(real_num-averg)./averg)*100;
An example of one of my data sets is 5448 rows by 13 columns
EDIT
This is the current adapation of Ben A.'s Solution and it is working.
A = interpolate;
averg = (A(1:end-2,:) + A(3:end,:))/2;
center_A = A(2:end-1,:);
streaking = [];
for idx = 1:size(A,2)
streaking(:,idx) = (abs(center_A(idx,:)-averg(idx,:))./averg(idx,:))*100;
end
I'm not entirely sure that I fully follow what you're doing in each step, but here is a stab at it:
A = interpolate;
averg = (A(1:end-2,:) + A(3:end,:))/2;
center_A = A(2:end-1,:);
streaking = [];
for idx = 1:size(A,2)
streaking(:,idx) = (abs(center_A(idx,:)-averg(idx,:))./averg(idx,:))*100;
end
Averg will be a vector of means for each column. I just use the values in the given column as the real_num variable that you had before. I'm not clear why you would need to index that the way you are as nothing is at risk of breaking index rules.
If this helps, great! If not let me know and I'll see if I can revise somewhat.
no time scores
1 10 123
2 11 22
3 12 22
4 50 55
5 60 22
6 70 66
. . .
. . .
n n n
Above a the content of my txt file (thousand of lines).
1st column - number of samples
2nd column - time (from beginning to end ->accumulated)
3rd column - scores
I wanted to create a new file which will be the total of every three sample of the scores divided by the time difference of the same sample.
e.g. (123+22+22)/ (12-10) = 167/2 = 83.5
(55+22+66)/(70-50) = 143/20 = 7.15
new txt file
83.5
7.15
.
.
.
n
so far I have this code:
fid=fopen('data.txt')
data = textscan(fid,'%*d %d %d')
time = (data{1})
score= (data{2})
for sample=1:length(score)
..... // I'm stucked here ..
end
....
If you are feeling adventurous, here's a vectorized one-line solution using ACCUMARRAY (assuming you already read the file in a matrix variable data like the others have shown):
NUM = 3;
result = accumarray(reshape(repmat(1:size(data,1)/NUM,NUM,1),[],1),data(:,3)) ...
./ (data(NUM:NUM:end,2)-data(1:NUM:end,2))
Note that here the number of samples NUM=3 is a parameter and can be substituted by any other value.
Also, reading your comment above, if the number of samples is not a multiple of this number (3), then simply discard the remaining samples by doing this beforehand:
data = data(1:fix(size(data,1)/NUM)*NUM,:);
Im sorry, here's a much simpler one :P
result = sum(reshape(data(:,3), NUM, []))' ./ (data(NUM:NUM:end,2)-data(1:NUM:end,2));
%# Easier to load with importdata
data = importdata('data.txt',' ',1);
%# Get the number of rows
n = size(data,1);
%# Column IDs
time = 2;score = 3;
%# The interval size (3 in your example)
interval = 3;
%# Pre-allocate space
new_data = zeros(numel(interval:interval:n),1);
%# For each new element in the new data
index = 1;
%# This will ignore elements past the closest (floor) multiple of 3 as requested
for i = interval:interval:n
%# First and last elements in a batch
a = i-interval+1;
b = i;
%# Compute the new data
new_data(index) = sum( data(a:b,score) )/(data(b,time)-data(a,time));
%# Increment
index = index+1;
end
For what it's worth, here is how you would go about to do that in Python. It is probably adaptable to Matlab.
import numpy
no, time, scores = numpy.loadtxt('data', skiprows=1).T
# here I assume that your n is a multiple of 3! otherwise you have to adjust
sums = scores[::3]+scores[1::3]+scores[2::3]
dt = time[2::3]-time[::3]
result = sums/dt
I suggest you use the importdata() function to get your data into your variable called data. Something like this:
data = importdata('data.txt',' ', 1)
replace ' ' by the delimiter your file uses, the 1 specifies that Matlab should ignore 1 header line. Then, to compute your results, try this statement:
(data(1:3:end,3)+data(2:3:end,3)+data(3:3:end,3))./(data(3:3:end,2)-data(1:3:end,2))
This worked on your sample data, should work on the real data you have. If you figure it out yourself you'll learn some useful Matlab.
Then use save() to write the results back to a file.
PS If you find yourself writing loops in Matlab you are probably doing something wrong.