One hot encode column vectors in matrix without iterating - matlab

I am implementing a neural network and am trying to one hot encode a matrix of column vectors based on the max value in each column. Previously, I had been iterating through the matrix vector by vector, but I've been told that this is unnecessary and that I can actually one hot encode every column vector in the matrix at the same time. Unfortunately, after perusing SO, GitHub, and MathWorks, nothing seems to be getting the job done. I've listed my previous code below. Please help! Thanks :)
UPDATE:
This is what I am trying to accomplish...except this only changed the max value in the entire matrix to 1. I want to change the max value in each COLUMN to 1.
one_hots = bsxfun(#eq, mini_batch_activations, max(mini_batch_activations(:)))
UPDATE 2:
This is what I am looking for, but it only works for rows. I need columns.
V = max(mini_batch_activations,[],2);
idx = mini_batch_activations == V;
Iterative code:
% This is the matrix I want to one hot encode
mini_batch_activations = activations{length(layers)};
%For each vector in the mini_batch:
for m = 1:size(mini_batch_activations, 2)
% Isolate column vector for mini_batch
vector = mini_batch_activations(:,m);
% One hot encode vector to compare to target vector
one_hot = zeros(size(mini_batch_activations, 1),1);
[max_val,ind] = max(vector);
one_hot(ind) = 1;
% Isolate corresponding column vector in targets
mini_batch = mini_batch_y{k};
target_vector = mini_batch(:,m);
% Compare one_hot to target vector , and increment result if they match
if isequal(one_hot, target_vector)
num_correct = num_correct + 1;
endif
...
endfor

You’ve got the maxima for each column:
V = max(mini_batch_activations,[],1); % note 1, not 2!
Now all you need to do is equality comparison, the output is a logical array that readily converts to 0s and 1s. Note that MATLAB and Octave do implicit singleton expansion:
one_hot = mini_batch_activations==V;

Related

Columnwise removal of first ones from binary matrix. MATLAB

I have some binary matrix. I want to remove all first ones from each column, but keep one if this value is alone in column. I have some code, which produces correct result, but it looks ugly- I should iterate through all columns.
Could You give me a piece of advice how to improve my code?
Non-vectorised code:
% Dummy matrix for SE
M = 10^3;
N = 10^2;
ExampleMatrix = (rand(M,N)>0.9);
ExampleMatrix1=ExampleMatrix;
% Iterate columns
for iColumn = 1:size(ExampleMatrix,2)
idx = find(ExampleMatrix(:,iColumn)); % all nonzeroes elements
if numel(idx) > 1
% remove all ones except first
ExampleMatrix(idx(1),iColumn) = 0;
end
end
I think this does what you want:
ind_col = find(sum(ExampleMatrix, 1)>1); % index of relevant columns
[~, ind_row] = max(ExampleMatrix(:,ind_col), [], 1); % index of first max of each column
ExampleMatrix(ind_row + (ind_col-1)*size(ExampleMatrix,1)) = 0; % linear indexing
The code uses:
the fact that the second output of max gives the index of the first maximum value. In this case max is applied along the first dimension, to find the first maximum of each column;
linear indexing.

Concatenating vectors while adding their elements - creating a valid time vector

I have several vectors I would like to concatenate, where each element is an increasing timestamp, but how do I concatenate the vectors while ensuring to have an continuously time scale?
Say, I have two vectors tone1_time and tone2_time both given by 1x4801 double. Each element of the vector contain a timestamps and thus the elements must be added when the vectors are concatenated in order to have the correct time. So far I have;
n = 10;
for i = 1:n
time(n,end) = tone1_time + tone2_time;
end
Which generates an error in matlab!
EDIT: More code
I generate two sounds vectors and concatenate them by:
% repeat n times
n = 10;
signal = [ tone1_signal tone2_signal ];
signal = repmat(signal,1,n);
This will e.g. return a new vector signal with a length of e.g. 1x48020 double. The time vector needs to have the same size as this vector, but also still having an continuously time.
first, you need to add the last element of tone1_time to all elements of tone2_time to ensure continuity of the time intervals:
tone2_time = tone2_time + tone1_time(end);
Then you can concat them
tone_time = [tone1_time, tone2_time];
Alternatively, you can work with the differences
tone_time = cumsum([diff([0 tone1_time]), diff([0 tone2_time])]);
EDIT:
For replicating the time vector:
tone_time_diff = [diff([0 tone1_time]), diff([0 tone2_time])];
tone_time = cumsum( repmat(tone_time_diff, 1, n) );

Assign labels based on given examples for a large dataset effectively

I have matrix X (100000 X 10) and vector Y (100000 X 1). X rows are categorical and assume values 1 to 5, and labels are categorical too (11 to 20);
The rows of X are repetitive and there are only ~25% of unique rows, I want Y to have statistical mode of all the labels for a particular unique row.
And then there comes another dataset P (90000 X 10), I want to predict labels Q based on the previous exercise.
What I tried is finding unique rows of X using unique in MATLAB, and then assign statistical mode of each of these labels for the unique rows. For P, I can use ismember and carry out the same.
The issue is in the size of the dataset and it takes an 1.5-2 hours to complete the process. Is there a vectorize version possible in MATLAB?
Here is my code:
[X_unique,~,ic] = unique(X,'rows','stable');
labels=zeros(length(X_unique),1);
for i=1:length(X_unique)
labels(i)=mode(Y(ic==i));
end
Q=zeros(length(P),1);
for j=1:length(X_unique)
Q(all(repmat(X_unique(j,:),length(P),1)==P,2))=label(j);
end
You will be able to accelerate your first loop a great deal if you replace it entirely with:
labels = accumarray(ic, Y, [], #(y) mode(y));
The second loop can be accelerated by using all(bsxfun(#eq, X_unique(i,:), P), 2) inside Q(...). This is a good vectorized approach assuming your arrays are not extremely large w.r.t. the available memory on your machine. In addition, to save more time, you could use the unique trick you did with X on P, run all the comparisons on a much smaller array:
[P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
EDIT:
to compute Q_unique in the following way: and then convert it back to the full array using:
Q_unique = zeros(length(P_unique),1);
for i = 1:length(X_unique)
Q_unique(all(bsxfun(#eq, X_unique(i,:), P_unique), 2)) = labels(i)
end
and convert back to Q_full to match the original P input:
Q_full = Q_unique(IC_P);
END EDIT
Finally, if memory is an issue, in addition to everything above, you might want you use a semi-vectorized approach inside your second loop:
for i = 1:length(X_unique)
idx = true(length(P), 1);
for j = 1:size(X_unique,2)
idx = idx & (X_unique(i,j) == P(:,j));
end
Q(idx) = labels(i);
% Q(all(bsxfun(#eq, X_unique(i,:), P), 2)) = labels(i);
end
This would take about x3 longer compared with bsxfun but if memory is limited then you gotta pay with speed.
ANOTHER EDIT
Depending on your version of Matlab, you could also use containers.Map to your advantage by mapping textual representations of the numeric sequences to the calculated labels. See example below.
% find unique members of X to work with a smaller array
[X_unique, ~, IC_X] = unique(X, 'rows', 'stable');
% compute labels
labels = accumarray(IC_X, Y, [], #(y) mode(y));
% convert X to cellstr -- textual representation of the number sequence
X_cellstr = cellstr(char(X_unique+48)); % 48 is ASCII for 0
% map each X to its label
X_map = containers.Map(X_cellstr, labels);
% find unique members of P to work with a smaller array
[P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
% convert P to cellstr -- textual representation of the number sequence
P_cellstr = cellstr(char(P_unique+48)); % 48 is ASCII for 0
% --- EDIT --- avoiding error on missing keys in X_map --------------------
% find which P's exist in map
isInMapP = X_map.isKey(P_cellstr);
% pre-allocate Q_unique to the size of P_unique (can be any value you want)
Q_unique = nan(size(P_cellstr)); % NaN is safe to use since not a label
% find the labels for each P_unique that exists in X_map
Q_unique(isInMapP) = cell2mat(X_map.values(P_cellstr(isInMapP)));
% --- END EDIT ------------------------------------------------------------
% convert back to full Q array to match original P
Q_full = Q_unique(IC_P);
This takes about 15 seconds to run on my laptop. Most of which is consumed by computation of mode.

How to see resampled data after BOOTSTRAP

I was trying to resample (with replacement) my database using 'bootstrap' in Matlab as follows:
D = load('Data.txt');
lead = D(:,1);
depth = D(:,2);
X = D(:,3);
Y = D(:,4);
%Bootstraping to resample 100 times
[resampling100,bootsam] = bootstrp(100,'corr',lead,depth);
%plottig the bootstraping result as histogram
hist(resampling100,10);
... ... ...
... ... ...
Though the script written above is correct, I wonder how I would be able to see/load the resampled 100 datasets created through bootstrap? 'bootsam(:)' display the indices of the data/values selected for the bootstrap samples, but not the new sample values!! Isn't it funny that I'm creating fake data from my original data and I can't even see what is created behind the scene?!?
My second question: is it possible to resample the whole matrix (in this case, D) altogether without using any function? However, I know how to create random values from a vector data using 'unidrnd'.
Thanks in advance for your help.
The answer to question 1 is that bootsam provides the indices of the resampled data. Specifically, the nth column of bootsam provides the indices of the nth resampled dataset. In your case, to obtain the nth resampled dataset you would use:
lead_resample_n = lead(bootsam(:, n));
depth_resample_n = depth(bootsam(:, n));
Regarding the second question, I'm guessing what you mean is, how would you just get a re-sampled dataset without worrying about applying a function to the resampled data. Personally, I would use randi, but in this situation, it is irrelevant whether you use randi or unidrnd. An example follows that assumes 4 columns of some data matrix D (as in your question):
%# Build an example dataset
T = 10;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, 1);
%# Obtain the resampled data
DResampled = D(Ind, :);
To create multiple re-sampled data, you can simply loop over the creation of random indices. Or you could do it in one step by creating a matrix of random indices and using that to index D. With careful use of reshape and permute you can turn this into a T*4*M array, where indexing m = 1, ..., M along the third dimension yields the mth resampled dataset. Example code follows:
%# Build an example dataset
T = 10;
M = 3;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, M);
%# Obtain the resampled data
DResampled = permute(reshape(D(Ind, :)', 4, T, []), [2 1 3]);

N-Dimensional Histogram Counts

I am currently trying to code up a function to assign probabilities to a collection of vectors using a histogram count. This is essentially a counting exercise, but requires some finesse to be able to achieve efficiently. I will illustrate with an example:
Say that I have a matrix X = [x1, x2....xM] with N rows and M columns. Here, X represents a collection of M, N-dimensional vectors. IN other words, each of the columns of X is an N-dimensional vector.
As an example, we can generate such an X for M = 10000 vectors and N = 5 dimensions using:
X = randint(5,10000)
This will produce a 5 x 10000 matrix of 0s and 1s, where each column is represents a 5 dimensional vector of 1s and 0s.
I would like to assign a probability to each of these vectors through a basic histogram count. The steps are simple: first find the unique columns of X; second, count the number of times each unique column occurs. The probability of a particular occurrence is then the #of times this column was in X / total number of columns in X.
Returning to the example above, I can do the first step using the unique function in MATLAB as follows:
UniqueXs = unique(X','rows')'
The code above will return UniqueXs, a matrix with N rows that only contains the unique columns of X. Note that the transposes are due to weird MATLAB input requirements.
However, I am unable to find a good way to count the number of times each of the columns in UniqueX is in X. So I'm wondering if anyone has any suggestions?
Broadly speaking, I can think of two ways of achieving the counting step. The first way would be to use the find function, though I think this may be slow since find is an elementwise operation. The second way would be to call unique recursively as it can also provide the index of one of the unique columns in X. This should allow us to remove that column from X and redo unique on the resulting X and keep counting.
Ideally, I think that unique might already be doing some counting so the most efficient way would probably be to work without the built-in functions.
Here are two solutions, one assumes all values are either 0's or 1's (just like the example in your description), the other does not. Both codes should be very fast (more so the one with binary values), even on large data.
1) only zeros and ones
%# random vectors of 0's and 1's
x = randi([0 1], [5 10000]); %# RANDINT is deprecated, use RANDI instead
%# convert each column to a binary string
str = num2str(x', repmat('%d',[1 size(x,1)])); %'
%# convert binary representation to decimal number
num = (str-'0') * (2.^(size(s,2)-1:-1:0))'; %'# num = bin2dec(str);
%# count frequency of how many each number occurs
count = accumarray(num+1,1); %# num+1 since it starts at zero
%# assign probability based on count
prob = count(num+1)./sum(count);
2) any positive integer
%# random vectors with values 0:MAX_NUM
x = randi([0 999], [5 10000]);
%# format vectors as strings (zero-filled to a constant length)
nDigits = ceil(log10( max(x(:)) ));
frmt = repmat(['%0' num2str(nDigits) 'd'], [1 size(x,1)]);
str = cellstr(num2str(x',frmt)); %'
%# find unique strings, and convert them to group indices
[G,GN] = grp2idx(str);
%# count frequency of occurrence
count = accumarray(G,1);
%# assign probability based on count
prob = count(G)./sum(count);
Now we can see for example how many times each "unique vector" occurred:
>> table = sortrows([GN num2cell(count)])
table =
'000064850843749' [1] # original vector is: [0 64 850 843 749]
'000130170550598' [1] # and so on..
'000181606710020' [1]
'000220492735249' [1]
'000275871573376' [1]
'000525617682120' [1]
'000572482660558' [1]
'000601910301952' [1]
...
Note that in my example with random data, the vector space becomes very sparse (as you increase the maximum possible value), thus I wouldn't be surprised if all counts were equal to 1...