Huffman encoding using cell array

Huffman encoding using cell array - matlab

I have some problems in final part of Huffman encoding.
Currently I have my coding table in cell array
code =
{
[1,1] = 000
[1,2] = 001
[1,3] = 010
[1,4] = 011
[1,5] = 100
...
}
Where second index represent ascii character in my other cell array
huffman_tree =
{
[1,1] = A
[1,2] = B
[1,3] = C
[1,4] = D
[1,5] = E
...
}
I'm using following code for encoding input to output:
output= [];
for i=1:length(input)
x = findInArray(huffman_tree, input(i));
output= [output code(x)];
end
function [index] = findInArray(array, searched)
index = -1;
for i=1:length(array)
if array{i} == searched
index = i;
end
end
end
At this point my code is O(n^2) or even worse. I'm having problem with large input where
length(input) = 1000000
There must be some faster way to transform input with my coding table to output.

Because you're using cell arrays, that's going to be inherently slow so you have no choice but to iterate over each cell. However, I can provide some suggestions to help speed things up. What you can do is use strcmp to compare strings. I'm assuming that each character in your cell array is represented as a one character string. strcmp has the ability to take an individual string and compare itself to a cell array of strings. The output will be an array that is the same size as the cell array of strings and give you a logical true if the input string matches a position in the cell array and false otherwise.
Because your Huffman dictionary will contain a unique set of characters, you will only get one possible match per character. Therefore, we can use this logical array output to index the codebook directly to retrieve the corresponding code you want. Logical indexing works by supplying a logical vector that is the same length as the vector of interest and it retrieves those values whose corresponding positions are true. Therefore, if we only had one true value in the logical vector with the rest of the elements being false, this means that we would get just the corresponding element we desire and nothing else.
Therefore, we can change your code to do this. Note that I've changed the loop counter i to idx because it has actually been shown that using i as a loop counter slows down your code by a slight amount. See this post by Shai Bagon for more details: Using i and j as variables in Matlab. Also, I've changed the length call to numel... mainly because I don't like using length.... just a personal choice though.
output = [];
for idx = 1:numel(input)
output = [output code(strcmp(input(idx), huffman_tree))];
end
Give the above a whirl and see if it performs any faster. For one thing, this will escape using an additional for loop for searching for a match as strcmp is very efficiently implemented, so the above code won't be O(n^2), but could be slightly better than quadratic.

Related

Printing progress in command window

I'd like to use fprintf to show code execution progress in the command window.
I've got a N x 1 array of structures, let's call it myStructure. Each element has the fields name and data. I'd like to print the name side by side with the number of data points, like such:
name1 number1
name2 number2
name3 number3
name4 number4
...
I can use repmat N times along with fprintf. The problem with that is that all the numbers have to come in between the names in a cell array C.
fprintf(repmat('%s\t%d',N,1),C{:})
I can use cellfun to get the names and number of datapoints.
names = {myStucture.name};
numpoints = cellfun(#numel,{myStructure.data});
However I'm not sure how to get this into a cell array with alternating elements for C to make the fprintf work.
Is there a way to do this? Is there a better way to get fprintf to behave as I desire?

You're very close. What I would do is change your cellfun call so that the output is a cell array instead of a numeric array. Use the 'UniformOutput' flag and set this to 0 or false.
When you're done, make a new cell array where both the name cell array and the size cell array are stacked on top of each other. You can then call fprintf once.
% Save the names in a cell array
A = {myStructure.name};
% Save the sizes in another cell array
B = cellfun(#numel, {myStructure.data}, 'UniformOutput', 0);
% Create a master cell array where the first row are the names
% and the second row are the sizes
out = [A; B];
% Print out the elements side-by-side
fprintf('%s\t%d\n', out{:});
The trick with the third line of code is that when you unroll the cell array using {:}, this creates a comma-separated list unrolled in column-major format, and so doing out{:} actually gives you:
A{1}, B{1}, A{2}, B{2}, ..., A{n}, B{n}
... which provides the interleaving you need. Therefore, providing this order into fprintf coincides with the format specifiers that are specified and thus gives you what you need. That's why it's important to stack the cell arrays so that each column gives the information you need.
Minor Note
Of course one should never forget that one of the easiest ways to tackle your problem is to just use a simple for loop. Even though for loops are considered bad practice, their performance has come a long way throughout MATLAB's evolution.
Simply put, just do this:
for ii = 1 : numel(myStructure)
fprintf('%s\t%d\n', myStructure(ii).name, numel(myStructure(ii).data));
end
The above code is arguably more readable in comparison to what we did above with cell arrays. You're accessing the structure directly rather than having to create intermediate variables for the purpose of calling fprintf once.
Example Run
Here's an example of this running. Using the data shown below:
clear myStructure;
myStructure(1).name = 'hello';
myStructure(1).data = rand(5,1);
myStructure(2).name = 'hi';
myStructure(2).data = zeros(3,3);
myStructure(3).name = 'huh';
myStructure(3).data = ones(6,4);
I get the following output after running the printing code:
hello 5
hi 9
huh 24
We can see that the sizes are correct as the first element in the structure is simply a random 5 element vector, the second element is a 3 x 3 = 9 zeroes matrix while the last element is a 6 x 4 = 24 ones matrix.

How to perform XOR in a recursive scenario

I have a 1x5 char matrix. I need to perform a bitwise XOR operation on all the elements in the matrix.If T is the char matrix , I need a matrix T' such that
T'= T XOR (T-1)' for all T
T for T=1
Let the char matrix be T
T=['0000000000110111' '0000000001000001' '0000000001001010' '0000000010111000' '0000000000101111']
T'=['0000000000110111' '0000000001110110' '0000000000111100' '0000000010000100' '0000000010101011']
ie; Leaving the first element as such , I need to XOR all the other elements with the newly formed matrix. I tried the following code but I'm unable to get the correct result.
Yxor1d = [T(1) cellfun(#(a,b) char((a ~= b) + '0'), T(2:end), T'(1:end-1), 'UniformOutput', false)]
I need to perform the XOR operation such that , for obtaining the elements of T'
T' (2)= T(2) XOR T' (1)
T' (3)= T(3) XOR T' (2)
It'll be really helpful to know where I went wrong.Thanks.

You are using cellfun when a cell array is expected as the input. You are using a character array, and what you're actually doing is taking each of those 5 strings and creating a single character array out of them. Chaining those strings together is actually performing a character concatenation.
You probably don't want that. To fix this, all you have to do is make T a cell array by placing {} characters instead of array ([]) characters to declare your characters:
T={'0000000000110111' '0000000001000001' '0000000001001010' '0000000010111000' '0000000000101111'};
Because you have edited your post after I provided my answer, my previous answer using cellfun is now incorrect. Because you are using a recurrence relation where you are referring to the previous output rather than input, you can no longer use cellfun. You'll need to use a for loop. There are probably more elegant ways to do it, but this is the easiest if you want to get something working.
As such, initialize an output cell array that is the same size as the input cell array like above, then you'll need to initialize the first cell to be the first cell of the input, then iterate through each pair of input and output elements yourself.
So do something like this:
Yxor1d = cell(1,numel(T));
Yxor1d{1} = T{1};
for idx = 2 : numel(T)
Yxor1d{idx} = char(double(T{idx} ~= Yxor1d{idx-1}) + '0');
end
For each value i of T', we XOR with the current input at T{i} with the previous output of T'{i-1}.
Use the above and your input cell array T, we get:
Yxor1d =
Columns 1 through 3
'0000000000110111' '0000000001110110' '0000000000111100'
Columns 4 through 5
'0000000010000100' '0000000010101011'
This matches with your specifications in your modified post.

Edit: There is a solution without a loop:
T=['0000000000110111';'0000000001000001';'0000000001001010';'0000000010111000' ;'0000000000101111'];
Yxor = dec2bin(bi2de(mod(cumsum(de2bi(bin2dec(T))),2)),16)
Yxor =
0000000000110111
0000000001110110
0000000000111100
0000000010000100
0000000010101011
This uses the fact that you effectively want a cumulative xor operation on the elements of your array.
For N booleans it should be either any one of them or else all of them. So if you do a cumulative sum of each of your bits, the sum should be an odd number for a true answer to 'xor'.
The one liner above can be decomposed like that:
Y = bin2dec(T) ; %// convert char array T into decimal numbers
Y = de2bi( Y ) ; %// convert decimal array Tbin into array of "bit"
Y = cumsum(Y) ; %// do the cumulative sum on each bit column
Y = mod(Y,2) ; %// convert all "even" numbers to '0', and 'odd' numbers to '1'
Y = bi2de(Y) ; %// re-assemble the bits into decimal numbers
Yxor = dec2bin(Y,16) ; %// get their string representation
Note that if you are happy to handle arrays of bits (boolean) instead of character arrays, you can shave off a few lines from above ;-)
Initial answer (simpler to grasp, but with a loop):
You can use the bitxor function, but you have to convert your char array in numeric value first:
T=['0000000000110111';'0000000001000001';'0000000001001010' ;'0000000010111000' ;'0000000000101111'];
Tbin = bin2dec(T) ; %// convert to numeric values
Ybin = Tbin ; %// pre-assign result, then loop ...
for idx = 2 : numel(Tbin)
Ybin(idx) = bitxor( Ybin(idx) , Ybin(idx-1) ) ;
end
Ychar = dec2bin(Ybin,16) %// convert back to 16bit char array representation if necessary
Ychar =
0000000000110111
0000000001110110
0000000000111100
0000000010000100
0000000010101011
edited answer after you redefined your problem

How to convert char to number in Matlab

I am having trouble converting a character variable to a number in Matlab.
Each cell in the char variable contains one of two possible words. I need to convert word_one (for example) to represent '1', and word_two to represent '2'.
Is there a command that will let me do this?
So far I've tried:
%First I converted 'Word' from cell to char
Word = char(Word);
Word(Word == 'Word_one') = '1';
Word(Word == 'Word_two') = '2';
However, I get the:
Error using ==
Matrix dimensions must agree.
When I try to include the first letter only (ie. 'W'), it only changes the first letter in the full word (ie. 1ord_one).
Is there an easy way to do this?
Thanks for your help - any advice is much appreciated!

Use ismember:
possibleWords = {'Word_one', 'Word_two'}; %// template: words corresponding to 1, 2, ...
words = {'Word_two', 'Word_one', 'Word_two'}; %// data: words you need to convert
[~, result] = ismember(words, possibleWords);
In this example,
result =
2 1 2
If you need more flexibility, you can specify the value corresponding to each word:
possibleWords = {'Word_one', 'Word_two'}; %// template: words corresponding to 1, 2, ...
correspondingValues = [1.1, 2.2]; %// template: value corresponding to each word
words = {'Word_two', 'Word_one', 'Word_two'}; %// data: words you need to convert
[~, ind] = ismember(words, possibleWords);
result = correspondingValues(ind);
which gives
result =
2.2000 1.1000 2.2000

Looks like there are a couple of potential issues here.
Use strcmp() (string compare) in place of your current equivalence statement. Comparing strings using == compares element by element and returns a logical vector (where here you want a single logical value). String comparison, strcmp(), will compare the entire strings instead and return a single value.
It's also probably not necessary for you to convert your cell array. You can maintain the cell array structure and address each cell individually.
Try something along the lines of the following snippet.
for i = 1:length(Word)
if strcmp(Word{i},'Word_one')
Word{i} = '1';
elseif strcmp(Word{i},'Word_two')
Word{i} = '2';
end
end

There are a number of ways to solve this problem. Here's my approach.
% define your words
words = {'word_one','word_two','word_two','word_one','word_one'};
% define a function to get the indexes of the words of interest
getindex = #(c, y) cellfun(#(x) strcmp(x,y), c);
% replace 'word_one' with '1'
words(getindex(words, 'word_one'))={'1'};
% replace 'word_two' with '2'
words(getindex(words, 'word_two'))={'2'};
words =
'1' '2' '2' '1' '1'

You can use short n simple unique -
input_cellarr = {'Word_two','Word_one','Word_two','Word_two','Word_one','Word_one'}
[~,~,out] = unique(input_cellarr)
Sample run -
input_cellarr =
'Word_two' 'Word_one' 'Word_two' 'Word_two' 'Word_one' 'Word_one'
out =
2
1
2
2
1
1
Explanation: unique works here because it will produce an ascending order sorted array with numeric arrays. Now, when used on cell arrays, that ascending order translates to alphabetical order sorting. Thus, unique(input_cellarr) would always have {'Word_one' , 'Word_two'} because one is alphabetically higher up than two. Therefore the out indices would always have the first unique ID as 1 for 'Word_one' and the second ID as 2 for 'Word_two'.

Substrings from a Cell Array in Matlab

I'm new to MATLAB and I'm struggling to comprehend the subtleties between array-wise and element wise operations. I'm working with a large dataset and I've found the simplest methods aren't always the fastest. I have a very large Cell Array of strings, like in this simplified example:
% A vertical array of same-length strings
CellArrayOfStrings = {'aaa123'; 'bbb123'; 'ccc123'; 'ddd123'};
I'm trying to extract an array of substrings, for example:
'a1'
'b1'
'c1'
'd1'
I'm happy enough with an element-wise reference like this:
% Simple element-wise substring operation
MySubString = CellArrayOfStrings{2}(3:4); % Expected result is 'b1'
But I can't work out the notation to reference them all in one go, like this:
% Desired result is 'a1','b1','c1','d1'
MyArrayOfSubStrings = CellArrayOfStrings{:}(3:4); % Incorrect notation!
I know that Matlab is capable of performing very fast array-wise operations, such as strcat, so I was hoping for a technique that works at a similar speed:
% An array-wise operation which works quickly
tic
speedTest = strcat(CellArrayOfStrings,'hello');
toc % About 2 seconds on my machine with >500K array elements
All the for loops and functions which use behind-the-scenes iteration I have tried run too slowly with my dataset. Is there some array-wise notation that would do this? Would somebody be able to correct my understanding of element-wise and array-wise operations?! Many thanks!

I can't work out the notation to reference them all in one go, like this:
MyArrayOfSubStrings = CellArrayOfStrings{:}(3:4); % Incorrect notation!
This is because curly braces ({}) return a comma-separated list, which is equivalent to writing the contents of these cells in the following way:
c{1}, c{2}, and so on....
When the subscript index refers to only one element, MATLAB's syntax allows to use parentheses (()) after the curly braces and further extract a sub-array (a substring in your case). However, this syntax is prohibited when the comma separated lists contains multiple items.
So what are the alternatives?
Use a for loop:
MyArrayOfSubStrings = char(zeros(numel(CellArrayOfStrings), 2));
for k = 1:size(MyArrayOfSubStrings, 1)
MyArrayOfSubStrings(k, :) = CellArrayOfStrings{k}(3:4);
end
Use cellfun (a slight variant of Dang Khoa's answer):
MyArrayOfSubStrings = cellfun(#(x){x(3:4)}, CellArrayOfStrings);
MyArrayOfSubStrings = vertcat(MyArrayOfSubStrings{:});
If your original cell array contains strings of a fixed length, you can follow Dan's suggestion and convert the cell array into an array of strings (a matrix of characters), reshape it and extract the desired columns:
MyArrayOfSubStrings =vertcat(CellArrayOfStrings{:});
MyArrayOfSubStrings = MyArrayOfSubStrings(:, 3:4);
Employ more complicated methods, such as regular expressions:
MyArrayOfSubStrings = regexprep(CellArrayOfStrings, '^..(..).*', '$1');
MyArrayOfSubStrings = vertcat(MyArrayOfSubStrings{:});
There are plenty solutions to pick from, just pick the one that fits you most :) I think that with MATLAB's JIT acceleration, a simple loop would be sufficient in most cases.
Also note that in all my suggestions the obtained cell array of substrings cell is converted into an array of strings (a matrix). This is just for the sake of the example; obviously you can keep the substrings stored in a cell array, should you decide so.

cellfun operates on every element of a cell array, so you could do something like this:
>> CellArrayOfStrings = {'aaa123'; 'bbb123'; 'ccc123'; 'ddd123'};
>> MyArrayofSubstrings = cellfun(#(str) str(3:4), CellArrayOfStrings, 'UniformOutput', false)
MyArrayofSubstrings =
'a1'
'b1'
'c1'
'd1'
If you wanted a matrix of strings instead of a cell array whose elements are the strings, use char on MyArrayOfSubstrings. Note that this is only allowed when each string is the same length.

You can do this:
C = {'aaa123'; 'bbb123'; 'ccc123'; 'ddd123'}
t = reshape([C{:}], 6, [])'
t(:, 3:4)
But only if your strings are all of equal length I'm afraid.

You can use char to convert them to a character array, do the indexing and convert it back to cell array
A = char(CellArrayOfStrings);
B = cellstr(A(:,3:4));
Note that if strings are of different lengths, char pads them with spaces at the end to create the array. Therefore if you index for a column that is beyond the length of one of the short strings you may receive some space characters.

Regarding storage issue for mixed-type value matrix

There has a loop in my program, and during each iteration an ID will be generated. I want to store these IDs into a two dimensional array, i.e., A. The first column of A stores the iteration number, i.e., A(1,1) = 1 and A(2,1) = 2. The second column of A stores the ID generated during each iteration, i.e., A(1,2) stores the ID generated during the first iteration. The tricky part is that these IDs can be either a numerical value or a string. For instance, A(1,2) = 12345; A(2,2) = abcde
Which kind of data structure should I use to store this mixed-value matrix?

You have two good options, a cell array or an array of structures.
To use a cell array you need to use braces:
A{1,1} = 1;
A{2,1} = 2;
A{1,2} = 12345;
A{2,2} = 'abcd';
You cannot use most vectorized code with cell arrays, although you can convert numeric subsets to numeric arrays, for example:
col1 = cell2mat(A(:,1));
To use an array of structures, you need to define fields. This has the advantage that you can name your columns of data.
A(1).iteration = 1;
A(2).iteration = 2;
A(1).result = 12345;
A(2).result = 'abcd';
To access a single row of data, use A(1), like this
>> A(1)
ans =
iteration: 1
result: 12345
To access a column of data, use brackets or braces
>> [A.iteration] %This results a numeric array, or an error if not possible
ans =
1 2
>> {A.result} %This returns a cell array, as discussed above.
ans =
[12345] 'abcd'
Which option you use depends on the nature of your task and what method is more suitable to your style. I usually start with a cell array, and eventually convert to an array of structs to take advantage of the named fields.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse