MATLAB : Alphanumeric character string extraction - matlab

As a foreword, I have been searching for solutions to this, and I have tried a myriad of codes but none of them work for the specific case.
I have a variable that is the registration number of different UK firms. The data was originally from Stata, and I had to use a code to import non-numeric data into Matlab. This variable (regno) is numeric up until observation 18000 (approx). From then it becomes registration numbers with both letters and numbers.
I wrote a very crude loop that grabbed the initial variable (cell), took out the double quotations, and extracted the characters into a another matrix (double). The code is :
regno2 = strrep(regno,'"','');
regno3 = cell2mat(regno2);
regno4 = [];
for i = 1:size(regno3,1);
regno4(i,1) = str2double(regno3(i,1:8));
end
For the variables with both letters and numbers I get NaN. I need the variables as a double in order to use them as dummy indicator variables in MatLab. Any ideas?
Thanks

Ok I'm not entirely sure about whether you need letters all the time, but here regular expressions would likely perform what you want.
Here is a simple example to help you get started; in this case I use regexp to locate the numbers in your entries.
clear
%// Create dummy entries
Case1 = 'NI000166';
Case2 = '12ABC345';
%// Put them in a cell array, like what you have.
CasesCell = {Case1;Case2};
%// Use regexp to locate the numbers in the expression. This will give the indices of the numbers, i.e. their position within each entry. Note that regexp can operate on cell arrays, which is useful to us here.
NumberIndices = regexp(CasesCell,'\d');
%// Here we use cellfun to fetch the actual values in each entry, based on the indices calculated above.
NumbersCell = cellfun(#(x,y) x(y),CasesCell,NumberIndices,'uni',0)
Now NumbersCell looks like this:
NumbersCell =
'000166'
'12345'
You can convert it to a number with str2num (or srt2double) and you're good to go.
Note that in the case in which you have 00001234 or SC001234, the values given by regexp would be considered as different so that would not cause a problem. If the variables are of different lenghts and you then have similar numbers, then you would need to add a bit of code with regexp to consider the letters.
Hope that helps! If you need clarifications or if I misunderstood something please tell me!

Related

Reading strings into individual array/matrix elements in Matlab

I have a text file that contains a number of strings, one per line, as below.
Happy
Sad
Disgust
Happy
Sad
Etc...
I want to be able to read these strings and store them into an array or matrix within Matlab. At the moment the code I have works, but it stores all of the strings into a single array element, like this.
HappySadDisgustHappySad...
I want the strings to be stored in their own individual elements.
What changes do I make to this code to make this happen?
emotionFile = fopen('emotion_labels_purged.txt','r');
formatSpec = '%s';
sizeEmotionLabels = [1 Inf];
emotionLabels = fscanf(emotionFile,formatSpec,sizeEmotionLabels)
fclose(emotionFile);

Octave / Matlab - Reading fixed width file

I have a fixed width file format (original was input for a Fortran routine). Several lines of the file look like the below:
1078.0711005.481 932.978 861.159 788.103 716.076
How this actually should read:
1078.071 1005.481 932.978 861.159 788.103 716.076
I have tried various methods, textscan, fgetl, fscanf etc, however the problem I have is, as seen above, sometimes because of the fixed width of the original files there is no whitespace between some of the numbers. I cant seem to find a way to read them directly and I cant change the original format.
The best I have come up with so far is to use fgetl which reads the whole line in, then I reshape the result into an 8,6 array
A=fgetl
A=reshape(A,8,6)
which generates the following result
11
009877
703681
852186
......
049110
787507
118936
So now I have the above and thought I might be able to concatenate the rows of that array together to form each number, although that is seeming difficult as well having tried strcat, vertcat etc.
All of that seems a long way round so was hoping for some better suggestions.
Thanks.
If you can rely on three decimal numbers you can use a simple regular expression to generate the missing blanks:
s = '1078.0711005.481 932.978 861.159 788.103 716.076';
s = regexprep(s, '(\.\d\d\d)', '$1 ');
c = textscan(s, '%f');
Now c{1} contains your numbers. This will also work if s is in fact the whole file instead of one line.
You haven't mentioned which class of output you needed, but I guess you need to read doubles from the file to do some calculations. I assume you are able to read your file since you have results of reshape() function already. However, using reshape() function will not be efficient for your case since your variables are not fixed sized (i.e 1078.071 and 932.978).
If I did't misunderstand your problem:
Your data is squashed in some parts (i.e 1078.0711005.481 instead
of 1078.071 1005.481).
Fractional part of variables have 3 digits.
First of all we need to get rid of spaces from the string array:
A = A(~ismember(A,' '));
Then using the information that fractional parts are 3 digits:
iter = length(strfind(A, '.'));
for k=1:iter
[stat,ind] = ismember('.', A);
B(k)=str2double(A(1:ind+3));
A = A(ind+4:end);
end
B will be an array of doubles as a result.

Filter on words in Matlab tables (as in Excel)

In Excel you can use the "filter" function to find certain words in your columns. I want to do this in Matlab over the entire table.
Using the Matlab example-table "patients.dat" as example; my first idea was to use:
patients.Gender=={'Female'}
which does not work.
strcmp(patients.Gender,{'Female'})
workd only in one column ("Gender").
My problem: I have a table with different words say 'A','B','bananas','apples',.... spread out in an arbitrary manner in the columns of the table. I only want the rows that contain, say, 'A' and 'B'.
It is strange I did not find this in matlab "help" because it seems basic. I looked in stackedO but did not find an answer there either.
A table in Matlab can be seen as an extended cell array. For example it additionally allows to name columns.
However in your case you want to search in the whole cell array and do not care for any extra functionality of a table. Therefore convert it with table2cell.
Then you want to search for certain words. You could use a regexp but in the examples you mention strcmp also is sufficient. Both work right away on cell arrays.
Finally you only need to find the rows of the logical search matrix.
Here the example that gets you the rows of all patients that are 'Male' and in 'Excellent' conditions from the Matlab example data set:
patients = readtable('patients.dat');
patients_as_cellarray = table2cell(patients);
rows_male = any(strcmp(patients_as_cellarray, 'Male'), 2); % is 'Male' on any column for a specific row
rows_excellent = any(strcmp(patients_as_cellarray, 'Excellent'), 2); % is 'Excellent' on any column for a specific row
rows = rows_male & rows_excellent; % logical combination
patients(rows, :)
which indeed prints out only male patients in excellent condition.
Here is a simpler, more elegant syntax for you:
matches = ((patients.Gender =='Female') & (patients.Age > 26));
subtable_of_matches = patients(matches,:);
% alternatively, you can select only the columns you want to appear,
% and their order, in the new subtable.
subtable_of_matches = patients(matches,{'Name','Age','Special_Data'});
Please note that in this example, you need to make sure that patients.Gender is a categorical type. You can use categorical(variable) to convert the variable to a categorical, and reassign it to the table variable like so:
patients.Gender = categorical(patiens.Gender);
Here is a reference for you: https://www.mathworks.com/matlabcentral/answers/339274-how-to-filter-data-from-table-using-multiple-strings

Matlab indexing within eval

I would like to change values within an dataset using eval. It shlould be in a way thet every second value is changed to the one before.
Short example:
A = magic(6)
ds = mat2dataset(A) % original dataset
ds.A1(2:2:end) = ds.A1(1:2:end) % dataset after change
That's the way I would like to do it. Now I need to use the variables letter and number which are assigned previous in the function.
letter = 'A'
number = '1'
eval([strcat('ds.', letter, number)]) % now gives me all values.
This is exactly the point where I would like to index the (1:2:end) to get just the indexed values.
Does one of you have a good idea how to index within the eval function? I would also prefer other ways of doing it if you have on.
Thanks a lot!
1) Don't use eval to achieve dynamic fieldnames:
h=ds.([letter, number])
2) Double indexing is not possible, you need two lines to achieve it.
h(1:2:end)

Create a variable of a specific length and populate it with 0's and 1's

I am trying to use MATLAB in order to simulate a communications encoding and decoding mechanism. Hence all of the data will be 0's or 1's.
Initially I created a vector of a specific length and populated with 0's and 1's using
source_data = rand(1,8192)<.7;
For encoding I need to perform XOR operations multiple times which I was able to do without any issue.
For the decoding operation I need to implement the Gaussian Elimination method to solve the set of equations where I realized this vector representation is not very helpful. I tried to use strcat to append multiple 0's and 1's to a variable a using a for loop:
for i=1:8192
if(mod(i,2)==0)
a = strcat(a,'0');
else
a = strcat(a,'1');
end
i = i+1;
disp(i);
end
when I tried length(a) after this I found that the length was 16384, which is twice 8192. I am not sure where I am going wrong or how best to tackle this.
Did you reinitialize a before the example code? Sounds like you ran it twice without clearing a in between, or started with a already 8192 long.
Growing an array in a loop like this in Matlab is inefficient. You can usually find a vectorized way to do stuff like this. In your case, to get an 8192-long array of alternating ones and zeros, you can just do this.
len = 8192;
a = double(mod(1:len,2) == 0);
And logicals might be more suited to your code, so you could skip the double() call.
There are probably a few answer/questions here. Firstly, how can one go from an arbitrary vector containing {0,1} elements to a string? One way would be to use cellfun with the converter num2str:
dataDbl = rand(1,8192)<.7; %see the original question
dataStr = cellfun(#num2str, num2cell(dataDbl));
Note that cellfun concatenates uniform outputs.