Matlab: how to convert character array or string into a formatted output OR parse a string - matlab

Could someone please tell me how to convert character array into a formatted output using Matlab?
I am expecting data like this:
CHAR (1 x 29) : 0.050822999 3.141592979 ; (1)
OR
CELL (1 x 1) or string: '0.050822999 3.141592979 ; (1)'
I am looking for output like this:
d1 = 0.050822999; %double
d2 = 3.141592979; %double
index = 1; % integer
I tried transposing and then using str2num(Str'); but, it's returning me 0x 0 double.
Any help would be appreciated.
Regards,
DK

you can use regexp to parse the string
c = { '0.050822999 3.141592979 ; (1)' };
p = regexp( c{1}, '^(\d+\.\d+)\s(\d+\.\d+)\s*;\s*\((\d+)\)$', 'tokens', 'once' ); %//parse the input string
numbers = str2mat(p); %// convert extracted strings to numerical values
Example result
ans =
0.050822999
3.141592979
1
Explaining the regexp pattern:
^ - pattern starts at the beginning of the input string
(\d+\.\d+) - parentheses ('()') enclosing this sub-pattern indicates it as a single token
\d+ matches one or more digits, then expecting \. a dot (notice the \, since . alone in regexp acts as a wildcard) and after the dot \d+ one or more digits are expected.
This token should correspond to the first number, e.g., 0.050822999
\s expecting a single space
(\d+\.\d+) - again, expecting another decimal fraction as the second token.
\s* - expecting white space (zero or more).
; - capture the ; in the expression, but not as a token.
\s+ - expecting white space (zero or more).
\( - expecting an open parenthesis, note the \ since parentheses in regexp are used to denote tokens.
(\d+) - expecting one or more digits as the third token, only integer numbers are expected here. no decimal point.
\) - expecting a closing parenthesis.
$ - pattern should reach the end of the input string.

You can use something like this (if I understood you correctly)
function str_dump(var)
info = whos;
disp([info.class ' ' mat2str(info.size) ' : ' var]);
end
This just shows information about the string. If you want to parse it and convert to another Matlab's structure, you have to explain it more carefully.

%// Input
a = [0.050822999 3.141592979];
n = 1;
%// Output
str = [num2str(a,'%0.9f ') ' ; (' num2str(n) ')']
Result:
str =
0.050822999 3.141592979 ; (1)

Related

How to calculate the number of appearance of each letter(A-Z ,a-z as well as '.' , ',' and ' ' ) in a text file in matlab?

How can I go about doing this? So far I've opened the file like this
fileID = fopen('hamlet.txt'.'r');
[A,count] = fscanf(fileID, '%s');
fclose(fileID);
Getting spaces from the file
First, if you want to capture spaces, you'll need to change your format specifier. %s reads only non-whitespace characters.
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%s');
>> fclose(fileID);
>> A
A = Thistexthasspacesinit.
Instead, we can use %c:
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%c');
>> fclose(fileID);
>> A
A = This text has spaces in it.
Mapping between characters and values (array indices)
We could create a character array that contains all of the target characters to look for:
search_chars = ['A':'Z', 'a':'z', ',', '.', ' '];
That would work, but to map the character to a position in the array you'd have to do something like:
>> char_pos = find(search_chars == 'q')
char_pos = 43
You could also use containters.Map, but that seems like overkill.
Instead, let's use the ASCII value of each character. For convenience, we'll use only values 1:126 (0 is NUL, and 127 is DEL. We should never encounter either of those.) Converting from characters to their ASCII code is easy:
>> c = 'q'
c = s
>> a = uint8(c) % MATLAB actually does this using double(). Seems wasteful to me.
a = 115
>> c2 = char(a)
c2 = s
Note that by doing this, you're counting characters that are not in your desired list like ! and *. If that's a problem, then use search_chars and figure out how you want to map from characters to indices.
Looping solution
The most intuitive way to count each character is a loop. For each character in A, find its ASCII code and increment the counter array at that index.
char_count = zeros(1, 126);
for current_char = A
c = uint8(current_char);
char_count(c) = char_count(c) + 1;
end
Now you've got an array of counts for each character with ASCII codes from 1 to 126. To find out how many instances of 's' there are, we can just use its ASCII code as an index:
>> char_count(115)
ans = 4
We can even use the character itself as an index:
>> char_count('s')
ans = 4
Vectorized solution
As you can see with that last example, MATLAB's weak typing makes characters and their ASCII codes pretty much equivalent. In fact:
>> 's' == 115
ans = 1
That means that we can use implicit broadcasting and == to create a logical 2D array where L(c,a) == 1 if character c in our string A has an ASCII code of a. Then we can get the count for each ASCII code by summing along the columns.
L = (A.' == [1:126]);
char_count = sum(L, 1);
A one-liner
Just for fun, I'll show one more way to do this: histcounts. This is meant to put values into bins, but as we said before, characters can be treated like values.
char_count = histcounts(uint8(A), 1:126);
There are dozens of other possibilities, for instance you could use the search_chars array and ismember(), but this should be a good starting point.
With [A,count] = fscanf(fileID, '%s'); you'll only count all string letters, doesn't matter which one. You can use regexp here which search for each letter you specify and will put it in a cell array. It consists of fields which contains the indices of your occuring letters. In the end you only sum the number of indices and you have the count for each letter:
fileID = fopen('hamlet.txt'.'r');
A = fscanf(fileID, '%s');
indexCellArray = regexp(A,{'A','B','C','D',... %I'm too lazy to add the other letters now^^
'a','b','c','d',...
','.' '};
letterCount = cellfun(#(x) numel(x),indexCellArray);
fclose(fileID);
Maybe you put the cell array in a struct where you can give fieldnames for the letters, otherwise you might loose track which count belongs to which number.
Maybe there's much easier solution, cause this one is kind of exhausting to put all the letters in the regexp but it works.

Regexp equivalent for floating number in matlab

I'd like to know how regexp is used for floating numbers or is there any other function to do this.
For example, the following returns {'2', '5'} rather than {'2.5'}.
nums= regexp('2.5','\d+','match')
Regular expressions are a tool for low-level text parsing and they have no concept of numeric datatypes. If you will want to parse decimals, you need to consider what characters compose a decimal number and design a regexp to explicitly match all of those characters.
Your example only returns the '2' and '5' because your pattern only matches characters that are digits (\d). To handle the decimal numbers, you need to explicitly include the . in your pattern. The following will match any number of digits followed by 0 or 1 radix points and 0 or more numbers after the radix point.
regexp('2.5', '\d+\.?\d*', 'match')
This assumes that you'll always have a leading digit (i.e. not '.5')
Alternately, you may consider using something like textscan or sscanf instead to parse your string which is going to be more robust than a custom regex since they are aware of different numeric datatypes.
C = textscan('2.5', '%f');
C = sscanf('2.5', '%f');
If your string only contains this floating point number, you can just use str2double
val = str2double('2.5');
#Suever answer has already been accepted, anyway here is some more complete one that should accept all sorts of floating points syntaxes (including NaN and +/-Inf by default):
% Regular expression for capturing a double value
function [s] = cdouble(supportPositiveInfinity, supportNegativeInfinity,
supportNotANumber)
%[
if (nargin < 3), supportNotANumber = true; end
if (nargin < 2), supportNegativeInfinity = true; end
if (nargin < 1), supportPositiveInfinity = true; end
% A double
s = '[+\-]?(?:(?:\d+\.\d*)|(?:\.\d+)|(?:\d+))(?:[eE][+\-]?\d+)?'; %% This means a numeric double
% Extra for nan or [+/-]inf
extra = '';
if (supportNotANumber), extra = ['nan|' extra]; end
if (supportNegativeInfinity), extra = ['-inf|' extra]; end
if (supportPositiveInfinity), extra = ['inf|\+inf|' extra]; end
% Adding capture
if (~isempty(extra))
s = ['((?i)(?:', extra, s, '))']; % (?i) => Locally case-insensitive for captured group
else
s = ['(', s, ')'];
end
%]
end
Basically above pattern says:
Eventually start with '+' or '-' sign
Then followed by either
One or more digits followed by a dot and eventually zero to many digits
A dot followed by one or more digits
One or more digits only
Then followed by exponant pattern, that is:
'e' or 'E' (eventually followed with '+' or '-') and one or more digits
Pattern is later completed with support of Inf, +Inf, -Inf and NaN in case insensitive way. Finally everthing is enclosed between ( and ) for capturing purpose.
Here is some online test example: https://regex101.com/r/K87z6e/1
Or you can test in matlab:
>> regexp('Hello 1.235 how .2e-7 are INF you +.7 doing ?', cdouble(), 'match')
ans =
'1.235' '.2e-7' 'INF' '+.7'

How to read a specific number (or word) from an answer

I have an .nc file I'm reading in matlab, and getting info out of the time variable.
the code looks like this
>> ncreadatt(model_list{3},'T','units')
ans =
'months since 1850-01-01'
what I want to do is get just the '1850' out of the answer.
Regular expression is a very powerful tool to parse and manipulate strings.
Matlab has regexp command:
line = 'months since 1850-01-01';
res = regexp( line, '\s(\d+)-', 'tokens', 'once');
year = str2double(res{1})
And the results is:
year =
1850
The regular expression used '\s(\d+)-' means:
\s - look for a single white space character (the space before 1850).
'(\d+)' - look for one or more digit ('\d+'), the parentheses means that all charcters matching here will be saved as a "token".
'-' - look for a single '-' after the digits.
You can play with it on ideone.

Read specific character from cell-array of string

I have an cell-array of dimensions 1x6 like this:
A = {'25_2.mat','25_3.mat','25_4.mat','25_5.mat','25_6.mat','25_7.mat'};
I want to read for example from the A{1} , the number after the '_' i.e 2 for my example
Using cellfun, strfind and str2double
out = cellfun(#(x) str2double(x(strfind(x,'_')+1:strfind(x,'.')-1)),A)
How does it work?
This code simply finds the index of character one number after the occurrence of '_'. Lets call it as start_index. Then finds the character one number lesser than the index of occurrence of '.' character. Lets call it as end_index. Then retrieves all the characters between start_index and end_index. Finally converts those characters to numbers using str2double.
Sample Input:
A = {'2545_23.mat','2_3.mat','250_4.mat','25_51.mat','25_6.mat','25_7.mat'};
Output:
>> out
out =
23 3 4 51 6 7
You can access the contents of the cell by using the curly braces{...}. Once you have access to the contents, you can use indexes to access the elements of the string as you would do with a normal array. For example:
test = {'25_2.mat', '25_3.mat', '25_4.mat', '25_5.mat', '25_6.mat', '25_7.mat'}
character = test{1}(4);
If your string length is variable, you can use strfind to find the index of the character you want.
Assuming the numbers are non-negative integers after the _ sign: use a regular expression with lookbehind, and then convert from string to number:
numbers = cellfun(#(x) str2num(x{1}), regexp(A, '(?<=\_)\d+', 'match'));

Unicode character transformation in SPSS

I have a string variable. I need to convert all non-digit characters to spaces (" "). I have a problem with unicode characters. Unicode characters (the characters outside the basic charset) are converted to some invalid characters. See the code for example.
Is there any other way how to achieve the same result with procedure which would not choke on special unicode characters?
new file.
set unicode = yes.
show unicode.
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
end data.
string Z (a10).
comp Z = T.
loop #k = 1 to char.len(Z).
if ~range(char.sub(Z, #k, 1), "0", "9") sub(Z, #k, 1) = " ".
end loop.
comp Z = normalize(Z).
comp len = char.len(Z).
list var = all.
exe.
The result:
T Z len
1234 1234 4
5678 5678 4
absd 0
12as 12 2
12(a 12 2
12(vi 12 2
12(vī 12 � 6
>Warning # 649
>The first argument to the CHAR.SUBSTR function contains invalid characters.
>Command line: 1939 Current case: 8 Current splitfile group: 1
12āčž 12 �ž 7
Number of cases read: 8 Number of cases listed: 8
The substr function should not be used on the left hand side of an expression in Unicode mode, because the replacement character may not be the same number of bytes as the character(s) being replaced. Instead, use the replace function on the right hand side.
The corrupted characters you are seeing are due to this size mismatch.
How about instead of replacing non-numeric characters, you cycle though and pull out the numeric characters and rebuild Z? (Note my version here is pre CHAR. string functions.)
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
12as23
end data.
STRING Z (a10).
STRING #temp (A1).
COMPUTE #len = LENGTH(RTRIM(T)).
LOOP #i = 1 to #len.
COMPUTE #temp = SUBSTR(T,#i,1).
DO IF INDEX('0123456789',#temp) > 0.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1),#temp).
ELSE.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1)," ").
END IF.
END LOOP.
EXECUTE.