Matlab strsplit at non-keyboard characters - matlab

In this instance I have a cell array of lat/long coordinates that I am reading from file as strings with format:
x = {'27° 57'' 21.4" N', '7° 34'' 11.1" W'}
where the ° is actually a degree symbol (U+00B0).
I want to use strsplit() or some equivalent to get out the numerical components, but I don't know how to specify the degree symbol as a delimiter.
I'm hesitant to simply split at the ',' and index out the number, since as demonstrated above I don't know how many digits to expect.
I found elsewhere on the site the following suggestion:
x = regexp(split{1}, '\D+', 'split')
however this also separates the integer and decimal components of the decimal numbers.
Is there a strsplit() option, or some other equivalent I could use?

You can copy-paste the degree symbol from your data file to your M-file script. MATLAB fully supports Unicode characters in its strings. For example:
strsplit(str, {'°','"',''''})
to split the string at the three symbols.
Alternatively, you could use sscanf (or fscanf if reading directly from file) to parse the string:
str = '27° 57'' 21.4"';
dot( sscanf(str, '%f° %f'' %f"'), [1, 1/60, 1/3600] );

The easiest solution is to copy-paste any Unicode character into your MATLAB editor as Cris suggested by Cris.
You can get these readily from the internet, or from the Windows Character Map
You can also use unicode2native and native2unicode if you want to use byte values for your native Unicode settings.
% Get the Unicode value for '°'
>> unicode2native('°')
ans = uint8(176)
% Check the symbol for a given Unicode value
>> native2unicode(176)
ans = '°'
So
>> strsplit( 'Water freezes at 0°C', native2unicode(176) )
ans =
1×2 cell array
{'Water freezes at 0'} {'C'}
You can get the Unicode value by using hex2dec on the Hex value which you already knew, if you want to avoid unicode2native:
hex2dec('00B0') % = 176

You can also improve your regular expression in order to catch the decimal part:
x = {'27° 57'' 21.4" N', '7° 34'' 11.1" W'}
x = regexp(x, '\d+\.?\d?', 'match')
x{:}
Result:
ans =
{
[1,1] = 27
[1,2] = 57
[1,3] = 21.4
}
ans =
{
[1,1] = 7
[1,2] = 34
[1,3] = 11.1
}
Where \d+\.?\d? means:
\d+ : one or more digit
%followed by
\.? : zero or one point
%followed by
\d? : zero or one digit

Consider using split and double with string:
>> x = {'27° 57'' 21.4" N'; '7° 34'' 11.1" W'};
>> x = string(x)
x =
2×1 string array
"27° 57' 21.4" N"
"7° 34' 11.1" W"
>> x = split(x,["° " "' " '" '])
x =
2×4 string array
"27" "57" "21.4" "N"
"7" "34" "11.1" "W"
>> double(x(:,1:3))
ans =
27.0000 57.0000 21.4000
7.0000 34.0000 11.1000

Related

How to calculate the number of appearance of each letter(A-Z ,a-z as well as '.' , ',' and ' ' ) in a text file in matlab?

How can I go about doing this? So far I've opened the file like this
fileID = fopen('hamlet.txt'.'r');
[A,count] = fscanf(fileID, '%s');
fclose(fileID);
Getting spaces from the file
First, if you want to capture spaces, you'll need to change your format specifier. %s reads only non-whitespace characters.
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%s');
>> fclose(fileID);
>> A
A = Thistexthasspacesinit.
Instead, we can use %c:
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%c');
>> fclose(fileID);
>> A
A = This text has spaces in it.
Mapping between characters and values (array indices)
We could create a character array that contains all of the target characters to look for:
search_chars = ['A':'Z', 'a':'z', ',', '.', ' '];
That would work, but to map the character to a position in the array you'd have to do something like:
>> char_pos = find(search_chars == 'q')
char_pos = 43
You could also use containters.Map, but that seems like overkill.
Instead, let's use the ASCII value of each character. For convenience, we'll use only values 1:126 (0 is NUL, and 127 is DEL. We should never encounter either of those.) Converting from characters to their ASCII code is easy:
>> c = 'q'
c = s
>> a = uint8(c) % MATLAB actually does this using double(). Seems wasteful to me.
a = 115
>> c2 = char(a)
c2 = s
Note that by doing this, you're counting characters that are not in your desired list like ! and *. If that's a problem, then use search_chars and figure out how you want to map from characters to indices.
Looping solution
The most intuitive way to count each character is a loop. For each character in A, find its ASCII code and increment the counter array at that index.
char_count = zeros(1, 126);
for current_char = A
c = uint8(current_char);
char_count(c) = char_count(c) + 1;
end
Now you've got an array of counts for each character with ASCII codes from 1 to 126. To find out how many instances of 's' there are, we can just use its ASCII code as an index:
>> char_count(115)
ans = 4
We can even use the character itself as an index:
>> char_count('s')
ans = 4
Vectorized solution
As you can see with that last example, MATLAB's weak typing makes characters and their ASCII codes pretty much equivalent. In fact:
>> 's' == 115
ans = 1
That means that we can use implicit broadcasting and == to create a logical 2D array where L(c,a) == 1 if character c in our string A has an ASCII code of a. Then we can get the count for each ASCII code by summing along the columns.
L = (A.' == [1:126]);
char_count = sum(L, 1);
A one-liner
Just for fun, I'll show one more way to do this: histcounts. This is meant to put values into bins, but as we said before, characters can be treated like values.
char_count = histcounts(uint8(A), 1:126);
There are dozens of other possibilities, for instance you could use the search_chars array and ismember(), but this should be a good starting point.
With [A,count] = fscanf(fileID, '%s'); you'll only count all string letters, doesn't matter which one. You can use regexp here which search for each letter you specify and will put it in a cell array. It consists of fields which contains the indices of your occuring letters. In the end you only sum the number of indices and you have the count for each letter:
fileID = fopen('hamlet.txt'.'r');
A = fscanf(fileID, '%s');
indexCellArray = regexp(A,{'A','B','C','D',... %I'm too lazy to add the other letters now^^
'a','b','c','d',...
','.' '};
letterCount = cellfun(#(x) numel(x),indexCellArray);
fclose(fileID);
Maybe you put the cell array in a struct where you can give fieldnames for the letters, otherwise you might loose track which count belongs to which number.
Maybe there's much easier solution, cause this one is kind of exhausting to put all the letters in the regexp but it works.

fprintf is not working as expected in MATLAB

I have the below code in MATLAB. I use version R2019a.
clc;
clear all;
K = 100;
r = 5:1:55;
W = 10;
t = round((2*K.*r.^2+W^2)./60);
%disp("Turn : " + r + " " + t);
str = "Turn : ";
fprintf("%s %d %d",str,r,t)
I want to use fprintf instead of disp. r and t are 1x51 double variables.
When I use fprintf without %s and str, the script prints 51 values one by one without a roblem. But if I use %s and str in fprintf it only prints the first line "Turn : 5 6 ", then it prints out strange characters as below.
If I use disp, it works correctly as below.
I was just reading this blog post by Loren Shure of the MathWorks. It taught me about compose. compose perfectly solves your problem. Use it instead of fprintf to combine your data into strings. Then use fprintf to print the strings to screen:
s = compose("%s %d %d", str, r.', t.');
fprintf("%s\n", s)
compose is much more intuitive than fprintf in how the values for each % element is taken from the input data. It generates one string for each row in the data. str is a scalar, its value will be repeated for each row. r.' and t.' here have the same number of rows, which will also be the number of rows in s.
Note: compose is new in MATLAB R2016b.
You don't actually need any functions. When working with strings, if you use the + operator you get the conversion for free:
>> str + r + " " + t
ans =
1×51 string array
Columns 1 through 5
"Turn : 5 85" "Turn : 6 122" "Turn : 7 165" "Turn : 8 215" "Turn : 9 272"
...

Capitalize the first and last letter of three letter words in a string

I am trying to capitalize the first and last letter of only the three letter words in a string. So far, I have tried
spaces = strfind(str, ' ');
spaces = [0 spaces];
lw = diff(spaces);
lw3 = find(lw ==4);
a3 = lw-1;
b3 = spaces(a3+1);
b4 = b3 + 2 ;
str(b3) = upper(str(b3));
str(b4) = upper(str(b4);
we had to find where the 3 letter words were first so that is what the first 4 lines of code are and then the others are trying to get it so that it will find where the first and last letters are and then capitalize them?
I would use regular expressions to identity the 3-letter words and then use regexprep combined with an anonymous function to perform the case-conversion.
str = 'abcd efg hijk lmn';
% Custom function to capitalize the first and last letter of a word
f = #(x)[upper(x(1)), x(2:end-1), upper(x(end))];
% This will match 3-letter words and apply function f to them
out = regexprep(str, '\<\w{3}\>', '${f($0)}')
% abcd EfG hijk LmN
Regular expressions are definitely the way to go. I am going to suggest a slightly different route, and that is to return the indices using the tokenExtents flag for regexpi:
str = 'abcd efg hijk lmn';
% Tokenize the words and return the first and last index of each
idx = regexpi(str, '(\<w{3}\>)', 'tokenExtents');
% Convert those indices to upper case
str([idx{:}]) = upper(str([idx{:}]));
Using the matlab ipusum function from the File Exchange, I generated a 1000 paragraph random text string with mean word length 4 +/- 2.
str = lower(matlab_ipsum('WordLength', 4, 'Paragraphs', 1000));
The result was a 177,575 character string with 5,531 3-letter words. I used timeit to check the execution time of using regexprep and regexpi with tokenExtents. Using regexpi is an order of magnitude faster:
regexpi = 0.013979s
regexprep = 0.14401s

Reading a text file in MATLAB?

I have a text file which has the contents as follows.. I need to read this file column wise (ies, 2 columns here). I have tried many ways.. but cannot do as it contains "(" , "," , ")" etc... please guide..
(-2.714141687294326, 0.17700122506478025)
(-2.8889905690592976, 0.1449494260855578)
(-2.74534285564141, 0.3182989792519164)
(-2.728716536554531, -0.3267545129349194)
(-2.280859632844493, -0.7413304490629143)
(-2.8205377507406095, 0.08946138452856946)
(-2.6261449731466335, -0.16338495969832847)
(-2.8863827317805537, 0.5783117541867042)
(-2.6727557978209546, 0.11377424587411682)
(-2.5069470906518565, -0.6450688986485736)
(-2.6127552309087236, -0.01472993916137419)
(-2.7861092661880185, 0.23511200020171835)
(-3.2238037438656533, 0.5113945870063824)
(-2.6447503899420304, -1.1787646364375748)
Try this:
x = importdata('filename.txt');
x = regexp(x,'-?\d+\.?\d*','match'); %// detect numbers as [-]a[.][bcd]
x = cellfun(#str2num, vertcat(x{:}));
If the file can contain numbers in decimal form ("1.234") and in scientific notation ("1.234e-56"):
x = importdata('filename.txt');
x = regexp(x,'-?\d+\.?\d*(e-?\d+)?','match');
x = cellfun(#str2num, vertcat(x{:}));
You can use fscanf and specify the desired format:
fid = fopen('filename.txt');
x = fscanf(fid,'(%f, %f)\n',[2,inf]).';
fclose(fid);
The format spec '(%f, %f)\n' reads to float values inside brackets, separated by , per line. With [2,inf] you specify to put it into a 2 x n array, where n is as large as needed. To have the same format as before, you'll have to transpose it again .'.

issue with sscanf

I'm having issue with getting what I want from sscanf;
e.g. getting varname, year, month, day from a filename;
filename = 'stn2014021412598cjgafe.cnv'
format = '%3s%4d%2d%2d%5d%*10s';
test = sscanf(filename,format);
and I get the result:
test =
115
116
110
2014
2
14
12598
but what I want is the
varname = 'stn'
year = 2014
month = 2
day = 14
and then record or not the 5 digits
num = 12598
and skip everything else.
However, I have no understanding on why I get those 3 numbers 115, 116, 110.
Those first three values are the character codes for 's', 't' and 'n'. The sscanf documentation explains why it comes out this way for your format specifier.
Mixing character and numeric conversion specifications causes the
resulting matrix to be numeric and any characters read to show up
as their numeric values, one character per MATLAB matrix element.
In other words:
>> char(test(1:3))'
ans =
stn
An easier solution is probably textscan since it stores the components in a cell array, allowing different types:
>> C = textscan(filename,format)
C =
{1x1 cell} [2014] [2] [14] [12598]
>> C{1}
ans =
'stn'