I have a text file that looks like this:
(a (bee (cold down)))
if I load it using
c=textscan(fid,'%s');
I get this:
'(a'
'(bee'
'(cold'
'down)))'
What I would like to get is:
'('
'a'
'('
'bee'
'('
'cold'
'down'
')'
')'
')'
I know I can delimit with '(' and ')' by specifying 'Delimiter' in textscan, but then I will loose this character, which I want to keep.
Thank you in Advance.
The %s specifier indicates that you want Strings, what you want is individual chars. Use %c instead .
c=textscan(fid,'%c');
Update if you want too keep your words intact then you'll want to load your text using the %s specifier. After the text is loaded you can either solve this problem with Regular Expressions (not my forte) or write your own parser then parses each word individually and saves the paranthesis and words to a new cell array.
AFAIK, there is no canned routine capable of preserving arbitrary delimiters.
You'd have to do it yourself:
string = '(a (bee (cold down)))';
bo = string == '(';
bc = string == ')';
sp = string == ' ';
output = cell(nnz(bo|bc|sp)+1,1);
j = 1;
for ii = 1:numel(string)
if bo(ii)
output{j} = '(';
j = j + 1;
elseif bc(ii)
output{j} = ')';
j = j + 1;
elseif sp(ii)
j = j + 1;
else
output{j} = [output{j} string(ii)];
end
end
Which can probably be improved -- the growing character array will prevent the loop from being JIT'ed. The array bc | bo | sp holds all the information to vectorize this thing, I just don't see how at this hour...
Nevertheless, it should give you a place to start.
Matlab has a strtok function similar to C. Its format is:
token = strtok(str)
token = strtok(str, delimiter)
[token, remain] = strtok('str', ...)
there is also a string replace function strrep:
modifiedStr = strrep(origStr, oldSubstr, newSubstr)
What I would do is modify the original string with strrep to add in delimiters, then use strtok. Since you already scanned the string into c:
c = (c,'(','( '); %Add a space after each open paren
c = (c,')',' ) '); % Add a space before and after each close paren
token = zeros(10); preallocate for speed
i = 2;
[token(1), remain] = strtok(c, ' ');
while(remain)
[token(i), remain] = strtok(c, ' ');
i =i + 1;
end
gives you the linear token array of each of the string you requested.
strtok reference: http://www.mathworks.com/help/techdoc/ref/strtok.html
strrep reference: http://www.mathworks.com/help/techdoc/ref/strrep.html
Related
How can I go about doing this? So far I've opened the file like this
fileID = fopen('hamlet.txt'.'r');
[A,count] = fscanf(fileID, '%s');
fclose(fileID);
Getting spaces from the file
First, if you want to capture spaces, you'll need to change your format specifier. %s reads only non-whitespace characters.
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%s');
>> fclose(fileID);
>> A
A = Thistexthasspacesinit.
Instead, we can use %c:
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%c');
>> fclose(fileID);
>> A
A = This text has spaces in it.
Mapping between characters and values (array indices)
We could create a character array that contains all of the target characters to look for:
search_chars = ['A':'Z', 'a':'z', ',', '.', ' '];
That would work, but to map the character to a position in the array you'd have to do something like:
>> char_pos = find(search_chars == 'q')
char_pos = 43
You could also use containters.Map, but that seems like overkill.
Instead, let's use the ASCII value of each character. For convenience, we'll use only values 1:126 (0 is NUL, and 127 is DEL. We should never encounter either of those.) Converting from characters to their ASCII code is easy:
>> c = 'q'
c = s
>> a = uint8(c) % MATLAB actually does this using double(). Seems wasteful to me.
a = 115
>> c2 = char(a)
c2 = s
Note that by doing this, you're counting characters that are not in your desired list like ! and *. If that's a problem, then use search_chars and figure out how you want to map from characters to indices.
Looping solution
The most intuitive way to count each character is a loop. For each character in A, find its ASCII code and increment the counter array at that index.
char_count = zeros(1, 126);
for current_char = A
c = uint8(current_char);
char_count(c) = char_count(c) + 1;
end
Now you've got an array of counts for each character with ASCII codes from 1 to 126. To find out how many instances of 's' there are, we can just use its ASCII code as an index:
>> char_count(115)
ans = 4
We can even use the character itself as an index:
>> char_count('s')
ans = 4
Vectorized solution
As you can see with that last example, MATLAB's weak typing makes characters and their ASCII codes pretty much equivalent. In fact:
>> 's' == 115
ans = 1
That means that we can use implicit broadcasting and == to create a logical 2D array where L(c,a) == 1 if character c in our string A has an ASCII code of a. Then we can get the count for each ASCII code by summing along the columns.
L = (A.' == [1:126]);
char_count = sum(L, 1);
A one-liner
Just for fun, I'll show one more way to do this: histcounts. This is meant to put values into bins, but as we said before, characters can be treated like values.
char_count = histcounts(uint8(A), 1:126);
There are dozens of other possibilities, for instance you could use the search_chars array and ismember(), but this should be a good starting point.
With [A,count] = fscanf(fileID, '%s'); you'll only count all string letters, doesn't matter which one. You can use regexp here which search for each letter you specify and will put it in a cell array. It consists of fields which contains the indices of your occuring letters. In the end you only sum the number of indices and you have the count for each letter:
fileID = fopen('hamlet.txt'.'r');
A = fscanf(fileID, '%s');
indexCellArray = regexp(A,{'A','B','C','D',... %I'm too lazy to add the other letters now^^
'a','b','c','d',...
','.' '};
letterCount = cellfun(#(x) numel(x),indexCellArray);
fclose(fileID);
Maybe you put the cell array in a struct where you can give fieldnames for the letters, otherwise you might loose track which count belongs to which number.
Maybe there's much easier solution, cause this one is kind of exhausting to put all the letters in the regexp but it works.
Ok so I have retrieved this string from the text file now I am supposed to shift it by a specified amount. so for example, if the string I retrieved was
To be, or not to be
That is the question
and the shift number was 5 then the output should be
stionTo be, or not
to beThat is the que
I was going to use circshift but the given string wouldn't of a matching dimesions. Also the string i would retrieve would be from .txt file.
So here is the code i used
S = sprintf('To be, or not to be\nThat is the question')
circshift(S,5,2)
but the output is
stionTo be, or not to be
That is the que
but i need
stionTo be, or not
to beThat is the que
By storing the locations of the new lines, removing the new lines and adding them back in later we can achieve this. This code does rely on the insertAfter function which is only available in MATLAB 2016b and later.
S = sprintf('To be, or not to be\nThat is the \n question');
newline = regexp(S,'\n');
S(newline) = '';
S = circshift(S,5,2);
for ii = 1:numel(newline)
S = insertAfter(S,newline(ii)-numel(newline)+ii,'\n');
end
S = sprintf(S);
You can do this by performing a circular shift on the indices of the non-newline characters. (The code below actually skips all control characters with ASCII code < 32.)
function T = strshift(S, k)
T = S;
c = find(S >= ' '); % shift only printable characters (ascii code >= 32)
T(c) = T(circshift(c, k, 2));
end
Sample run:
>> S = sprintf('To be, or not to be\nThat is the question')
S = To be, or not to be
That is the question
>> r = strshift(S, 5)
r = stionTo be, or not
to beThat is the que
If you want to skip only the newline characters, just change to
c = find(S != 10);
I'm trying to make an algorithm in Matlab that scans the character array from left to right and if it encounters a space, it should do nothing, but if it encounters 2 consecutive spaces, it should start printing the remaining quantities of array from next line. for example,
inpuut='a bc d';
after applying this algorithm, the final output should have to be:
a bc
d
but this algorithm is giving me the output as:
a bc
d d
Also, if someone has got a more simpler algorithm to do this task, do help me please :)
m=1; t=1;
inpuut='a bc d';
while(m<=(length(inpuut)))
if((inpuut(m)==' ')&&(inpuut(m+1)==' '))
n=m;
fprintf(inpuut(t:(n-1)));
fprintf('\n');
t=m+2;
end
fprintf(inpuut(t));
if(t<length(inpuut))
t=t+1;
elseif(t==length(inpuut))
t=t-1;
else
end
m=m+1;
end
fprintf('\n');
OK I gave up telling why your code doesn't work. This is a working one.
inpuut='a bc d ';
% remove trailing space
while (inpuut(end)==' ')
inpuut(end)=[];
end
str = regexp(inpuut, ' ', 'split');
for ii = 1:length(str)
fprintf('%s\n', str{ii});
end
regexp with 'split' option splits the string into a cell array, with delimiter defined in the matching expression.
fprintf is capable of handling complicated strings, much more than printing a single string.
You can remove the trailing space before printing, or do it inside the loop (check if the last cell is empty, but it's more costly).
You can use regexprep to replace two consecutive spaces by a line feed:
result_string = regexprep(inpuut, ' ', '\n');
If you need to remove trailing spaces: use this first:
result_string = regexprep(inpuut, ' $', '');
I have a solution without using regex, but I assumed you wanted to print on 2 lines maximum.
Example: with 'a b c hello':
a b
c hello
and not:
a b
c
hello
In any case, here is the code:
inpuut = 'a b c';
while(length(inpuut) > 2)
% Read the next 2 character
first2char = inpuut(1:2);
switch(first2char)
case ' ' % 2 white spaces
% we add a new line and print the rest of the input
fprintf('\n%s', inpuut(3:end));
inpuut = [];
otherwise % not 2 white spaces
% Just print one character
fprintf('%s', inpuut(1))
inpuut(1) = [];
end
end
fprintf('%s\n', inpuut);
if possible please let me know that how I can read different text files in Matlab .
considering that there is 33 txt files that every one should process.
it is my code which has error. :(
textFilename = cell(1,33);
id = cell(1,33);
for k=1:33;
textFilename{k} = fullfile('C:\Users\Desktop\SentimentCode\textfiles',['file' num2str(k) '.txt']);
id{k} = fopen(textFilename{k},'rt');
str{k} = textscan(id{k},'%s%s');
end
str(str == '.') = '';
str(str == '_') = '';
str(str == '-') = '';
% Remove numbers from text
T =regexprep(str, '[\d]', ' ');
and my error is : ??? Undefined function or method 'eq' for input arguments of type 'cell'.
Error in ==> Untitled9 at 23
str(str == '.') = '';
In your current edit your error seem more directed to the removal of . - and _ characters.
The == comparasion works better with character strings while textscan returns a cell
Instead of
str(str == '.') = '';
str(str == '_') = '';
str(str == '-') = '';
try using
regexprep(str,'(\.|-|_)','')
to replace all at once (the '\.' is needed as '.' is a special character).
This works on cellstrings so depending on how deep your cell structure goes you might need to call it within a for loop, str{k},str{k}{1}, str{k}{i} etc...
An alternative could be to look at cellfun
or/and strjoin... depending on how your data are arranged in the files.
Just by looking at the example code:
extFilename{k} = fullfile(..);
should be
textFilename{k} = fullfile(...);
Also it is good idea to close the files after you read them: fclose(id{k})
I have the following string in MATLAB, for example
##%%F1_USA(40)_u
and I want
F1_USA_40__u
Does it has any function for this?
Your best bet is probably regexprep which allows you to replace parts of a string using regular expressions:
s_new = regexprep(regexprep(s, '[()]', '_'), '[^A-Za-z0-9_]', '')
Update: based on your updated comment, this is probably what you want:
s_new = regexprep(regexprep(s, '^[^A-Za-z0-9_]*', ''), '[^A-Za-z0-9_]', '')
or:
s_new = regexprep(regexprep(s, '[^A-Za-z0-9_]', '_'), '^_*', '')
One way to do this is to use the function ISSTRPROP to find the indices of alphanumeric characters and replace or remove the others accordingly:
>> str = '##%%F1_USA(40)_u'; %# Sample string
>> index = isstrprop(str,'alphanum'); %# Find indices of alphanumeric characters
>> str(~index) = '_'; %# Set non-alphanumeric characters to '_'
>> str = str(find(index,1):end) %# Remove any leading '_'
str =
F1_USA_40__u %# Result
If you want to use regular expressions (which can get a little more complicated) then the last suggestion from Tamas will work. However, it can be greatly simplified to the following:
str = regexprep(str,{'\W','^_*'},{'_',''});