I have some rows in a text file have NA and i want to delete them .
when i used isempty(strfind(l,'NA')), this deletes also strings have NA such as: 'RNASE' ,'GNAS'
example
0.552353744371678 NA
0.0121476193502138 ANG;RNASE
0.189489997218949 GNAS
0.0911820441646675 MYCL1
output:
0.0911820441646675 MYCL1
output expected:
0.0121476193502138 ANG;RNASE
0.189489997218949 GNAS
0.0911820441646675 MYCL1
Using single regexp I do not know how to find
"NA that does not have any alphanumeric character before or after".
I mean, it is easy if you know there will be at least one other character before and after:
ind = regexp(str, '[^A-Za-z_]NA[^A-Za-z_]'); %Or something similar, depending what exactly can and cannot be there.
However, this string requires characters before and after and will not match single 'NA' by itself.
That is to say, I am nearly certain suitable regexp exists, I just don't know it :)
What I would do is (assuming strl = single line with text you are deciding to keep or remove, that might have multiple NA).
ind = regexp(strl, 'NA'); % This finds all NA in the string.
removestr = true;
for i = 1 : length(ind)
if (ind == 1 || any(regexp(strl(ind-1), '[^A-Za-z_]'))) ... &&
&& (ind+1 == length(strl) || any(regexp(strl(ind+2), '[^A-Za-z_]')))
disp('This is maybe the string to remove - if there are no wrong NA's later')
else
removestr = false;
break; % stop checking in this loop, this string is to keep.
end
end
if (removestr)
disp('Remove string')
end
Conditions in if are a bit overkill and quite slow, but should work. If you don't require checking for multiple NA in a single line, simply omit for loop.
Related
How can I go about doing this? So far I've opened the file like this
fileID = fopen('hamlet.txt'.'r');
[A,count] = fscanf(fileID, '%s');
fclose(fileID);
Getting spaces from the file
First, if you want to capture spaces, you'll need to change your format specifier. %s reads only non-whitespace characters.
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%s');
>> fclose(fileID);
>> A
A = Thistexthasspacesinit.
Instead, we can use %c:
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%c');
>> fclose(fileID);
>> A
A = This text has spaces in it.
Mapping between characters and values (array indices)
We could create a character array that contains all of the target characters to look for:
search_chars = ['A':'Z', 'a':'z', ',', '.', ' '];
That would work, but to map the character to a position in the array you'd have to do something like:
>> char_pos = find(search_chars == 'q')
char_pos = 43
You could also use containters.Map, but that seems like overkill.
Instead, let's use the ASCII value of each character. For convenience, we'll use only values 1:126 (0 is NUL, and 127 is DEL. We should never encounter either of those.) Converting from characters to their ASCII code is easy:
>> c = 'q'
c = s
>> a = uint8(c) % MATLAB actually does this using double(). Seems wasteful to me.
a = 115
>> c2 = char(a)
c2 = s
Note that by doing this, you're counting characters that are not in your desired list like ! and *. If that's a problem, then use search_chars and figure out how you want to map from characters to indices.
Looping solution
The most intuitive way to count each character is a loop. For each character in A, find its ASCII code and increment the counter array at that index.
char_count = zeros(1, 126);
for current_char = A
c = uint8(current_char);
char_count(c) = char_count(c) + 1;
end
Now you've got an array of counts for each character with ASCII codes from 1 to 126. To find out how many instances of 's' there are, we can just use its ASCII code as an index:
>> char_count(115)
ans = 4
We can even use the character itself as an index:
>> char_count('s')
ans = 4
Vectorized solution
As you can see with that last example, MATLAB's weak typing makes characters and their ASCII codes pretty much equivalent. In fact:
>> 's' == 115
ans = 1
That means that we can use implicit broadcasting and == to create a logical 2D array where L(c,a) == 1 if character c in our string A has an ASCII code of a. Then we can get the count for each ASCII code by summing along the columns.
L = (A.' == [1:126]);
char_count = sum(L, 1);
A one-liner
Just for fun, I'll show one more way to do this: histcounts. This is meant to put values into bins, but as we said before, characters can be treated like values.
char_count = histcounts(uint8(A), 1:126);
There are dozens of other possibilities, for instance you could use the search_chars array and ismember(), but this should be a good starting point.
With [A,count] = fscanf(fileID, '%s'); you'll only count all string letters, doesn't matter which one. You can use regexp here which search for each letter you specify and will put it in a cell array. It consists of fields which contains the indices of your occuring letters. In the end you only sum the number of indices and you have the count for each letter:
fileID = fopen('hamlet.txt'.'r');
A = fscanf(fileID, '%s');
indexCellArray = regexp(A,{'A','B','C','D',... %I'm too lazy to add the other letters now^^
'a','b','c','d',...
','.' '};
letterCount = cellfun(#(x) numel(x),indexCellArray);
fclose(fileID);
Maybe you put the cell array in a struct where you can give fieldnames for the letters, otherwise you might loose track which count belongs to which number.
Maybe there's much easier solution, cause this one is kind of exhausting to put all the letters in the regexp but it works.
I am trying to separate data from a csv file into "blocks" of data that I then put into 10 different categories. Each block has a set of spaces at the top of it. Each category contains 660 blocks. Currently, my code successfully puts in the first block, but only the first block. It does correctly count the number of blocks though. I do not understand why it only puts in the first block when the block count works correctly, and any help would be appreciated.
The csv file can be downloaded from here.
https://archive.ics.uci.edu/ml/machine-learning-databases/00195/
fid = fopen('Train_Arabic_Digit.txt','rt');
traindata = textscan(fid, '%f%f%f%f%f%f%f%f%f%f%f%f%f', 'MultipleDelimsAsOne',true, 'Delimiter','[;', 'HeaderLines',1);
fclose(fid);
% Each line in Train_Arabic_Digit.txt or Test_Arabic_Digit.txt represents
% 13 MFCCs coefficients in the increasing order separated by spaces. This
% corresponds to one analysis frame. Lines are organized into blocks, which
% are a set of 4-93 lines separated by blank lines and corresponds to a
% single speech utterance of an spoken Arabic digit with 4-93 frames.
% Each spoken digit is a set of consecutive blocks.
% TO DO: how get blocks...split? with /n?
% In Train_Arabic_Digit.txt there are 660 blocks for each spoken digit. The
% first 330 blocks represent male speakers and the second 330 blocks
% represent the female speakers. Blocks 1-660 represent the spoken digit
% "0" (10 utterances of /0/ from 66 speakers), blocks 661-1320 represent
% the spoken digit "1" (10 utterances of /1/ from the same 66 speakers
% 33 males and 33 females), and so on up to digit 9.
content = fileread( 'Train_Arabic_Digit.txt' ) ;
default = regexp(content,'\n','split');
digit0=[];
digit1=[];
digit2=[];
digit3=[];
digit4=[];
digit5=[];
digit6=[];
digit7=[];
digit8=[];
digit9=[];
blockcount=0;
a=0;
for i=1:1:length(default)
if strcmp(default{i},' ')
blockcount=blockcount+1;
else
switch blockcount % currently only works for blockcount=1 even though
%it does pick up the number of blocks...
case blockcount>0 && blockcount<=660 %why does it not recognize 2 as being<660
a=a+1;
digit0=[digit0 newline default{i}];
case blockcount>660 && blockcount<=1320
digit1=[digit1 newline default{i}];
case blockcount<=1980 && blockcount>1320
digit2=[digit2 newline default{i}];
case blockcount<=2640 && blockcount>1980
digit3=[digit3 newline default{i}];
case blockcount<=3300 && blockcount>2640
digit4=[digit4 newline default{i}];
case blockcount<=3960 && blockcount>3300
digit5=[digit5 newline default{i}];
case blockcount<=4620 && blockcount>3960
digit6=[digit6 newline default{i}];
case blockcount<=5280 && blockcount>4620
digit7=[digit7 newline default{i}];
case blockcount<=5940 && blockcount>5280
digit8=[digit8 newline default{i}];
case blockcount<=6600 && blockcount>5940
digit9=[digit9 newline default{i}];
end
end
end
That's because you have somehow confused if-else syntax with switch-case. Note that an expression like blockcount>0 && blockcount<=660 always returns a logical value, meaning it's either 0 or 1. Now, when blockcount is equal to 1, first case expression also results 1 and the rest result 0, so, 1==1 and first block runs. But when blockcount becomes 2, the first case expression still results 1 and 2~=1 so nothing happens!
You can either use if-else or change your case expressions to cell arrays containing ranges of values. According to docs:
The switch block tests each case until one of the case expressions is
true. A case is true when:
For numbers, case_expression == switch_expression.
For character vectors, strcmp(case_expression,switch_expression) == 1.
For objects that support the eq function, case_expression ==
switch_expression. The output of the overloaded eq function must be
either a logical value or convertible to a logical value.
For a cell array case_expression, at least one of the elements of the
cell array matches switch_expression, as defined above for numbers,
character vectors, and objects.
It should be something like:
switch blockcount
case num2cell(0:660)
digit0 ...
case num2cell(661:1320)
digit1 ...
...
end
BUT, this block of code will take forever to complete. First, always avoid a = [a b] in loops. Resizing matrices is time consuming. Always preallocate a and do a(i) = b.
eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');
I would like to modify a string that will have make the first letter capitalized and all other letters lower cased, and anything else will be unchanged.
I tried this:
function new_string=switchCase(str1)
%str1 represents the given string containing word or phrase
str1Lower=lower(str1);
spaces=str1Lower==' ';
caps1=[true spaces];
%we want the first letter and the letters after space to be capital.
strNew1=str1Lower;
strNew1(caps1)=strNew1(caps1)-32;
end
This function works nicely if there is nothing other than a letter after space. If we have anything else for example:
str1='WOW ! my ~Code~ Works !!'
Then it gives
new_string =
'Wow My ^code~ Works !'
However, it has to give (according to the requirement),
new_string =
'Wow! My ~code~ Works !'
I found a code which has similarity with this problem. However, that is ambiguous. Here I can ask question if I don't understand.
Any help will be appreciated! Thanks.
Interesting question +1.
I think the following should fulfil your requirements. I've written it as an example sub-routine and broken down each step so it is obvious what I'm doing. It should be straightforward to condense it into a function from here.
Note, there is probably also a clever way to do this with a single regular expression, but I'm not very good with regular expressions :-) I doubt a regular expression based solution will run much faster than what I've provided (but am happy to be proven wrong).
%# Your example string
Str1 ='WOW ! my ~Code~ Works !!';
%# Convert case to lower
Str1 = lower(Str1);
%# Convert to ascii
Str1 = double(Str1);
%# Find an index of all locations after spaces
I1 = logical([0, (Str1(1:end-1) == 32)]);
%# Eliminate locations that don't contain lower-case characters
I1 = logical(I1 .* ((Str1 >= 97) & (Str1 <= 122)));
%# Check manually if the first location contains a lower-case character
if Str1(1) >= 97 && Str1(1) <= 122; I1(1) = true; end;
%# Adjust all appropriate characters in ascii form
Str1(I1) = Str1(I1) - 32;
%# Convert result back to a string
Str1 = char(Str1);
I have a textfile that I am I want to make into a list. I have asked two questions recently about this topic. The problem I keep coming across is that the I want to parse the textfile but the sections are of different length. So I cannot use
textscan(fid,'%s %s %s')
because the length of each gene varies. I have also had trouble using fields because when I use the code to set up the fields it only allows for one line iin each field for the "note" field below in the first gene I would like to be able to multiple lines in one field an be able to read them in. currently I am getting errors about the index exceeds matrix dimensions.
fieldname = regexp(line{1},'/(.+)=','tokens','once');
value = regexp(line{1},'="?([^"]+)"?$','tokens','once');
Another possible way I see this working is using some sort of isLineEmpty to be able to divide up the genes be the empty line that is between them.
Is there a way to be able to have multiple lines in my field entry so I can get all the information associated with "note" ? or a way to use an isLineEmpty and skip using fields?
gene 218705..219367
/locus_tag="Rv0187"
/db_xref="GeneID:886779"
CDS 218705..219367
/locus_tag="Rv0187"
/EC_number="2.1.1.-"
/function="THOUGHT TO BE INVOLVED IN TRANSFER OF METHYL
GROUP."
/note="Rv0187, (MTCI28.26), len: 220 aa. Probable
O-methyltransferase (EC 2.1.1.-), similar to many e.g.
AB93458.1|AL357591 putative O-methyltransferase from
Streptomyces coelicolor (223 aa); MDMC_STRMY|Q00719
O-methyltransferase from Streptomyces mycarofaciens (221
aa), FASTA scores: opt: 327, E(): 2.4e-17, (35.9% identity
in 192 aa overlap). Also similar to Rv1703c, Rv1220c from
Mycobacterium tuberculosis."
/codon_start=1
/transl_table=11
/product="O-methyltransferase"
/protein_id="NP_214701.1"
/db_xref="GI:15607328"
/db_xref="GeneID:886779"
gene 219486..219917
/locus_tag="Rv0188"
/db_xref="GeneID:886776"
CDS 219486..219917
/locus_tag="Rv0188"
/function="UNKNOWN"
/experiment="experimental evidence, no additional details
recorded"
/codon_start=1
/transl_table=11
/product="transmembrane protein"
/protein_id="NP_214702.1"
/db_xref="GI:15607329"
I would probably consider using some sort of simple wrapper function to collapse the multi-line fields into a single line. Something like:
function l = readlongline( fh )
quotesSeen = 0;
done = false;
l = '';
while ~done
tline = fgetl( fh );
if ~ischar( tline )
% Hit EOF
l = tline;
return
end
quotesSeen = quotesSeen + length( strfind( tline, '"' ) );
% Break if we've seen 0 or 2 quotes
done = any( quotesSeen == [0 2] );
l = [l, tline];
end
end
This is intended to be a replacement for fgetl.