This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Using regexp to find a word
I'm working on an assignment for my CS course.
We're given a plain text file, which, in my case, contains a series of tweets.
What I need to do is create a script that will detect hashtags, and then save each hashtag into an cell array.
So far I know how to write a function that detects the '#' symbol...
strfind(textRead{i},'#');
where in a for loop where i=1:30 (that is, the number of cells of text). However, past that, I'm at a loss as to how I should write a script that will detect the '#' and return the text between that and the next ' ' (space) character.
Try this:
str = '#someHashtag other tweet text ignore #random';
regexp(str, '#[A-z]*', 'match')
I think you'll be able to find the rest out yourself :)
Here is basic skeleton. But make sure to use correct regexp to extract the values ;-)
Yes with the above Dorin's regexp and match you get one value at a time. You may add a token as per this example from mathworks.
Sample:
str = ['if <code>A </code> == x<sup>2 </sup>, ' ... '<em>disp(x) </em>']
str = if <code>A </code> == x<sup>2 </sup>, <em>disp(x) </em>
expr = '<(\w+).*?>.*?</\1>';
[tok mat] = regexp(str, expr, 'tokens', 'match');
tok{:}
ans = 'code'
ans = 'sup'
ans = 'em'
in above code you don't really need to loop and can process entire text bulk as one string , hopefully not hitting any string limit......
But if you want to loop, or if you need to loop, you use the following sample with Rody's regexp and match only.
fid = fopen('data.txt');
dataText = fgetl(fid);
while ~feof(fid)
ldata = textscan(dataText,'*%d#*');
X = (ldata, '#[A-z]*', 'match')
Cellarray = X{1}
end
Disp(X)
fclose(fid);
Related
I am trying to concatenate three lines (I want to leave the lines as is; 3 rows) from Shakespeare.txt file that shows:
To be,
or not to be:
that is the question.
My code right now is
fid = fopen('Shakespeare.txt')
while ~feof(fid)
a = fgets(fid);
b = fgets(fid);
c = fgets(fid);
end
fprintf('%s', strcat(a, b, c))
I'm supposed to use strcat and again, I want concatenated and leave them as three rows.
One method of keeping the rows separate is by storing the lines of the text file in a string array. Here a 1 by 3 string array is used. It may also be a good idea to use fgetl() which grabs each line of the text file at a time. Concantenating the outputs of fgetl() as strings may also be another option to ensure the they do not get stored as character (char) arrays. Also using the \n indicates to line break when printing the strings within the array String_Array.
fid = fopen('Shakespeare.txt');
while ~feof(fid)
String_Array(1) = string(fgetl(fid));
String_Array(2) = string(fgetl(fid));
String_Array(3) = string(fgetl(fid));
end
fprintf('%s\n', String_Array);
Ran using MATLAB R2019b
this question about matlab:
i'm running a loop and each iteration a new set of data is produced, and I want it to be saved in a new file each time. I also overwrite old files by changing the name. Looks like this:
name_each_iter = strrep(some_source,'.string.mat','string_new.(j).mat')
and what I#m struggling here is the iteration so that I obtain files:
...string_new.1.mat
...string_new.2.mat
etc.
I was trying with various combination of () [] {} as well as 'string_new.'j'.mat' (which gave syntax error)
How can it be done?
Strings are just vectors of characters. So if you want to iteratively create filenames here's an example of how you would do it:
for j = 1:10,
filename = ['string_new.' num2str(j) '.mat'];
disp(filename)
end
The above code will create the following output:
string_new.1.mat
string_new.2.mat
string_new.3.mat
string_new.4.mat
string_new.5.mat
string_new.6.mat
string_new.7.mat
string_new.8.mat
string_new.9.mat
string_new.10.mat
You could also generate all file names in advance using NUM2STR:
>> filenames = cellstr(num2str((1:10)','string_new.%02d.mat'))
filenames =
'string_new.01.mat'
'string_new.02.mat'
'string_new.03.mat'
'string_new.04.mat'
'string_new.05.mat'
'string_new.06.mat'
'string_new.07.mat'
'string_new.08.mat'
'string_new.09.mat'
'string_new.10.mat'
Now access the cell array contents as filenames{i} in each iteration
sprintf is very useful for this:
for ii=5:12
filename = sprintf('data_%02d.mat',ii)
end
this assigns the following strings to filename:
data_05.mat
data_06.mat
data_07.mat
data_08.mat
data_09.mat
data_10.mat
data_11.mat
data_12.mat
notice the zero padding. sprintf in general is useful if you want parameterized formatted strings.
For creating a name based of an already existing file, you can use regexp to detect the '_new.(number).mat' and change the string depending on what regexp finds:
original_filename = 'data.string.mat';
im = regexp(original_filename,'_new.\d+.mat')
if isempty(im) % original file, no _new.(j) detected
newname = [original_filename(1:end-4) '_new.1.mat'];
else
num = str2double(original_filename(im(end)+5:end-4));
newname = sprintf('%s_new.%d.mat',original_filename(1:im(end)-1),num+1);
end
This does exactly that, and produces:
data.string_new.1.mat
data.string_new.2.mat
data.string_new.3.mat
...
data.string_new.9.mat
data.string_new.10.mat
data.string_new.11.mat
when iterating the above function, starting with 'data.string.mat'
Loading a well formatted and delimited text file in Matlab is relatively simple, but I struggle with a text file that I have to read in. Sadly I can not change the structure of the source file, so I have to deal with what I have.
The basic file structure is:
123 180 (two integers, white space delimited)
1.5674e-8
.
.
(floating point numbers in column 1, column 2 empty)
.
.
100 4501 (another two integers)
5.3456e-4 (followed by even more floating point numbers)
.
.
.
.
45 String (A integer in column 1, string in column 2)
.
.
.
A simple
[data1,data2]=textread('filename.txt','%f %s', ...
'emptyvalue', NaN)
Does not work.
How can I properly filter the input data? All examples I found online and in the Matlab help so far deal with well structured data, so I am a bit lost at where to start.
As I have to read a whole bunch of those files >100 I rather not iterate trough every single line in every file. I hope there is a much faster approach.
EDIT:
I made a sample file available here: test.txt (google drive)
I've looked at the text file you supplied and tried to draw a few general conclusions -
When there are two integers on a line, the second integer corresponds to the number of rows following.
You always have (two integers (A, B) followed by "B" floats), repeated twice.
After that you have some free-form text (or at least, I couldn't deduce anything useful about the format after that).
This is a messy format so I doubt there are going to be any nice solutions. Some useful general principles are:
Use fgetl when you need to read a single line (it reads up to the next newline character)
Use textscan when it's possible to read multiple lines at once - it is much faster than reading a line at a time. It has many options for how to parse, which it is worth getting to know (I recommend typing doc textscan and reading the entire thing).
If in doubt, just read the lines in as strings and then analyse them in MATLAB.
With that in mine, here is a simple parser for your files. It will probably need some modifications as you are able to infer more about the structure of the files, but it is reasonably fast on the ~700 line test file you gave.
I've just given the variables dummy names like "a", "b", "floats" etc. You should change them to something more specific to your needs.
function output = readTestFile(filename)
fid = fopen(filename, 'r');
% Read the first line
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
a = nums{1}(1);
b = nums{1}(2);
% Read 'b' of the next lines:
contents = textscan(fid, '%f', b);
floats1 = contents{1};
% Read the next line:
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
c = nums{1}(1);
d = nums{1}(2);
% Read 'd' of the next lines:
contents = textscan(fid, '%f', d);
floats2 = contents{1};
% Read the rest:
rest = textscan(fid, '%s', 'Delimiter', '\n');
output.a = a;
output.b = b;
output.c = c;
output.d = d;
output.floats1 = floats1;
output.floats2 = floats2;
output.rest = rest{1};
end
You can read in the file line by line using the lower-level functions, then parse each line manually.
You open the file handle like in C
fid = fopen(filename);
Then you can read a line using fgetl
line = fgetl(fid);
String tokenize it on spaces is probably the best first pass, storing each piece in a cell array (because a matrix doesn't support ragged arrays)
colnum = 1;
while ~isempty(rem)
[token, rem] = strtok(rem, ' ');
entries{linenum, colnum} = token;
colnum = colnum + 1;
end
then you can wrap all of that inside another while loop to iterate over the lines
linenum = 1;
while ~feof(fid)
% getl, strtok, index bookkeeping as above
end
It's up to you whether it's best to parse the file as you read it or read it into a cell array first and then go over it afterwards.
Your cell entries are all going to be strings (char arrays), so you will need to use str2num to convert them to numbers. It does a good job of working out the format so that might be all you need.
I have file names stored as follows:
>> allFiles.name
ans =
k-120_knt-500_threshold-0.3_percent-34.57.csv
ans =
k-216_knt-22625_threshold-0.3_percent-33.33.csv
I wish to extract the 4 values from them and store in a cell.
data={};
for k =1:numel(allFiles)
data{k,1}=csvread(allFiles(k).name,1,0);
data{k,2}= %kvalue
data{k,3}= %kntvalue
data{k,4}=%threshold
data{k,5}=%percent
...
end
There's probably a regular expression that can be used to do this, but a simple piece of code would be
data={numel(allFiles),5};
for k =1:numel(allFiles)
data{k,1}=csvread(allFiles(k).name,1,0);
[~,name] = fileparts(allFiles(k).name);
dashIdx = strfind(name,'-'); % find location of dashes
usIdx = strfind(name,'_'); % find location of underscores
data{k,2}= str2double(name(dashIdx(1)+1:usIdx(1)-1)); %kvalue
data{k,3}= str2double(name(dashIdx(2)+1:usIdx(2)-1)); %kntvalue
data{k,4}= str2double(name(dashIdx(3)+1:usIdx(3)-1)); %threshold
data{k,5}= str2double(name(dashIdx(4)+1:end)); %percent
...
end
For efficiency, you might consider using a single matrix to store all the numeric data, and/or a structure (so that you can access the data by name rather than index).
You simply need to tokenize using strtok multiple times (there is more than 1 way to solve this). Someone has a handy matlab script somewhere on the web to tokenize strings into a cell array.
(1) Starting with:
filename = 'k-216_knt-22625_threshold-0.3_percent-33.33.csv'
Use strfind to prune out the extension
r = strfind(filename, '.csv')
filenameWithoutExtension = filename(1:r-1)
This leaves us with:
'k-216_knt-22625_threshold-0.3_percent-33.33'
(2) Then tokenize this:
'k-216_knt-22625_threshold-0.3_percent-33.33'
using '_' . You get the tokens:
'k-216'
'knt-22625'
'threshold-0.3'
'percent-33.33'
(3) Lastly, for each string, tokenize using using '-'. Each second string will be:
'216'
'22625'
'0.3'
'33.33'
And use str2num to convert.
Strategy: strsplit() + str2num().
data={};
for k =1:numel(allFiles)
data{k,1}=csvread(allFiles(k).name,1,0);
words = strsplit( allFiles(k).name(1:(end-4)), '_' );
data{k,2} = str2num(words{1}(2:end));
data{k,3} = str2num(words{2}(4:end));
data{k,4} = str2num(words{3}(10:end));
data{k,5} = str2num(words{4}(8:end));
end
eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');