I have this structure in a text file named my_file.txt.
# Codelength = 3.74556 bits.
1:1:1:1 0.000218593 "v12978"
1:1:1:2 0.000153576 "v1666"
1:1:1:3 0.000149092 "v45"
1:1:1:4 0.000100329 "v4618"
1:1:1:5 5.1005e-005 "v5593"
1:1:1:6 3.53112e-005 "v10214"
1:1:1:7 3.36297e-005 "v10389"
1:1:1:8 2.85852e-005 "v2273"
1:1:1:9 2.63433e-005 "v13253"
1:1:1:10 2.41013e-005 "v10109"
1:1:1:11 2.01778e-005 "v9204"
1:1:1:12 1.73753e-005 "v16508"
1:1:1:13 1.34519e-005 "v335"
This is a small part of this text file. Main file has more than 600,000 lines. I want have a array with this properties:
First column : 1 1 1 1 1 1 1 ... (left values in txt file)
Second column : 1 1 1 1 1 1 1 ...
Third column : 1 1 1 1 1 1 1 ...
Fourth column : 1 2 3 4 5 6 ...
Fifth column : 0.000218593 0.000153576 000149092 000100329 ....
and a string containing last right text file items ("v12978", "v1666" ...). How can I do this in MATLAB?
Suppose that textfile.txt is your data file, then
fid = fopen('textfile.txt', 'r');
oC = onCleanup(#() any(fopen('all')==fid) && fclose(fid) );
data = textscan(fid,...
'%d:%d:%d:%d %f %q',...
'Headerlines', 1);
fclose(fid);
will give
data =
[13x1 int32] [13x1 int32] [13x1 int32] [13x1 int32] [13x1 double] {13x1 cell}
That already fits your description of the desired output format.
Now, you could go on and concatenate the numbers into a single array, where you should take care of the fact that MATLAB downcasts by default:
numbers = cellfun(#double, data(1:end-1), 'UniformOutput', false);
numbers = [numbers{:}];
but well, that all depends on your specific use case.
You might want to split the reading/processing up in chunks of say, 10,000 lines, because reading 600k lines all at once can eat away your RAM. Read the documentation on textscan how to do this.
Related
I am struggling with a text file that I have to read in. In this file, there are two types of line:
133 0102764447 44 11 54 0.4 0 0.89 0 0 8 0 0 7 Attribute_Name='xyz' Type='string' 02452387764447 884
134 0102256447 44 1 57 0.4 0 0.81 0 0 8 0 0 1 864
What I want to do here is to textscan all the lines and then try to determine the number of 'xyz' (and the total number of lines).
I tried to use:
fileID = fopen('test.txt','r') ;
data=textscan(fileID, %d %d %d %d %d %d %d %d %d %d %d %d %d %s %s %d %d','\n) ;
And then I will try to access data{i,16} to count how many are equal to Attribute_Name='xyz', it doesnt seem to be an efficient though.
what will be a proper way to read the data(what interests me is to count how many Attribute_Name='xyz' do I have)? Thanks
You could simply use count which is referenced here.
In your case you could use it in this way:
filetext = fileread("test.txt");
A = count(filetext , "xyz")
fileread will read the whole text file into a single string. Afterwards you can process that string using count which will return the occurrences from the given pattern.
An alternative when using older versions of MATLAB is this one. It will work with R2006a and above.
filetext = fileread("test.txt");
A = length(strfind(filetext, "xyz");
strfind will return an array which length represents the amount of occurrences of the specified string. The length of that array can be accessed by length.
There is the option of strsplit. You may do something like the following:
count = 0;
fid = fopen('test.txt','r');
while ~feof(fid)
line = fgetl(fid);
words = strsplit( line )
ind = find( strcmpi(words{:},'Attribute_Name=''xyz'''), 1); % Assume only one instance per line, remove 1 for more and correct the rest of the code
if ( ind > 0 ) then
count = count + 1;
end if
end
So at the end count will give you the number.
I have a table in MATLAB with attributes in the first three columns and data from the fourth column onwards. I was trying to sort the entire table based on the first three columns. However, one of the columns (Column C) contains months ('January', 'February' ...etc). The sortrows function would only let me choose 'ascend' or 'descend' but not a custom option to sort by month. Any help would be greatly appreciated. Below is the code I used.
sortrows(Table, {'Column A','Column B','Column C'} , {'ascend' , 'ascend' , '???' } )
As #AnonSubmitter85 suggested, the best thing you can do is to convert your month names to numeric values from 1 (January) to 12 (December) as follows:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t.ColumnC = month(datenum(t.ColumnC,'mmmm'));
This will facilitate the access to a standard sorting criterion for your ColumnC too (in this example, ascending):
t = sortrows(t,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
If, for any reason that is unknown to us, you are forced to keep your months as literals, you can use a workaround that consists in sorting a clone of the table using the approach described above, and then applying to it the resulting indices:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t_original = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t_clone = t_original;
t_clone.ColumnC = month(datenum(t_clone.ColumnC,'mmmm'));
[~,idx] = sortrows(t_clone,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
t_original = t_original(idx,:);
If we have
A=[100 -0.1 0];
B=[30 0.2 -2]; t1='text 1'; t2=text 2'
how to use fprintf so that the output saved in a file will look like that
100 -1.000E-0001 0.000E-0000 'text 1'
30 2.000E-0001 -2.000E-0000 'text 2'
I put together a "one-liner" (spread across several lines for better readability) that takes an array, a single number format, and a delimiter and returns the desired string. And while you found the leading blank-space flag, I prefer the + flag, though the function will work with both:
A=[-0.1 0];
B=[0.2 -2];
minLenExp = 4;
extsprintf = #(num,fmt,delim) ...
strjoin(cellfun(...
#(toks)[toks{1},repmat('0',1,max([0,minLenExp-length(toks{2})])),toks{2}],...
regexp(sprintf(fmt,num),'([+-\s][\.\d]+[eE][+-])(\d+)','tokens'),...
'UniformOutput',false),delim);
Astr = extsprintf(A,'%+.4E',' ');
Bstr = extsprintf(B,'%+.4E',' ');
disp([Astr;Bstr]);
Running this yields:
>> foo
-1.0000E-0001 +0.0000E+0000
+2.0000E-0001 -2.0000E+0000
(foo is just what the script file is called.)
Here's a more general approach that searches for the exponential format instead of assuming it:
A=[100 -0.1 0].';
B=[30 0.2 -2];
extsprintf = #(fmt,arr) ...
regexprep(...
sprintf(fmt,arr),...
regexprep(regexp(sprintf(fmt,arr),'([+-\s][\.\d]+[eE][+-]\d+)','match'),'([\+\.])','\\$1'),...
cellfun(#(match)...
cellfun(...
#(toks)[toks{1},repmat('0',1,max([0,minLenExp-length(toks{2})])),toks{2}],...
regexp(match,'([+-\s][\.\d]+[eE][+-])(\d+)','tokens'),...
'UniformOutput',false),...
regexp(sprintf(fmt,arr),'([+-\s][\.\d]+[eE][+-]\d+)','match')));
fmt = '%3d %+.4E %+.4e';
disp(extsprintf(fmt,A));
disp(extsprintf(fmt,B));
Outputs
>> foo
100 -1.0000E-0001 +0.0000e+0000
30 +2.0000E-0001 -2.0000e+0000
I am trying to compute and plot the distribution of bigrams frequencies
First I did generate all possible bigrams which gives 1296 bigrams
then i extract the bigrams from a given file and save them in words1
my question is how to compute the frequency of these 1296 bigrams for the file a.txt?
if there are some bigrams did not appear at all in the file, then their frequencies should be zero
a.txt is any text file
clear
clc
%************create bigrams 1296 ***************************************
chars ='1234567890abcdefghijklmonpqrstuvwxyz';
chars1 ='1234567890abcdefghijklmonpqrstuvwxyz';
bigram='';
for i=1:36
for j=1:36
bigram = sprintf('%s%s%s',bigram,chars(i),chars1(j));
end
end
temp1 = regexp(bigram, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp1(1:end-1)', temp1(2:end)','un',0);
bigrams = temp2;
bigrams = unique(bigrams);
bigrams = rot90(bigrams);
bigram = char(bigrams(1:end));
all_bigrams_len = length(bigrams);
clear temp temp1 temp2 i j chars1 chars;
%****** 1. Cleaning Data ******************************
collection = fileread('e:\a.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W','');
collection = strtrim(regexprep(collection,'\s*',''));
%*******************************************************
temp = regexp(collection, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp(1:end-1)', temp(2:end)','un',0);
words1 = rot90(temp2);
%*******************************************************
words1_len = length(words1);
vocab1 = unique(words1);
vocab_len1 = length(vocab1);
[vocab1,void1,index1] = unique(words1);
frequencies1 = hist(index1,vocab_len1);
I. Character counting problem for a string
bsxfun based solution for counting characters -
counts = sum(bsxfun(#eq,[string1-0]',65:90))
Output -
counts =
2 0 0 0 0 2 0 1 0 0 ....
If you would like to get a tabulate output of counts against each letter -
out = [cellstr(['A':'Z']') num2cell(counts)']
Output -
out =
'A' [2]
'B' [0]
'C' [0]
'D' [0]
'E' [0]
'F' [2]
'G' [0]
'H' [1]
'I' [0]
....
Please note that this was a case-sensitive counting for upper-case letters.
For a lower-case letter counting, use this edit to this earlier code -
counts = sum(bsxfun(#eq,[string1-0]',97:122))
For a case insensitive counting, use this -
counts = sum(bsxfun(#eq,[upper(string1)-0]',65:90))
II. Bigram counting case
Let us suppose that you have all the possible bigrams saved in a 1D cell array bigrams1 and the incoming bigrams from the file are saved into another cell array words1. Let us also assume certain values in them for demonstration -
bigrams1 = {
'ar';
'de';
'c3';
'd1';
'ry';
't1';
'p1'}
words1 = {
'de';
'c3';
'd1';
'r9';
'yy';
'de';
'ry';
'de';
'dd';
'd1'}
Now, you can get the counts of the bigrams from words1 that are present in bigrams1 with this code -
[~,~,ind] = unique(vertcat(bigrams1,words1));
bigrams_lb = ind(1:numel(bigrams1)); %// label bigrams1
words1_lb = ind(numel(bigrams1)+1:end); %// label words1
counts = sum(bsxfun(#eq,bigrams_lb,words1_lb'),2)
out = [bigrams1 num2cell(counts)]
The output on code run is -
out =
'ar' [0]
'de' [3]
'c3' [1]
'd1' [2]
'ry' [1]
't1' [0]
'p1' [0]
The result shows that - First element ar from the list of all possible bigrams has no find in words1 ; second element de has three occurrences in words1 and so on.
Hey similar to Dennis solution you can just use histc()
string1 = 'ASHRAFF'
histc(string1,'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
this checks the number of entries in the bins defined by the string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' which is hopefully the alphabet (just wrote it fast so no garantee). The result is:
Columns 1 through 21
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
Columns 22 through 26
0 0 0 0 0
Just a little modification of my solution:
string1 = 'ASHRAFF'
alphabet1='A':'Z'; %%// as stated by Oleg Komarov
data=histc(string1,alphabet1);
results=cell(2,26);
for k=1:26
results{1,k}= alphabet1(k);
results{2,k}= data(k);
end
If you look at results now you can easily check rather it works or not :D
This answer creates all bigrams, loads in the file does a little cleanup, ans then uses a combination of unique and histc to count the rows
Generate all Bigrams
note the order here is important as unique will sort the array so this way it is created presorted so the output matches expectation;
[y,x] = ndgrid(['0':'9','a':'z']);
allBigrams = [x(:),y(:)];
Read The File
this removes capitalisation and just pulls out any 0-9 or a-z character then creates a column vector of these
fileText = lower(fileread('d:\loremipsum.txt'));
cleanText = regexp(fileText,'([a-z0-9])','tokens');
cleanText = cell2mat(vertcat(cleanText{:}));
create bigrams from file by shifting by one and concatenating
fileBigrams = [cleanText(1:end-1),cleanText(2:end)];
Get Counts
the set of all bigrams is added to our set (so the values are created for all possible). Then a value ∈{1,2,...,1296} is assigned to each unique row using unique's 3rd output. Counts are then created with histc with the bins equal to the set of values from unique's output, 1 is subtracted from each bin to remove the complete set bigrams we added
[~,~,c] = unique([fileBigrams;allBigrams],'rows');
counts = histc(c,1:1296)-1;
Display
to view counts against text
[allBigrams, counts+'0']
or for something potentially more useful...
[sortedCounts,sortInd] = sort(counts,'descend');
[allBigrams(sortInd,:), sortedCounts+'0']
ans =
or9
at8
re8
in7
ol7
te7
do6 ...
Did not look into the entire code fragment, but from the example at the top of your question, I think you are looking to make a histogram:
string1 = 'ASHRAFF'
nr = histc(string1,'A':'Z')
Will give you:
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
(Got a working solution with hist, but as #The Minion shows histc is more easy to use here.)
Note that this solution only deals with upper case letters.
You may want to do something like so if you want to put lower case letters in their correct bin:
string1 = 'ASHRAFF'
nr = histc(upper(string1),'A':'Z')
Or if you want them to be shown separately:
string1 = 'ASHRaFf'
nr = histc(upper(string1),['a':'z' 'A':'Z'])
bi_freq1 = zeros(1,all_bigrams_len);
for k=1: vocab_len1
for i=1:all_bigrams_len
if char(vocab1(k)) == char(bigrams(i))
bi_freq1(i) = frequencies1(k);
end
end
end
I have a large number of csv files to be processed. I only want the selected columns in each file and then load all the files from a certain folder and then output as one combined file. Here are my codes running with errors.... Could anyone help me to solve this problem?
data_directory = 'C:\Users\...\data';
numfiles = 17;
for n = 1:numfiles
filepath = [data_directory,'data_', num2str(n),'_output.csv'];
fid = fopen (filepath, 'rt');
wanted_columns= [2 3 4 5 10 11 12 13 14 15 16 17 35 36 41 42 44 45 59 61];
format = [];
columns = 109;
for i = 1 : columns;
if any (i == wanted_columns)
format = [format '%s'];
else
format = [format '%*s'];
end
end
data = textscan(fid, format, 'Delimiter',',','HeaderLines',1);
fclose(fid);
end
I think you should check whether the file is opened correctly. The error message seems to indicate that this is not the case. If it is not, check if the filepath is correct.
fid = fopen (filepath, 'rt');
if fid == -1
error('Failed to open file');
end
If the error is thrown here, you know that there was a problem with 'fopen'.
Ofcourse I don't know which files are on your computer, but I assume the '...' in the filename is not in your actual matlab file, only in your question on SO.
But could it be that you repeat the word 'data', while the actual filename only contains 'data' once? You code now will result in filenames like ''C:\Users\...\datadata_1_output.csv'. Maybe 'data' should be removed in data_directory or in filepath = ...?
Here is another way how you can setup the format string in a vectorized manner:
fcell = repmat({'%*s '},1,n_columns);
fcell(wanted_columns) = {'%s '};
formatstr = [fcell{:}];
Notice format is a build-in function in MATLAB, and it's better not to be used for variable name.