Count the number of lines starting with particular string - Matlab - matlab

I would like to count the number of strings in a text file starting with particular string only
function count = countLines(fname)
fh = fopen(fname, 'rt');
assert(fh ~= -1, 'Could not read: %s', fname);
x = onCleanup(#() fclose(fh));
count = 0;
while ~feof(fh)
count = count + sum( fread( fh, 16384, 'char' ) == char(10) );
end
count = count+1;
end
I am using the above function to count the number of lines in the whole .text file. But now I wanted to find the number of lines starting with only with particular strings (Eg. All the lines starting with a letter 's').

You can try two importdata based approaches here.
Approach #1
str1 = 's'; %// search string
%// Read in text data into a cell array with each cell holding text data from
%// each line of text file
textdata = importdata(fname,'\n') ;
%// Compare each cell's starting n characters for the search string, where
%// n is no. of chars in search string and sum those matches for the final count
count = sum(cellfun(#(x) strncmp(x,str1,numel(str1)),textdata))
Approach #2
%// ..... Get str1 and textdata as shown in Approach #1
%// Convert cell array data into char array
char_textdata = char(textdata)
%// Select the first n columns from char array and compare against search
%// string for all characters match and sum those matches for the final count.
%// Here n is the number of characters in search string
count = sum(all(bsxfun(#eq,char_textdata(:,1:numel(str1)),str1),2))
With both these approaches, you can even specify a char array as search string.

Although I don't get the idea behind your code, so I cannot modify it I use this approach:
fid = fopen(txt);
% Loop through data file until we get a -1 indicating EOF
while(x ~= (-1))
line = fgetl(fid);
r = r+1;
end
r = r-1;
with r being the number of lines in the file. You can easily check the value of line before increasing the counter r to meet your condition first.
Also I don't know what are the performance issues regarding those two methods.

Related

Save cell containing different length vectors to text file in MATLAB

I am trying to save a cell array containing different length column vectors to a text file in MATLAB, but I am a bit stuck.
cell_structure = {[1;2;3;...546] [1;2;3;...800] [1;2;3;...1011] [1;2;3;...1118] [1;2;3;...1678]}
I tried using the following code:
fid = fopen( 'myFile.txt', 'w' ) ;
for cId = 1 : numel( cell_structure )
fprintf( fid, '%f ', cell_structure{cId} ) ;
fprintf( fid, '\r\n' ) ;
end
The problem is when I open the text file the column vectors are saved as row vectors and their length is limited to 545. Any help would be much appreciated.
Your first iteration of the for loop prints the ENTIRE first array, in cell_structure. It doesn't matter whether this array is a row or a column vector, since you're using fprintf(), it's going to print each element out, one after another.
This is a little trickier than I can manage at work right now... but you will need to pad your shorter vectors to match the length of your largest.
Then:
for k = 1:size_of_largest_vector
for j = 1:total_number_of_columns
new_cell{k,j} = cell_structure{j}(k)
end
end
This will give you an array of all your column vectors.
Then, use a space delimited csvwrite() to write the columns.
I used Elijah Rockers idea to pad out the shorter column vectors so they are all the same length. And Crowley pointed out that dmlwrite and cvswrite cannot handle empty cells. I found a function that can handle empty cells:
https://github.com/mkoohafkan/UCBcode-Matlab/blob/master/OtherPeoplesFunctions/dlmcell.m
And here is my code to pad out and the save data.
for k = 1:max(cellfun('length',cell)); %longest column vector length
for j = 1:length(cell); % number of colum vctors
if k > length(cell{j}); % if the index is longer than colum vector length
new_cell{k,j} = []; % pad vector
else
new_cell{k,j} = cell{j}(k); %otherwise add to array
end
end
end
dlmcell('test.txt',new_cell,',')
It works like a charm, all of the column vectors in the cell array are now saved in separate columns of the text file.

Read multiple data from a MATLAB file

I am currently trying to read data from a text file written exactly like this:
Height = 10
Length = 10
NodeX = 11
NodeY = 11
K = 10
I've written a small code like this
fileID = fopen('input.dat','r');
[a, b] = fscanf(fileID, '%s %f')
And I get the following answer:
a =
72
101
105
103
104
116
b =
1
It seems quite obvious I am not mananging to specify the format specification.
I would like to know how to pick a string along with a float multiple times in the same file.
As the documentation for fscanf states:
If formatSpec contains a combination of numeric and character
specifiers, then fscanf converts each character to its numeric
equivalent. This conversion occurs even when the format explicitly
skips all numeric values (for example, formatSpec is '%*d %s').
MATLAB can be annoyingly bad at reading mixed data types. One possible alternative is to read each line and split up your data using a simple regular expression:
fileID = fopen('results.txt','r');
mydata = {};
ii = 1;
while ~feof(fileID) % While we're not at the end of the file
tline = fgetl(fileID); % Get next line
mydata(ii,:) = regexp(tline, '([a-zA-Z])* = (\d*)', 'tokens');
ii = ii + 1;
end
fclose(fileID);
This returns a 5 x 1 cell array where each cell contains 2 cells (slightly annoying, but you can pull them out) that match your data. In this case, mydata{1}{1} is Height and mydata{1}{2} is 10.
Edit:
And you can flatten your cell array with a reshape call:
mydata = reshape([mydata{:}], 2, [])';
Which turns mydata in this case into a 5x2 cell array.
The fscanf function is a low-level I/O function and is often not the best choice for such rather high-level file input. One alternative would be to use the textscan function, which allows quite advanced format specifications:
fileID = fopen('input.dat','r');
C = textscan(fileID,'%s = %d')
which creates a 1x2 cell array. The first cell C{1} contains another 5x1 cell, where each field contains the name of the field, e.g. 'Height'. The second cell C{2} contains a 5x1 vector containing all integer values from the file.

Search for a specific digit in an integer

I'm looking for a really quick method in MATLAB of searching for a specific digit within an integer, ideally in a given position. For example:
Simple case...
I want to look through an array of integers and return all those which contain the number 1 eg 1234, 4321, 6515, 847251737 etc
More complex case...
I want to loop through an array of integers and return all those which contain the number 1 in the third digit eg 6218473, 541846, 3115473 BUT 175846 would not be returned.
Any thoughts?
There's a few answers here already, I'll throw my try into the pot.
Conversion to string can be expensive, so if it can be avoided, it should be.
n = 1:100000; % sample numbers
m = 3; % digit to check
x = 1; % number to find
% Length of the numbers in digits
num_length = floor(log10(abs(n)))+1;
% digit (from the left) to check
num_place = num_length-m;
% get the digit
digit_in_place = mod(floor(abs(n)./(10.^num_place)),10);
found_number = n(digit_in_place==x);
By casting to strings, the trick to vectorising is just to make sure x is a column vector. x(:) guarantees this. Also you need to left-align the strings which is done with the format specifier '%-d' where - is for left-alignment and d is for integers:
s = num2str(x(:), '%-d');
ind = s(:,3)=='1'
and this also allows you to easily solve your first case:
ind = any(s=='1',2)
in either case to recover your original number just go:
x(ind)
One way of getting there is to cast your numbers as strings and then check if the 3rd position of that string is '1'. It works perfectly fine in a loop, but I am confident that there is also a vectorized solution:
numbers = [6218473, 541846, 3115473, 175846]'
returned_numbers = [];
for i = 1:length(numbers)
number = numbers(i);
y = sprintf('%d', number) %// cast to string
%// add number to list, if its third character is 11
if strcmp(y(3), '1')
returned_numbers = [returned_numbers, number];
end
end
% // it returns:
returned_numbers =
6218473 541846 3115473
Code
%// Input array
array1 = [-94341 1234 4321 6515 847251737 6218473 541846 3115473 175846]
N = numel(array1); %// number of elements in input array
digits_sep = num2str(array1(:))-'0'; %//' Seperate the digits into a matrix
%// Simple case
output1 = array1(any(digits_sep==1,2))
%// More complex case output
col_num = 3;
%// Get column numbers for each row of the digits matrix and thus
%// the actual linear index corresponding to 3rd digit for each input element
ind1 =sub2ind(size(digits_sep),1:N,...
size(digits_sep,2)-floor(log10(abs(array1))-col_num+1));
%// Select the third digits, check which ones have `1` and use them to logically
%// index into input array to get the output
output2 = array1(digits_sep(ind1)==1)
Code run -
array1 =
-94341 1234 4321 6515 847251737 6218473 541846 3115473 175846
output1 =
-94341 1234 4321 6515 847251737 6218473 541846 3115473 175846
output2 =
6515 6218473 541846 3115473

matlab parse file into cell array

I have a file in the following format in matlab:
user_id_a: (item_1,rating),(item_2,rating),...(item_n,rating)
user_id_b: (item_25,rating),(item_50,rating),...(item_x,rating)
....
....
so each line has values separated by a colon where the value to the left of the colon is a number representing user_id and the values to the right are tuples of item_ids (also numbers) and rating (numbers not floats).
I would like to read this data into a matlab cell array or better yet ultimately convert it into a sparse matrix wherein the user_id represents the row index, and the item_id represents the column index and store the corresponding rating in that array index. (This would work as I know a-priori the number of users and items in my universe so ids cannot be greater than that ).
Any help would be appreciated.
I have thus far tried the textscan function as follows:
c = textscan(f,'%d %s','delimiter',':') %this creates two cells one with all the user_ids
%and another with all the remaining string values.
Now if I try to do something like str2mat(c{2}), it works but it stores the '(' and ')' characters also in the matrix. I would like to store a sparse matrix in the fashion that I described above.
I am fairly new to matlab and would appreciate any help regarding this matter.
f = fopen('data.txt','rt'); %// data file. Open as text ('t')
str = textscan(f,'%s'); %// gives a cell which contains a cell array of strings
str = str{1}; %// cell array of strings
r = str(1:2:end);
r = cellfun(#(s) str2num(s(1:end-1)), r); %// rows; numeric vector
pairs = str(2:2:end);
pairs = regexprep(pairs,'[(,)]',' ');
pairs = cellfun(#(s) str2num(s(1:end-1)), pairs, 'uni', 0);
%// pairs; cell array of numeric vectors
cols = cellfun(#(x) x(1:2:end), pairs, 'uni', 0);
%// columns; cell array of numeric vectors
vals = cellfun(#(x) x(2:2:end), pairs, 'uni', 0);
%// values; cell array of numeric vectors
rows = arrayfun(#(n) repmat(r(n),1,numel(cols{n})), 1:numel(r), 'uni', 0);
%// rows repeated to match cols; cell array of numeric vectors
matrix = sparse([rows{:}], [cols{:}], [vals{:}]);
%// concat rows, cols and vals into vectors and use as inputs to sparse
For the example file
1: (1,3),(2,4),(3,5)
10: (1,1),(2,2)
this gives the following sparse matrix:
matrix =
(1,1) 3
(10,1) 1
(1,2) 4
(10,2) 2
(1,3) 5
I think newer versions of Matlab have a stringsplit function that makes this approach overkill, but the following works, if not quickly. It splits the file into userid's and "other stuff" as you show, initializes a large empty matrix, and then iterates through the other stuff, breaking it apart and placing in the correct place in the matrix.
(I Didn't see the previous answer when I opened this for some reason - it is more sophisticated than this one, though this may be a little easier to follow at the expense of slowness). I throw in the \s* into the regex in case the spacing is inconsistent, but otherwise don't perform much in the way of data-sanity-checking. Output is the full array, that you can then turn into a sparse array if desired.
% matlab_test.txt:
% 101: (1,42),(2,65),(5,0)
% 102: (25,78),(50,12),(6,143),(2,123)
% 103: (23,6),(56,3)
clear all;
fclose('all');
% your path will vary, of course
file = '<path>/matlab_test.txt';
f = fopen(file);
c = textscan(f,'%d %s','delimiter',':');
celldisp(c)
uids = c{1}
tuples = c{2}
% These are stated as known
num_users = 3;
num_items = 40;
desired_array = zeros(num_users, num_items);
expression = '\((\d+)\s*,\s*(\d+)\)'
% Assuming length(tuples) == num_users for simplicity
for k = 1:num_users
uid = uids(k)
tokens = regexp(tuples{k}, expression, 'tokens');
for l = 1:length(tokens)
item_id = str2num(tokens{l}{1})
rating = str2num(tokens{l}{2})
desired_array(uid, item_id) = rating;
end
end

3D cell arrays in matlab

I am currently working using matlab, I have uploaded a csv file into a cell array that I have named B. What I now wish to do is to input the information of B into a 3-D cell array, the 3rd dimension of the array being the first column of B which are strings ranging from "chr1" to "chr24". The full length of B is m, and the maximum length of any "chr" is maxlength. I doubt that this is the best way of going about it but here is my code:
for j = 1:m ,
Ind = findstr(B{1}{j}, 'chr');
Num = B{1}{j}(Ind+3:end-1);
cnum = str2num(Num);
for i = 1:24,
if cnum == i;
for k = 2:9 ,
for l = 1:maxlength ,
C{l}{k}{i} = B{k}{j};
C{l}{k}{i}
end
end
end
end
end
The 3-D array that comes out of this does not match the corresponding values in the initial array. I also want to know if this is the right way to create a 3-D array, I can't seem to find anything on the matlab website about them.
Thanks
There are a few possible issues with your approach: First of all, Matlab indexing is different from c-style indexing into tables. myCell{i}{j} is the j-th element of the cell array that is contained in the i-th element of the cell array myCell. If you want to index into a 2-d cell array, you would get the contents of the element in row i, column j as myCell{i,j}.
If the columns 2 through 9 of your .csv file contain all numeric data, it may be a lot more convenient to use either a 1D cell array with an entry for every chromosome, or to use a 2D or 3D numeric array if you get, for each chromosome, a single row, or a table, respectively.
Here's one way to do it
%# convert chromosomes to numbers
chromosomes = B{1};
chromosomes = strrep(chromosomes,'X',25);
chromosomes = strrep(chromosomes,'Y',26);
tmp = regexp(chromsomes,'chr(\d+)','tokens','once');
cnum = cellfun(#(x)str2double(x{1}),tmp);
%# catenate the rest of B into a 2D cell array
allNumbers = cell2mat(cat(2,B{2:end}));
%# now we can make a table with [chromosomeNumber,allOtherNumbers]
finalTable = [chromosomeNumber,allNumbers]
%# alternatively, if there are multiple entries for each chromosome, we can
%# group the data in a cell array, so that the i-th entry corresponds to chr.i
%# for readability, use a loop
outputCell = cell(26,1); %# assume 26 chromosomes
for i=1:26
outputCell{i} = allNumbers(cnum==i,:);
end
I've managed to do this with only two for loops, here is my code:
C = zeros(26,8,maxlength);
next = zeros(1,26);
for j = 1:m ,
Ind = findstr(B{1}{j}, 'chr');
Num = B{1}{j}(Ind+3:end-1);
cnum = str2num(Num);
if Num == 'X'
cnum = 25;
end
if Num == 'Y'
cnum = 26;
end
next(cnum) = next(cnum) + 1;
for k = 2:9 ,
D{cnum}{k-1}{next(cnum)} = B{k}{j};
C(cnum,k-1,next(cnum)) = str2num(B{k}{j});
end
end