Extract specific column information from table in MATLAB

Extract specific column information from table in MATLAB - matlab

I have several *.txt files with 3 columns information, here just an example of one file:
namecolumn1 namecolumn2 namecolumn3
#----------------------------------------
name1.jpg someinfo1 name
name2.jpg someinfo2 name
name3.jpg someinfo3 name
othername1.bmp info1 othername
othername2.bmp info2 othername
othername3.bmp info3 othername
I would like to extract from "namecolumn1" only the names starting with name but from column 1.
My code look like this:
file1 = fopen('test.txt','rb');
c = textscan(file1,'%s %s %s','Headerlines',2);
tf = strcmp(c{3}, 'name');
info = c{1}{tf};
the problem is that when I do disp(info) I got only the first entry from the table: name1.jpg and I would like to have all of them:
name1.jpg
name2.jpg
name3.jpg

You're pretty much there. What you're seeing is an example of MATLAB's Comma Separated List, so MATLAB is returning each value separately.
You can verify this by entering c{1}{tf} in the command line after running your script, which returns:
>> c{1}{tf}
ans =
name1.jpg
ans =
name2.jpg
ans =
name3.jpg
Though sometimes we'd want to concatenate them, I think in the case of character arrays it is more difficult to work with than retaining the cell arrays:
>> info = [c{1}{tf}]
info =
name1.jpgname2.jpgname3.jpg
versus
>> info = c{1}(tf)
info =
'name1.jpg'
'name2.jpg'
'name3.jpg'
The former would require you to reshape the result (and whitespace pad, if the strings are different lengths), whereas you can index the strings in a cell array directly without having to worry about any of that (e.g. info{1}).

Related

Pull specific cells out of a cell array by comparing the last digit of their filename

I have a cell array of filenames - things like '20160303_144045_4.dat', '20160303_144045_5.dat', which I need to separate into separate arrays by the last digit before the '.dat'; one cell array of '...4.dat's, one of '...5.dat's, etc.
My code is below; it uses regex to split the file around the '.dat', reshapes a bit then regexes again to pull out the last number of the filename, builds a cell to store the filenames in then, and then I get a tad stuck. I have an array produced such as '1,0,1,0,1,0..' of required cell indexes which I thought might be trivial to pull out, but I'm struggling to get it to do what I want.
numFiles = length(sampleFile); %sampleFile is the input cell array
splitFiles = regexp(sampleFile,'.dat','split');
column = vertcat(splitFiles{:});
column = column(:,1);
splitNums = regexp(column,'_','split');
splitNums = splitNums(:,1);
column = vertcat(splitNums{:});
column = column(:,3);
column = cellfun(#str2double,column); %produces column array of values - 3,4,3,4,3,4, etc
uniqueVals = unique(column);
numChannels = length(uniqueVals);
fileNameCell = cell(ceil(numFiles/numChannels),numChannels);
for i = 1:numChannels
column(column ~= uniqueVals(i)) = 0;
column = column / uniqueVals(i); %e.g. 1,0,1,0,1,0
%fileNameCell(i)
end
I feel there should be an easier way than my hodge-podge of code, and I don't want to throw together a ton of messy for-loops if I can avoid it; I definitely believe I've overcomplicated this problem massively.

We can neaten your code quite a bit.
Take some example data:
files = {'abc4.dat';'abc5.dat';'def4.dat';'ghi4.dat';'abc6.dat';'def5.dat';'nonum.dat'};
You can get the final numbers using regexp and matching one or more digits followed by '.dat', then using strrep to remove the '.dat'.
filenums = cellfun(#(r) strrep(regexp(r, '\d+.dat', 'match', 'once'), '.dat', ''), ...
files, 'uniformoutput', false);
Now we can put these in a structure, using the unique numbers (prefixed by a letter because fields can't start with numbers) as field names.
% Get unique file numbers and set up the output struct
ufilenums = unique(filenums);
filestruct = struct;
% Loop over file numbers
for ii = 1:numel(ufilenums)
% Get files which have this number
idx = cellfun(#(r) strcmp(r, ufilenums{ii}), filenums);
% Assign the identified files to their struct field
filestruct.(['x' ufilenums{ii}]) = files(idx);
end
Now you have a neat output
% Files with numbers before .dat given a field in the output struct
filestruct.x4 = {'abc4.dat' 'def4.dat' 'ghi4.dat'}
filestruct.x5 = {'abc5.dat' 'def5.dat'}
filestruct.x6 = {'abc6.dat'}
% Files without numbers before .dat also captured
filestruct.x = {'nonum.dat'}

Slow regexprep with a very long string

I have simulation data in an ascii file with a lot of data points. I'm trying to extract variable names and their values from it. The below is an example of what the file format looks like:
*ESA
*COM on Tue Sep 27 15:23:02 2016
*COM C:\Users\vi813c\Documents\My Matlab\
*COM The pathname to the ESB file was: C:\Users\vi813c\Documents\My Matlab
Case013
*RTITLE
Run Date/Time = 20-SEP-2016 13:29:00
MSC.EASY5 time-history plot with 20001 data points
*EOD
*FLOAT
TIME FDLB(1) FSLB(1) FVLB(1) MXLB(1) \
MYLB(1) MZLB(1) FDLB(2) FSLB(2) FVLB(2) \
MXLB(2) MYLB(2) MZLB(2) FDLB(3) FSLB(3) \
FVLB(3) MXLB(3) MYLB(3) MZLB(3)
0 884.439 -0 53645.8 -972.132
-311780 207.866 5403.68 1981.49 327781
258746 -1.74898E+006 84631.4 5384.25 -1308.47
326538 -97028.6 -1.74013E+006 -61858.1
0.002 882.616 0.008033 53661.1 -972.4
-311702 207.779 5400.42 1982.11 327784
258726 -1.74906E+006 84628.3 5381.01 -1308.44
326541 -97040.1 -1.74021E+006 -61858.8
0.004 876.819 0.031336 53705.6 -973.183
-311683 207.661 5391.19 1983.9 327795
258693 -1.74935E+006 84624 5371.85 -1309.63
326552 -97040.6 -1.74051E+006 -61858.8
0.006 869.491 0.061631 53763.3 -974.213
-311806 207.618 5377.45 1986.76 327813
258659 -1.74995E+006 84621.7 5358.2 -1312.04
326569 -97040.3 -1.7411E+006 -61861
0.008 861.718 0.095625 53828.1 -975.379
-312039 207.648 5360.82 1990.12 327834
A summary of data format characteristics is as follows:
Everything above "*FLOAT" is a header and I need to get rid of it
Stuff between "*FLOAT" and the first numeric value are the variable names
The variable names and the values are delimited by space(s) and '\'
The data are "lumped". Each lump has values for the variables at a given simulation time step. In the example above, there are 19 variables so that there are 19 numeric values in each lump
There can be multiple data sets; each preceded with "*FLOAT" and a variable name section
The following is how I am currently handling this data:
fileread the file --> one big string of characters
regexprep {'\s+,'\','\n'} with ',' --> comma delimited for strsplit
strfind "*FLOAT"
strsplit by ',' --> now becomes a cell
find the first numeric value by isnan(str2double(parse))
Then between the index from 2. and the index from 4 are the variable names and between the index from 4 and the next "*FLOAT" are the numeric data
This scheme is sort of working, but I can't stop thinking that there's gotta be a better way to do this. For one, the step 1. is extremely slow. I guess it's one big string for regexprep to work on with multiple things to replace.
How can I improve my script?

I gave this a shot with the string class which is new in 16b.
str = string(fileread('file.txt'));
fileNewline = [13 newline]; % This data has carriage returns
str = extractAfter(str, ['*FLOAT' fileNewline]);
str = erase(str, ['\' fileNewline]);
str = splitlines(str);
% Get the variable names
varNames = split(str(1))';
% Get the data
data = reshape(str(2:end), 4, [])';
data = strip(data);
data = join(data);
data = split(data);
data = double(data);
I'm not sure about how to load the file faster.
As mentioned in another comment, textscan could probably help. It might end up being the fastest solution. With the correct format specified and using the 'HeaderLines' option, I think you can make it work.

Matlab: str2double works on Windows PC but not on Apple Mac

Say I have the following example file names: file_0250.pdf, file_0251.pdf, file_0252.pdf. I would like to get the following cell array:
'250 251 252'.
On the Windows PC at work I can run the following code with no problems, but on my Macbook at home I can not get the 'str2double' values as it returns a NaN value. It's frustrating:
folder_name = '/User/....';
file_name = 'file_';
extension = '.pdf';
%//' files pattern with absolute paths
filePattern = fullfile(folder_name, [file_name '*' extension] );
old_filename = cellstr(ls(filePattern)) ;
%// Get numbers associated with each file
file_ID = strrep(strrep(old_filename, file_name ,''), extension,'');
file_ID_doublearr = str2double(file_ID);
I tried 'cell2mat', 'str2mat', but they do not go well with the rest of the code:
file_ID_doublearr = file_ID_doublearr - min(file_ID_doublearr)+ start_number;
file_ID = strtrim(cellstr(num2str(file_ID_doublearr)));
%// Get zeros string to be pre-appended to each new_name
str_zeros = arrayfun(#(t) repmat('0',1,t), 4-cellfun(#numel,file_ID),'uni',0) ;
%// Generate new filenames
new_name = get(handles.new_name, 'string');
new_extension = get(handles.new_extension, 'string');
new_filename = strcat(new_name,str_zeros,file_ID,new_extension) ;
%// Finally rename files with the absolute paths
cellfun(#(m1,m2) movefile(m1,m2),fullfile(folder_name,file_name),fullfile(folder_name,new_filename)) ;

Your issue has to do with the different way that *nix (Linux and Mac) and Windows treat ls as mentioned in the documentation. As you've found out, ls returns a 2D character array of filenames. On the PC, these filenames will be returned one per row.
file_001.pdf
file_002.pdf
file_003.pdf
file_004.pdf
When you call cellstr on the result, it will place each filename into it's own cell array element after which you can successfully extract the number portion and convert them to numbers.
On *nix-based systems though, ls will typically yield a multi-column output. For example:
file_001.pdf file_002.pdf file_003.pdf
file_004.pdf
When you call cellstr on this, you will get one cell array element per row, but as you can see the first row actually contains three filenames. Then once you extract the number portion you would get something like this:
'001 002 003'
'004'
When you try to convert to a number, you're trying to convert a string of numbers to a single number and you get a NaN.
str2double({'001 002 003'; '004'})
% NaN 4
The best way to fix this is to not use the OS-dependent ls and use dir instead which is guaranteed to have consistent behavior across operating systems.
files = dir(fullfile(folder_name, [filename, '*', extension]));
numbers = regexp({files.name}, '[0-9]*', 'match');
The other option is to make sure that file_ID does not contain any space-separated numbers.
file_IDs = {'001 002 003'; '004'};
% Break each element up into multiple elements if it contains spaces
file_IDs = cellfun(#(x)strsplit(x), file_IDs, 'UniformOutput', 0);
file_IDs = cat(2, file_IDs{:});
% Now convert to a number
str2double(file_IDs);
% 1 2 3 4

How do I read comma separated values from a .txt file in MATLAB using textscan()?

I have a .txt file with rows consisting of three elements, a word and two numbers, separated by commas.
For example:
a,142,5
aa,3,0
abb,5,0
ability,3,0
about,2,0
I want to read the file and put the words in one variable, the first numbers in another, and the second numbers in another but I am having trouble with textscan.
This is what I have so far:
File = [LOCAL_DIR 'filetoread.txt'];
FID_File = fopen(File,'r');
[words,var1,var2] = textscan(File,'%s %f %f','Delimiter',',');
fclose(FID_File);
I can't seem to figure out how to use a delimiter with textscan.

horchler is indeed correct. You first need to open up the file with fopen which provides a file ID / pointer to the actual file. You'd then use this with textscan. Also, you really only need one output variable because each "column" will be placed as a separate column in a cell array once you use textscan. You also need to specify the delimiter to be the , character because that's what is being used to separate between columns. This is done by using the Delimiter option in textscan and you specify the , character as the delimiter character. You'd then close the file after you're done using fclose.
As such, you just do this:
File = [LOCAL_DIR 'filetoread.txt'];
f = fopen(File, 'r');
C = textscan(f, '%s%f%f', 'Delimiter', ',');
fclose(f);
Take note that the formatting string has no spaces because the delimiter flag will take care of that work. Don't add any spaces. C will contain a cell array of columns. Now if you want to split up the columns into separate variables, just access the right cells:
names = C{1};
num1 = C{2};
num2 = C{3};
These are what the variables look like now by putting the text you provided in your post to a file called filetoread.txt:
>> names
names =
'a'
'aa'
'abb'
'ability'
'about'
>> num1
num1 =
142
3
5
3
2
>> num2
num2 =
5
0
0
0
0
Take note that names is a cell array of names, so accessing the right name is done by simply doing n = names{ii}; where ii is the name you want to access. You'd access the values in the other two variables using the normal indexing notation (i.e. n = num1(ii); or n = num2(ii);).

How to display selected entries of an array of structures in MATLAB

Suppose we have an array of structure. The structure has fields: name, price and cost.
Suppose the array A has size n x 1. If I'd like to display the names of the 1st, 3rd and the 4th structure, I can use the command:
A([1,3,4]).name
The problem is that it prints the following thing on screen:
ans =
name_of_item_1
ans =
name_of_item_3
ans =
name_of_item
How can I remove those ans = things? I tried:
disp(A([1,3,4]).name);
only to get an error/warning.

By doing A([1,3,4]).name, you are returning a comma-separated list. This is equivalent to typing in the following in the MATLAB command prompt:
>> A(1).name, A(3).name, A(4).name
That's why you'll see the MATLAB command prompt give you ans = ... three times.
If you want to display all of the strings together, consider using strjoin to join all of the names together and we can separate the names by a comma. To do this, you'll have to place all of these in a cell array. Let's call this cell array names. As such, if we did this:
names = {A([1,3,4]).name};
This is the same as doing:
names = {A(1).name, A(3).name, A(4).name};
This will create a 1 x 3 cell array of names and we can use these names to join them together by separating them with a comma and a space:
names = {A([1,3,4]).name};
out = strjoin(names, ', ');
You can then show what this final string looks like:
disp(out);

You can use:
[A([1,3,4]).name]
which will, however, concatenate all of the names into a single string.
The better way is to make a cell array using:
{ A([1,3,4]).name }

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extract specific column information from table in MATLAB - matlab

Related

Pull specific cells out of a cell array by comparing the last digit of their filename

Slow regexprep with a very long string

Matlab: str2double works on Windows PC but not on Apple Mac

How do I read comma separated values from a .txt file in MATLAB using textscan()?

How to display selected entries of an array of structures in MATLAB

Categories

Resources