importing data with a very complex structure into matlab

importing data with a very complex structure into matlab - matlab

i have some data files that i would like to load into matlab. unfortunatly, they have a quite complex structure - at least compared to what i am used to. you should be able to download an old example of this here, https://www.dropbox.com/s/vbh6kl334c5zg1s/fn1_2.out (it opens fine in notepad or wordpad)
it is data files based on synchrotron data where both the raw data, regularized "raw" data and the (indirect) fourier transformed data+fit to data is listed. there are furthermore some statistics from the fourier transformation.
I just need to quote the results from the statistics in my paper, so while it would be nice to plot some of the results, it is not strictly necessary. I need, however, the raw and regularized data together with the fit, and the fourier transformed data.
My problem
in the data file, the results from the statistical analysis is shown before the data i need. but the size of the columns from the statistical analysis varies from data file to data file. this means that i cannot just include the statistics in the header unless i manually change the number of header lines for each file i import. i need to analysis groups of 5 data files together and i would at least need to analyze around 30 files this time so i would like to avoid it if possible. in the future i would again need to load this kind of data files - so even if changing the number of headerlines 30 times does not sound bad it would be nice to be able do it automatically
Possible solution
both the he raw and regularized data together with the fit as well as the fourier transformed data are preceded by a specific line that tells me that after this and a blank/empty line, the data begins
so i though that maybe i could use regular expressions to tell matlab to ignore everything until you see this specific line, ignore this line and one more, and then import data
i googled and found this topic where regular expressions are used: Trying to parse a fairly complex text file
but i am new to regular expressions and the code suggested is a bit complex for me. i can gather that he uses named capture but i am not quite sure i understand how he uses it and if i can adopt it to me need. i have checked the official matlab documentation but their examples are somewhat simpler :) (http://www.mathworks.se/help/matlab/matlab_prog/regular-expressions.html#bqm94nz-1)
Sorry for writing such a long post. any suggestions on how to proceed with this problem will be greatly appreciated
/Martin
EDIT
the code i have used based on the link in the comment:
fileName = 'data.dat';
inputfile = fopen(fileName);
% Ignore all until we see one that just consists of this:
startString = ' R P(R) ERROR';
mydata = [];
while 1
tline = fgetl(inputfile);
% Break if we hit end of file, or the start marker
if ~ischar(tline) || strcmp(tline, startString)
break
end
data = sscanf(tline, '%f', 3 );
mydata(end+1,:) = data;
end
fclose(inputfile);
When i run the code i get the error:
Subscripted assignment dimension mismatch.
mydata(end+1,:) = data;
any suggestions will be greatly appriciated and my apologize for the strange layout/leaving the link in the comment. i am not allowed to include more than two links in a post and i cannot add a new answer yet - both due to me having to low rep :)

Since the blocks are separated by at least two new lines you can use that to separate the text into blocks and analyse them individually. Try this code
fileH = fopen('fn1_2.out');
content = fscanf(fileH, '%c', inf);
fclose(fileH);
splitstring = regexp(content, '\r\n\r\n', 'split');
blocks = regexp(splitstring, '\d\.\d{4}.*\r\n.*\d\.\d{4}','match');
numericBlocksIdx = find(cellfun(#(x) ~isempty(x), blocks));
numericBlocks = splitstring(numericBlocksIdx);
Now the numericBlocks{1}, numericBlocks{2}, ... contain the tables that you are interested in. Note that for some tables the headers are also included because they are not separated by two new lines. From here you can use functions like textscan to read the data into matrices.

Related

Separating Files Based on their names

New to Matlab so this may be more simple than I'm realising.
I'm working with a large number of text files for some data analysis and would like to separate them into a number of categories. They come under a format similar to Tp_angle_RunNum.txt, and Ts_angle_RunNum.text. I would like to have 4 different groups, the Tp and Ts files for angle1, and the same for angle2.
Any help would be appreciated!

A few things here:
Remember that your text files are not necessarily going to be where you process your data, so think about how you want to store the data in memory while you work with it. It may be beneficial to pull all of the data into a MATLAB Table. You can read your text files directly into tables with the function readtable()., with columns to capture Ts, Tp, and angle... or angle1 and angle2. You can use logicals with all 4, i.e., a true or false as to which group the data row belongs. You could also capture the run number so you know exactly what file it came from. Lots of ways to store the data. But, if you get into very large data sets, Tables are way easier to manipulate and work with. They are also very fast, and compact if you use categorical types as applicable.
dir() will likely be a necessary function for you for the dir listing, however there are some alternatives and some considerations based upon your platform. I suggest you take a look at the doc page for dir.
Take advantage of MATLAB Strings and vector processing. Its not the only way to do things, but it is by far the easiest. Strings were introduced in R2016b, and have gotten better since as more and more features and capabilities work with strings. More recently, you can use patterns instead of (or with!) regular expressions for finding what you need to process. The doc page I linked above has some great examples, so no point in me reinventing that wheel.
fileparts() is also your friend in MATLAB when working with many files. You can use it to separate the path, the filename and the extension. You might use it to simplify your processing.
Regarding vector processing, you can take an entire array of strings and pass it to the functions used with pattern. Or, if you take my suggestion of working with tables, you can get rows of your table that match specific characteristics.
Lets look at a few of these concepts together with some sample code. I don't have your files, but I can demonstrate with just the output of the dir command and some dummy files... I'm going to break this into two parts, a dirTable function (which is a dir wrapper I like to use instead of dir, and keep on my path) and a script that uses it. I'd suggest you copy the code and take advantage of code sections to run a section at a time. See doc page on Create and Run Sections in Code if this is new to you
dirTable.m
% Filename: dirTable.m
function dirListing = dirTable( names )
arguments
names (:,1) string {mustBeText} = pwd()
end
dirListing = table();
for nIdx = 1:numel( names )
tempListing = dir( names( nIdx ) );
dirListing = [ dirListing;...
struct2table( tempListing,'AsArray', true ) ]; %#ok<AGROW>
end
if ~isempty( dirListing )
%Adjust some of the types...
dirListing.name = categorical( dirListing.name );
dirListing.folder = categorical( dirListing.folder );
dirListing.date = datetime( dirListing.date );
end
end
Example script
%% Create some dummy files - assuming windows, or I'd use the "touch" cmd.
cmds = "type nul >> Ts_42_" + (1:3) + ".txt";
cmds = [cmds;"type nul >> Tp_42_" + (1:3) + ".txt"];
cmds = [cmds;"type nul >> Ts_21_" + (1:3) + ".txt"];
cmds = [cmds;"type nul >> Tp_21_" + (1:3) + ".txt"];
for idx = 1:numel(cmds)
system(cmds(idx));
end
%% Get the directory listing for all the files
% Note, the filenames come out as categoricals by my design, though that
% doesnt help much for this example - in fact - I'll have to cast to
% convert the categoricals to string a few times. Thats ok, its not a
% heavy lift. If you use dir directly, you'd not only be casting to
% string, but you'd also have to deal with the structure and clunky if/else
% conditions everywhere.
listing = dirTable();
%% Define patterns for each of the 4 groups
% - pretending the first code cell doesnt exist.
Tp_Angle1_pattern = "Tp_21";
Ts_Angle1_pattern = "Ts_21";
Tp_Angle2_pattern = "Tp_42";
Ts_Angle2_pattern = "Ts_42";
%% Cycle a group's data, creating a single table from all the files
% I could be more clever here and loop through the patterns as well and
% create a table of tables; however, I am going to keep this code easier
% to read at the cost of repetitiveness. I will however use a local
% function to gather all the runs from one group into a single table.
Tp_Angle1_matches = string(listing.name).startsWith(Tp_Angle1_pattern);
Tp_Angle1_filenames = string(listing.name(Tp_Angle1_matches));
Tp_Angle1_data = aggregateDataFilesToTable(Tp_Angle1_filenames);
% Repeat for each group... Or loop the above code for a single table
% if you loop for a single table, make sure to add column(s) for the group
% information
%% My local function for reading all the files in a group
function data_table = aggregateDataFilesToTable(filenames)
arguments
filenames (:,1) string
end
% We could assume that since we're using run numbers at the end of the
% filename, that we'll get the filenames pre-sorted for us. If not zero
% padding the numbers, then need to extract the run number to determine the
% sort order of the files to read in. I'm going to be lazy and assume zero
% padded for simplicity.
data_table = table();
for fileIdx = 1:numel(filenames)
% For the following line, two things:
% 1) the [data_table;readtable()] syntax appends the table from
% readtable to the end of data_table.
% 2) The comment at the end acknowledges that this variable is growing
% in a loop, which is usually not the best practice; however, since I
% have no way of knowing the total table dimensions ahead of time, I
% cannot pre-allocate the table before the loop - hence the table()
% call before the for loop. If you have a way of knowing this ahead of
% time, do pre-allocate!
data_table = [data_table;readtable(filenames(fileIdx))]; %#ok<AGROW>
end
end
NOTE 1: using empty parens is not necessary on function calls with no parameters; however, I find it to be easier for others to read when they know they are calling a function and not reading a variable.
NOTE 2: I know the dummy files are empty. That won't matter for this example, as an empty table appended to an empty table is another empty table. And the OP's quesiton was about the file manipulation and grouping, etc.
NOTE 3: In case the syntax is new to you, BOTH functions in my example use function argument blocks, which were introduced in R2019b - and they make for much easier to maintain and read code than NOT validating inputs, or using more complex ways of validating inputs. I was going to leave that out of this example, but they were already in my dirTable function, so I figured I'd just explain it instead.

MATLAB: making a histogram plot from csv files read and put into cells?

Unfortunately I am not too tech proficient and only have a basic MATLAB/programming background...
I have several csv data files in a folder, and would like to make a histogram plot of all of them simultaneously in order to compare them. I am not sure how to go about doing this. Some digging online gave a script:
d=dir('*.csv'); % return the list of csv files
for i=1:length(d)
m{i}=csvread(d(i).name); % put into cell array
end
The problem is I cannot now simply write histogram(m(i)) command, because m(i) is a cell type not a csv file type (I'm not sure I'm using this terminology correctly, but MATLAB definitely isn't accepting the former).
I am not quite sure how to proceed. In fact, I am not sure what exactly is the nature of the elements m(i) and what I can/cannot do with them. The histogram command wants a matrix input, so presumably I would need a 'vector of matrices' and a command which plots each of the vector elements (i.e. matrices) on a separate plot. I would have about 14 altogether, which is quite a lot and would take a long time to load, but I am not sure how to proceed more efficiently.
Generalizing the question:
I will later be writing a script to reduce the noise and smooth out the data in the csv file, and binarise it (the csv files are for noisy images with vague shapes, and I want to distinguish these shapes by setting a cut off for the pixel intensity/value in the csv matrix, such as to create a binary image showing these shapes). Ideally, I would like to apply this to all of the images in my folder at once so I can shift out which images are best for analysis. So my question is, how can I run a script with all of the csv files in my folder so that I can compare them all at once? I presume whatever technique I use for the histogram plots can apply to this too, but I am not sure.
It should probably be better to write a script which:
-makes a histogram plot and/or runs the binarising script for each csv file in the folder
-and puts all of the images into a new, designated folder, so I can sift through these.
I would greatly appreciate pointers on how to do this. As I mentioned, I am quite new to programming and am getting overwhelmed when looking at suggestions, seeing various different commands used to apparently achieve the same thing- reading several files at once.

The function csvread returns natively a matrix. I am not sure but it is possible that if some elements inside the csv file are not numbers, Matlab automatically makes a cell array out of the output. Since I don't know the structure of your csv-files I will recommend you trying out some similar functions(readtable, xlsread):
M = readtable(d(i).name) % Reads table like data, most recommended
M = xlsread(d(i).name) % Excel like structures, but works also on similar data
Try them out and let me know if it worked. If not please upload a file sample.

The function csvread(filename)
always return the matrix M that is numerical matrix and will never give the cell as return.
If you have textual data inside the .csv file, it will give you an error for not having the numerical data only. The only reason I can see for using the cell array when reading the files is if the dimensions of individual matrices read from each file are different, for example first .csv file contains data organised as 3xA, and second .csv file contains data organised as 2xB, so you can place them all into a single structure.
However, it is still possible to use histogram on cell array, by extracting the element as an array instead of extracting it as cell element.
If M is a cell matrix, there are two options for extracting the data:
M(i) and M{i}. M(i) will give you the cell element, and cannot be used for histogram, however M{i} returns element in its initial form which is numerical matrix.
TL;DR use histogram(M{i}) instead of histogram(M(i)).

Faster way to load .csv files in folder and display them using imshow in MATLAB

I have a piece of MATLAB code that works fine, but I wanted to know is there any faster way of performing the same task, where each .csv file is a 768*768 dimension matrix
Current code:
for k = 1:143
matFileName = sprintf('ang_thresholded%d.csv', k);
matData = load(matFileName);
imshow(matData)
end
Any help in this regard will be very helpful. Thank You!

In general, its better to separate the loading, computational and graphical stuff.
If you have enough memory, you should try to change your code to:
n_files=143;
% If you know the size of your images a priori:
matData=zeros( 768, 768,n_files); % prealocate for speed.
for k = 1:n_files
matFileName = sprintf('ang_thresholded%d.csv', k);
matData(:,:,k) = load(matFileName);
end
seconds=0.01;
for k=1:n_Files
%clf; %Not needed in your case, but needed if you want to plot more than one thing (hold on)
imshow(matData(:,:,k));
pause(seconds); % control "framerate"
end
Note the use of pause().

Here is another option using Matlab's data stores which are designed to work with large datasets or lots of smaller sets. The TabularTextDatastore is specifically for this kind of text based data.
Something like the following. However, note that since I don't have any test files it is sort of notional example ...
ttds = tabularTextDatastore('.\yourDirPath\*.csv'); %Create the data store
while ttds.hasdata %This turns false after reading the last file.
temp = read(ttds); %Returns a Matlab table class
imshow(temp.Variables)
end
Since it looks like your filenames' numbering is not zero padded (e.g. 1 instead of 001) then the file order might get messed up so that may need addressed as well. Anyway I thought this might be a good alternative approach worth considering depending on what else you want to do with the data and how much of it there might be.

How can I increase the speed of this xlsread for loop?

I have made a script which contains a for loop selecting columns from 533 different excel files and places them into matrices so that they can be compared, however the process is taking too long (it ran for 3 hours yesterday and wasn't even halfway through!!).
I know xlsread is naturally slow, but does anyone know how I can make my script run faster? The script is below, thanks!!
%Split the data into g's and h's
CRNum = 533; %Number of Carrington Rotation files
A(:,1) = xlsread('CR1643.xlsx','A:A'); % Set harmonic coefficient columns
A(:,2) = xlsread('CR1643.xlsx','B:B');
B(:,1) = xlsread('CR1643.xlsx','A:A');
B(:,2) = xlsread('CR1643.xlsx','B:B');
for k = 1:CRNum
textFileName = ['CR' num2str(k+1642) '.xlsx'];
A(:,k+2) = xlsread(textFileName,'C:C'); %for g
B(:,k+2) = xlsread(textFileName,'D:D'); %for h
end

Don't use xlsread if you want to go through a loop. because it opens excel and then closes excel server each time you call it, which is time consuming. instead before the loop use actxserver to open excel, do what you want and finally close actxserver after your loop. For a good example of using actxserver, search for "Read Spreadsheet Data Using Excel as Automation Server" in MATLAB help.
And also take a look at readtable which works faster than xlsread, but generates a table instead.

The most obvious improvement seems to be to load the files only partially if possible. However, if that is not an option, try whether it helps to only open each file once (read everything you need, and then assign it).
M(:,k+2) = xlsread(textFileName,'C:D');
Also check how much you are reading in each time, if you read in many rows in the first file, you may make the first dimension of A big, and then you will fill it each time you read a file?
As an extra: a small but simple improvment can be found at the start. Don't use 4 load statements, but use 1 and then assign variables based on the result.

As mentioned in this post, the easiest thing to change would be to set 'Basic' to true. This disables things like formulas and macros in Excel and allows you to read a simple table more quickly. For example, you can use:
xlsread('CR1643.xlsx','A:A', 'Basic', true)
This resulted in a decrease in load time from about 22 seconds to about 1 second for me when I tested it on a 11,000 by 7 Excel sheet.

Simplest way to read space delimited text file matlab

Ok, so I'm struggling with the most mundane of things I have a space delimited text file with a header in the first row and a row per observation and I'd like to open that file in matlab. If I do this in R I have no problem at all, it'll create the most basic matrix and voila!
But MATLAB seems to be annoying with this...
Example of the text file:
"picFile" "subjCode" "gender"
"train_1" 504 "m"
etc.
Can I get something like a matrix at all? I would then like to have MATLAB pull out some data by doing data(1,2) for example.
What would be the simplest way to do this?
It seems like having to write a loop using f-type functions is just a waste of time...

If you have a sufficiently new version of Matlab (R2013b+, I believe), you can use readtable, which is very much like how R does it:
T = readtable('data.txt','Delimiter',' ')
There are many functions for manipulating tables and converting back and forth between them and other data types such as cell arrays.
There are some other options in the data import and export section of the Statistics toolbox that should work in older versions of Matlab:
tblread: output in terms of separate variables for strings and numbers
caseread: output in terms of a char array
tdfread: output in terms of a struct
Alternatively, textscan should be able to accomplish what you need and probably will be the fastest:
fid = fopen('data.txt');
header = textscan(fid,'%s',3); % Optionally save header names
C = textscan(fid,'%s%d%s','HeaderLines',1); % Read data skipping header
fclose(fid); % Don't forget to close file
C{:}

Found a way to solve my problem.
Because I don't have the latest version of MATLAB and cannot use readable which would be the preferred option I ended up doing using textread and specifying the format of each column.
Tedious but maybe the "simplest" way I could find:
[picFile subCode gender]=textread('data.txt', '%s %f %s', 'headerlines',1);
T=[picFile(:) subCode(:) gender(:)]
The textscan solution by #horchler seems pretty similar. Thanks!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse