Importing large amount of data into MATLAB? - matlab

I have a text file that is ~80MB. It has 2 cols and around 6e6 rows. I would like to import the data into MATLAB, but it is too much data to do with the load function. I have been playing around with the fopen function but cant get anything to work.
Ideally I would like to take the first col of data and import and eventually have it in one large array in MATLAB. If that isn't possible, I would like to split it into arrays of 34,013 in length. I would also like to do the same for the 2nd col of data.

fileID = fopen('yourfilename.txt');
formatSpec = '%f %f';
while ~feof(fileID)
C = textscan(fileID,formatSpec,34013);
end
Hope this helps..
Edit:
The reason you are getting error is because C has two columns. So you need to take the columns individually and handle them.
For example:
column1data = reshape(C(:,1),301,113);
column2data = reshape(C(:,2),301,113);

You may also consider to convert your file to binary format if your data file is not changing each time you want to load it. Then you'll load it way much faster.
Or you may do "transparent binary conversion" like in the function below. Only first time you load the data will be slow. All subsequent will be fast.
function Data = ReadTextFile(FileName,NColumns)
MatFileName = sprintf('%s.mat',FileName); % binary file name
if exist(MatFileName,'file')==2 % if it exists
S = load(MatFileName,'Data'); % load it instead of
Data = S.Data; % the original text file
return;
end
fh = fopen(FileName); % if binary file does not exist load data ftom the original text file
fh_closer = onCleanup( #() fclose(fh) ); % the file will be closed properly even in case of error
Data = fscanf(fh, repmat('%f ',1,NColumns), [NColumns,inf]);
Data = Data';
save(MatFileName,'Data'); % and make binary "chache" of the original data for faster subsequent reading
end
Do not forget to remove the MAT file when the original data file is changed.

Related

Save the data in a form of three columns in text files

This function reads the data from multiple mat files and save them in multiple txt files. But the data (each value) are saved one value in one column and so on. I want to save the data in a form of three columns (coordinates) in the text files, so each row has three values separated by space. Reshape the data before i save them in a text file doesn't work. I know that dlmwrite should be modified in away to make newline after three values but how?
mat = dir('*.mat');
for q = 1:length(mat)
load(mat(q).name);
[~, testName, ~] = fileparts(mat(q).name);
testVar = eval(testName);
pos(q,:,:) = testVar.Bodies.Positions(1,:,:);
%pos=reshape(pos,2,3,2000);
filename = sprintf('data%d.txt', q);
dlmwrite(filename , pos(q,:,:), 'delimiter','\t','newline','pc')
end
My data structure:
These data should be extracted from each mat file and stored in the corresponding text files like this:
332.68 42.76 42.663 3.0737
332.69 42.746 42.655 3.0739
332.69 42.75 42.665 3.074
A TheMathWorks-trainer once told me that there is almost never a good reason nor a need to use eval. Here's a snippet of code that should solve your writing problem using writematrix since dlmwrite is considered to be deprecated.
It further puts the file-handling/loading on a more resilient base. One can access structs dynamically with the .(FILENAME) notation. This is quite convenient if you know your fields. With who one can list variables in the workspace but also in .mat-files!
Have a look:
% path to folder
pFldr = pwd;
% get a list of all mat-files (returns an array of structs)
Lst = dir( fullfile(pFldr,'*.mat') );
% loop over files
for Fl = Lst.'
% create path to file
pFl = fullfile( Fl.folder, Fl.name );
% variable to load
[~, var2load, ~] = fileparts(Fl.name);
% get names of variables inside the file
varInfo = who('-file',pFl);
% check if it contains the desired variables
if ~all( ismember(var2load,varInfo) )
% display some kind of warning/info
disp(strcat("the file ",Fl.name," does not contain all required varibales and is therefore skipped."))
% skip / continue with loop
continue
end
% load | NO NEED TO USE eval()
Dat = load(pFl, var2load);
% DO WHATEVER YOU WANT TO DO
pos = squeeze( Dat.(var2load)(1,:,1:2000) );
% create file name for text file
pFl2save = fullfile( Fl.folder, strrep(Fl.name,'.mat','.txt') );
writematrix(pos,pFl2save,'Delimiter','\t')
end
To get your 3D-matrix data into a 2D matrix that you can write nicely to a file, use the function squeeze. It gets rid of empty dimensions (in your case, the first dimension) and squeezes the data into a lower-dimensional matrix
Why don't you use writematrix() function?
mat = dir('*.mat');
for q = 1:length(mat)
load(mat(q).name);
[~, testName, ~] = fileparts(mat(q).name);
testVar = eval(testName);
pos(q,:,:) = testVar(1,:,1:2000);
filename = sprintf('data%d.txt', q);
writematrix(pos(q,:,:),filename,'Delimiter','space');
end
More insight you can find here:
https://www.mathworks.com/help/matlab/ref/writematrix.html

Adding size information of dataset to file name

I have several datasets, called '51.raw' '52.raw'... until '69.raw' and after I run these datasets in my code the size of these datasets changes from 375x91x223 to sizes with varying y-dimensions (i.e. '51.raw' output: 375x45x223; '52.raw' output: 375x50x223, ... different with each dataset).
I want to later save the '.raw' file name with this information (i.e. '51_375x45x223.raw') and also want to use the new dataset size to later reshape the dataset within my code. I have attempted to do this but need help:
for k=51:69
data=reshape(data,[375 91 223]); % from earlier in the code after importing data
% then executes code with dimensions of 'data' chaging to 375x45x223, ...
length=size(data); dimensions.([num2str(k)]) = length; %save size in 'dimensions'.
path=['C:\Example\'];
name= sprintf('%d.raw',k);
write([path name], data);
% 'write' is a function to save the dat in specified path and name (value of k). I don't know how to add the size of the dataset to the name.
Also later I want to reshape the dataset 'data' for this iteration and do a reshape with the new y dimensions value.
i.e. data=reshape(data,[375 new y-dimension 223]);
Your help will be appreciated. Thanks.
You can easily convert your dimensions to a string which will be saved as a file.
% Create a string of the form: dim1xdim2xdim3x...
dims = num2cell(size(data));
dimstr = sprintf('%dx', dims{:});
dimstr = dimstr(1:end-1);
% Append this to your "normal" filename
folder = 'C:\Example\';
filename = fullfile(folder, sprintf('%d_%s.raw', k, dimstr));
write(filename, data);
That being said, it is far better include this dimension information within the file itself rather than relying on the filename.
As a side note, avoid using names of internal functions as variable names such as length, and path. This can potentially result in strange and unexpected behavior in the future.
Update
If you need to parse the filename, you could use textscan to do that:
filename = '1_2x3x4.raw';
ndims = sum(filename == 'x') + 1;
fspec = repmat('%dx', [1 ndims]);
parts = textscan(filename, ['%d_', fspec(1:end-1)]);
% Then load your data
% Now reshape it based on the filename
data = reshape(data, parts{2:end});

Importing matrix with different row length

I have large data files and I would like to import 12 columns of data for further use. However the row length will be different in each instance. I would import the selected columns only but below the data I need are some blank rows followed by extra numbers which aren't necessary, so I'm wondering how to import just the data I need? I don't mind specifying and end row but this would be different for each case and I'm not sure if I'm missing anything else obvious! To help I've attached a print-screen of an example of the data I'm working with:
To summarise I only require the "blue" data above the purple boxes, each file I will use will have the same layout except there may be more/less rows of data.
EDIT
I have updated the code to give you a better understanding of the process:
% An empty array:
importedarray = [];
% Open the data file to read from it:
fid = fopen( 'dummydata.txt', 'r' );
% Check that fid is not -1 (error openning the file).
% read each line of the data in a cell array:
cac = textscan( fid, '%s', 'Delimiter', '\n' );
% size(cac{1},1) must equals the # of rows in your data file.
totalRows = size(cac{1},1);
fprintf('Imported %d rows of data!\n',totalRows)
% Close the file as we don't need it anymore:
fclose( fid );
% for total rows in data
for k=1:totalRows
fprintf('Parsing data on row %d of %d...\n',k,totalRows);
currentRow = cac{1}{k,1};
fprintf('Row contains:\n%s\n',currentRow);
% finish (break from loop) when encounter an empty row:
if isempty(currentRow)
fprintf('Empty row encountered (#%d). Exiting the loop...\n',k);
break;
end
eachRowElement = strsplit(currentRow, ' ');
fprintf('Splitting row to %d elements...\n',length(eachRowElement));
fprintf('Converting row to floats...');
eachRowElement2num = cellfun(#str2num,eachRowElement,'UniformOutput',false);
fprintf('Done!\n');
fprintf('Converting cell to matrix...');
importedarray(k,:) = cell2mat(eachRowElement2num);
fprintf('Done!\n');
end
clearvars cac k fid totalRows currentRow eachRowElement eachRowElement2num;
Given your example image (that all the columns of each row are filled with floats and on an empty row you stop) this should do the job giving info along the way. If not you will be able to tell what is the issue by looking the line the code stopped. I include code to eliminate the unnecessary variables after importing. This must be done manually or you can create a function to perform the task (functions' work space is different the the temporary variables are deleted on function return, see: http://www.mathworks.com/help/matlab/ref/function.html). Hope this helps.
PS. In your example you keep 12 columns skipping the first two. The above code will import the whole row. You can choose what columns to keep later by using matrix indexing, like:
importedarray = importedarray(:,3:14);
if these columns don't change you can incorporate this into your function.

Name each variable differently in a loop

I have created a .dat file of file names. I want to read into MATLAB each file in that list and give the data a different name. Currently, each iteration just overwrites the last one.
I found that a lot of people give this answer:
for i=1:10
A{i} = 1:i;
end
However, it isn't working for my problem. Here's what I am doing
flist = fopen('fnames.dat'); % Open the list of file names
nt = 0; % Counter will go up one for each file loaded
while ~feof(flist) % While end of file has not been reached
for i = 1:6 % Number of filenames in the .dat file
% For each file
fname = fgetl(flist); % Reads next line of list, which is the name of the next data file
disp(fname); % Stores name as string in fname
nt = nt+1; % Time index
% Save data
data{i} = read_mixed_csv(fname, '\t'); % Reads in the CSV file% Open file
data{i} = data(2:end,:); % Replace header row
end
end
The code runs with no errors, but only one data variable is saved.
My fnames.dat contains this:
IA_2007_MDA8_O3.csv
IN_2007_MDA8_O3.csv
MI_2007_MDA8_O3.csv
MN_2007_MDA8_O3.csv
OH_2007_MDA8_O3.csv
WI_2007_MDA8_O3.csv
If possible, I would really like to name data something more intuitive. Like IA for the first file, IN for the second and so on. Is there any way to do this?
The last line of the loop is the problem:
data{i} = data(2:end,:);
I don't know what exactly happens I did not run your code, but data(2:end,:) refers to the second to last dataset, not the second to last line.
Try:
thisdata = read_mixed_csv(fname, '\t');
data{i} = thisdata(2:end,:);
If you want to keep track of what data came from which file, save out a second cell array with the names:
thisdata = read_mixed_csv(fname, '\t');
data{i} = thisdata(2:end,:);
names{i} = fname(1:2); % presuming you only need first two letters.
If you need a specific part of the filename that's not always the same length look into strtok or fileparts. Then you can use things like strcmp to check the cell array names for where the data labelled IA or whichever is stored.
As mentioned by #Daniel the simple way to store data of various sizes in a cell array.
data{1} = thisdata(2:end,:)
However, if the names are really important, you could consider using a struct instead. For example:
dataStruct(1).numbers= thisdata(2:end,:);
dataStruct(1).name= theRelevantName
Of course you could also just add them to the cell array:
dataCell{1,1} = thisdata(2:end,:);
dataCell{1,2} = theRelevantName

Vectorizing the following for loop

I have a function that imports data collected in a txt file using the following code:
FILE = fopen(textDataFileName);
FIC = textscan(FILE, '%s');
FICu = FIC{1,1}(:,:);
n = numel(FICu);
for i = 1:n
FICun = str2double(FICu{i,1});
FICa(i,1) = FICun;
end
After the data is imported my function extracts all the necessary data and then I have other functions that do data analysis for me. My issue though, is that the function as a whole is slowed down by the above for loop. Originally the for loop wasn't a problem as the data set was relatively small; however, new data is appended to the text file everyday and thus the for loop has to contend with larger and larger data sets (the size of which cannot be predicted before importing). Does anyone have any easy way to vectorize my for loop and accomplish the same thing?
As a quick note, changing the format does not result in the behavior one would expect. In fact changing the formation to %f (for floating point) or %d (for decimal) causes the function to skip most of the data in the file.
Another update:
My code is now as follows:
FILE = fopen(textDataFileName);
FIC = textscan(FILE, '%s');
FICu = FIC{1,1}(:,:);
n = numel(FICu);
FICa = zeros(n,1);
FICa = str2double(FICu(:,1));
This shaved 2s off the time it takes to complete. Any suggestions? (also, please note the issues of changing the file format, it did not work as one would expect.