I basically have a large data set file and I want to write a MATLAB script that creates a data structure for it. I have tried to read about using structured arrays in MATLAB, but I haven't found a solution of how to do this. I don't really have a lot of experience in writing scripts on MATLAB.
Edited: My data set is a large list of items with, say, 10 different characteristics of each item written down. So for example, say 100,000 listings of houses and characteristics given could be price, county, state, date when sold, etc. This file is in a txt., xls., or any format you like to play with.
I would like to write a MATLAB script that creates a data structure of it say in the format:
house(i).price
house(i).county
house(i).state
house(i).date
etc
Any suggestions to the right direction or examples of teaching how to do this would be greatly appreciated.
This seems like a very reasonable question, and one that can be easily addressed.
The format of the file, really makes this problem easy or hard. I really don't like .xls files for this kind of work myself, but I realize, you get what you get. Let's assume it's in a tab delimited text file like:
Price County State Date
100000 Sherlock London 2001-10-01
134000 Holmes Dartmoor 2011-12-30
123456 Watson Boston 2003-04-15
IfI would just read the whole thing into an parse the field name row and use dynamic structure naming to make the array of structures.
fid = fopen('data.txt','r');
tline = fgetl(fid);
flds = regexp(tline,'\s*','split');
% initialize the first prototype struct
data = struct();
for ii=1:length(flds)
data.(flds{ii}) = [];
end
ii = 1;
% get the first line of data
tline = fgetl(fid);
while ischar(tline)
% parse the data
rowData = regexp(tline,'\s*','split');
% we're assuming no missing data, etc
% populate the structure
for jj=1:length(flds)
data(ii).(flds{jj}) = rowData{jj};
end
% since we don't know how many lines we have
% we could figure that out, but we won't now
% we'll just use the size extending feature of
% matlab arrays, even though it's slow, just
% to show how we would do it
tline = fgetl(fid);
ii = ii + 1;
end
fclose(fid)
Hope this gets you started!
Related
I have a piece of MATLAB code that works fine, but I wanted to know is there any faster way of performing the same task, where each .csv file is a 768*768 dimension matrix
Current code:
for k = 1:143
matFileName = sprintf('ang_thresholded%d.csv', k);
matData = load(matFileName);
imshow(matData)
end
Any help in this regard will be very helpful. Thank You!
In general, its better to separate the loading, computational and graphical stuff.
If you have enough memory, you should try to change your code to:
n_files=143;
% If you know the size of your images a priori:
matData=zeros( 768, 768,n_files); % prealocate for speed.
for k = 1:n_files
matFileName = sprintf('ang_thresholded%d.csv', k);
matData(:,:,k) = load(matFileName);
end
seconds=0.01;
for k=1:n_Files
%clf; %Not needed in your case, but needed if you want to plot more than one thing (hold on)
imshow(matData(:,:,k));
pause(seconds); % control "framerate"
end
Note the use of pause().
Here is another option using Matlab's data stores which are designed to work with large datasets or lots of smaller sets. The TabularTextDatastore is specifically for this kind of text based data.
Something like the following. However, note that since I don't have any test files it is sort of notional example ...
ttds = tabularTextDatastore('.\yourDirPath\*.csv'); %Create the data store
while ttds.hasdata %This turns false after reading the last file.
temp = read(ttds); %Returns a Matlab table class
imshow(temp.Variables)
end
Since it looks like your filenames' numbering is not zero padded (e.g. 1 instead of 001) then the file order might get messed up so that may need addressed as well. Anyway I thought this might be a good alternative approach worth considering depending on what else you want to do with the data and how much of it there might be.
My data is x,y co-ordinates in multiple files
a=dir('*.mat')
b={a(:).name}
to load the filenames in a cell array
How do I use a loop to sequentially load one column of data from each file into consecutive rows of a new/separate array......?
I've been doing it individually using e.g.
Load(example1.mat)
A(:,1)=AB(:,1)
Load(example2.mat)
A(:,2)=AB(:,1)
Load(example3.mat)
A(:,3)=AB(:,1)
Obviously very primitive and time consuming!!
My Matlab skills are weak so any advice gratefully received
Cheers
Many thanks again, I'm still figuring out how to read the code but I used it like this;
a=dir('*.mat');
b={a(:).name};
test1=zeros(numel(b),1765);
for k=1:numel(b) S=load(b{k});
I then used the following code to create a PCA cluster plot
test1(k,:)=S.AB(:,2); end [wcoeff,score,latent,tsquared,explained] = pca(test1,... 'VariableWeights','variance');
c3 = wcoeff(:,1:3) coefforth = inv(diag(std(test1)))*wcoeff; I = c3'*c3 cscores = zscore(test1)*coefforth;
figure() plot(score(:,1),score(:,2),'+') xlabel('1st Principal Component') ylabel('2nd Principal Component') –
I was using 'gname' to label the points on the cluster plot but found that the point were simply labelled from 1 to the number of rows in the array.....I was going to ask you about this but I found out simply through trial and error if I used 'gname(b)' this labels the points with the .names listed in b.....
However the clusterplot starts to look very busy/messy once I have labelled quite a few points so now I am wondering is is possible to extract the filenames into a list by dragging round or selecting a few points, I think it is possible as I have read a few related topics.....but any tips/advice around gname or labelled/extracting labels from clusterplots would be greatly appreciated. Apologies again for my formatting I'm still getting used to this website!!!
Here is a way to do it. Hopefully I got what you wanted correctly :)
The code is commented but please ask any questions if something is unclear.
a=dir('*.mat');
b={a(:).name};
%// Initialize the output array. Here SomeNumber depends on the size of your data in AB.
A = zeros(numel(b),SomeNumber);
%// Loop through each 'example.mat' file
for k = 1:numel(b)
%// ===========
%// Here you could do either of the following:
1)
%// Create a name to load with sprintf. It does not require a or b.
NameToLoad = sprintf('example%i.mat',k);
%// Load the data
S = load(NameToLoad);
2)
%// Load directly from b:
S = load(b{k});
%// ===========
%// Now S is a structure containing every variable from the exampleX.mat file.
%// You can access the data using dot notation.
%// Store the data into rows of A
A(k,:) = S.AB(:,1);
end
Hope that is what you meant!
I have a very large number of large data files. I would like to be able to categorize the data in each file, and then save the filename to a cell array, such that at the end I'll have one cell array of filenames for each category of data, which I could then save to a mat file so that I can then come back later and run analysis on each category. It might look something like this:
MatObj = matfile('listOfCategorizedFilenames.mat');
MatObj.boring = {};
MatObj.interesting = {};
files = dir(directory);
K = numel(files);
for k=1:K
load(files(k).name,'data')
metric = testfunction(data)
if metric < threshold
MatObj.boring{end+1} = files(k).name;
else
MatObj.interesting{end+1} = files(k).name;
end
end
Because the list of files is very long, and testfunction can be slow, I'd like to set this to run unattended overnight or over the weekend (this is a stripped down version, metric might return one of several different categories), and in case of crashes or unforeseen errors, I'd like to save the data on the fly rather than populating a cell array in memory and dumping to disk at the end.
The problem is that using matfile will not allow cell indexing, so the save step throws an error. My question is, is there a workaround for this limitation? Is there better way to incrementally write the filenames to a list that would be easy to retrieve later?
I have no experience with matfile, so I cannot help you with that. As a quick and dirty solution, I would just write the filenames to two different text-files. Quick testing suggests that the data is flushed to disk straight away and that the text-files are OK even if you close matlab without doing a fclose (to simulate a crash). Untested code:
files = dir(directory);
K = numel(files);
boring = fopen('boring.txt', 'w');
interesting = fopen('interesting.txt', 'w');
for k=1:K
load(files(k).name,'data')
metric = testfunction(data)
if metric < threshold
fprintf(boring, '%s\n', files(k).name);
else
fprintf(interesting, '%s\n', files(k).name);
end
end
%be nice and close files
fclose(boring);
fclose(interesting);
Processing the boring/interesting text files afterwards should be trivial. If you would also write the directory listing to a separate file before starting the loop, it should be pretty easy (either by hand or automatically) to figure out where to continue in case of a crash.
Mat files are probably the most efficient way to store lists of files, but I guess whenever I've had this problem, I make a cell array and save it using xlswrite or fprintf into a document that I can just reload later.
You said the save step throws an error, so I assume this part is okay, right?
for k=1:K
load(files(k).name,'data')
metric = testfunction(data)
if metric < threshold
MatObj.boring{end+1} = files(k).name;
else
MatObj.interesting{end+1} = files(k).name;
end
end
Personally, I just then write,
xlswrite('name.xls', MatObj.interesting, 1, 'A1');
[~, ~, list] = xlsread('name.xls'); % later on
Or if you prefer text,
% I'm assuming here that it's just a single list of text strings.
fid = fopen('name.txt', 'w');
for row=1:nrows
fprintf(fid, '%s\n', MatObj.interesting{row});
end
fclose(fid);
And then later open with fscanf. I just use the xlswrite. I've never had a problem with it, and it's not noticeably slow enough to detract me from using it. I know my answer is just a workaround instead of a real solution, but I hope it helps.
I am reading some data from a csv or text file ( which consists of thousands of rows with each row consists of fixed number of columns - e.g.: 20).
I am keeping the above details in matlab with a structure as follows.
initial_var(firs).second_var(sec).third_var(thir).time(end+1, :) = [];
initial_var(firs).second_var(sec).third_var(thir).scan(end+1, :) = [];
initial_var(firs).second_var(sec).third_var(thir).time(end+1, :) = val1;
initial_var(firs).second_var(sec).third_var(thir).scan(end+1, :) = val2;
where first, sec, thir, val1, val2 are filled from the csv/text file.
There are multiple fields available other than time and scan but I have not included them here.
While running the program, I am getting the warning
The variable initial_var appears to change size on every loop iteration. Consider preallocating for speed.
I know this can be solved by preallocating and initializing.
But my question here is, what is the better way of keeping the above data rather than the above mentioned structure type?
These lines won't do anything:
initial_var(firs).second_var(sec).third_var(thir).time(end+1, :) = [];
initial_var(firs).second_var(sec).third_var(thir).scan(end+1, :) = [];
It means "delete the row after the end of this array".
You might like to look at a multi-dimensional structure:
vars(firs,sec,thr).time(end+1, :) = val1
vars(firs,sec,thr).scan(end+1, :) = val2
Should be easier to initialise too.
Also, when loading the data, you might like to look at textscan.
Typically, the fastest and most flexible way to read data is with fscanf. (See also csvread for a convenience wrapper for csv files.) For example:
data = randn(1e4, 20);
save data.txt data -ASCII
tic
h = fopen('data.txt')
data_read = fscanf(h, '%f');
data_read = reshape(data_read, 1e4, []);
toc
Elapsed time is 0.089097 seconds.
If the data are all numeric, then it is fastest to store and operate on simple matrices.
Also, if you post some specific data and reproducible code, we might be able to give more specific answers...
I've written a script that saves its output to a CSV file for later reference, but the second script for importing the data takes an ungainly amount of time to read it back in.
The data is in the following format:
Item1,val1,val2,val3
Item2,val4,val5,val6,val7
Item3,val8,val9
where the headers are on the left-most column, and the data values take up the remainder of the row. One major difficulty is that the arrays of data values can be different lengths for each test item. I'd save it as a structure, but I need to be able to edit it outside the MATLAB environment, since sometimes I have to delete rows of bad data on a computer that doesn't have MATLAB installed. So really, part one of my question is: Should I save the data in a different format?
Second part of the question:
I've tried importdata, csvread, and dlmread, but I'm not sure which is best, or if there's a better solution. Right now I'm using my own script using a loop and fgetl, which is horribly slow for large files. Any suggestions?
function [data,headers]=csvreader(filename); %V1_1
fid=fopen(filename,'r');
data={};
headers={};
count=1;
while 1
textline=fgetl(fid);
if ~ischar(textline), break, end
nextchar=textline(1);
idx=1;
while nextchar~=','
headers{count}(idx)=textline(1);
idx=idx+1;
textline(1)=[];
nextchar=textline(1);
end
textline(1)=[];
data{count}=str2num(textline);
count=count+1;
end
fclose(fid);
(I know this is probably terribly written code - I'm an engineer, not a programmer, please don't yell at me - any suggestions for improvement would be welcome, though.)
It would probably make the data easier to read if you could pad the file with NaN values when your first script creates it:
Item1,1,2,3,NaN
Item2,4,5,6,7
Item3,8,9,NaN,NaN
or you could even just print empty fields:
Item1,1,2,3,
Item2,4,5,6,7
Item3,8,9,,
Of course, in order to pad properly you would need to know what the maximum number of values across all the items is before hand. With either format above, you could then use one of the standard file reading functions, like TEXTSCAN for example:
>> fid = fopen('uneven_data.txt','rt');
>> C = textscan(fid,'%s %f %f %f %f','Delimiter',',','CollectOutput',1);
>> fclose(fid);
>> C{1}
ans =
'Item1'
'Item2'
'Item3'
>> C{2}
ans =
1 2 3 NaN %# TEXTSCAN sets empty fields to NaN anyway
4 5 6 7
8 9 NaN NaN
Instead of parsing the string textline one character at a time. You could use strtok to break the string up for example
stringParts = {};
tline = fgetl(fid);
if ~ischar(tline), break, end
i=1;
while 1
[stringParts{i},r]=strtok(tline,',');
tline=r;
i=i+1;
if isempty(r), break; end
end
% store the header
headers{count} = stringParts{1};
% convert the data into numbers
for j=2:length(stringParts)
data{count}(j-1) = str2double(stringParts{j});
end
count=count+1;
I've had the same problem with reading csv data in Matlab, and I was surprised by how little support there is for this, but then I just found the import data tool. I'm in r2015b.
On the top bar in the "Home" tab, click on "Import Data" and choose the file you'd like to read. An app window will come up like this:
Import Data tool screenshot
Under "Import Selection" you have the option to "generate function", which gives you quite a bit of customization options, including how to fill empty cells, and what you'd like the output data structure to be. Plus it's written by MathWorks, so it's probably utilizing the fastest available method to read csv files. It was almost instantaneous on my file.
Q1) If you know the max number of columns you can fill empty entries with NaN
Also, if all values are numerical, do you really need "Item#" column? If yes, you can use only "#", so all data is numerical.
Q2) The fastest way to read num. data from a file without mex-files is csvread.
I try to avoid using strings in csv files, but if I have to, I use my csv2cell function:
http://www.mathworks.com/matlabcentral/fileexchange/20135-csv2cell