Matlab: how handle abnormal data files - matlab

I am trying to import a large number of files into Matlab for processing. A typical file would look like this:
mass intensity
350.85777 238
350.89252 3094
350.98688 2762
351.87899 468
352.17712 569
352.28449 426
Some text and numbers here, describing the experimental setup, eg
Scan 3763 # 81.95, contains 1000 points:
The numbers in the two columns are separated by 8 spaces. However, sometimes the experiment will go wrong and the machine will produce a datafile like this one:
mass intensity
Some text and numbers here, describing the experimental setup, eg
Scan 3763 # 81.95, contains 1000 points:
I found that using space-separated files with a single header row, ie
importdata(path_to_file,' ', 1);
works best for the normal files. However, it totally fails on all the abnormal files. What would the easiest way to fix this be? Should I stick with importdata (already tried all possible settings, it just doesn't work) or should I try writing my own parser? Ideally, I would like to get those values in a Nx2 matrix for normal files and [0 0] for abnormal files.
Thanks.

I don't think you need to create your own parser, nor is this all that abnormal. Using textscan is your best option here.
fid = fopen('input.txt', 'rt');
data = textscan(fid, '%f %u', 'Headerlines', 1);
fclose(fid);
mass = data{1};
intensity = data{2};
Yields:
mass =
350.8578
350.8925
350.9869
351.8790
352.1771
352.2845
intensity =
238
3094
2762
468
569
426
For your 1st file and:
mass =
Empty matrix: 0-by-1
intensity =
Empty matrix: 0-by-1
For your empty one.
By default, text scan reads whitespace as a delimiter, and it only reads what you tell it to until it can no longer do so; thus it ignores the final lines in your file. You can also run a second textscan after this one if you want to pick up those additional fields:
fid = fopen('input.txt', 'rt');
data = textscan(fid, '%f %u', 'Headerlines', 1);
mass = data{1};
intensity = data{2};
data = textscan(fid, '%*s %u %*c %f %*c %*s %u %*s', 'Headerlines', 1);
scan = data{1};
level = data{2};
points = data{3};
fclose(fid);
Along with your mass and intensity data gives:
scan =
3763
level =
81.9500
points =
1000

what do you mean 'totally failes on abnormal files'?
you can check if importdata finds any data using e.g.
>> imported = importdata(path_to_file,' ', 1);
>> isfield(imported, 'data')

Related

Using data from txt file to make a plot

I have a file with float numbers( -1.22070E-02 -7.32420E-02 ) on two columns.
I want to read them and make a plot. Something like the examples here: https://www.mathworks.com/help/signal/examples/signal-generation-and-visualization.html.
After searching this is my code:
fid = fopen('sample_b.txt', 'r');
a = fscanf(fid, '%f %f', [2 inf]);
simin = a';
plot(simin(:,1),simin(:,2));
fclose(fid);
But this is the plot I get:
And when I display simin using 'disp(simin)' I see:
81.4724 90.5792
12.6987 91.3376
63.2359 9.7540
27.8498 54.6882
I do not understand why my program uses these 8 numbers. My file contains many numbers, but the first 8 are these:
-1.22070E-02 -7.32420E-02
1.22070E-02 0.197750
4.88280E-03 9.27730E-02
-4.39450E-02 3.41800E-02
If I use 'plot(a(:,1),a(:,2));' the plot is just a line.
Edit: my data represents and EEG signal.

Reading large text files with insuficient RAM in Matlab

I want to read a large text file of about 2GB and perform a series of operations on that data. Following approach
tic
fid=fopen(strcat(Name,'.dat'));
C=textscan(fid, '%d%d%f%f%f%d');
fclose(fid);
%Extract cell values
y=C{1}(1:Subsampling:end)/Subsampling;
x=C{2}(1:Subsampling:end)/Subsampling;
%...
Reflectanse=C{6}(1:Subsampling:end);
Overlap=round(Overlap/Subsampling);
fails immediatly after reading C (C=textscan(fid, '%d%d%f%f%f%d');) with a strange peak in my memory usage:
What would be the best way to import a file of this size? Is there a way to use textscan() to read individual parts of a text file, or are there any other functions better suited for this task?
Edit: Every column in the textscan contains an information field information for 3D-Points:
width hieght X Y Z Grayscale
345 453 3.422 53.435 0.234 200
346 453 3.545 52.345 0.239 200
... % and so on for ~40 millon points
If you can process each row individually then the following code will allow you to do that. I've included start_line and end_line if you want to specify a block of data.
headerSpec = '%s %s %s %s %s %s';
dataSpec = '%f %f %f %f %f %f';
fid=fopen('data.dat');
% Read Header
names = textscan(fid, headerSpec, 1, 'delimiter', '\t');
k = 0;
% specify a start and end line for getting a block of data
start_line = 2;
end_line = 3;
while ~feof(fid)
k=k+1;
if k < start_line
continue;
end
if k > end_line
break;
end
% get data
C = textscan(fid, dataSpec, 1, 'delimiter', '\t');
row = [C{1:6}]; % convert data from cell to vector
% do what you want with the row
end
fclose(fid);
There is the possibility of reading in the entire file, but this will depend on the amount of memory you have available and if matlab has any restrictions in place. This can be seen by typing memory at the command window.

read a txt file to matrix and cellarray Matlab

I have a txt file with those entries and I would like to know how to get the numerical values from the second column until the last column in a matrix and the first column in a cell array.
I've tried with import data and fscanf and I dont understand what's going on.
CP6 7,2 -2,7 6,6
P5 -5,8 -5,9 5,8
P6 5,8 -5,9 5,8
AF7 -5,0 7,2 3,6
AF8 5,0 7,2 3,6
FT7 -7,6 2,8 3,6
This should give you what you want based on the text sample you supplied.
fileID = fopen('x.txt'); %open file x.txt
m=textscan(fileID,'%s %d ,%d %d ,%d %d ,%d');
fclose(fileID); %close file
col1 = m{1,1}; %get first column into cell array col1
colRest = cell2mat(m(1,2:6)); %convert rest of columns into matrix colRest
Lookup textscan for more info on reading specially formatted data
This function should do the trick. It reads your file and scans it according to your pattern. Then, put the first column in a cell array and the others in a matrix.
function [ C1,A ] = scan_your_txt_file( filename )
fid = fopen(filename,'rt');
C = textscan(fid, '%s %d,%d %d,%d %d,%d');
fclose(fid);
C1 = C{1};
A = cell2mat(C(2:size(C,2)));
end
Have you tried xlsread? It makes a numeric array and two non-numeric arrays.
[N,T,R]=xlsread('yourfilename.txt')
but your data is not comma delimited. It also looks like you are using a comma to represent a decimal point. Does this array have 7 columns or 4? Because I'm in the US, I'm going to assume you have paired coordinates and the comma is one kind of delimiter while the space is a second one.
So here is something klugy, but it works. It is a gross ugly hack, but it works.
%housekeeping
clc
%get name of raw file
d=dir('*22202740*.txt')
%translate from comma-as-decimal to period-as-decimal
fid = fopen(d(1).name,'r') %source
fid2= fopen('myout.txt','w+') %sink
while 1
tline = fgetl(fid); %read
if ~ischar(tline), break, end %end loop
fprintf(fid2,'%s\r\n',strrep(tline,',','.')) %write updated line to output
end
fclose(fid)
fclose(fid2)
%open, gulp, parse/store, close
fid3 = fopen('myout.txt','r');
C=textscan(fid3,'%s %f %f %f ');
fclose(fid3);
%measure waist size and height
[n,m]=size(C);
n=length(C{1});
%put in slightly more friendly form
temp=zeros(n,m);
for i=2:m
t0=C{i};
temp(:,i)=t0;
end
%write to excel
xlswrite('myout_22202740.xlsx',temp(:,2:end),['b1:' char(96+m) num2str(n)]);
xlswrite('myout_22202740.xlsx',C{1},['a1:a' num2str(n)])
%read from excel
[N,T,R]=xlsread('myout_22202740.xlsx')
If you want those commas to be decimal points, then that is a different question.

Reading CSV with mixed type data

I need to read the following csv file in MATLAB:
2009-04-29 01:01:42.000;16271.1;16271.1
2009-04-29 02:01:42.000;2.5;16273.6
2009-04-29 03:01:42.000;2.599609;16276.2
2009-04-29 04:01:42.000;2.5;16278.7
...
I'd like to have three columns:
timestamp;value1;value2
I tried the approaches described here:
Reading date and time from CSV file in MATLAB
modified as:
filename = 'prova.csv';
fid = fopen(filename, 'rt');
a = textscan(fid, '%s %f %f', ...
'Delimiter',';', 'CollectOutput',1);
fclose(fid);
But it returs a 1x2 cell, whose first element is a{1}='ÿþ2', the other are empty.
I had also tried to adapt to my case the answers to these questions:
importing data with time in MATLAB
Read data files with specific format in matlab and convert date to matal serial time
but I didn't succeed.
How can I import that csv file?
EDIT After the answer of #macduff i try to copy-paste in a new file the data reported above and use:
a = textscan(fid, '%s %f %f','Delimiter',';');
and it works.
Unfortunately that didn't solve the problem because I have to process csv files generated automatically, which seems to be the cause of the strange MATLAB behavior.
What about trying:
a = textscan(fid, '%s %f %f','Delimiter',';');
For me I get:
a =
{4x1 cell} [4x1 double] [4x1 double]
So each element of a corresponds to a column in your csv file. Is this what you need?
Thanks!
Seems you're going about it the right way. The example you provide poses no problems here, I get the output you desire. What's in the 1x2 cell?
If I were you I'd try again with a smaller subset of the file, say 10 lines, and see if the output changes. If yes, then try 100 lines, etc., until you find where the 4x1 cell + 4x2 array breaks down into the 1x2 cell. It might be that there's an empty line or a single empty field or whatever, which forces textscan to collect data in an additional level of cells.
Note that 'CollectOutput',1 will collect the last two columns into a single array, so you'll end up with 1 cell array of 4x1 containing strings, and 1 array of 4x2 containing doubles. Is that indeed what you want? Otherwise, see #macduff's post.
I've had to parse large files like this, and I found I didn't like textscan for this job. I just use a basic while loop to parse the file, and I use datevec to extract the timestamp components into a 6-element time vector.
%% Optional: initialize for speed if you have large files
n = 1000 %% <# of rows in file - if known>
timestamp = zeros(n,6);
value1 = zeros(n,1);
value2 = zeros(n,1);
fid = fopen(fname, 'rt');
if fid < 0
error('Error opening file %s\n', fname); % exit point
end
cntr = 0
while true
tline = fgetl(fid); %% get one line
if ~ischar(tline), break; end; % break out of loop at end of file
cntr = cntr + 1;
splitLine = strsplit(tline, ';'); %% split the line on ; delimiters
timestamp(cntr,:) = datevec(splitLine{1}, 'yyyy-mm-dd HH:MM:SS.FFF'); %% using datevec to parse time gives you a standard timestamp vector
value1(cntr) = splitLine{2};
value2(cntr) = splitLine{3};
end
%% Concatenate at the end if you like
result = [timestamp value1 value2];

Matlab, read multiple 2d arrays from a csv file

I have a csv file which contains 2d arrays of 4 columns but a varying number of rows. Eg:
2, 354, 23, 101
3, 1023, 43, 454
1, 5463, 45, 7657
4, 543, 543, 654
3, 56, 7654, 344
...
I need to be able to import the data such that I can run operations on each block of data, however csvread, dlmread and textscan all ignore the blank lines.
I can't seem to find a solution anywhere, how can this be done?
PS:
It may be worth noting that the files of the format above are actually the concatenation of many files containing only one block of data (I don't want to have to read from thousands of files every time) therefore the blank line between blocks can be changed to any other delimiter / marker. This is just done with a python script.
EDIT: My Solution - based upon / inspired by petrichor below
I replaced the csvread with textscan which is faster. Then I realised that if I replaced the blank lines with lines of nan instead (modifying my python script) I could remove the need for a second textscan the slow point. My code is:
filename = 'data.csv';
fid = fopen(filename);
allData = cell2mat(textscan(fid,'%f %f %f %f','delimiter',','));
fclose(fid);
nanLines = find(isnan(allData(:,1)))';
iEnd = (nanLines - (1:length(nanLines)));
iStart = [1 (nanLines(1:end-1) - (0:length(nanLines)-2))];
nRows = iEnd - iStart + 1;
allData(nanLines,:)=[];
data = mat2cell(allData, nRows);
Which evaluates in 0.28s (a file of just of 103000 lines). I've accepted petrichor's solution as it indeed best solves my initial problem.
filename = 'data.txt';
%# Read all the data
allData = csvread(filename);
%# Compute the empty line indices
fid = fopen(filename);
lines = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
blankLines = find(cellfun('isempty', lines{1}))';
%# Find the indices to separate data into cells from the whole matrix
iEnd = [blankLines - (1:length(blankLines)) size(allData,1)];
iStart = [1 (blankLines - (0:length(blankLines)-1))];
nRows = iEnd - iStart + 1;
%# Put the data into cells
data = mat2cell(allData, nRows)
That gives the following for your data:
data =
[3x4 double]
[2x4 double]