I need to read an ASCII data file using MATLAB fscanf command. Data is basically floating numbers with fixed field length and precision. In each row of data file there are 10 columns of numeric values and the number of row varies from one file to another one. Below is an example of the first line:
0.000 0.000 0.005 0.000 0.010 0.000 0.015 0.000 0.020 -0.000
The field width is 7 and precision is 3.
I have tried:
x = fscanf(fid,'%7.3f\r\n');
x = fscanf(fid,[repmat('%7.3f',1,10) '\r\n']);
but they return nothing!
When I do not specify the field and precision, for example x = fscanf(fid,'%f');, it reads all the data but sine some data occupy exactly 7 spaces (example 158.000) it joins the two consecutive numbers which results in a wrong output. Here is an example:
0.999158.000
it reads this as 0.999158 and .000
Any hint or help will be highly appreciated.
If your data might not be separated by a space (0.999158.000 in the example you made in the question), you could try using textscan to read the file.
Notice that with this format you can not have an input such as -158.000.
Nevertheless, with this format, you can not have a value such as -158.000
Since textscan returns a cellarray you might need to convert the cellarray into a matrix (if you do not like working with cellarray).
fp=fopen('input_file_5.txt')
x = textscan(fp,repmat('%7.3f',1,10))
fclose(fp)
m=[x{:}]
Input file
0.999130.000 0.005 0.000 0.010 0.000 0.015 0.000 0.020 -0.000
0.369-30.000123.005 0.000 0.040 0.000 0.315 0.000 0.020-10.000
Output
m =
Columns 1 through 8
0.9990 130.0000 0.0050 0 0.0100 0 0.0150 0
0.3690 -30.0000 123.0050 0 0.0400 0 0.3150 0
Columns 9 through 10
0.0200 0
0.0200 -10.0000
Hope this helps.
For reading ASCII text files with well defined input as specified in the question you should use the dlmread function.
>> X = dlmread(filename, delimiter);
will read numeric data from filename that is delimited (along the same row) with delimiter into the matrix X. For you case you can use
>> X = dlmread(filename, ' ');
as your data is delimited by a space, ' '.
Related
I need to write a string and a table to one text file. The string contains a description of the data in the table, but is not the headers for the table. I am using R2019a which I guess means the "Append" writetable option does not work? Please see my example data below:
% Load data
load cereal.mat;
table = table(Calories, Carbo, Cups, Fat, Fiber, Mfg, Name, Potass);
string = {'This is a string about the cereal table'};
filename = "dummyoutput.sfc";
% How I tried to do this (which does not work!)
fid = fopen(filename, 'w', 'n');
fprintf(fid, '%s', cell2mat(string))
fclose(fid);
writetable(table, filename, 'FileType', 'text', 'WriteVariableNames', 0, 'Delimiter', 'tab', 'WriteMode', 'Append')
I get this error:
Error using writetable (line 155)
Wrong number of arguments.
Does anyone have a suggestion as to how to proceed?
Thanks!
A bit hacky, but here's an idea.
Convert your existing table to a cell array with table2cell.
Prepend a row of cells which consists of your string, followed by empty cells.
Convert the cell array back to a table with cell2table, and write the new table to the file.
load cereal.mat;
table = table(Calories, Carbo, Cups, Fat, Fiber, Mfg, Name, Potass);
s = {'This is a string about the cereal table'};
filename = "dummyoutput.sfc";
new_table = cell2table([[s repmat({''},1,size(table,2)-1)] ; table2cell(table)]);
writetable(new_table,filename,'FileType','text','WriteVariableNames',0,'Delimiter','tab');
>> !head dummyoutput.sfc
This is a string about the cereal table
70 5 0.33 1 10 N 100% Bran 280
120 8 -1 5 2 Q 100% Natural Bran 135
70 7 0.33 1 9 K All-Bran 320
50 8 0.5 0 14 K All-Bran with Extra Fiber 330
110 14 0.75 2 1 R Almond Delight -1
110 10.5 0.75 2 1.5 G Apple Cinnamon Cheerios 70
110 11 1 0 1 K Apple Jacks 30
130 18 0.75 2 2 G Basic 4 100
90 15 0.67 1 4 R Bran Chex 125
I'm writing some larger (~500MB - 3GB) pieces binary data in MATLAB using the fwrite command.
I want the data to be written in a tabular format so I'm using the skip parameter. E.g. I have 2 vectors of uint8 values a = [ 1 2 3 4]; b = [5 6 7 8]. I want the binary file to look like this 1 5 2 6 3 7 4 8
So in my code I do something similar to this (my data is more complex)
fwrite(f,a,'1*uint8',1);
fseek(f,2)
fwrite(f,b,'1*uint8',1);
But the writes are painfully slow ( 2MB/s ).
I ran the following block of code, and when I set passed in a skip count of 1 the write is approximately 300x slower.
>> f = fopen('testfile.bin', 'w');
>> d = uint8(1:500e6);
>> tic; fwrite(f,d,'1*uint8',1); toc
Elapsed time is 58.759686 seconds.
>> tic; fwrite(f,d,'1*uint8',0); toc
Elapsed time is 0.200684 seconds.
>> 58.759686/0.200684
ans =
292.7971
I could understand 2x or 4x slowdown since the you have to traverse twice as many bytes with the skip parameter set to 1 but 300x makes me think I'm doing something wrong.
Has anyone encountered this before? Is there a way to speed up this write?
Thanks!
UPDATE
I wrote the following function to format arbitrary data sets. Write speed is vastly improved (~300MB/s) for large data sets.
%
% data: A cell array of matrices. Matrices can be composed of any
% non-complex numeric data. Each entry in data is considered
% to be an independent column in the data file. Rows are indexed
% by the last column in the numeric matrix hence the count of elements
% in the last dimension of the matrix must match.
%
% e.g.
% size(data{1}) == [1,5]
% size(data{2}) == [4,5]
% size(data{3}) == [3,2,5]
%
% The data variable has 3 columns and 5 rows. Column 1 is made of scalar values
% Column 2 is made of vectors of length 4. And column 3 is made of 3 x 2
% matrices
%
%
% returns buffer: a N x M matrix of bytes where N is the number of bytes
% of each row of data, and M is the number of rows of data.
function [buffer] = makeTabularDataBuffer(data)
dataTypes = {};
dataTypesLengthBytes = [];
rowElementCounts = []; %the number of elements in each "row"
rowCount = [];
%figure out properties of tabular data
for idx = 1:length(data)
cDat = data{idx};
dimSize = size(cDat);
%ensure each column has the same number of rows.
if isempty(rowCount)
rowCount = dimSize(end);
else
if dimSize(end) ~= rowCount
throw(MException('e:e', sprintf('data column %d does not have the required number of rows (%d)\n',idx,rowCount)));
end
end
dataTypes{idx} = class(data{idx});
dataTypesLengthBytes(idx) = length(typecast(eval([dataTypes{idx},'(1)']),'uint8'));
rowElementCounts(idx) = prod(dimSize(1:end-1));
end
rowLengthBytes = sum(rowElementCounts .* dataTypesLengthBytes);
buffer = zeros(rowLengthBytes, rowCount,'uint8'); %rows of the dataset map to column in the buffer matrix because fwrite writes columnwise
bufferRowStartIdxs = cumsum([1 dataTypesLengthBytes .* rowElementCounts]);
%load data 1 column at a time into the buffer
for idx = 1:length(data)
cDat = data{idx};
columnWidthBytes = dataTypesLengthBytes(idx)*rowElementCounts(idx);
cRowIdxs = bufferRowStartIdxs(idx):(bufferRowStartIdxs(idx+1)-1);
buffer(cRowIdxs,:) = reshape(typecast(cDat(:),'uint8'),columnWidthBytes,[]);
end
end
I've done some very limited testing of the function but it appears to be working as expected. The returned
buffer matrix can then be passed to fwrite without the skip argument and fwrite will write the buffer in column major order.
dat = {};
dat{1} = uint16([1 2 3 4]);
dat{2} = uint16([5 6 7 8]);
dat{3} = double([9 10 ; 11 12; 13 14; 15 16])';
buffer = makeTabularDataBuffer(dat)
buffer =
20×4 uint8 matrix
1 2 3 4
0 0 0 0
5 6 7 8
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
34 38 42 46
64 64 64 64
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
36 40 44 48
64 64 64 64
For best I/O performance, use sequential writes, and avoid skipping.
Reorder the data in the RAM before saving to file.
Reordering the data in the RAM is in order of 100 times faster than reordering data on disk.
I/O operations and storage devices are optimized for sequential writes of large data chunks (optimized both in hardware and in software).
In mechanical drives (HDD), writing data with skipping may take a very long time, because the mechanical head of the drive must move (usually the OS optimize it by using memory buffer, but in principle it takes a long time).
With SSD, there is no mechanical seeking, but sequential writes are still much faster. Read the following post Sequential vs Random I/O on SSDs? for some explanation.
Example for reordering data in RAM:
a = uint8([1 2 3 4]);
b = uint8([5 6 7 8]);
% Allocate memory space for reordered elements (use uint8 type to save RAM).
c = zeros(1, length(a) + length(b), 'uint8');
%Reorder a and b in the RAM.
c(1:2:end) = a;
c(2:2:end) = b;
% Write array c to file
fwrite(f, c, 'uint8');
fclose(f);
Time measurements in my machine:
Writing file to SSD:
Elapsed time is 56.363397 seconds.
Elapsed time is 0.280049 seconds.
Writing file to HDD:
Elapsed time is 56.063186 seconds.
Elapsed time is 0.522933 seconds.
Reordering d in RAM:
Elapsed time is 0.965358 seconds.
Why 300x times slower and not 4x ?
I am guessing the software implementation of writing data with skipping is not optimized for best performance.
According to the following post:
fseek() or fflush() require the library to commit buffered operations.
Daniel's guess (in the comment) is probably correct.
"The skip causes MATLAB to flush after each byte."
Skipping is probably implemented using fseek(), and fseek() forces flushing data to disk.
It could explain why writing with skipping is painfully slow.
If we take for example a vector of one line
>>m = linspace(0,100,11)
>>J = exp(m.^0.25)
we get
J =
Columns 1 through 4
1.0000 5.9197 8.2875 10.3848
Columns 5 through 8
12.3650 14.2841 16.1700 18.0385
Columns 9 through 11
19.8996 21.7599 23.6243
We get the right result in the first entry of the output matrix which is e^(0^0.25) = e^0 = 1
But if we take
>> J = exp(m.^2.5)
We get
J =
1.0e+137 *
Columns 1 through 4
0.0000 2.1676 Inf Inf
Columns 5 through 8
Inf Inf Inf Inf
Columns 9 through 11
Inf Inf Inf
But e^(0^2.5) = e^0 = 1
I did not use matlab for a long time I don't have a good idea how this works, I first thought it could be a round off or a truncation or both, I looked up what the operation was and some documentation of the formats, I found that it does show the right result within the vector using the format longE :
>>format longE
which returns 1.000000000000000e+00
but then I checked the first matrix with the enry 0 ( Default format short) by using
>>J(1)
And it returned 1.
So the value in that entry is correct but it shows 0 , and a factor outside the matrix 1.0e+137 *
I don't get what is happening, why it shows a 0 ?
When displaying a matrix in format long, MATLAB chooses a factor which is suitable for all entries. An example for the purpose of explanation:
k=10.^[1:10]
k =
1.0e+10 *
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0100 0.1000 1.0000
The first entry is a 10, but because of the factor of 10 000 000 000 it is not displayed. When you instead type in k(1) matlab will choose a format suitable for that number:
>> k(1)
ans =
10
The standard output is "usually" good where all numbers in a similar magnitude. A workaround is to use mat2str.
>> mat2str([pi,10.^[1:20]])
ans =
'[3.14159265358979 10 100 1000 10000 100000 1000000 10000000 100000000 1000000000 10000000000 100000000000 1000000000000 10000000000000 100000000000000 1e+15 1e+16 1e+17 1e+18 1e+19 1e+20]'
It displays up to 15 digits which is usually enough, but 17 digits would be required to display a double in full accuracy (further information)
The numerical output format in MATLAB's Command Window is under user control which you can change from the MATLAB Command Window Preferences.
What you are observing is the default behavior.
>> k=10.^[1:10];
>> k
k =
1.0e+10 *
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0100 0.1000 1.0000
However, you can change the output format to use the floating point format:
>> format short e
>> k
k =
1.0000e+01 1.0000e+02 1.0000e+03 1.0000e+04 1.0000e+05 1.0000e+06
1.0000e+07 1.0000e+08 1.0000e+09 1.0000e+10
You can also change it to use the "best of fixed or floating point format":
>> format short g
>> k
k =
10 100 1000 10000 1e+05 1e+06 1e+07 1e+08 1e+09 1e+10
See the format command on other available options. Explicitly printing your variables (using the mat2str command etc.) to see their full precision is not necessary, nor is it used by most MATLAB users. If you really want full precision you can use format long e or format long g.
I have two vectors:
First have many values between 0 and 1
Second vector have 100 values(intervals) between 0 and 1: [0 0.01 0.02 .... 1] where 0 0.01 is first interval, 0.01 0.02 second and so on.
I need to create vector, where each element is the number of occurrences of elements of the first vector in each interval from second.
For example:
first [0.00025 0.0001 0.0011 0.0025 0.009 ...(a lot of values bigger then 0.01) ... 1]
then first element of result vector should be 5, and so on.
Any ideas how to implement this in matlab?
How to extract the "mean" and "depth" data like the following of each month?
MEAN, S.D., NO. OF OBSERVATIONS
January February ...
Depth Mean S.D. #Obs Mean S.D. #Obs ...
0 32.92 0.43 9 32.95 0.32 21
10 32.92 0.43 14 33.06 0.37 48
20 32.88 0.46 10 33.06 0.37 50
30 32.90 0.51 9 33.12 0.35 48
50 33.05 0.54 6 33.20 0.42 41
75 33.70 1.11 7 33.53 0.67 37
100 34.77 1 34.47 0.42 10
150
200
July August
Depth Mean S.D. #Obs Mean S.D. #Obs
0 32.76 0.45 18 32.75 0.80 73
10 32.76 0.40 23 32.65 0.92 130
20 32.98 0.53 24 32.84 0.84 121
30 32.99 0.50 24 32.93 0.59 120
50 33.21 0.48 16 33.05 0.47 109
75 33.70 0.77 10 33.41 0.73 80
100 34.72 0.54 3 34.83 0.62 20
150 34.69 1
200
It has undefinable number of spaces between the data, and a introduction line at the beginning.
Thank you!
Here is an example for how to read line from file:
fid = fopen('yourfile.txt');
tline = fgetl(fid);
while ischar(tline)
disp(tline)
tline = fgetl(fid);
end
fclose(fid);
Inside the while loop you'll want to use strtok (or something like it) to break up each line into string tokens delimited by spaces.
Matlab's regexp is powerful for pulling data out of less-structure text. It's really worth getting familiar with regular expressions in general: http://www.mathworks.com/help/techdoc/ref/regexp.html
In this case, you would define the pattern to capture each observation group (Mean SD Obs), e.g.: 32.92 0.43 9
Here I see a pattern for each group of data: each group is preceded by 6 spaces (regular expression = \s{6}), and the 3 data points are divided by less than 6 spaces (\s+). The data itself consists of two floats (\d+.\d+) and one integer (\d+):
So, putting this together, your capture pattern would look something like this (the brackets surround the pattern of data to capture):
expr = '\s{6}(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+)';
We can add names for each token (i.e. each data point to capture in the group) by adding '?' inside the brackets:
expr = '\s{6}(?<mean>\d+\.\d+)\s+(?<sd>\d+\.\d+)\s+(?<obs>\d+)';
Then, just read your file into one string variable 'strFile' and extract the data using this defined pattern:
strFile = urlread('file://mydata.txt');
[tokens data] = regexp(strFile, expr, 'tokens', 'names');
The variable 'tokens' will then contain a sequence of observation groups and 'data' is a structure with fields .mean .sd and .obs (because these are the token names in 'expr').
If you just want to get, for example, the first two columns, then textscan() is a great choice.
fid = fopen('yourfile.txt');
tline = fgetl(fid);
while ischar(tline)
oneCell = textscan(tline, '%n'); % read the whole line, put it into a cell
allTheNums = oneCell{1}; % open up the cell to get at the columns
if isempty(allTheNums) % no numbers, header line
continue;
end
usefulNums = allTheNums(1:2) % get the first two columns
end
fclose(fid);
textscan automatically splits the strings you feed it where there is whitespace, so the undefined number of strings between columns isn't an issue. A string with no numbers will give an array that you can test as empty to avoid out-of-bounds or bad data errors.
If you need to programmatically figure out which columns to get, you can scan for the words 'Depth' and 'Mean' to find the indeces. Regular expressions might be helpful here, but textscan should work fine too.