I have a text file that has 200 rows, and there are 200 values in every row. The file consists of integers, but they are not separated by any delimiter, not even a space. Here is an example,
1111111111111111111111111111111111111111122222222222222222222222222220000111
1111111111111111111111111111100000000003123333333333333333333333333333300002
0000000000022222222222222222222222222222222211111121212222222222222222111111
The file may contain some strings at the beginning, but I want only to read these numbers. I want to be able to count the occurrence of every integer. So, I will read all these numbers into a vector, or a matrix, where every element in the vector is a number in the file. So, the vector must contain 200 * 200 elements. Then, I will calculate the occurrence of every element.
I checked available file reading methods like textscan, but I think that textscan with this format C = textscan(fid,'%d %d'); requires specifying %d 200 * 200 times, is this the case, or there is a way to use textscan?
I also tried importdata, but when I tried to print the result I didn't get the numeric values. It seems that it only reads the first row, because of this line 200x1 double. Here is the output,
A =
data: [200x1 double]
textdata: {6x1 cell}
colheaders: {[1x107 char]}
Can you please tell me what method I can use to read the file described above?
The data you have with importdata, imports only double values and the headers. You could use the readtable function as follows (I assume 1 header line):
datafile='test.txt';
headerlines=1;
%OPTION1
A=readtable(datafile); %from Matlab R2013b
AA=cell2mat(table2array(A(headerlines+1:end,:)));
%OPTION2
A=textread(datafile,'%s'); %from Matlab R2006a
AA=cell2mat(A(headerlines+1:end,:));
%PROCESSING
b=zeros(size(AA));
for k=1:size(AA,1)
b(k,:)=str2double(regexp(AA(k,:),'\d','match'));
end
%COUNTING
[nelements,centers]=hist(b',0:9);
The regular expression does the trick of getting out the numbers to columns:
regexp('01112345640','\d','match')
This should return a 1x11 cell with the numbers in char-format.
A simple approach:
each integer is a separate number (in the desired output), so read the data in line-by-line as a string, then do a loop
for j= 1:numel(a_line_of_integers),
x(j) = str2num(a_line_of_integers(j);
end
And repeat for every row you read in. Note in passing that if you switch to R, x=as.numeric(strsplit(a_line_of_Integers)) is much faster and easier
Related
I have two .csv files which I am trying to read into Matlab as numeric matrices. Call it list_a, simply has two columns of ID numbers and corresponding values (appr. 50000 lines) with a ',' delimiter. list_b has 6 columns with a ';' delimiter. I am only interested in the first two columns containing containing numbers; the other columns contain text that I don't care about.
I initially tried using the readtable function in Matlab but noticed that these values aren't stored as numeric values, which is a requirement I have. I couldn't figure out how to cast these as integers after reading them either.
For list_a I have used the dlmread function, which I believes reads the file as numeric values.
For list_b I have tried using the dlmread function in which row and column offsets can be specified (https://www.mathworks.com/help/matlab/ref/dlmread.html#d117e329603) - the problem here is however, that the length of the file could change in the future, so I'm not sure what to enter for the row offsets.
I'm also not sure I understand how this function works, considering I tried testing it for the first 1000 rows as follows:
csv_matrix = dlmread(csv_fullpath,';',[1 1 1000 2]);
and subsequently got the following error message - even though "field number 3" shouldn't even be included in the first place:
Error using dlmread (line 147)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 1, field number 3) ==>
RandomTextInFile\n Error in Damage_List_Reader (line 15)
csv_matrix = dlmread(csv_fullpath,';',[1 1 1000 3]);
I get the impression that I'm making this problem a lot harder than it needs to be so if there's an all around better way to do this, I'm all ears.. Thanks!
I would suggest using fopen in combination with textscan (e.g. for list_a) like this:
file = fopen('list_a.csv');
out = textscan(file, '%d%f', 'delimiter', ',');
ID = out{1};
Vals = out{2};
'%d%f' specifies the FormatSpec, so the way how the data is formatted in file. With this, you can capture any data from a csv file (and also omit data). I recommend reading the textscan Matlab doc for further formatting issues.
P.S.: I think you can put and "end" (without the quotations) instead of one of the row offset values if the number of rows/cols isn't fixed.
I have a fixed width file format (original was input for a Fortran routine). Several lines of the file look like the below:
1078.0711005.481 932.978 861.159 788.103 716.076
How this actually should read:
1078.071 1005.481 932.978 861.159 788.103 716.076
I have tried various methods, textscan, fgetl, fscanf etc, however the problem I have is, as seen above, sometimes because of the fixed width of the original files there is no whitespace between some of the numbers. I cant seem to find a way to read them directly and I cant change the original format.
The best I have come up with so far is to use fgetl which reads the whole line in, then I reshape the result into an 8,6 array
A=fgetl
A=reshape(A,8,6)
which generates the following result
11
009877
703681
852186
......
049110
787507
118936
So now I have the above and thought I might be able to concatenate the rows of that array together to form each number, although that is seeming difficult as well having tried strcat, vertcat etc.
All of that seems a long way round so was hoping for some better suggestions.
Thanks.
If you can rely on three decimal numbers you can use a simple regular expression to generate the missing blanks:
s = '1078.0711005.481 932.978 861.159 788.103 716.076';
s = regexprep(s, '(\.\d\d\d)', '$1 ');
c = textscan(s, '%f');
Now c{1} contains your numbers. This will also work if s is in fact the whole file instead of one line.
You haven't mentioned which class of output you needed, but I guess you need to read doubles from the file to do some calculations. I assume you are able to read your file since you have results of reshape() function already. However, using reshape() function will not be efficient for your case since your variables are not fixed sized (i.e 1078.071 and 932.978).
If I did't misunderstand your problem:
Your data is squashed in some parts (i.e 1078.0711005.481 instead
of 1078.071 1005.481).
Fractional part of variables have 3 digits.
First of all we need to get rid of spaces from the string array:
A = A(~ismember(A,' '));
Then using the information that fractional parts are 3 digits:
iter = length(strfind(A, '.'));
for k=1:iter
[stat,ind] = ismember('.', A);
B(k)=str2double(A(1:ind+3));
A = A(ind+4:end);
end
B will be an array of doubles as a result.
I have an input file having the following basic structure:
master header line(s)
block 1 header line(s)
... [m' x n] numerical matrix ...
block 2 header line(s)
... [m'' x n] numerical matrix ...
...
block N header line(s)
... [m(N) x n] numerical matrix ...
where n is constant, but m may assume different values (as indicated by the prime marks).
I am wondering if there is a simple way to load data of this organization into a cell array (or another structure of some kind) having the following form: each block of data (as defined by the header) is represented by a cell in a cell array, the contents of which are the numerical data in the form of a double array. To concretize that description, the desired MATLAB representation would appear as follows: cell{1} contains a double array containing the numerical data listed under the block 1 header; cell{2} contains a double array containing the numerical data listed under the block 2 header; etc.
Of course, there are simple alternatives, such as splitting the input file into individual block-specific files and sequentially reading each file into an element of a cell array via a loop statement, but I am interested to know whether there is a solution that does not require such manipulation.
I've had to do something similar. One way, as you say, is to divide into files. But really, since your file has a set structure:
1 - open the file
2 - read the first line (e.g. using fget)
3 - Read the header (e.g. using fget)
4 - read the next M rows (e.g. using fget, fread, etc.) and store as a matrix
5 - loop back to 3 except when eof.
(apologies for the pseudocode, I don't have access to Matlab on this computer)
Yes, this is still manipulation of the file, but it becomes extendable to when the file isn't as ordered as the example you gave (which is the case I have), and is extremely easy to read and debug. However, it will be slow if your file is hundreds of MBs.
I have a structured data file consisting of header lines interspersed with blocks of data. I am reading each block of data (as defined by the header line) into a separate cell of a cell array. For instance, suppose that after loading the data with textscan, I have a cell array x and an array of indices of header lines and EOF (headerIdx) of the following form:
x={'header line 1';'98.78743';'99.39717';'99.93578';'100.40125';'100.79166';'101.10525';'101.34037';'101.49553';'101.56939';'101.56072';'101.4685';'101.29184';'101.03002';'100.68249';'header line 2';'100.24887';'99.72897';'99.12274';'98.43036';'97.65215';'96.78864';'95.84054';'header line 3';'3.2';'4.31';'2.7';'4.6';'9.3'};
headerIdx=[1;16;24;30];
I then attempt to extract each block of data below a header line into a separate element of a cell array using sscanf and str2mat (as suggested by this post). Initially, this approach failed because the elements within a given block of data were of different length. This can be solved by including a numerical flag for the '%f' argument to help sscanf know where to delimit the input data (as suggested by this post). One can then use a strategy such as the following to effect the conversion of structured data to a cell array of block-specific double arrays:
extract_data = #(n) sscanf(str2mat(x(headerIdx(n)+1:headerIdx(n+1)-1)).',['%' num2str(size(str2mat(x(headerIdx(n)+1:headerIdx(n+1)-1)).',1)) 'f']);
extracted_data = arrayfun(extract_data,1:numel(headerIdx)-1,'UniformOutput',false);
The numerical flag of the format string can either be set to something arbitrarily large to encompass all the data, or can be set on a block-specific basis as I have done in the example above. The latter approach leads to redundant evaluation of str2mat (once for the input to sscanf and once for the input to the '%f' string generator. Can this redundancy be avoided without using loop statements that store the output of the str2mat command in a temporary variable? Note that one cannot simply take the output of the size command applied to the output of str2mat(x).' on the entire data set because the header lines are generally going to be the lines with the greatest number of characters.
Finally, I have constructed the x matrix above to reflect the fact that some blocks of data may have different precision than other blocks. This is the reason to set the format string in a block-specific manner. My testing has shown that despite accurate construction of a block-specific format string (['%' num2str(size(str2mat(x(headerIdx(n)+1:headerIdx(n+1)-1)).',1)) 'f']), the data in all elements of the resulting cell array (extracted_data) are ultimately forced to have the same precision (see below). Why is this the case, and how can it be corrected?
extracted_data{:}
ans =
98.7874
99.3972
99.9358
100.4013
100.7917
101.1052
101.3404
101.4955
101.5694
101.5607
101.4685
101.2918
101.0300
100.6825
ans =
100.2489
99.7290
99.1227
98.4304
97.6522
96.7886
95.8405
ans =
3.2000
4.3100
2.7000
4.6000
9.3000
I'm a Mac user (10.6.8) using MATLAB to process calculation results. I output large tables of numbers to .csv files. I then use the .csv files in EXCEL. This all works fine.
The problem is that each column of numbers needs a label (a string header). I can't figure out how to concatenate labels to the table of numbers. I would very much appreciate any advice. Here is some further information that might be useful:
My labels are contained within a cell array:
columnsHeader = cell(1,15)
that I fill in with calculation results; for example:
columnsHeader{1} = propertyStringOne (where propertyStringOne = 'Liq')
The sequence of labels is different for each calculation. My first attempt was to try and concatenate the labels directly:
labelledNumbersTable=cat(1,columnsHeader,numbersTable)
I received an error that concatenated types need to be the same. So I tried converting the labels/strings using cell2mat:
columnsHeader = cell2mat(columnsHeader);
labelledNumbersTable = cat(1,columnsHeader,numbersTable)
But that took ALL the separate labels and made them into one long word... Which leads to:
??? Error using ==> cat
CAT arguments dimensions are not consistent.
Does anyone know of an alternative method that would allow me to keep my original cell array of labels?
You will have to handle writing the column headers and the numeric data to the file in two different ways. Outputting your cell array of strings will have to be done using the FPRINTF function, as described in this documentation for exporting cell arrays to text files. You can then output your numeric data by appending it to the file (which already contains the column headers) using the function DLMWRITE. Here's an example:
fid = fopen('myfile.csv','w'); %# Open the file
fprintf(fid,'%s,',columnsHeader{1:end-1}); %# Write all but the last label
fprintf(fid,'%s\n',columnsHeader{end}); %# Write the last label and a newline
fclose(fid); %# Close the file
dlmwrite('myfile.csv',numbersTable,'-append'); %# Append your numeric data
The solution to the problem is already shown by others. I am sharing a slightly different solution that improves performance especially when trying to export large datasets as CSV files.
Instead of using DLMWRITE to write the numeric data (which internally uses a for-loop over each row of the matrix), you can directly call FPRINTF to write the whole thing at once. You can see a significant improvement if the data has many rows.
Example to illustrate the difference:
%# some random data with column headers
M = rand(100000,5); %# 100K rows, 5 cols
H = strtrim(cellstr( num2str((1:size(M,2))','Col%d') )); %'# headers
%# FPRINTF
tic
fid = fopen('a.csv','w');
fprintf(fid,'%s,',H{1:end-1});
fprintf(fid,'%s\n',H{end});
fprintf(fid, [repmat('%.5g,',1,size(M,2)-1) '%.5g\n'], M'); %'# default prec=5
fclose(fid);
toc
%# DLMWRITE
tic
fid = fopen('b.csv','w');
fprintf(fid,'%s,',H{1:end-1});
fprintf(fid,'%s\n',H{end});
fclose(fid);
dlmwrite('b.csv', M, '-append');
toc
The timings on my machine were as follows:
Elapsed time is 0.786070 seconds. %# FPRINTF
Elapsed time is 6.285136 seconds. %# DLMWRITE