Matlab to read in fix-width text file - matlab

I have a text file like below:
TestData
6.84 11.31 17.51 22.62 26.91 31.98 36.47 35.85 28.47 20.57 10.50 6.37 test1
0.24 2.62 4.94 7.17 10.39 15.37 18.73 18.29 12.26 6.46 1.15 -0.33 test2
68.47 95.04156.07218.39304.31320.22311.69269.22203.01135.60 68.18 55.09 test3
68.47 95.04156.07218.39304.31320.22311.69269.22203.01135.60 68.18 55.09 test4
...
As you can see, the first two lines are comments to ignore. In the following lines, there is a comment at the end of each line too. Each number is in the form of %6f. Also, there are blank lines in between.
I want to read in all the numbers into a matrix to make plots. I tried to use textscan, but had problems to ignore the last column, the blank lines and read in numbers that are connected (e.g., some numbers in the line: test4).
Here is the code I have by now:
data=dir('*.txt');
formatspecific='%6f%6f%6f%6f%6f%6f%6f%6f%6f%6f%6f%6f';
for i=1:length(data);
TestData1=data(i).name;
tempData=textscan(TestData1,formatspecific,'HeaderLines',2);
end
Anybody can help to make a sample code to improve the textscan part?

To use textscan to read a file, you have to "open" it before calling textscan and "close" it after; you should use
fopen to open the input file
fclose to close the input file
textscan returns a cellarray with the content read from the input file; since you are reading more than one file, you should change the way you manage the cellarray returned by textscan, actually, as it is now in your code, the data are overwritten at each iteration.
One possibility could be to store the data in an array of struct with, for example, 2 fields: the name of the input file and the data.
Another possibility could be to generate a struct whos each fields contains the data read from the input file; you can automatically generate the name of the fileds.
Another one possibility could be to store them into a a matrix.
Hereafter, you can find a script in which these three alternative have been implemented.
Code Updated (following the comment received)
In order to be able to correctly read data such as 95.04156.07 as 95.04 156.07, the format specifier should be modified from %6f to %6.2f
% Get the list of input data
data=dir('input_file*.txt');
% Define the number of data column
n_data_col=12;
% Define the number of heared lines
n_header=2;
% Build the format specifier string
% OLD format specifier
formatspecific=[repmat('%6f',1,n_data_col) '%s']
% NEW format specifier
formatspecific=[repmat('%6.2f',1,n_data_col) '%s']
% Initialize the m_data matrix (if you know in advance the numer of row of
% each input file yoiu can define since the beginning the size of the
% matrix)
m_data=[];
% Loop for input file reading
for i=1:length(data)
% Get the i-th file name
file_name=data(i).name
% Open the i-th input file
fp=fopen(file_name,'rt')
% Read the i-th input file
C=textscan(fp,formatspecific,'headerlines',n_header)
% Close the input file
fclose(fp)
% Assign the read data to the "the_data" array struct
the_data(i).f_name=file_name
the_data(i).data=[C{1:end-1}]
% Assign the data to a struct whos fileds are named after the inout file
data_struct.(file_name(1:end-4))=[C{1:end-1}]
% Assign the data to the matric "m_data
m_data=[m_data;[C{1:end-1}]]
end
Input file
TestData
6.84 11.31 17.51 22.62 26.91 31.98 36.47 35.85 28.47 20.57 10.50 6.37 test1
0.24 2.62 4.94 7.17 10.39 15.37 18.73 18.29 12.26 6.46 1.15 -0.33 test2
68.47 95.04156.07218.39304.31320.22311.69269.22203.01135.60 68.18 55.09 test3
68.47 95.04156.07218.39304.31320.22311.69269.22203.01135.60 68.18 55.09 test4
Output
m_data =
Columns 1 through 7
6.8400 11.3100 17.5100 22.6200 26.9100 31.9800 36.4700
0.2400 2.6200 4.9400 7.1700 10.3900 15.3700 18.7300
68.4700 95.0400 156.0700 218.3900 304.3100 320.2200 311.6900
68.4700 95.0400 156.0700 218.3900 304.3100 320.2200 311.6900
Columns 8 through 12
35.8500 28.4700 20.5700 10.5000 6.3700
18.2900 12.2600 6.4600 1.1500 -0.3300
269.2200 203.0100 135.6000 68.1800 55.0900
269.2200 203.0100 135.6000 68.1800 55.0900
Hope this helps.

Related

How to load a csv file as a datamatrix in matlab?

I try to load a csv file in matlab to use a certain column as a vector for a OLS estimation. However, my csv looks like:
Date KCFSI
13 2004-02-01 -0.67
14 2004-03-01 -0.58
15 2004-04-01 -0.57
16 2004-05-01 -0.49
17 2004-06-01 -0.67
...
and I want to have the the column KCFSI as a vector.
I tried:
x=fopen('kcfsi.csv');
kcfsi=x(:,2);
But I don't even get a matrix for my x. Just get as value : "14" for whatever reason. I want to have something like "2x100"
csvread cannot open csv files containing non-Numeric values as stated in the documentation.
The file must contain only numeric values.
So you should use textscan as explained in this answer : https://stackoverflow.com/a/19613301/11756186
Alternatively you can use the readtable built-in function
csvtable = readtable('kcfsi.csv');
kcfsi_array = csvtable.KCFSI; %Column vector with the content of the KCFSI column
fopen returns a fileID, not a matrix.
Use A = readmatrix(filename) or M = csvread(filename) instead.

Matlab import binary as matrix

I have a .txt file that contains a data like this:
0000000011111000
0000001110001110
0000011000011111
0001110000000001
0011000000000001
0011000000000001
0110000000000001
0100000000000001
1100000000000001
1100000000000001
1000000000000001
1100000000000010
1100000000000110
0100000000001100
0110000000011000
0011111111110000
0
//repeats like this
The 0 at the end is a label that describes the 16x16 matrix of 0's and 1's. As you can see it is actually a binary image of 0.
I need to load this file as a 16x16 matrix. I have tried importdata, textscanand fscanf but none worked for me.
The file continues in this format.
My initial tought was to use '' as a delimiter for importdata, but that did not worked.
Is there a way to achieve this?
This is one way to read the file (see here for some documentation):
fid=fopen(textfile);
dat = textscan(fid,'%s',-1); % <-- read into cell array of strings
fclose(fid);
dat=char(dat); % <-- concatenate the strings into one char array
dat = double(dat)- '0'; % <-- convert to numeric 0/1 (48 = '0'+0)
The last row will contain the number represented ("0") and superfluous stuff, you can delete with e.g. dat(end,:)=[];
Happy trails!
Edit: Although the posted answer works with the input text file and input method I used, for the OP the code requires modification (probably due to a difference in input format):
i = 1 : length (dat{1,1})
result(i,:) = double(char(dat{1,1}{i,1})) - '0';
end

Modify the value of a specific position in a text file in Matlab

AIR, ID
AIR.SIT
50 1 1 1 0 0 2 1
43.57 -116.24 1. 857.7
Hi, All,
I have a text file like above. Now in Matlab, I want to create 5000 text files, changing the number "2" (the specific number in the 3rd row) from 1 to 5000 in each file, while keeping other contents the same. In every loop, the changed number is the same with the loop number. And the output in every loop is saved into a new text file, with the name like AIR_LoopNumber.SIT.
I've spent some time writing on that. But it is kind of difficult for a newby. Here is what I have:
% - Read source file.
fid = fopen ('Air.SIT');
n = 1;
textline={};
while (~feof(fid))
textline(n,1)={fgetl(fid)};
end
FileName=Air;
% - Replace characters when relevant.
for i = 1 : 5000
filename = sprintf('%s_%d.SIT','filename',i);
end
Anybody can help on finishing the program?
Thanks,
James
If your file is so short you do not have to read it line by line. Just read the full thing in one variable, modify only the necessary part of it before each write, then write the full variable back in one go.
%% // read the full file as a long sequence of 'char'
fid = fopen ('Air.SIT');
fulltext = fread(fid,Inf,'char') ;
fclose(fid) ;
%% // add a few blank placeholder (3 exactly) to hold the 4 digits when we'll be counting 5000
fulltext = [fulltext(1:49) ; 32 ; 32 ; 32 ; fulltext(50:end) ] ;
idx2replace = 50:53 ; %// save the index of the characters which will be modified each loop
%% // Go for it
baseFileName = 'AIR_%d.SIT' ;
for iFile = 1:1000:5000
%// build filename
filename = sprintf(baseFileName,iFile);
%// modify the string to write
fulltext(idx2replace) = num2str(iFile,'%04d').' ; %//'
%// write the file
fidw = fopen( filename , 'w' ) ;
fwrite(fidw,fulltext) ;
fclose(fidw) ;
end
This example works with the text in your example, you may have to adjust slightly the indices of the characters to replace if your real case is different.
Also I set a step of 1000 for the loop to let you try and see if it works without writing 1000's of file. When you are satisfied with the result, remove the 1000 step in the for loop.
Edit:
The format specifier %04d I gave in the first solution insure the output will take 4 characters, and it will pad any smaller number with zero (ex: 23 => 0023). It is sometimes desirable to keep the length constant, and in your particular example it made things easier because the output string would be exactly the same length for all the files.
However it is not mandatory at all, if you do not want the loop number to be padded with zero, you can use the simple format %d. This will only use the required number of digits.
The side effect is that the output string will be of different length for different loop number, so we cannot use one string for all the iterations, we have to recreate a string at each iteration. So the simple modifications are as follow. Keep the first paragraph of the solution above as is, and replace the last 2 paragraphs with the following:
%% // prepare the block of text before and after the character to change
textBefore = fulltext(1:49) ;
textAfter = fulltext(51:end) ;
%% // Go for it
baseFileName = 'AIR_%d.SIT' ;
for iFile = 1:500:5000
%// build filename
filename = sprintf(baseFileName,iFile);
%// rebuild the string to write
fulltext = [textBefore ; num2str(iFile,'%d').' ; textAfter ]; %//'
%// write the file
fidw = fopen( filename , 'w' ) ;
fwrite(fidw,fulltext) ;
fclose(fidw) ;
end
Note:
The constant length of character for a number may not be important in the file, but it can be very useful for your file names to be named AIR_0001 ... AIR_0023 ... AIR_849 ... AIR_4357 etc ... because in a list they will appear properly ordered in any explorer windows.
If you want your files named with constant length numbers, the just use:
baseFileName = 'AIR_%04d.SIT' ;
instead of the current line.

Parse a PC-Axis (.px) file in Matlab

Background: PC-Axis is a file format format used for dissemination of statistical information. The format is used by a number of national statistical organisations to disseminate official statistics.
A PC-Axis file looks a little like this, although they're usually a lot longer:
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
More details about PC-Axis files can be found at the Statistics Sweden website, but the basic gist is that the metadata is positioned at the top of the file and after "DATA=" is the actual data itself. It's also worth noting that the data is organized more like a data-table rather than in columns.
The Problem: I'd like to parse a PC-Axis file using Matlab, but I'm a little stumped as to how to go about doing it. Does anyone know how to parse one of these files in Matlab? Would it be easier to parse this type of file using some other language, like Perl, and then import the data into Matlab, or, would Matlab be a suitable enough tool for the job? Note that the plan would be to analyze the data in Matlab after the text processing stage.
I've tried using Matlab's text processing tools such as fgetl, textscan, fscanf, and a few others, but it's terribly tricky. Does anyone have any pointers on how to go about doing it?
Essentially, I'd like to store each of the keywords (CHARSET, MATRIX, etc.) and their corresponding values (ANSI, BE001, etc.) as metadata in Matlab - as a structure, perhaps. I'd like to have the data stored in Matlab also - as a matrix, for example.
Note: I'm aware of the pxR package (CRAN) in R, which works a treat for reading .px files into the workspace as a data.frame object. There's also a Perl module called Data::PcAxis (CPAN) that is also very good, but I'm specifically wanting to know how to parse a .px file using Matlab.
UPDATE: I should have mentioned that in addition to metadata and data, there are also variables. This is best explained by an example. The example PC-Axis file below is the same as the one above except I've added two variables. They're named VALUES("Month") and VALUES("region") and are positioned after the metadata and before the data.
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
VALUES("Month")="1976M01","1976M02","1976M03","1976M04",
"1976M05","1976M06","1976M07","1976M08",
"1976M09","1976M10","1976M11","1976M12";
VALUES("region")="Sweden","Germany","France",
"Ireland","Finland";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
Textscan works a treat when reading in each line of the text file as a string (in a cell array). However, the elements after the "=" sign for both of the variables (i.e. VALUES("Month") and VALUES("region")) span more than one line. It seems that using textscan in this case means that some strings would have to be concatenated, say, for example, in order to collect the list of months (1976M01 to 1976M12).
Question: What's the best way to collect the variables data? Read the text file as a single string and then use strtok twice to extract the substring of dates? Perhaps, there's a better (more systematic) way?
Usually textscan and regexp is the way to go when parsing string fields (as shown here):
Read the input lines as strings with textscan:
fid = fopen('input.px', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Parse the header field names and values using regexp. Picking the right regular expression should do the trick!
X = regexp(C{:}, '^\s*([^=\(\)]+)\s*=\s*"([^"]+)"\s*', 'tokens');
X = [X{:}]; %// Flatten the cell array
X = reshape([X{:}], 2, []); %// Reshape into name-value pairs
The "VALUE" fields may span over multiple lines, so they need to be concatenated first:
idx_data = find(~cellfun('isempty', regexp(C{:}, '^\s*Data')), 1);
idx_values = find(~cellfun('isempty', regexp(C{:}, '^\s*VALUES')));
Y = arrayfun(#(m, n){[C{:}{m:m + n - 1}]}, ...
idx_values(idx_values < idx_data), diff([idx_values; idx_data]));
... and then tokenized:
Y = regexp(Y, '"([^,"]+)"', 'tokens'); %// Tokenize values
Y = cellfun(#(x){{x{1}{1}, {[x{2:end}]}}}, Y); %// Group values in one array
Y = reshape([Y{:}], 2, []); %// Reshape into name-value pairs
Make sure the field names are legal (I've decided to convert everything to lowercase and replace apostrophes and any whitespace with underscores), and plug them into a struct:
X = [X, Y]; %// Store all fields in one array
X(1, :) = lower(regexprep(X(1, :), '-+|\s+', '_'));
S = struct(X{:});
Here's what I get for your input file (only the header fields):
S =
charset: 'ANSI'
matrix: 'BE001'
subject_code: 'BE'
subject_area: 'Population'
title: 'Population by region, time, marital status and sex.'
month: {1x12 cell}
region: {1x5 cell}
As for the data itself, it needs to be handled separately:
Extract data lines after the "Data" field and replace all ".." values with default values (say, NaN):
D = strrep(C{:}(idx_data + 1:end), '".."', 'NaN');
Obviously this assumes that there are only numerical data after the "Data" field. However, this can be easily modified if this is not case.
Convert the data to a numerical matrix and add it to the structure:
D = cellfun(#str2num, D, 'UniformOutput', false);
S.data = vertcat(D{:})
And here's S.data for your input file:
S.data =
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN 24.80000 34.20000 52.00000 23.00000
NaN 32.10000 40.30000 50.70000 1.00000
NaN 31.60000 35.00000 49.10000 2.30000
41.20000 43.00000 50.80000 60.10000 0.00000
50.90000 52.00000 53.90000 65.90000 0.00000
Hope this helps!
I'm not personally familiar with PC-Axis files, but here are my thoughts.
Parse the header first. If the header is of fixed size, you can read in that many lines and parse out the values you want. The regexp method may be useful for this.
The data appear to be both string and numeric. I would change the ".." values to NaN (make an original backup first, of course), and then scan the matrix using textscan. Textscan can be tricky, so make sure the file parses completely. If textscan encounters a line that does not match the format string, it will stop parsing. You can check the position of the file handle (using ftell) to see if it matches the end of the file (you can fseek to the end of the file to find what that value should be). The length of the cell arrays returned by textscan should all be the same. If not, the length will tell you what line they failed on - you can check this line with a text editor to see what violated the format.
You can assign and access fields in Matlab structs using string arguments. For example:
foo.('a') = 1;
foo.a
ans =
1
So, the workflow I suggest is to parse the header lines, assigning each attribute/value pair as field/value pairs in struct. Then parse the matrix (after some brief text preprocessing to make sure all the data are numeric).

Import csv file in Matlab

I need your help to import some csv files into matlab. They have the following format
#CONTENT
Class,Category,Level,Form
xxxxx,xxxxx,1.0,1
#DATA_GENERATION
Date,Agency,Version,ScientificAuthority
2010-04-08,INME,1.0,XXX xxx xxxx
#PLATFORM
Type,ID,Name,Country,GAW_ID
STN,308,xxxx,xxx
#INSTRUMENT
Name,Model,Number
ECC,6A,6A23500
#LOCATION
Latitude,Longitude,Height
25,-3,631.0
#TIMESTAMP
UTCOffset,Date,Time
+00:00:00,2010-04-07,10:51:00
* SOFTWARE: SNDPRO 1.321
* TROPOPAUSE IN MB 184
*
#FLIGHT_SUMMARY
IntegratedO3,CorrectionCode,SondeTotalO3,CorrectionFactor,TotalO3,WLCode,ObsType,Instrument,Number
328.4,0,379.9
#AUXILIARY_DATA
MeteoSonde,ib1,ib2,PumpRate,BackgroundCorr,SampleTemperatureType,MinutesGroundO3
RS92-SGPW,,,,Pressure,Pump
#PUMP_CORRECTION
Pressure,Correction
2.0,1.171
3.0,1.131
5.0,1.092
10.0,1.055
20.0,1.032
30.0,1.022
50.0,1.015
100.0,1.011
200.0,1.008
300.0,1.006
500.0,1.004
1000.0,1.000
#PROFILE
Pressure,O3PartialPressure,Temperature,WindSpeed,WindDirection,LevelCode,Duration,GPHeight,RelativeHumidity,SampleTemperature
945.36,4.590,14.6,10.0,30,2,0,631,43,22.8
944.90,4.620,14.3,7.8,20,0,2,635,44,22.8
943.51,4.630,13.9,7.6,17,0,4,647,44,22.8
942.13,4.620,13.4,8.1,16,0,6,660,45,22.8
940.98,4.590,13.0,9.0,16,0,8,670,45,22.8
939.83,4.590,12.6,9.8,17,0,10,680,46,22.8
938.69,4.600,12.1,10.3,18,2,12,691,46,22.8
937.77,4.600,12.2,10.9,18,0,14,699,47,22.9
936.63,4.600,12.1,11.4,19,0,16,709,47,22.9
935.48,4.600,11.8,11.9,19,0,18,719,47,22.9
934.12,4.600,11.7,12.3,19,0,20,731,47,22.9
932.98,4.590,11.6,12.6,19,0,22,742,48,22.9
931.84,4.590,11.6,12.9,18,0,24,752,48,22.9
930.93,4.600,11.6,13.2,18,0,26,760,48,22.9
929.79,4.600,11.4,13.4,17,0,28,770,49,22.9
928.88,4.610,11.5,13.6,16,0,30,778,49,22.9
927.98,4.620,11.4,13.7,15,0,32,787,49,23.0
927.30,4.620,11.3,13.8,14,0,34,793,49,23.0
The first line of the file is empty and second line contains the #CONTENT. I would like to have in a matrix all data that are under the line Pressure,O3PartialPressure,Temperature,WindSpeed,WindDirection,LevelCode,Duration,GPHeight,RelativeHumidity,SampleTemperature
Use the csvread() function. From the documentation:
csvread Read a comma separated value file.
M = csvread('FILENAME') reads a comma separated value formatted file
FILENAME. The result is returned in M. The file can only contain
numeric values.
In your case, since you want to exclude all of the content up until the #PROFILE data, you would have to know the line number of the data you're interested in in advance, then use one of the following uses (again from the documentation):
M = csvread('FILENAME',R,C) reads data from the comma separated value
formatted file starting at row R and column C. R and C are zero-
based so that R=0 and C=0 specifies the first value in the file.
M = csvread('FILENAME',R,C,RNG) reads only the range specified
by RNG = [R1 C1 R2 C2] where (R1,C1) is the upper-left corner of
the data to be read and (R2,C2) is the lower-right corner. RNG
can also be specified using spreadsheet notation as in RNG = 'A1..B7'.