AIR, ID
AIR.SIT
50 1 1 1 0 0 2 1
43.57 -116.24 1. 857.7
Hi, All,
I have a text file like above. Now in Matlab, I want to create 5000 text files, changing the number "2" (the specific number in the 3rd row) from 1 to 5000 in each file, while keeping other contents the same. In every loop, the changed number is the same with the loop number. And the output in every loop is saved into a new text file, with the name like AIR_LoopNumber.SIT.
I've spent some time writing on that. But it is kind of difficult for a newby. Here is what I have:
% - Read source file.
fid = fopen ('Air.SIT');
n = 1;
textline={};
while (~feof(fid))
textline(n,1)={fgetl(fid)};
end
FileName=Air;
% - Replace characters when relevant.
for i = 1 : 5000
filename = sprintf('%s_%d.SIT','filename',i);
end
Anybody can help on finishing the program?
Thanks,
James
If your file is so short you do not have to read it line by line. Just read the full thing in one variable, modify only the necessary part of it before each write, then write the full variable back in one go.
%% // read the full file as a long sequence of 'char'
fid = fopen ('Air.SIT');
fulltext = fread(fid,Inf,'char') ;
fclose(fid) ;
%% // add a few blank placeholder (3 exactly) to hold the 4 digits when we'll be counting 5000
fulltext = [fulltext(1:49) ; 32 ; 32 ; 32 ; fulltext(50:end) ] ;
idx2replace = 50:53 ; %// save the index of the characters which will be modified each loop
%% // Go for it
baseFileName = 'AIR_%d.SIT' ;
for iFile = 1:1000:5000
%// build filename
filename = sprintf(baseFileName,iFile);
%// modify the string to write
fulltext(idx2replace) = num2str(iFile,'%04d').' ; %//'
%// write the file
fidw = fopen( filename , 'w' ) ;
fwrite(fidw,fulltext) ;
fclose(fidw) ;
end
This example works with the text in your example, you may have to adjust slightly the indices of the characters to replace if your real case is different.
Also I set a step of 1000 for the loop to let you try and see if it works without writing 1000's of file. When you are satisfied with the result, remove the 1000 step in the for loop.
Edit:
The format specifier %04d I gave in the first solution insure the output will take 4 characters, and it will pad any smaller number with zero (ex: 23 => 0023). It is sometimes desirable to keep the length constant, and in your particular example it made things easier because the output string would be exactly the same length for all the files.
However it is not mandatory at all, if you do not want the loop number to be padded with zero, you can use the simple format %d. This will only use the required number of digits.
The side effect is that the output string will be of different length for different loop number, so we cannot use one string for all the iterations, we have to recreate a string at each iteration. So the simple modifications are as follow. Keep the first paragraph of the solution above as is, and replace the last 2 paragraphs with the following:
%% // prepare the block of text before and after the character to change
textBefore = fulltext(1:49) ;
textAfter = fulltext(51:end) ;
%% // Go for it
baseFileName = 'AIR_%d.SIT' ;
for iFile = 1:500:5000
%// build filename
filename = sprintf(baseFileName,iFile);
%// rebuild the string to write
fulltext = [textBefore ; num2str(iFile,'%d').' ; textAfter ]; %//'
%// write the file
fidw = fopen( filename , 'w' ) ;
fwrite(fidw,fulltext) ;
fclose(fidw) ;
end
Note:
The constant length of character for a number may not be important in the file, but it can be very useful for your file names to be named AIR_0001 ... AIR_0023 ... AIR_849 ... AIR_4357 etc ... because in a list they will appear properly ordered in any explorer windows.
If you want your files named with constant length numbers, the just use:
baseFileName = 'AIR_%04d.SIT' ;
instead of the current line.
Background: PC-Axis is a file format format used for dissemination of statistical information. The format is used by a number of national statistical organisations to disseminate official statistics.
A PC-Axis file looks a little like this, although they're usually a lot longer:
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
More details about PC-Axis files can be found at the Statistics Sweden website, but the basic gist is that the metadata is positioned at the top of the file and after "DATA=" is the actual data itself. It's also worth noting that the data is organized more like a data-table rather than in columns.
The Problem: I'd like to parse a PC-Axis file using Matlab, but I'm a little stumped as to how to go about doing it. Does anyone know how to parse one of these files in Matlab? Would it be easier to parse this type of file using some other language, like Perl, and then import the data into Matlab, or, would Matlab be a suitable enough tool for the job? Note that the plan would be to analyze the data in Matlab after the text processing stage.
I've tried using Matlab's text processing tools such as fgetl, textscan, fscanf, and a few others, but it's terribly tricky. Does anyone have any pointers on how to go about doing it?
Essentially, I'd like to store each of the keywords (CHARSET, MATRIX, etc.) and their corresponding values (ANSI, BE001, etc.) as metadata in Matlab - as a structure, perhaps. I'd like to have the data stored in Matlab also - as a matrix, for example.
Note: I'm aware of the pxR package (CRAN) in R, which works a treat for reading .px files into the workspace as a data.frame object. There's also a Perl module called Data::PcAxis (CPAN) that is also very good, but I'm specifically wanting to know how to parse a .px file using Matlab.
UPDATE: I should have mentioned that in addition to metadata and data, there are also variables. This is best explained by an example. The example PC-Axis file below is the same as the one above except I've added two variables. They're named VALUES("Month") and VALUES("region") and are positioned after the metadata and before the data.
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
VALUES("Month")="1976M01","1976M02","1976M03","1976M04",
"1976M05","1976M06","1976M07","1976M08",
"1976M09","1976M10","1976M11","1976M12";
VALUES("region")="Sweden","Germany","France",
"Ireland","Finland";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
Textscan works a treat when reading in each line of the text file as a string (in a cell array). However, the elements after the "=" sign for both of the variables (i.e. VALUES("Month") and VALUES("region")) span more than one line. It seems that using textscan in this case means that some strings would have to be concatenated, say, for example, in order to collect the list of months (1976M01 to 1976M12).
Question: What's the best way to collect the variables data? Read the text file as a single string and then use strtok twice to extract the substring of dates? Perhaps, there's a better (more systematic) way?
Usually textscan and regexp is the way to go when parsing string fields (as shown here):
Read the input lines as strings with textscan:
fid = fopen('input.px', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Parse the header field names and values using regexp. Picking the right regular expression should do the trick!
X = regexp(C{:}, '^\s*([^=\(\)]+)\s*=\s*"([^"]+)"\s*', 'tokens');
X = [X{:}]; %// Flatten the cell array
X = reshape([X{:}], 2, []); %// Reshape into name-value pairs
The "VALUE" fields may span over multiple lines, so they need to be concatenated first:
idx_data = find(~cellfun('isempty', regexp(C{:}, '^\s*Data')), 1);
idx_values = find(~cellfun('isempty', regexp(C{:}, '^\s*VALUES')));
Y = arrayfun(#(m, n){[C{:}{m:m + n - 1}]}, ...
idx_values(idx_values < idx_data), diff([idx_values; idx_data]));
... and then tokenized:
Y = regexp(Y, '"([^,"]+)"', 'tokens'); %// Tokenize values
Y = cellfun(#(x){{x{1}{1}, {[x{2:end}]}}}, Y); %// Group values in one array
Y = reshape([Y{:}], 2, []); %// Reshape into name-value pairs
Make sure the field names are legal (I've decided to convert everything to lowercase and replace apostrophes and any whitespace with underscores), and plug them into a struct:
X = [X, Y]; %// Store all fields in one array
X(1, :) = lower(regexprep(X(1, :), '-+|\s+', '_'));
S = struct(X{:});
Here's what I get for your input file (only the header fields):
S =
charset: 'ANSI'
matrix: 'BE001'
subject_code: 'BE'
subject_area: 'Population'
title: 'Population by region, time, marital status and sex.'
month: {1x12 cell}
region: {1x5 cell}
As for the data itself, it needs to be handled separately:
Extract data lines after the "Data" field and replace all ".." values with default values (say, NaN):
D = strrep(C{:}(idx_data + 1:end), '".."', 'NaN');
Obviously this assumes that there are only numerical data after the "Data" field. However, this can be easily modified if this is not case.
Convert the data to a numerical matrix and add it to the structure:
D = cellfun(#str2num, D, 'UniformOutput', false);
S.data = vertcat(D{:})
And here's S.data for your input file:
S.data =
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN 24.80000 34.20000 52.00000 23.00000
NaN 32.10000 40.30000 50.70000 1.00000
NaN 31.60000 35.00000 49.10000 2.30000
41.20000 43.00000 50.80000 60.10000 0.00000
50.90000 52.00000 53.90000 65.90000 0.00000
Hope this helps!
I'm not personally familiar with PC-Axis files, but here are my thoughts.
Parse the header first. If the header is of fixed size, you can read in that many lines and parse out the values you want. The regexp method may be useful for this.
The data appear to be both string and numeric. I would change the ".." values to NaN (make an original backup first, of course), and then scan the matrix using textscan. Textscan can be tricky, so make sure the file parses completely. If textscan encounters a line that does not match the format string, it will stop parsing. You can check the position of the file handle (using ftell) to see if it matches the end of the file (you can fseek to the end of the file to find what that value should be). The length of the cell arrays returned by textscan should all be the same. If not, the length will tell you what line they failed on - you can check this line with a text editor to see what violated the format.
You can assign and access fields in Matlab structs using string arguments. For example:
foo.('a') = 1;
foo.a
ans =
1
So, the workflow I suggest is to parse the header lines, assigning each attribute/value pair as field/value pairs in struct. Then parse the matrix (after some brief text preprocessing to make sure all the data are numeric).
I need your help to import some csv files into matlab. They have the following format
#CONTENT
Class,Category,Level,Form
xxxxx,xxxxx,1.0,1
#DATA_GENERATION
Date,Agency,Version,ScientificAuthority
2010-04-08,INME,1.0,XXX xxx xxxx
#PLATFORM
Type,ID,Name,Country,GAW_ID
STN,308,xxxx,xxx
#INSTRUMENT
Name,Model,Number
ECC,6A,6A23500
#LOCATION
Latitude,Longitude,Height
25,-3,631.0
#TIMESTAMP
UTCOffset,Date,Time
+00:00:00,2010-04-07,10:51:00
* SOFTWARE: SNDPRO 1.321
* TROPOPAUSE IN MB 184
*
#FLIGHT_SUMMARY
IntegratedO3,CorrectionCode,SondeTotalO3,CorrectionFactor,TotalO3,WLCode,ObsType,Instrument,Number
328.4,0,379.9
#AUXILIARY_DATA
MeteoSonde,ib1,ib2,PumpRate,BackgroundCorr,SampleTemperatureType,MinutesGroundO3
RS92-SGPW,,,,Pressure,Pump
#PUMP_CORRECTION
Pressure,Correction
2.0,1.171
3.0,1.131
5.0,1.092
10.0,1.055
20.0,1.032
30.0,1.022
50.0,1.015
100.0,1.011
200.0,1.008
300.0,1.006
500.0,1.004
1000.0,1.000
#PROFILE
Pressure,O3PartialPressure,Temperature,WindSpeed,WindDirection,LevelCode,Duration,GPHeight,RelativeHumidity,SampleTemperature
945.36,4.590,14.6,10.0,30,2,0,631,43,22.8
944.90,4.620,14.3,7.8,20,0,2,635,44,22.8
943.51,4.630,13.9,7.6,17,0,4,647,44,22.8
942.13,4.620,13.4,8.1,16,0,6,660,45,22.8
940.98,4.590,13.0,9.0,16,0,8,670,45,22.8
939.83,4.590,12.6,9.8,17,0,10,680,46,22.8
938.69,4.600,12.1,10.3,18,2,12,691,46,22.8
937.77,4.600,12.2,10.9,18,0,14,699,47,22.9
936.63,4.600,12.1,11.4,19,0,16,709,47,22.9
935.48,4.600,11.8,11.9,19,0,18,719,47,22.9
934.12,4.600,11.7,12.3,19,0,20,731,47,22.9
932.98,4.590,11.6,12.6,19,0,22,742,48,22.9
931.84,4.590,11.6,12.9,18,0,24,752,48,22.9
930.93,4.600,11.6,13.2,18,0,26,760,48,22.9
929.79,4.600,11.4,13.4,17,0,28,770,49,22.9
928.88,4.610,11.5,13.6,16,0,30,778,49,22.9
927.98,4.620,11.4,13.7,15,0,32,787,49,23.0
927.30,4.620,11.3,13.8,14,0,34,793,49,23.0
The first line of the file is empty and second line contains the #CONTENT. I would like to have in a matrix all data that are under the line Pressure,O3PartialPressure,Temperature,WindSpeed,WindDirection,LevelCode,Duration,GPHeight,RelativeHumidity,SampleTemperature
Use the csvread() function. From the documentation:
csvread Read a comma separated value file.
M = csvread('FILENAME') reads a comma separated value formatted file
FILENAME. The result is returned in M. The file can only contain
numeric values.
In your case, since you want to exclude all of the content up until the #PROFILE data, you would have to know the line number of the data you're interested in in advance, then use one of the following uses (again from the documentation):
M = csvread('FILENAME',R,C) reads data from the comma separated value
formatted file starting at row R and column C. R and C are zero-
based so that R=0 and C=0 specifies the first value in the file.
M = csvread('FILENAME',R,C,RNG) reads only the range specified
by RNG = [R1 C1 R2 C2] where (R1,C1) is the upper-left corner of
the data to be read and (R2,C2) is the lower-right corner. RNG
can also be specified using spreadsheet notation as in RNG = 'A1..B7'.