Octave / Matlab - Reading fixed width file

Octave / Matlab - Reading fixed width file - matlab

I have a fixed width file format (original was input for a Fortran routine). Several lines of the file look like the below:
1078.0711005.481 932.978 861.159 788.103 716.076
How this actually should read:
1078.071 1005.481 932.978 861.159 788.103 716.076
I have tried various methods, textscan, fgetl, fscanf etc, however the problem I have is, as seen above, sometimes because of the fixed width of the original files there is no whitespace between some of the numbers. I cant seem to find a way to read them directly and I cant change the original format.
The best I have come up with so far is to use fgetl which reads the whole line in, then I reshape the result into an 8,6 array
A=fgetl
A=reshape(A,8,6)
which generates the following result
11
009877
703681
852186
......
049110
787507
118936
So now I have the above and thought I might be able to concatenate the rows of that array together to form each number, although that is seeming difficult as well having tried strcat, vertcat etc.
All of that seems a long way round so was hoping for some better suggestions.
Thanks.

If you can rely on three decimal numbers you can use a simple regular expression to generate the missing blanks:
s = '1078.0711005.481 932.978 861.159 788.103 716.076';
s = regexprep(s, '(\.\d\d\d)', '$1 ');
c = textscan(s, '%f');
Now c{1} contains your numbers. This will also work if s is in fact the whole file instead of one line.

You haven't mentioned which class of output you needed, but I guess you need to read doubles from the file to do some calculations. I assume you are able to read your file since you have results of reshape() function already. However, using reshape() function will not be efficient for your case since your variables are not fixed sized (i.e 1078.071 and 932.978).
If I did't misunderstand your problem:
Your data is squashed in some parts (i.e 1078.0711005.481 instead
of 1078.071 1005.481).
Fractional part of variables have 3 digits.
First of all we need to get rid of spaces from the string array:
A = A(~ismember(A,' '));
Then using the information that fractional parts are 3 digits:
iter = length(strfind(A, '.'));
for k=1:iter
[stat,ind] = ismember('.', A);
B(k)=str2double(A(1:ind+3));
A = A(ind+4:end);
end
B will be an array of doubles as a result.

Related

Is there a native function in Matlab that export string array to csv and vice versa?

Now that string array is a thing since R2016b, is there a native function that export a string array to csv file and vice versa?
A function that fills the same role of csvread and csvwrite for numeric arrays in the old days but for string arrays. And to relax the requirement, say the string array contains columns of pure strings and columns of pure doubles. Stock prices with time stamp strings would be an example.
Native = not looping with fprintf. But if you are certain Matlab hasn't included any such functions yet, feel free to answer with the best approach thusfar without any restrictions.
Without any native function, pre-R2013a, looping with fprintf is the only way I can think of. And it was awful. Given past reputation of inefficiency, I still don't trust looping in Matlab.
Post-R2016b, one can convert a string array to cell array with num2cell and then to table with cell2table. Table can be written to csv file with writetable. This is actually fast, as writetable is fast. Only num2cell slows down the whole process a little. However, formatting is impossible along the way.
Post-R2019a, cell2table can be skipped with writecell, which is nice but the time consuming (slightly) step is num2cell and formatting should still be impossible. (I don't have R2019a to test it.)
Is there a better way or is it another one of those basic things left to be desired about Matlab?

writematix and readmatrix are the functions to do that since R2019a.
%If S1 is a string array that you want to `foobar.csv` then:
writematrix(S1,'foobar.csv');
%To read this csv file back into MATLAB as the same string array, use:
S2 = readmatrix('foobar.csv','OutputType','string');
%Verifying the result:
isequal(S1,S2)
ans =
logical
1
Loops have been significantly improved since R2015b. Not all loops are slow and not all vectorised versions are faster. The correct approach is to timeit when in doubt.

Octave - Convert number-strings from CSV import file to numerical matrix

I'm writing a code to import data from a CSV file and convert it into an Octave matrix. The imported data can be seen in the following snap:
In the next step I added the following command to delete the commas and "":
meas_raw_temp = strrep(meas_raw_temp,',',' ')
And then I get the data format in the following form:
The problem is that Octave still sees the data as 1 single 1-dimensional array. i.e., when I use the size command I get a single number, i.e. 2647. What I need to have is a matrix output, with each line of the snaps being a row of the matrix, and with each element separated.
Any thoughts?

Here's what's happening.
You have a 1-dimensional (rows only) cell array. Each element (i.e. cell) in the cell array contains a single string.
Your strings contain commas and literal double-quotes in them. You have managed to get rid of them all by replacing them in-place with an 'empty string'. Good. However that doesn't change the fact that you still have a single string per cell.
You need to create a for loop to process each cell separately. For each cell, split the string into its components (i.e. 'tokens') using ' ' (i.e. space) as the delimiter. You can use either strsplit or strtok appropriately to achieve this. Note that the 'tokens' you get out of this process are still of 'string' type. If you want numbers, you'll need to convert them (e.g. using str2double or something equivalent).
For each cell you process, find a way to fill the corresponding row of a preallocated matrix.
As Adriaan has pointed out in the comments, the exact manner in which you follow the steps above programmatically can vary, therefore I'm not going to provide the range of possible ways that you could do so, and I'm limiting the answer to this question to the steps above, which is how you should think about solving your problem.
If you attempt these steps and get stuck on a 'programmatic' aspect of your implementation, feel free to ask another stackoverflow question.

MATLAB : Alphanumeric character string extraction

As a foreword, I have been searching for solutions to this, and I have tried a myriad of codes but none of them work for the specific case.
I have a variable that is the registration number of different UK firms. The data was originally from Stata, and I had to use a code to import non-numeric data into Matlab. This variable (regno) is numeric up until observation 18000 (approx). From then it becomes registration numbers with both letters and numbers.
I wrote a very crude loop that grabbed the initial variable (cell), took out the double quotations, and extracted the characters into a another matrix (double). The code is :
regno2 = strrep(regno,'"','');
regno3 = cell2mat(regno2);
regno4 = [];
for i = 1:size(regno3,1);
regno4(i,1) = str2double(regno3(i,1:8));
end
For the variables with both letters and numbers I get NaN. I need the variables as a double in order to use them as dummy indicator variables in MatLab. Any ideas?
Thanks

Ok I'm not entirely sure about whether you need letters all the time, but here regular expressions would likely perform what you want.
Here is a simple example to help you get started; in this case I use regexp to locate the numbers in your entries.
clear
%// Create dummy entries
Case1 = 'NI000166';
Case2 = '12ABC345';
%// Put them in a cell array, like what you have.
CasesCell = {Case1;Case2};
%// Use regexp to locate the numbers in the expression. This will give the indices of the numbers, i.e. their position within each entry. Note that regexp can operate on cell arrays, which is useful to us here.
NumberIndices = regexp(CasesCell,'\d');
%// Here we use cellfun to fetch the actual values in each entry, based on the indices calculated above.
NumbersCell = cellfun(#(x,y) x(y),CasesCell,NumberIndices,'uni',0)
Now NumbersCell looks like this:
NumbersCell =
'000166'
'12345'
You can convert it to a number with str2num (or srt2double) and you're good to go.
Note that in the case in which you have 00001234 or SC001234, the values given by regexp would be considered as different so that would not cause a problem. If the variables are of different lenghts and you then have similar numbers, then you would need to add a bit of code with regexp to consider the letters.
Hope that helps! If you need clarifications or if I misunderstood something please tell me!

How do I read the HITRAN2012 database into MATLAB?

The HITRAN database is a listing of molecular rotational-vibrational transitions. It is given in a text file where each line is 160 characters, with fixed width fields defining molecule, isotope, etc. The format is well documented, and there is even a program on the MathWorks File Exchange that will read in the database and simulate a portion of the spectrum. However, I need to read in a specific portion of the spectrum and then use it to do some fitting to a measured spectrum, so I need something much more custom.
As given in the comment section of that function, as well as elsewhere, the following line should read each line in properly:
database = which('HITRAN2012.par');
fid = fopen(database);
hitran = textscan(fid,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6c%12c%1c%7f%7f','delimiter','','whitespace','');
fclose(fid);
The first two fields denote the molecule code, which runs from 1-47, and the isotope code which runs from 1-9.
Unfortunately, molecules 1-9 do not have a leading zero, and no matter what I do, it seems to silently confuse MATLAB. If I load in the entire database and then type
unique(hitran{1})
I do not get the numbers 1-47, but I get 10-92 with a few numbers missing. As far as I can figure, when MATLAB encounters a leading space, it shifts the line over and then pads the end, so that ' 12' becomes '12', but I'm not exactly sure. I have also tried
hitran = textscan(fid,'%160c','delimiter','\n','whitespace','');
and then tried to parse the resulting strings, but that also sometimes gets confused by the first space.
For instance, the first water line looks like
exampleHitranLine = ' 14 0.007002 1.165E-32 2.071E-14.05870.305 818.00670.590.000000 0 0 0 0 0 0 7 5 2 7 5 3 005540 02227 5 2 0 90.0 90.0';
The first bit of code comes across this line and returns '14' instead of ' 1' and '4'. If I just read in a subset that only contains molecule 1 (as in this example), then the second method of reading works fine. If I try to read in the entire database, however, the lines with molecule 1-9 are shifted the the left, which messes up all the other fields.
I should note that I've tried reading the numerical fields both as floats and as integers, and neither gives satisfactory results. The entire database in text form is nearly 700 MB, and so I need something that works as efficiently as possible.
What am I doing wrong?

I have a new file on the FileExchange that will read in HITRAN 2004+ format data. Please try it out and let me know if there are any issues with it.

I don't have an answer as to why this is happening, but I do have a solution. If anyone has an answer as to why, I'd be happy to accept it.
It is the leading space that is screwing things up. MATLAB is being a little too clever, and when textscan encounters a leading space, it decides that it's extra and discards it and moves on to the next two characters. To get it to properly read in the file, I had to go line by line and test whether the first character is a space and then replace it with a leading zero, like this:
database = which('HITRAN2012_First100Lines.par');
fileParams = dir(database);
K = fileParams.bytes/162;
hitran = cell(K,19);
fid = fopen(database);
for k = 1:K
hitranTemp = fgetl(fid);
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
hitran(k,:) = deal(textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6c%12c%1c%7f%7f','delimiter','','whitespace',''));
end
fclose(fid);
I'm working in MATLAB 2013a. Should I consider this to be a bug and report it? Is there some reason that the leading space should be gobbled up like this?
Update:
My workaround above was slow, but worked. Then I had to process the HITEMP database, which is several times larger, so I finally did submit a support ticket to MathWorks. The workaround suggested by MathWorks technical support is to read everything in as text and then convert. This saves a lot of disk reads and works.
fileParams = dir(database);
fid = fopen(database);
hitran = textscan(fid,'%2c%1c%12c%10c%10c%5c%5c%10c%4c%8c%15c%15c%15c%15c%6c%12c%1c%7c%7c','delimiter','','whitespace','');
fclose(fid);
moleculeNumber = uint8(str2num(hitran{1}));
isotopologueNumber = uint8(str2num(hitran{2});
vacuumWavenumber = str2num(hitran{3});
...
etc.
Depending on the application, for larger databases one would probably want to do this in chunks rather than all at once.
He also said he would forward the behavior to the development team for consideration in a future update.

How to read text fields into MATLAB and create a single matrix

I have a huge CSV file that has a mix of numerical and text datatypes. I want to read this into a single matrix in Matlab. I'll use a simpler example here to illustrate my problem. Let's say I have this CSV file:
1,foo
2,bar
I am trying to read this into MatLab using:
A=fopen('filename.csv');
B=textscan(A,'%d %d', 'delimiter',',');
C=cell2mat(B);
The first two lines work fine, but the problem is that texscan doesn't create a 2x2 matrix; instead it creates a 1x2 matrix with each value being an array. So I try to use the last line to combine the arrays into one big matrix, but it generates an error because the arrays have different datatypes.
Is there a way to get around this problem? Or a better way to combine the arrays?

I am note sure if combining them is a good idea. It is likely that you would be better off with them separate.
I changed your code, so that it works better:
clear
clc
A=fopen('filename.csv');
B=textscan(A,'%d %s', 'delimiter',',')
fclose(A)
Looking at the results
K>> B{1}
ans =
1
2
K>> B{2}
ans =
'foo'
'bar'
Really, I think this is the format that is most useful. If anything, most people would want to break this cell array into smaller chunks
num = B{1}
txt = B{2}
Why are your trying to combine them? They are already together in a cell array, and that is the most combined you are going to get.

There is a natural solution to this, but it requires the Statistics toolbox (version 6.0 or higher). Mixed data types can be read into a dataset array. See the Mathworks help page here.

I believe you can't use textscan for this purpose. I'd use fscanf which always gives you a matrix as specified. If you don't know the layout of the data it gets kind of tricky however.
fscanf works as follows:
fscanf(fid, format, size)
where fid is the fid generated by the fopen
format is the file format & how you are reading the data (['%d' ',' '%s'] would work for your example file)
size is the matrix dimensions ([2 2] would work on your example file).