How do I read the HITRAN2012 database into MATLAB? - matlab

The HITRAN database is a listing of molecular rotational-vibrational transitions. It is given in a text file where each line is 160 characters, with fixed width fields defining molecule, isotope, etc. The format is well documented, and there is even a program on the MathWorks File Exchange that will read in the database and simulate a portion of the spectrum. However, I need to read in a specific portion of the spectrum and then use it to do some fitting to a measured spectrum, so I need something much more custom.
As given in the comment section of that function, as well as elsewhere, the following line should read each line in properly:
database = which('HITRAN2012.par');
fid = fopen(database);
hitran = textscan(fid,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6c%12c%1c%7f%7f','delimiter','','whitespace','');
fclose(fid);
The first two fields denote the molecule code, which runs from 1-47, and the isotope code which runs from 1-9.
Unfortunately, molecules 1-9 do not have a leading zero, and no matter what I do, it seems to silently confuse MATLAB. If I load in the entire database and then type
unique(hitran{1})
I do not get the numbers 1-47, but I get 10-92 with a few numbers missing. As far as I can figure, when MATLAB encounters a leading space, it shifts the line over and then pads the end, so that ' 12' becomes '12', but I'm not exactly sure. I have also tried
hitran = textscan(fid,'%160c','delimiter','\n','whitespace','');
and then tried to parse the resulting strings, but that also sometimes gets confused by the first space.
For instance, the first water line looks like
exampleHitranLine = ' 14 0.007002 1.165E-32 2.071E-14.05870.305 818.00670.590.000000 0 0 0 0 0 0 7 5 2 7 5 3 005540 02227 5 2 0 90.0 90.0';
The first bit of code comes across this line and returns '14' instead of ' 1' and '4'. If I just read in a subset that only contains molecule 1 (as in this example), then the second method of reading works fine. If I try to read in the entire database, however, the lines with molecule 1-9 are shifted the the left, which messes up all the other fields.
I should note that I've tried reading the numerical fields both as floats and as integers, and neither gives satisfactory results. The entire database in text form is nearly 700 MB, and so I need something that works as efficiently as possible.
What am I doing wrong?

I have a new file on the FileExchange that will read in HITRAN 2004+ format data. Please try it out and let me know if there are any issues with it.

I don't have an answer as to why this is happening, but I do have a solution. If anyone has an answer as to why, I'd be happy to accept it.
It is the leading space that is screwing things up. MATLAB is being a little too clever, and when textscan encounters a leading space, it decides that it's extra and discards it and moves on to the next two characters. To get it to properly read in the file, I had to go line by line and test whether the first character is a space and then replace it with a leading zero, like this:
database = which('HITRAN2012_First100Lines.par');
fileParams = dir(database);
K = fileParams.bytes/162;
hitran = cell(K,19);
fid = fopen(database);
for k = 1:K
hitranTemp = fgetl(fid);
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
hitran(k,:) = deal(textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6c%12c%1c%7f%7f','delimiter','','whitespace',''));
end
fclose(fid);
I'm working in MATLAB 2013a. Should I consider this to be a bug and report it? Is there some reason that the leading space should be gobbled up like this?
Update:
My workaround above was slow, but worked. Then I had to process the HITEMP database, which is several times larger, so I finally did submit a support ticket to MathWorks. The workaround suggested by MathWorks technical support is to read everything in as text and then convert. This saves a lot of disk reads and works.
fileParams = dir(database);
fid = fopen(database);
hitran = textscan(fid,'%2c%1c%12c%10c%10c%5c%5c%10c%4c%8c%15c%15c%15c%15c%6c%12c%1c%7c%7c','delimiter','','whitespace','');
fclose(fid);
moleculeNumber = uint8(str2num(hitran{1}));
isotopologueNumber = uint8(str2num(hitran{2});
vacuumWavenumber = str2num(hitran{3});
...
etc.
Depending on the application, for larger databases one would probably want to do this in chunks rather than all at once.
He also said he would forward the behavior to the development team for consideration in a future update.

Related

How should I alter my code for my "string of Text & Numbers to Morse code" converter for the code to be able to run?

push button callback to convert to Morse
Hi, I have a problem, I'm supposed to create a GUI in MATLAB which converts letter & numbers into Morse code but my code wouldn't run, the attached image link above is for the push button callback. Also it says that the 'Morse' underlined in red needs to be preallocated for speed as it changes size every loop iteration. How should I approach this? Thanks..
Also, should I include anything under my edit1 and edit2 callbacks? Since edit1 is just for entering the input of numbers and letters and edit2 is just to output the Morse code. Thanks again!
edit1 & edit2 callbacks
"Morse" changes size every loop iteration. First of all, let's define 2 variables.
Morse_1 = [];
Morse_2 = zeros(1,100);
(I'm taking the liberty of defining matrices instead of strings, but that's easier to explain this concept). You are basically saying that Morse_1 is a blank variable that can be filled, while Morse_2 has fixed dimensions. The dimensions of blank variables like Morse_1 (pardon me if I'm not using the correct names, but I think blank variable explains it quite well) are flexible. This means that doing
Morse_1(1,101) = 1
will work (Morse_1 will be a 101-dimensional vector with 100 zeros and a 1 at the 101st position). Doing
Morse_2(1,101) = 1
will work as well, but you might end up with too many unused elements if you largely overestimate the dimensions (e.g. zeros(1,1000) but your message actually only reaches a few hundred).
In your case, I'd use a blank variable, since you don't really know beforehand how long your coded message is going to be (even if you knew the number of characters in your original string, the coded message would be 5 times longer if it were all '9's than all 'e's). This warning is really useful when dealing with 1000x1000 matrices, but for processing strings I'd ignore it.
To sum it up, I'd use a blank variable if you have no idea how long it'll get, or if your code can't handle a variable length, or if you don't want to worry about calculating exactly how many elements are needed. On the other hand, I'd use fixed dimensions if your code needs a properly dimensioned array, or if you're working with very large arrays. For a lot of cases, though, you really won't notice the speed difference (filling a blank array might take 0.01s, while filling a fixed dimension one might take 0.001s. Unless you're doing this a thousand times (why??), it's literally unnoticeable).
Personally, I'd change the way this loop works using strrep() like this:
for i=1:length(alphabet) %alphabet = 26 letters+10 numbers+space, 37 characters in total
original_message = strrep(original_message,alphabet{i},morse_alphabet{i});
end
strrep(a,b,c) finds the substrings b inside a and replaces it with c. In your case, alphabet is the same as the dictionary chars, and morse_alphabet is the same as the dictionary code.
As for the callbacks, I don't really know about it, so I can't help you with that.

Transform a matrix of integers (0 to 30) to a matrix of emojis

I am working on transforming an image into a set of emojis, depending on how many colors are there. The Maths part is done. I have the matrix of numbers from 0 to 30, but I specifically need to convert the numbers into symbols and I was thinking about emojis since they are so used nowadays.
My question is how am I supposed to read a matrix of integers from a file, transform the matrix of integers into a matrix with different emojis (eventually, from a list of my choice) and put the output in another text file, this time containing the emojis? Is that possible? I guess it should be, but how do I do that? Does anyone have any suggestions?
The problem that I face is actually with the emojis unicode, I don't seem to have success when it comes to receiving messages on the console in their case. I just get "? ?" instead of a smiley face. But that thing happens only for them, the ASCII characters seem to work a bit better. The problem with ASCII characters is that I need, again, expressive images instead of numbers or random pipe shapes.
There is the code:
%make sure you have the "1234567.jpg" in the same location as the .m file
imdata = imread('1234567.jpg');
[X_no_dither,map] = rgb2ind(imdata,30,'nodither');
imshow(X_no_dither,map)
% and there I try to put the output in a text file
dlmwrite('result.txt',X_no_dither,'delimiter',"\t");
Ok, and the output in the text file is:
0 0 0 0 26 26 ... etc.
And I wonder how am I supposed to write the code in such a way that I will get emojis instead of numbers.
🤔 🤔 🤔 🤔 💖 💖 ... etc.
That's how I'd want the output to be like. But, from what I tried yesterday, I cannot print them without getting warnings/errors.
What you need to do is create a table with your 30 emojis (this documentation page might be helpful), then index into that table. I'm using the compose function as indicated in the page above, it should also be possible to copy-paste emojis into your M-file. If you don't see the emojis in MATLAB's console, change the font you're using.
For example:
table = [compose("\xD83D\xDE0E"),
"B", % find the utf16 encoding for your emojis, or copy-paste them in
"C",
"D",
...
];
output = table(X_no_dither + 1);
f = fopen('result.txt', 'wt');
for ii = 1:size(output, 1)
fprintf(f, '%s', output(ii, :));
fprintf(f, '\n');
end
fclose(f);
This will write the file out in UTF16 format, which is what MATLAB uses. If you're on Windows this might work well for you. On other platforms you might want to save as UTF8 instead, which can be accomplished by opening the file in UTF8 mode:
f = fopen('result.txt', 'wt', 'native', 'UTF-8');
Note that, even if you don't manage to get the emojis shown in the MATLAB command window, opening the text file in an editor will show the emojis correctly.

Octave / Matlab - Reading fixed width file

I have a fixed width file format (original was input for a Fortran routine). Several lines of the file look like the below:
1078.0711005.481 932.978 861.159 788.103 716.076
How this actually should read:
1078.071 1005.481 932.978 861.159 788.103 716.076
I have tried various methods, textscan, fgetl, fscanf etc, however the problem I have is, as seen above, sometimes because of the fixed width of the original files there is no whitespace between some of the numbers. I cant seem to find a way to read them directly and I cant change the original format.
The best I have come up with so far is to use fgetl which reads the whole line in, then I reshape the result into an 8,6 array
A=fgetl
A=reshape(A,8,6)
which generates the following result
11
009877
703681
852186
......
049110
787507
118936
So now I have the above and thought I might be able to concatenate the rows of that array together to form each number, although that is seeming difficult as well having tried strcat, vertcat etc.
All of that seems a long way round so was hoping for some better suggestions.
Thanks.
If you can rely on three decimal numbers you can use a simple regular expression to generate the missing blanks:
s = '1078.0711005.481 932.978 861.159 788.103 716.076';
s = regexprep(s, '(\.\d\d\d)', '$1 ');
c = textscan(s, '%f');
Now c{1} contains your numbers. This will also work if s is in fact the whole file instead of one line.
You haven't mentioned which class of output you needed, but I guess you need to read doubles from the file to do some calculations. I assume you are able to read your file since you have results of reshape() function already. However, using reshape() function will not be efficient for your case since your variables are not fixed sized (i.e 1078.071 and 932.978).
If I did't misunderstand your problem:
Your data is squashed in some parts (i.e 1078.0711005.481 instead
of 1078.071 1005.481).
Fractional part of variables have 3 digits.
First of all we need to get rid of spaces from the string array:
A = A(~ismember(A,' '));
Then using the information that fractional parts are 3 digits:
iter = length(strfind(A, '.'));
for k=1:iter
[stat,ind] = ismember('.', A);
B(k)=str2double(A(1:ind+3));
A = A(ind+4:end);
end
B will be an array of doubles as a result.

How can I increase the speed of this xlsread for loop?

I have made a script which contains a for loop selecting columns from 533 different excel files and places them into matrices so that they can be compared, however the process is taking too long (it ran for 3 hours yesterday and wasn't even halfway through!!).
I know xlsread is naturally slow, but does anyone know how I can make my script run faster? The script is below, thanks!!
%Split the data into g's and h's
CRNum = 533; %Number of Carrington Rotation files
A(:,1) = xlsread('CR1643.xlsx','A:A'); % Set harmonic coefficient columns
A(:,2) = xlsread('CR1643.xlsx','B:B');
B(:,1) = xlsread('CR1643.xlsx','A:A');
B(:,2) = xlsread('CR1643.xlsx','B:B');
for k = 1:CRNum
textFileName = ['CR' num2str(k+1642) '.xlsx'];
A(:,k+2) = xlsread(textFileName,'C:C'); %for g
B(:,k+2) = xlsread(textFileName,'D:D'); %for h
end
Don't use xlsread if you want to go through a loop. because it opens excel and then closes excel server each time you call it, which is time consuming. instead before the loop use actxserver to open excel, do what you want and finally close actxserver after your loop. For a good example of using actxserver, search for "Read Spreadsheet Data Using Excel as Automation Server" in MATLAB help.
And also take a look at readtable which works faster than xlsread, but generates a table instead.
The most obvious improvement seems to be to load the files only partially if possible. However, if that is not an option, try whether it helps to only open each file once (read everything you need, and then assign it).
M(:,k+2) = xlsread(textFileName,'C:D');
Also check how much you are reading in each time, if you read in many rows in the first file, you may make the first dimension of A big, and then you will fill it each time you read a file?
As an extra: a small but simple improvment can be found at the start. Don't use 4 load statements, but use 1 and then assign variables based on the result.
As mentioned in this post, the easiest thing to change would be to set 'Basic' to true. This disables things like formulas and macros in Excel and allows you to read a simple table more quickly. For example, you can use:
xlsread('CR1643.xlsx','A:A', 'Basic', true)
This resulted in a decrease in load time from about 22 seconds to about 1 second for me when I tested it on a 11,000 by 7 Excel sheet.

Is there a way to automatically suppress Matlab from printing big matrices in command window?

Is there an option in matlab or a plugin/app or a trick such that if you are in an interactive command session, every time it would print out a matrix way too big for a human to look through, it redacts the output to either a warning of how big the matrix is or a summary (only a few rows and columns) of the matrix?
There are many times where I want to examine a matrix in the command window, but I didn't realize how big it was, so I accidentally printed the whole thing out. Or some place inside a function I did not code myself, someone missed a semicolon and I handed it a big matrix, and it dumps the whole thing in my command window.
It make sense that in 99.99% of the time, people do not intend to print a million row matrix in their interactive command window, right? It completely spams their scroll buffer and removes all useful information that you had on screen before.
So it makes much more sense for matlab to automatically assume that the user in interactive sessions want to output a summary of a big matrix, instead of dumping the whole thing into the command window. There should at least be such an option in the settings.
One possibility is to overload the display function, which is called automatically when you enter an expression that is not terminated by ;. For example, if you put the following function in a folder called "#double" anywhere on your MATLAB path, the default display behavior will be overridden for double arrays (this is based on Mohsen Nosratinia's display.m for displaying matrix dimensions):
% #double/display.m
function display(v)
% DISPLAY Display a variable, limiting the number of elements shown.
name = inputname(1);
if isempty(name)
name = 'ans';
end
maxElementsShown = 500;
newlines = repmat('\n',1,~strcmp(get(0,'FormatSpacing'),'compact'));
if numel(v)>maxElementsShown,
warning('display:varTooLong','Data not displayed because of length.');
% OR show the first N=maxElementsShown elements
% builtin('disp', v(1:maxElementsShown));
elseif numel(v)>0,
fprintf([newlines '%s = \n' newlines], name);
builtin('disp', v);
end
end
For example,
>> xx=1:10
xx =
1 2 3 4 5 6 7 8 9 10
>> xx=1:1e4
Warning: Data not displayed because of length.
> In double.display at 17
EDIT: Updated to respect 'compact' and 'loose' output format preference.
EDIT 2: Prevent displaying an empty array. This makes whos and other commands avoid an unnecessary display.