I have downloaded a large set of GridFloat (.flt, .hdr) DEM files from USGS NED (1") in order to implement my own elevation service on my website. I would like to be able to look up an elevation from this fileset, given latitude and longitude as inputs. I use Perl for my website development. The files have a conventional naming scheme, and I am able to get the appropriate tile filename using the lat/lng. Howevever, accessing the internals of the file is where I'm having an issue.
I know the file is in a fairly straightforward format (.flt, apparently called "Gridfloat"), but I could use some help figuring out the magic numbers for calculating where in the file I need to seek to for a given lat/lng, and how to handle byte order and so on so that I end up with an elevation. From what I understand, apparently row ordering can be an issue, as well as byte ordering. I am looking for a recipe that does not involve use of any third party libraries such as GDAL, which I think are overly complicated and slow for what I want to do. I think it should be possible to just open the file, seek to a position based on some calculation, read some bytes and then unpack them into the correct byte order. Here is an example .hdr file that accompanies floatn48w097_1.flt, I think it has the necessary info. There are a bunch of other files that come with the .zip, including .prj, but I believe those are for a commercial program like ArcInfo. I think everything I need should be in the following .hdr file.
ncols 3612
nrows 3612
xllcorner -97.00166666667
yllcorner 46.99833333333
cellsize 0.000277777777778
NODATA_value -9999
byteorder LSBFIRST
What I'm really hoping for is a formula for calculating the row and column from the lat/lng, then another formula for translating the row/column into a position for seek, how many bytes to read, and how to convert those raw bytes into an integer (or whatever it is these files contain). I feel that this could be a very fast operation, without all the overhead involved with the larger libraries which seem to be focused on doing a lot of stuff that I don't need.
I don't need Perl code, just pseudocode showing the calculations for row/col offsets etc would be more than enough. I believe the files are binary format, a straightforward grid of 4-byte numbers. The file example that goes with the .hdr file above has a size of 52186176, and when you multiply the ncols by nrows (from the .hdr), you get 13046544. which divides nicely into the file size by 4. So I assume it's just a matter of getting the right formula for row/col based on lat/lng, and then getting the bytes swizzled into the right order. I've just not done this much.
I found some reference to the Gridfloat format here: coolutils.com/formats/flt so apparently the file consists of a grid of 64-bit floating point values.
Thanks!
Ok, I think I have an answer. The following is Perl routine, which seems to give back reasonable looking elevation values when tested with the USGS NED1 .flt files. The script takes latitude and longitude as command line arguments, looks up the file and indexes into the grid.
#!/usr/bin/perl
use strict;
use POSIX;
use Math::Round;
sub get_elevation
{
my ($lat, $lng) = #_;
my $lat_degree = ceil ($lat);
my $lng_degree = floor ($lng);
my $lat_letter = ($lat >= 0) ? 'n' : 's';
my $lng_letter = ($lng >= 0) ? 'e' : 'w';
my $lng_tilenum = abs($lng_degree);
my $lat_tilenum = abs($lat_degree);
my $tilename = $lat_letter . sprintf('%02d', $lat_tilenum) . $lng_letter . sprintf('%03d',$lng_tilenum);
my $path = "/data/elevation/ned1/$tilename/float${tilename}_1.flt";
print "path = $path\n";
die "No such file" if (!-e($path));
my ($lat_fraction, $lat_integral) = modf (abs($lat));
my $row = floor ((1 - $lat_fraction) * 3600);
my ($lng_fraction, $lng_integral) = modf (abs($lng));
my $col = floor ((1 - $lng_fraction) * 3600);
open(FILE, "<$path");
my $pos = (3612 * 4 * 6) + (3612 * 4 * $row) + (4 * 6) + ($col * 4);
seek (FILE, $pos, SEEK_SET);
my $buffer;
read (FILE, $buffer, 4);
close (FILE);
my ($elevation) = unpack('f', $buffer);
if ($elevation == -9999)
{
return 'undefined';
}
return $elevation;
}
my $lat = $ARGV[0];
my $lng = $ARGV[1];
my $elevation = get_elevation ($lat, $lng);
print "Elevation for ($lat, $lng) = $elevation meters (", $elevation * 3.28084, " feet)\n";
Hope this might be useful to anyone else trying to do the same kind of thing... I've tested this method now and it seems to produce good looking elevation profiles which are smoother than those from the 3" SRTM data.
Neil put me on the right track but I think there's a few problems with his original answer. I've added some fixes and improvements including on-the-fly download of the needed tile from the 1/3 arc second (10 meter) dataset, proper parsing of the header file, and what I believe is corrected indexing.
This is still mostly illustrative and should be improved before production use, particularly, hanging on to the header information and the file handle for repeated queries.
https://gist.github.com/biomiker/32fe34e1fa1bb49ae1135ab6652f596d
Related
I have a series of obj files which were produced by photogrammetry by my coworkers who specialize in dealing with GIS (Geographic Information Systems) data. The first few data points in the files look something like:
v 445077.679 4460688.700 61.371
v 445077.340 4460686.317 61.367
v 445077.296 4460686.024 61.416
I believe the file is valid because I can open the files in an online viewer and I get what I expect to see using the viewer at http://masc.cs.gmu.edu/wiki/ObjViewer:
When I open the same file in Blender, Unity or Unreal Engine, the object is very far from the world origin. I can center it by moving the origin to the center of mass and then resetting the object location, but when I recenter the object I always see something that looks like:
What am I doing wrong, or what could be wrong with my file?
The reason for the problem with these files is the large offset combined with 32-bit float values. In this case the objects all use the same geographic origin, probably at a lat/long of 0.000N/0.000E
Nearly all 3D graphics programs use 32-bit floating point values to store each points location, and the combination of the offset and the 32-bit value causes some of the precision to be lost. 32-bit floats have about 7 decimal digits of precision, so the offset of 4460688 in the example file completely dominates, and effectively cuts the model from 1mm resolution to 1m resolution data. The reason for the long triangles is that there is more data lost in one direction due to the asymmetry of the offset.
The solution is to apply some offset to bring the objects close to the origin BEFORE importing them with the 3D software.
I wrote a quick python script that can help with this: https://gitlab.umich.edu/lsa-ts-rsp/xr-shiftOBJ/-/blob/main/shiftOBJ.py
import re # regex
def shiftFile(inFileName, outFileName, offset):
with open(inFileName) as myInFile:
with open(outFileName, 'w') as myOutFile:
for line in myInFile:
myOutFile.write(shiftLine(line, offset))
def shiftLine(inLine, offset):
#if a line is a vertex then apply the shift and drop vertex colors
lineRegex = re.compile('v (\d+\.\d+) (\d+\.\d+) (\d+\.\d+)')
m = lineRegex.match(inLine)
if m and len(m.groups()) >= 3:
outLine = 'v ' + "{:.3f}".format(float(m.groups()[0]) + offset[0]) + ' ' + "{:.3f}".format(float(m.groups()[1]) + offset[1]) + ' ' + "{:.3f}".format(float(m.groups()[2]) + offset[2]) + '\n'
return outLine
else:
return inLine
if __name__ == '__main__':
inFile = '/Users/crstock/Documents/Unreal Projects/Olynthos Data/B88DW18.obj'
outFile = '/Users/crstock/Documents/Unreal Projects/Olynthos Data/B88DW18_shifted.obj'
offset = [-445070, -4460680, -59.0]
shiftFile(inFile, outFile, offset)
This applies an offset to all vertex lines and leaves the other lines alone. By using the same offset values for multiple input files you can maintain the relative shift so that related objects fit together appropriately.
From a Monte-Carlo simulation I have a range of files, say: file_1.mat, file_2.mat,...,file_n.mat, where n is large. Each file contains one or several (maximum 3 if it matters) large 1D arrays in time of interest, say var1, var2, var3.
I am now as always interested in finding the mean value of these variables. My question is now, how do I do this in the most efficient way? The keyword here is efficiency. Below you will find the MWE which is done the standard way, but this is quite time consuming as the files are large and there are many.
I am programming in Matlab, however ideas presented in pseudo code is also very well received.
MWE:(The standard way)
meanVar1 = zeros(1,1e6); %I do not remember the exact size, just use 1e6
meanVar2 = zeros(1,1e6);
meanVar3 = zeros(1,1e6);
for i 1=1:n
load(strcat('file_',int2str(i)),'var1','var2','var3')
meanVar1 = meanVar1 + var1;
meanVar2 = meanVar2 + var2;
meanVar3 = meanVar3 + var3;
end
meanVar1 = meanVar1/n;
meanVar2 = meanVar2/n;
meanVar3 = meanVar3/n;
I have a CSV file 1.6 GB large, that I need to feed into matlab. I will have to do this frequently and I need it to run quickly. The file is of the form:
20111205 00:00.2 99.18 6 E
20111205 00:00.2 99.18 5 E
20111205 00:00.2 99.18 1 E
20111205 00:00.2 99.195 5 E
20111205 00:00.2 99.195 5 E
20111205 01:27.0 99.19 5 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
The code I have right now is the following:
tic;
format long g
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
[c] = fscanf(fid, '%d,%d:%d.%d,%f,%d,%c');
c = reshape(c, 7, length(c)/7)
toc;
But this is far too slow. I would appreciate a method of getting this CSV file into matlab in the most efficient manner possible. Thank you!
Consider using a binary file format. Binary files are much smaller and don't need to be converted by MATLAB into the binary format. Hence they are much faster to read and write. They may also be more accurate (precision may be higher).
http://www.mathworks.com.au/help/matlab/ref/fread.html
The recommended syntax is textscan (http://www.mathworks.com/help/matlab/ref/textscan.html)
Your code would look like this:
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = textscan(fid, '%d,%d:%d.%d,%f,%d,%c');
fclose(fid);
You end up with a cell array... whether it's worth converting that to another shape really depends on how you want to access the data afterwards.
It is quite likely that this would be faster if you include a loop that allows you to use a smaller, fixed amount of memory for much of the operation. One problem with reading large files is the fact that you don't know ahead of time how big it will be - and that very likely means that Matlab guesses the amount of memory it needs, and frequently has to rescale. That is a very slow operation - if it happens every 1MB, say, then it copies 1 MB once, next 2 MB, then again 3 MB, etc - as you can see it is quadratic in the size of the array.
If instead you allocate a fixed amount of memory for the final result, and process in smaller batches, you avoid all that overhead. I'm pretty sure it will be much faster - but you would have to experiment a bit with the block size. That would look something like this:
block = 1000;
Nlines = 35E6;
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = struct(field1, field2, fieldn, value); %... initialize structure array or other storage for c ...
c_offset = 0;
while ~feof(fid)
temp = textscan(fid, '%d,%d:%d.%d,%f,%d,%c', block);
bt = size(temp, 1); % first dimension - should be `block`, except for last loop
%... extract, process, store in c(c_offset + (1:bt))...
c_offset = c_offset + bt;
end
fclose(fid);
Inspired by #Axon's answer, I implemented a "fast" C program to convert the file to binary, then read it in using Matlab's fread function. Spoiler alert: reading is then 20x faster... although the initial conversion takes a little bit of time.
To make the job in Matlab easier, and the file size smaller, I am converting each of the number fields into an int16 (short integer). For the first field - which looks like a yyyymmdd field - that involves splitting into two smaller numbers; similarly the decimal numbers are converted to two short integers (given the apparent range I think that is valid). All this is recognizing that "to really optimize, you must really know your problem" - so if assumptions are invalid, the results will be too.
Here is the C code:
#include <stdio.h>
int main(){
FILE *fp, *fo;
long int ld1;
int d2, d3, d4, d5, d6, d7;
short int buf[9];
char c8;
int n;
short int year, monthday;
fp = fopen("bigdata.txt", "r");
fo = fopen("bigdata.bin", "wb");
if (fp == NULL || fo == NULL) {
printf("unable to open file\n");
return 1;
}
while(!feof(fp)) {
n = fscanf(fp, "%ld %d:%d.%d %d.%d %d %c\n", \
&ld1, &d2, &d3, &d4, &d5, &d6, &d7, &c8);
year = d1 / 10000;
monthday = d1 - 10000 * year;
// move everything into buffer for single call to fwrite:
buf[0] = year;
buf[1] = monthday;
buf[2] = d2;
buf[3] = d3;
buf[4] = d4;
buf[5] = d5;
buf[6] = d6;
buf[7] = d7;
buf[8] = c8;
fwrite(buf, sizeof(short int), 9, fo);
}
fclose(fp);
fclose(fo);
return 0;
}
The resulting file is about half the size of the original - which is encouraging and will speed up access. Note that it would be a good idea if the output file could be written to a different disk than the input file - it really helps keep data streaming without a lot of time wasted in seek operations.
Benchmark: using a file of 2 M lines as input, this ran in about 2 seconds (same disk). The resulting binary file is read in Matlab with the following:
tic
fid = fopen('bigdata.bin');
d = fread(fid, 'int16');
d = reshape(d, 9, []);
toc
Of course, now if you want to recover the numbers as floating point numbers, you will have to do a little bit of work; but I think it's worth it. One possible problem you will have to solve is the situation where the value after the decimal point has a different number of digits: converting (a,b) into float isn't as simple as "a + b/100" when b > 100... "exercise for the student"?
A little benchmarking: The above code took about 0.4 seconds. By comparison, my first suggestion with textread took about 9 seconds on the same file; and your original code took a little over 11 seconds. The difference may get bigger when the file gets bigger.
If you do this a lot (as you said), it clearly is worth converting your files once to binary format, and using them that way. Especially if the file needs to be converted only once, and read many times, the savings will be considerable.
update
I repeated the benchmark with a 13M line file. The conversion took 13 seconds, the binary read < 3 seconds. By contrast each of the other two methods took over a minute (textscan: 61s; fscanf: 77s). It seems that things are scaling linearly (file size 470M text, 240M binary)
I have a NetCDF file, which contains data representing total precipitation across the globe over several months (so it's stored in a three dimensional array). I first ensured that the data was sensible, and the way it was formed, both in XConv and ncdump. All looks sensible - values vary from very small (~10^-10 - this makes sense, as this is model data, and effectively represents zero) to about 5x10^-3.
The problems start when I try to handle this data in IDL or MatLab. The arrays generated in these programs are full of huge negative numbers such as -4x10^4, with occasional huge positive numbers, such as 5000. Strangely, looking at a plot of the data in MatLab with respect to latitude and longitude (at a specific time), the pattern of rainfall looks sensible, but the values are just completely wrong.
In IDL, I'm reading the file in to write it to a text file so it can be handled by some software that takes very basic text files. Here's the code I'm using:
PRO nao_heaps
address = '/Users/levyadmin/Downloads/'
file_base = 'output'
ncid = ncdf_open(address + file_base + '.nc')
MONTHS=['january','february','march','april','may','june','july','august','september','october','november','december']
varid_field = ncdf_varid(ncid, "tp")
varid_lon = ncdf_varid(ncid, "longitude")
varid_lat = ncdf_varid(ncid, "latitude")
varid_time = ncdf_varid(ncid, "time")
ncdf_varget,ncid, varid_field, total_precip
ncdf_varget,ncid, varid_lat, lats
ncdf_varget,ncid, varid_lon, lons
ncdf_varget,ncid, varid_time, time
ncdf_close,ncid
lats = reform(lats)
lons = reform(lons)
time = reform(time)
total_precip = reform(total_precip)
total_precip = total_precip*1000. ;put in mm
noLats=(size(lats))(1)
noLons=(size(lons))(1)
noMonths=(size(time))(1)
; the data may not be an integer number of years (otherwise we could make this next loop cleaner)
av_precip=fltarr(noLons,noLats,12)
for month=0, 11 do begin
year = 0
while ( (year*12) + month lt noMonths ) do begin
av_precip(*,*,month) = av_precip(*,*,month) + total_precip(*,*, (year*12)+month )
year++
endwhile
av_precip(*,*,month) = av_precip(*,*,month)/year
endfor
fname = address + file_base + '.dat'
OPENW,1,fname
PRINTF,1,'longitude'
PRINTF,1,lons
PRINTF,1,'latitude'
PRINTF,1,lats
for month=0,11 do begin
PRINTF,1,MONTHS(month)
PRINTF,1,av_precip(*,*,month)
endfor
CLOSE,1
END
Anyone have any ideas why I'm getting such strange values in MatLab and IDL?!
AH! Found the answer. NetCDF files use an offset, and a scale factor for the data to keep the size of the file to a minimum. To get the correct values, I simply need to:
total_precip = offset + (scale_factor * total_precip) ;put into correct range
At present I'm getting the scale factor and offset from ncdump, and hard coding them into my IDL program, but does anyone know how I can get them dynamically in my IDL code..?
I've always used printf, and I've never used write/format. Is there any way to reproduce printf("%12.5e", $num) using a format? I'm having trouble digesting the perlform documentation, but I don't see a straightforward way of doing this.
EDIT: based on the answers I got, I'm just gonna keep on using printf.
Short answer, don't use formats.
Unresearched answer, sure, just use sprintf:
#!/usr/bin/perl
use strict;
use warnings;
our $num = .005;
write;
format STDOUT =
#>>>>>>>>>>>>>>>>>
sprintf("%12.5e", $num)
.
Seriously, if you need something like Perl 5 formats, take a look at Perl6::Form (note, this is a Perl 5 module, it just implements the proposed Perl 6 version of formats).
I totally agree with Chas. Owens on formats in general. Format was really slick 15 years ago, but format has not kept up with the advancements of the rest of Perl.
Here is a technique for line oriented output that I use time to time. You can use formline which is one of the public internal functions used by format. Format is page oriented. It is very hard to do things like span columns or change the format by line depending on the data. You can format a single line using the same text formatting logic used by format and then output that result yourself.
A (messy) example:
use strict; use warnings;
sub print_line {
my $pic=shift;
my #args=#_;
formline($pic,#args);
print "$^A\n";
$^A='';
}
my ($wlabel, $wlow, $whigh, $wavg)=(0,0,0,0);
my ($plabel,$plow,$phigh, $pavg);
my ($s_low,$s_high,$s_avg)=qw(%.2f %.2e %.2f);
my #results=( ["Label 1", 3.445, 0.00006678, .025],
["Label 2", 12.5555556, 55.112, 1.11],
["Wide Label 3", 1231.11, 1555.0, 66.66] );
foreach (#results) {
my $tmp;
$tmp=length($_->[0]);
$wlabel=$tmp if $tmp>$wlabel;
$tmp=length(sprintf($s_low,$_->[3]));
$wlow=$tmp if $tmp>$wlow;
$tmp=length(sprintf($s_high,$_->[2]));
$whigh=$tmp if $tmp>$whigh;
$tmp=length(sprintf($s_avg,$_->[1]));
$wavg=$tmp if $tmp>$wavg;
}
print "\n\n";
my #a1=("Label", "Rate - Operations / sec");
my #a2=("Text", "Average", "High", "Low");
my #a3=("----------", "-------", "----", "---");
my $l1fmt="#".'|' x $wlabel." #".'|'x($whigh+$wavg+$wlow+6);
my $l2fmt="#".'|' x $wlabel." #".'|' x $wavg." #".'|' x $whigh .
" #".'|' x $wlow;
print_line($l1fmt,#a1);
print_line($l2fmt,#a2);
print_line($l2fmt,#a3);
$plabel="#".'>' x $wlabel;
$phigh="#".'>' x $whigh;
$pavg="#".'>' x $wavg;
$plow="#".'<' x $wlow;
foreach (#results) {
my $pic="$plabel $pavg $phigh $plow";
my $mark=$_->[0];
my $avg=sprintf($s_avg,$_->[1]);
my $high=sprintf($s_high,$_->[2]);
my $low=sprintf($s_low,$_->[3]);
print_line($pic,$mark,$avg,$high,$low);
}
print "\n\n";
Outputs this:
Label Rate - Operations / sec
Text Average High Low
---------- ------- ---- ---
Label 1 3.44 6.68e-05 0.03
Label 2 12.56 5.51e+01 1.11
Wide Label 3 1231.11 1.56e+03 66.66
Notice that the width of the columns is set based on the width of the data as formatted by the sprintf format string. You can then left, center, right justify that result. The "Low" data column is left justified, the rest of the data are right justified. You can change this by the symbol used in the scalar $plow and it is the same as format syntax. The labels at the top are centered and the "Rate - Operations / sec" label spans 3 columns.
This is obviously not "production ready" code, but you get the drift I think. You would need to further check the total width of the columns against desired width, etc. You have to manually do some of the work that format does for you, but you have far more flexibility with this approach. It is very easy to use this method for several sections of a line with sprintf for example.
Cheers.