FORTRAN: Best way to store large amount of data which is readable in MATLAB - matlab

I am working on developing an application in Fortran where I have points defining quadrilateral panels on the surface of an object. I am calculating various parameters on these quadrilateral panels for a number of frequencies.
The output file should look like:
FREQUENCY,PANEL_NUMBER,X1,Y1,Z1,X2,Y2,Z2,X3,Y3,Z3,X4,Y4,Z4,AREA,PRESSURE,....
0.01,1,....
0.01,2,....
0.01,3,....
.
.
.
.
0.01,2000,....
0.02,1,....
0.02,2,....
.
.
.
0.02,2000,...
.
.
I am expecting a maximum of 300,000 rows with 30 columns. Data types are composed of integer, real and complex numbers. I want to store this file and later read the file in MATLAB to create a 3D geometry which I will color based on pressure at each panel.
The problem is, as you can see from the file structure, there is lot of data. I am currently writing this as a CSV file and the size is about 26GB.
I do not want to use database to handle this. Could anyone suggest what file format I should write this data using FORTRAN.
Thanks for your help,
Amitava

Store the data in the native format of the computer rather than in a human-readable file in which the numbers have been converted to base 10 and characters. This will produce the smallest file and the fastest to process. On the Fortran open statement, use form='unformatted', access='stream'. The first causes the file to be unformatted, the second causes Fortran not to include its usual record-length information, which is Fortran specific. This omission makes the file more portable to other languages. Someone else can help better with how to read the file in MATLAB; I found this on the web: http://www.mathworks.com/help/matlab/import_export/importing-binary-data-with-low-level-i-o.html
UPDATE: This approach has several assumptions. It might not work easily if you wish to transport the file between different types of computers. Your question implies that want many rows of identical content. Identical rows simply matches a file structure with that number of identical records. It seems that you want to read the entire file, in which case a sequential file is appropriate. If you wish to read "random" records, a Fortran direct access file might be useful. With the simplicity of identical records, using a native file format seems easy. If you want self-documentation or portability across computers (different numeric representations), a file format such as HDF or FITS would be useful.

I second #steabert's mention of NetCDF and there's also HDF5 (on which the NetCDF 4 format is built). However, it does depend on what you mean by "data types": they are best used with regular/rigid data layouts and NetCDF's support for Fortran derived types can be painful at times.
Possible advantages for cases with large lumps are data transparent compression; data checksumming; and possibly more natural random access (that is, no need to compute seek positions based on array index) compared with Fortran stream access. That's on top of the usual things of a self-documenting and portable file format.
MATLAB has inbuilt support for reading these files, and recent versions also support the OPeNDAP framework so you wouldn't even need to have the file on the same (or multiple) machine(s).
Of course, disadvantages: extra software; extra skills development (especially for HDF5); and increased code complexity on the Fortran side.

Related

Memory issues with large amounts of data stored as nested cells in MATLAB

I have large amounts of data stored as nested cells in .mat files. My biggest problem right now is the load times for accessing these files, but I'm wondering if the underlying problem is that I came up with an inefficient way for storing the data and I should restructure it to be smaller.
The full file consists of a cell aray:
Hemi{1,h} where there are 52 versions of h
.{n,p} where there are 85 versions of n and up to ~100 versions of p
.Variable where there are 10 variables, each with ~2500 values
This full file ate up all my memory, so I saved it in parts, aka:
Hemi1.mat=Hemi{1,1}
Hemi2.mat=Hemi{1,2}
etc.
The next step for this application is to load each file, determine which part of it is an appropriate solution (I need Hemi{1,h}.{n,p}.Var1, Hemi{1,h}.{n,p}.Var2, and Hemi{1,h}.{n,p}.Var3 for this, but I still need to keep track of the other Variables), save the solution, then close the file and move to the next one.
Is there a faster way to load these files?
Is the problem less my dataset and more how I've chosen to store it? Is there a better alternative?
That is quite a lot of data. I have a few suggestions that you could look into. The first is to see if you can change the datatypes to something like Categorical Objects. They are way more memory efficient. Also if you are storing strings as your final data this can be quite heavy storage wise.
Second you could look into HDF5 file storage. I hear it is a nice way to store structured data.
You could finally try to convert your {n,p} arrays into Table structures. I am not sure if this is better for memory, but tables are nice to work with and it may help you out. (Depending on your version of Matlab you may not have tables :P).
I hope this helps!
-Kyle

How do you generate a CAD geometry of randomly oriented objects?

How can one generate CAD geometries of randomly oriented and randomly sized objects (3D)? I need to model randomly sized and randomly oriented rectangles--thousands to millions of them.
I have not yet come across any CAD tools that have =rand() functions that can be inputted into dimensions. Is one way perhaps to have a CAD program import a CSV file of these randomly generated parameter values?
In SolidWorks, you can have model parameters (dimension lengths/angles, constraints, etc.) stored in an Excel spreadsheet called a Design Table. Each row in the spreadsheet will represent a different configuration of your model, and each column a different parameter. You can use Excel's built-in capabilities or an export-capable tool of your choosing to generate the configurations according to your desired distribution. I don't recall off the top of my head the easiest way to get a large number of instances with different configurations into the same assembly, but you haven't really told us what you're trying to accomplish so I can't give you specific recommendations anyways.
If you have a specific CAD tool then you can often find documentation on the internal file format. With a little experimentation you can sometimes write a small external program that will generate the header of the CAD file and then loop thousands or millions of times generating each individual object. Finally you generate the lines needed to complete the file. That can sometimes be easier than trying to force a tool to do something the designers never expected. And this might let you use the software of your choice to generate the file.
I would suggest starting small. Use the CAD tool to create a file with two or three of your rectangles. Save and inspect the contents of the file to see that it matches your understanding of the needed format. Then try externally creating what should be the same file and verify your version is correctly accepted.
You might consider that some tool designers never expected someone to want thousands or millions of anything. I would suggest sneaking up on the problem. Try doubling the number of items, check this works as expected and then repeat this process again and again until either you successfully get to millions or until you find the CAD tool won't be able to handle this.

Efficient disk access of large number of small .mat files containing objects

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).
Five ideas to consider:
Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
A variation on your method #2 is to store them in one or more files, but to turn off compression.
Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
#SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
(New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).
Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/
Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.
The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.

MATLAB: Differences between .mat versions

The official documentation states the following:
. But I have noticed that there are other important differences besides those stated in the table above.
For example, saving a cell array with about 6,000 elements that occupies 176 MB of memory in MATLAB gives me the following results depending on whether I use -v7 or -v7.3:
With -v7: File size = 15 MB, and save & load is fast.
With -v7.3: File size = 400 MB, and save & load is very slow (probably in part because of the large file size).
Has anybody else noticed these differences?
Update 1: As the replies point out, -v7.3 relies on HDF5 and according to Mathworks, "this format has a significant storage overhead", although it's not clear if this overhead is really due to the format itself, or to the MATLAB implementation and handling of HDF5 instead.
Update 2: #Andrew Janke points us to this very helpful PDF (which apparently is not available in HTML format on the web). For more details, see the comments in the answer provided by #Amro.
This all takes me to the next question: Are there any alternatives that combine the best of both worlds (e.g. the efficiency of -v7 and the ability to deal with very large files of -v7.3)?
Version 7.3 of MAT-files uses HDF5 format, this format has a significant storage overhead to describe the contents of the file, especially so for complex nested cellarrays and structures. Its main advantage over previous versions of MAT-files is that it allows storing data larger than 2GB on 64-bit systems.
Note that both v7 and v7.3 are compressed and use Unicode encoding (unlike v6), yet they are two completely different formats...
References:
MAT-File Preferences
MAT-File Versions

How can I create a web page that shows aggregate data from Sawtooth surveys?

I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.