Data management in matlab versus other common analysis packages - matlab

Background:
I am analyzing large amounts of data using an object oriented composition structure for sanity and easy analysis. Often times the highest level of my OO is an object that when saved is about 2 gigs. Loading the data into memory is not an issue always, and populating sub objects then higher objects based on their content is much more java memory efficient than just loading in a lot of mat files directly.
The Problem:
Saving these objects that are > 2 gigs will often fail. It is a somewhat well known problem that I have gotten around by just deleting a number of sub objects until the total size is below 2-3 gigs. This happens regardless of how boss the computer is, a 16 gigs of ram 8 cores etc, will still fail to save the objects correctly. Back versioning the save also does not help
Questions:
Is this a problem that others have solved somehow in MATLAB? Is there an alternative that I should look into that still has a lot of high level analysis and will NOT have this problem?
Questions welcome, thanks.

I am not sure this will help, but here: Do you make sure to use recent version of mat file? Check for instance save. Quoting from the page:
'-v7.3' 7.3 (R2006b) or later Version 7.0 features plus support for data items greater than or equal to 2 GB on 64-bit systems.
'-v7' 7.0 (R14) or later Version 6 features plus data compression and Unicode character encoding. Unicode encoding enables file sharing between systems that use different default character encoding schemes.
Also, could by any chance your object by or contain a graphic handle object? In that case, it is wise to use hgsave

Related

Best Time Series Format for Querying and Converting to Matlab (HDF5)

I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

Core Data Efficiency

Is it overkill to store 4 types of attributes into a 32-bit int within Core Data?
Or should I simply create a separate attribute for each of them? (will use logic operators to set/get).
I plan to add a new entity to an existing object which will contain anywhere from 200-400 items with about 14 attributes (including an 'index' attribute for ordering purposes). Only one set will be operated on or viewed at any time.
I will need to maintain undo support (referencing How do I improve performance of Core Data object insert on iPhone?)
I may be able to ween this down to about 8 attributes if I store several attributes into a single field. Barring the searchability issue, will I be saving appreciable space?
Also, is it unreasonable to store a set of 400 objects in a growing list of items which will grow at a rate of about 1-3 per week?
I hear some people storing thousands of items in core data so perhaps I am being paranoid. I suppose in the long-run I will need to provide an export archive option, perhaps to iCloud.
So you're saying you have 400 objects with 14 attributes, which you could either store as separate integers, or combine into fewer 32-bit (4-byte) values?
400 x 14 x 4 bytes = 22,400 bytes
That's not very much space at all. If you cut that in half, you're saving 11K, which will probably be dwarfed by the size of the extra code you'll be generating to encode and decode these values.
But don't take my word for it: write a small test program that stuffs values into Core Data, then run it under the profiler to see what happens.
You can also read up on SQLite and how it stores values, since that's what Core Data is using under the hood.
My gut says you're being paranoid and your bit twiddling code won't help much. Worse, it will be a great way to introduce really subtle bugs when you forget that you are shifting with sign extension. (And I love writing bit-twiddling code!)

Efficient disk access of large number of small .mat files containing objects

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).
Five ideas to consider:
Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
A variation on your method #2 is to store them in one or more files, but to turn off compression.
Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
#SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
(New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).
Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/
Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.
The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.

MATLAB: Differences between .mat versions

The official documentation states the following:
. But I have noticed that there are other important differences besides those stated in the table above.
For example, saving a cell array with about 6,000 elements that occupies 176 MB of memory in MATLAB gives me the following results depending on whether I use -v7 or -v7.3:
With -v7: File size = 15 MB, and save & load is fast.
With -v7.3: File size = 400 MB, and save & load is very slow (probably in part because of the large file size).
Has anybody else noticed these differences?
Update 1: As the replies point out, -v7.3 relies on HDF5 and according to Mathworks, "this format has a significant storage overhead", although it's not clear if this overhead is really due to the format itself, or to the MATLAB implementation and handling of HDF5 instead.
Update 2: #Andrew Janke points us to this very helpful PDF (which apparently is not available in HTML format on the web). For more details, see the comments in the answer provided by #Amro.
This all takes me to the next question: Are there any alternatives that combine the best of both worlds (e.g. the efficiency of -v7 and the ability to deal with very large files of -v7.3)?
Version 7.3 of MAT-files uses HDF5 format, this format has a significant storage overhead to describe the contents of the file, especially so for complex nested cellarrays and structures. Its main advantage over previous versions of MAT-files is that it allows storing data larger than 2GB on 64-bit systems.
Note that both v7 and v7.3 are compressed and use Unicode encoding (unlike v6), yet they are two completely different formats...
References:
MAT-File Preferences
MAT-File Versions

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.