Efficient disk access of large number of small .mat files containing objects

Efficient disk access of large number of small .mat files containing objects - matlab

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).

Five ideas to consider:
Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
A variation on your method #2 is to store them in one or more files, but to turn off compression.
Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
#SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
(New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).
Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/

Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.

The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.

Related

Memory issues with large amounts of data stored as nested cells in MATLAB

I have large amounts of data stored as nested cells in .mat files. My biggest problem right now is the load times for accessing these files, but I'm wondering if the underlying problem is that I came up with an inefficient way for storing the data and I should restructure it to be smaller.
The full file consists of a cell aray:
Hemi{1,h} where there are 52 versions of h
.{n,p} where there are 85 versions of n and up to ~100 versions of p
.Variable where there are 10 variables, each with ~2500 values
This full file ate up all my memory, so I saved it in parts, aka:
Hemi1.mat=Hemi{1,1}
Hemi2.mat=Hemi{1,2}
etc.
The next step for this application is to load each file, determine which part of it is an appropriate solution (I need Hemi{1,h}.{n,p}.Var1, Hemi{1,h}.{n,p}.Var2, and Hemi{1,h}.{n,p}.Var3 for this, but I still need to keep track of the other Variables), save the solution, then close the file and move to the next one.
Is there a faster way to load these files?
Is the problem less my dataset and more how I've chosen to store it? Is there a better alternative?

That is quite a lot of data. I have a few suggestions that you could look into. The first is to see if you can change the datatypes to something like Categorical Objects. They are way more memory efficient. Also if you are storing strings as your final data this can be quite heavy storage wise.
Second you could look into HDF5 file storage. I hear it is a nice way to store structured data.
You could finally try to convert your {n,p} arrays into Table structures. I am not sure if this is better for memory, but tables are nice to work with and it may help you out. (Depending on your version of Matlab you may not have tables :P).
I hope this helps!
-Kyle

Best Time Series Format for Querying and Converting to Matlab (HDF5)

I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!

A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

Core Data Efficiency

Is it overkill to store 4 types of attributes into a 32-bit int within Core Data?
Or should I simply create a separate attribute for each of them? (will use logic operators to set/get).
I plan to add a new entity to an existing object which will contain anywhere from 200-400 items with about 14 attributes (including an 'index' attribute for ordering purposes). Only one set will be operated on or viewed at any time.
I will need to maintain undo support (referencing How do I improve performance of Core Data object insert on iPhone?)
I may be able to ween this down to about 8 attributes if I store several attributes into a single field. Barring the searchability issue, will I be saving appreciable space?
Also, is it unreasonable to store a set of 400 objects in a growing list of items which will grow at a rate of about 1-3 per week?
I hear some people storing thousands of items in core data so perhaps I am being paranoid. I suppose in the long-run I will need to provide an export archive option, perhaps to iCloud.

So you're saying you have 400 objects with 14 attributes, which you could either store as separate integers, or combine into fewer 32-bit (4-byte) values?
400 x 14 x 4 bytes = 22,400 bytes
That's not very much space at all. If you cut that in half, you're saving 11K, which will probably be dwarfed by the size of the extra code you'll be generating to encode and decode these values.
But don't take my word for it: write a small test program that stuffs values into Core Data, then run it under the profiler to see what happens.
You can also read up on SQLite and how it stores values, since that's what Core Data is using under the hood.
My gut says you're being paranoid and your bit twiddling code won't help much. Worse, it will be a great way to introduce really subtle bugs when you forget that you are shifting with sign extension. (And I love writing bit-twiddling code!)

How would you minimize or compress Core Data sqlite file size?

I have a 215MB csv file which I have parsed and stored in core data wrapped in my own custom objects. The problem is my core data sqlite file is around 260MB. The csv file contains about 4.5million lines of data on my city's transit system (bus stop, times, routes etc).
I have tried modifying attributes so that arrays of strings representing stop times are stored instead as NSData files but for some reason the file size still remains at around 260MB.
I can't ship an app this size. I doubt anyone would want to download a 260MB app even if it means they have the whole city's transit schedule on it.
Are there any ways to compress or minimize the storage space used (even if it means not using core data, I am willing to hear suggestions)?
EDIT: I just want to provide an update right now because I have been staring at the file size in disbelief. With some clever manipulation involving strings, indexing and database normalization in general, I have managed to reduce the size down to 6.5MB or 2.6MB when compressed. About 105,000 objects stored in Core Data containing the full details of the city's transit system. I'm almost in tears right now D':

Unless your original CSV is encoded in a really foolish manner, it seems unlikely that the size is not going to get below 100M, no matter how much you compress it. That's still really large for an app. The solution is to move your data to a web service. You may want to download and cache significant parts, but if you're talking about millions of records, then fetching from a server seems best. Besides, I have to believe that from time to time the transit system changes, and it would be frustrating to have to upgrade a many-10s-of-MB app every time there was a single stop adjustment.
I've said that, but actually there are some things you may consider:
Move booleans into a bit fields. You can put 64 booleans into an NSUInteger. (And don't use a full 64-bit integer if you just need 8 bits. Store the smallest thing you can.)
Compress how you store times. There are only 1440 minutes in a day. You can store that in 2 bytes. Transit times are generally not to the second; they don't need a CGFloat.
Days of the week and dates can similarly be compressed.
Obviously you should normalize any strings. Look at the CSV for duplicated string values on many lines.
I generally would recommend raw sqlite rather than core data for this kind of problem. Core Data is more about object persistence than raw data storage. The fact that you're seeing a 20% bloat over CSV (which is not itself highly efficient) is not a good direction for this problem.
If you want to get even tighter, and don't need very good searching capabilities, you can create packed data blobs. I used to do this on phone switches where memory was extremely tight. You create a bit field struct and allocate 5 bits for one variable, and 7 bits for another, etc. With that, and some time shuffling things so they line up correctly on word boundaries, you can get pretty tight.
Since you care most about your initial download size, and may be willing to expand your data later for faster access, you can consider very domain-specific compression. For example, in the above discussion, I mentioned how to get down to 2 bytes for a time. You could probably get down to 1 bytes in many cases by storing times as delta minutes since the last time (since most of your times are going to be always increasing by fairly small steps if they're bus and train schedules). Abandoning the database, you could create a very tightly encoded data file that you could extract into a database on first launch.
You also can use domain-specific knowledge to encode your strings into smaller tokens. If I were encoding the NY subway system, I would notice that some strings show up a lot, like "Avenue", "Road", "Street", "East", etc. I'd probably encode those as unprintable ASCII like ^A, ^R, ^S, ^E, etc. I'd probably encode "138 Street" as two bytes (0x8A13). This of course is based on my knowledge that è (0x8a) never shows up in the NY subway stops. It's not a general solution (in Paris it might be a problem), but it can be used to highly compress data that you have special knowledge of. In a city like Washington DC, I believe their highest numbered street is 38th St, and then there's a 4-value direction. So you can encode that in two bytes, first a "numbered street" token, and then a bit field with 2 bits for the quadrant and 6 bits for the street number. This kind of thinking can potentially significantly shrink your data size.

You might be able to perform some database normalization.
Look for anything that might be redundant or the same values being stored in multiple rows. You will probably need to restructure your database so these duplicate values (if any) are stored in separate tables and then referenced from their original row by means of id's.

How big is the sqlite file compressed? If it's satisfactorily small, the simplest thing would be to ship it compressed, then uncompress it to NSCachesDirectory.

Reason for monolithic data files

Primarily this seems to be a technique used by games, where they have all the sounds in one file, textures in another etc. With these files commonly reaching the GB size.
What is the reason behind doing this over maintaining it all in subdirectories as small files - one per texture which many small games use this, with the monolithic system being favoured by larger companies?
Is there some file system overhead with lots of small files?
Are they trying to protect their property - although most just seem to be a compressed file with a new extension?

The reasons we use an "archive" system like this where I work (a game development company):
lookup speed: We rarely need to iterate over files in a directory; we're far more often looking them up directly by name. By using a custom "file allocation table" that is essentially just a sequence of hash( normalized_filename ) -> [ offset, size ], we can look up files very quickly. We can also keep this index in RAM, potentially interleave it with other index tables, etc.
(When we do need to iterate, we can either easily iterate over all files in a .arc, or we can store a list of filenames, a list of hash-of-filenames, or just a list of [ offset, size ] pairs somewhere -- maybe even as a file in the archive. This is usually faster than directory-traversal on a FS.)
metadata: It's easy for us to tuck in any file metadata we want. For example, a single bit in the "size" field indicates whether the file is compressed or not (if it is, it has a header with more details about how to decompress it). We can even vary compression on pieces of a file if we know enough about the structure of the file ahead of time (we do this for sprite archives).
size: One of the devices we use has a "file size must be a multiple of X" requirement, where X is large compared to some of our files. For example, some of our lua scripts end up being just a few hundred bytes when compiled; taking extra overhead per .luc file adds up quickly.
alignment: on the other hand, sometimes we want to waste space. To take advantage of faster streaming (e.g. background DMA) from the filesystem, some of our files do want to obey certain alignment/size requirements. We can take care of that right in the tool, and the align/size we're shooting for doesn't necessarily have to line up with the underlying FS, allowing us to waste space only where we need it.
But those are the mundane reasons. The more fun stuff:
Each .arc registers in a list, and attempts to open a file know to look in the arcs. We search already-in-RAM archives first, then archives on the device FS, then the actual device FS. This gives us a ton of flexibility:
dynamic additions to the filesystem: at any time we can stream a new file or archive to the machine in question (over the network or the like) and have it appear as part of the "logical" filesystem; this is great when the actual FS resides in ROM or on a CD, and allows us to iterate much more quickly than we could otherwise.
(Doom's .wad system is a sort of example of the above, which allows modders to more easily override assets and scripts built into the game.)
possibility of no underlying fs: It's possible to use bin2obj to embed an entire arc directly in the executable (.rodata) at link time, at which point you don't ever need to look at the device FS -- we do this for certain small demo builds and the like. We can also send levels across the network or savegame-sneakernet this way. =)
organization and load/unload: since we can load and unload and override virtual "pieces" of our filesystem at any time, we can do some performance tricks with having the number of files in the FS be very small at any given time. We can additionally specify that an entire archive be loaded into memory, index table and data; our file load code is smart enough to know that if the file is already in memory, it doesn't need to do anything to read it other than move a pointer around. Some of the higher level code can actually detect that the file is in ram and just ask for the probably-already-looks-like-a-struct pointer directly.
portability: we only need to figure out how to get a few files on each new device we use, and then the remainder of the FS code is more or less the same. =) We do change the tool output a bit occasionally (for alignment reasons), but most of the processing remains the same.
de-duplication: with smarter archives, such as our sprite archives, we can (and do) de-duplicate data. If "jump" animation's fifth frame and "kick"'s third frame are the same, we can pull apart the file and only store one copy of that frame. We can do the same for whole files.
We ported a PC game to a system with much slower FS access recently. We didn't change the data format, and it turns out iterating through a dir on the raw device FS to load a hundred small XML files was absolutely killing our load times. The solution we used was to take each dir, make it into its own subdir.arc, and stick it in the master game.arc compressed. When the dir was needed (something like opendir was called) we decompressed the entire subdir.arc into RAM, added it to the filesystem, then iterated through it super-quickly.
It's the ability to throw something like this together in a few hours, and to ease the pain of porting across systems, that makes stuff like this worthwhile.

File systems do have an overhead. Usually, a file takes disk space rounded up to some power of 2 (e.g. up to 4 KB), so many small files would waste space. Some modern file systems try to mitigate that, but AFAIK it's not widespread yet. Additionally, file systems are often quite slow when accessing multiple files. E.g. it is usually considerably faster to copy one 400 MB file than 4000 100 KB files.
File systems come in handy when you have to modify files, because they handle changing file sizes much better than any simple home-grown solution. However, that's certainly not the case for constant game data.

On Apple systems, the most common way is to use, as you suggest, directories. They are called Bundles, and are in the Finder represented as just one file, but if you explore them more, they're actually directories. This makes writing code and conserving memory when loading individual items out of this bundle very easy. :-) Also, this makes taking incremental backups of gigantic databases easy, as for instance your iPhoto database is just a bundle, so you just backup changed and new files
On Windows, however, I believe this is much harder to do, it will look like a directory "no matter what" (I'm sure smart people have found a solution that will make Explorer see certain directories as a single file, but it's not common).
From a games developer point of view, you're not dealing with so small files that disk space overhead is something you're very much concerned with, so I doubt #doublep's suggestion, since it makes for such a hassle, but it makes it much easier with a single file if users are to copy an entire game over somewhere, then it's easy to check if the entire set is correct.
And, of course, it's harder to read for people that shouldn't have access to it. But it's also harder to modify, which means harder to patch, and harder to write extensions. Someone that uses extensions a lot, prefers the directory structure: The Sims.
Were I the games developer, I'd love to go for individual files. Then again, I'd be using bundles as I'd be writing for the Mac ;-)
Cheers
Nik

I can think of multiple reasons.
As doublep suggested, files occupy more space on the disc than they require. So an archive saves space. 10k files (of any size) should save you 20MB when packed into an archive. Not exactly a large amount of space nowadays, but still.
The other reason I can think of is disc fragmentation. I suspect a heavily fragmented disc will perform worse when accessing thousands of separate files on a fragmented space. But I'm no expert in this field, so I'd appreciate if someone more experienced verified this.
Finally, I think this may also have something to do with restricting access to separate game files. You can have a bunch of Lua scripts exposed, mess with them and break something. Or you could have the outro cinematic/sound/text/whatever exposed and get spoiled by accessing it. I do that myself as well: I encrypt images with a multipass XOR key, pack text files and config variables into a monolithic file (zipped for extra security) and only leave music freely accessible. This way, the game's secrets will remain undiscovered for a bit longer :).
Or there may be another reason I never thought about :D.

As you know games, especially with larger companies try to squeeze as much performance as they can. One technique is to have all the data in one large file and just DMA it to memory (think of it as a memcpy from CD to RAM). Since all the files are in one large one there will be no disk seeks and you can have a large number of files (which may cause large amount of seeks) all loaded quicky because of the technique.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse