What type of entropy encoder does the MATLAB save() function use? I.e. how does that function work? - matlab

I am working on a compression project, and I used the default save() function in Matlab for the purpose of lossless (entropy) encoding. The transform module is all figured out.
I used the save() function to encode a 3d array that includes a bunch of zeros. I am sure that Matlab is using some kind of lossless compression with the save() function since, when I save that array, it ends up taking far less space than an array, say, containing no zeros at all. I had no success finding out what type of entropy encoding schemes are behind the function. Because it is a core part of the algorithm, I think I must at least know what is behind the function.
Plus, if you know any other type of entropy encoder that would do a better job in compressing a 3d array that contains zeros, I would really appreciate you sharing. Or, if you think I could easily write the code for that myself, then please let me know.

The v7 format uses deflate.
The v7.3 format uses the HDF5 format, which supports gzip (deflate) and szip compression. It also has an option to not compress.

The MATLAB save function supports compression for some of the formats that are available. Specifically, -v7 (default format) and -v7.3 support compression. The details of the compression are not documented.

Related

Matlab - JPEG Compression. Huffman Encoding

I've been trying to implement a JPEG compression algorithm on Matlab.
The only part I'm having trouble implementing is the huffman encoding. I do understand the DCT into quantization and zig-zag'ing that 8x8 matrix. I also understand how does huffman encoding work, in general.
What I do not understand is, after I have an output bitstream and a dictionary that translates consecutive bits to their original form, what do I do with the output? How can I tell a computer to translate that output bitstream using the dictionary I created for it?
In addition, each 8x8 matrix will have its own output and dictionary. How can all these outputs be combined into one? Because at the end of the day, the result is supposed to be an image.
I might have misunderstood some of the steps, in which case my apologies for any confusion caused by this.
Any help would be extremely appriciated!
EDIT: I'm sorry, my question appearntly hasn't been clear enough. Say I use Matlabs built in huffman functions (huffmanenco and huffmandict), what am I supposed to do with the value the huffmanenco returns?
The part of what to do with the output string of bits hasn't been clear to me as far as huffman encoding goes in other IDE's and programming languages aswell.
You have two choices with the huffman coding.
Use a pre-canned huffman table.
Make two passes over the data where the first pass generates the huffman tables and the second pass encode.
You cannot have a different dictionary for each MCU.
You say you have the run length encoded values. You huffman encode those and write to the output stream.
EDIT:
You need to be sure that the matlab huffman endocoder is JPEG-compatible. There are different ways to huffman encode.
You need to write the bits from the encoder to the JPEG stream. This means you need a bit level I/O routine. PLUS you need to convert FF values in the compressed data into FF00 values in the JPEG stream.
I suggest getting a copy of
http://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=pd_sim_14_1?ie=UTF8&dpID=41XJBED6RCL&dpSrc=sims&preST=_AC_UL160_SR127%2C160_&refRID=1DYN5VCQQP0N88E64P5Q
to show how the encoding is done.

Read and represent mp3 files using memmapfile in matlab

I have to analyze bio acoustic audiofiles using matlab. Eventually I want to be able to find anomalies in the audio. That's the reason I need to find a way to represent the audio in a way I can extract and compare features. I'm dealing with mp3 files up to 150 mb. These files are too large for matlab to read in to it's memory. Therefore I want to use the memmapfile() function. I used the following code and a small mp3 file to find out how it actually works.
[testR, ~] = audioread('test.mp3');
testM = memmapfile('test.mp3');
disp(testM.Data);
disp(testR);
The actual values of the testM.Data and testR are different. Audioread() returns a 7483391 x 2 matrix and memmapfile() a 4113874 x 1 matrix.
I'm not really sure how memmapfile() works, I expected this to be equal to each other. Is there a way to read mp3 files in the same format audioread() does using memmapfile()? And what does memmapfile actually return in case of an audio file? Maybe it's also usable in the vector format in the case of anomaly detection?
Thanks in advance!
NOTE: The original files were in wav IMA ADPCM format with sizes from 1.5 up to 2.5 gb. Since Matlab can't deal with that format and the size of the files I converted them to 8bit mp3 files.
I think that the problem is mammapfile by default read data in uint8 format, while audioread function read data in another way.
How you can see here you can specify the format of data when you read it with memmapfile, so try to "play" with different values. From the documentation I read that you can read data in double format, so try to modify the memmapfile data format and audioread data format.
Last thing, memmapfile always organize the data in matrix like "somenumbers x 1", so if you want the original one you need to use something like reshape.
Anyway if you work with big data I suggest you to try with something different instead memmapfile, because it is very very slow

Saving an array in matlab efficiently

I want to save an array efficiently in matlab. I have the array of size 3000 by 9000. If I save this array in the mat file it consumes around 214 MB using just the save function. If I use fwrite and use float data type came to be around 112. Is there any other way that I can still reduce the hard disk space consumed when I save this array in matlab?
I would suggest writing using binary mode and then using compression algorithm such as bzip
There are a few ways to reduce the required memory:
1. Reducing precision
Rather than using that double you normally have, consider using a single, or perhaps even a uint8 or logical. Using the print function will also so this, but you may want to consider compressing further afterwards as printing does not create a compressed file.
2. Utilizing a pattern
If your matrix has a certain pattern, this can sometimes be leveraged to store the data more efficiently. Or at least the information to create the data. The most common example is that your matrix is storeable as a few vectors. For example when it is sparse.

Alternatives to Matlab's Mat File Format

I'm finding that writing and reading the native mat file format becomes very, very slow with larger data structures of about 1G in size. In addition we have other, non-matlab, software that should be able to read and write these files. So I would to find an alternative format to use to serialize matlab data structures. Ideally this format would ...
be able to represent an arbitrary matlab structure to a file.
have faster I/O than than mat files.
have I/O libraries for other languages like Java, Python and C++.
Simplifying your data structures and using the new v7.3 MAT file format, which is a variant of HDF5, might actually be the best approach. The HDF5 format is open and already has I/O libraries for your other languages. And depending on your data structure, they may be faster than the old binary mat files.
Simplify the data structures you're saving, preferring large arrays of primitives to complex container structures.
Try turning off compression if your data structures are still complex.
Try the v7.3 MAT file format using "-v7.3"
If using a network file system, consider saving and loading to a temporary dir on a fast local drive and copying to/from the network
For large data structures, your MAT file I/O speed may be determined more by the internal structure of the data you're writing out than the size of the resulting MAT file itself. (In my experience, this has usually been the major factor in slow MAT files.) When you say "arbitrary Matlab structure", that suggests you might be using cells, structs, or objects to make complex data structures. That slows down MAT I/O because there is per-array overhead in MAT file I/O, and the members of cell and struct arrays (container types) all count as separate arrays. For example, 5,000 strings stored in a cellstr are much, much slower than the same 5,000 strings stored in a 2-D char array. And objects have even more overhead. As a test, try writing out a 1 GB file that contains just a 1 GB primitive array of random uint8s, and see how long that takes. From there, see if you can simplify your data to reduce the total mxarray count, even if that means reshaping it for serialization. (My experience with this is mostly with the v7 format; the newer HDF5 format may have less per element overhead.)
If your data files live on the network, you could also try doing the save and load operations on temporary files on fast local drives, and separately using copy operations to move them back and forth between the network. At least on Windows networks, I've seen speedups of up to 2x from doing this. Possibly due to optimizations the full-file copy operation can do that the MAT I/O code can't.
It would probably be a substantial effort to come up with an alternate file format that supported fully arbitrary Matlab data structures and was portable to other languages. I'd try making smaller changes around your use of the existing format first.
mat format has changed with Matlab versions. v7.3 uses HDF5 format, which has builtin compression and other features and it can take a large time to read/write. However, you can force Matlab to use previous formats which are faster (but might take more space).
See here:
http://www.mathworks.com/help/matlab/import_export/mat-file-versions.html

Efficient way to fingerprint an image (jpg, png, etc)?

Is there an efficient way to get a fingerprint of an image for duplicate detection?
That is, given an image file, say a jpg or png, I'd like to be able to quickly calculate a value that identifies the image content and is fairly resilient to other aspects of the image (eg. the image metadata) changing. If it deals with resizing that's even better.
[Update] Regarding the meta-data in jpg files, does anyone know if it's stored in a specific part of the file? I'm looking for an easy way to ignore it - eg. can I skip the first x bytes of the file or take x bytes from the end of the file to ensure I'm not getting meta-data?
Stab in the dark, if you are looking to circumvent meta-data and size related things:
Edge Detection and scale-independent comparison
Sampling and statistical analysis of grayscale/RGB values (average lum, averaged color map)
FFT and other transforms (Good article Classification of Fingerprints using FFT)
And numerous others.
Basically:
Convert JPG/PNG/GIF whatever into an RGB byte array which is independent of encoding
Use a fuzzy pattern classification method to generate a 'hash of the pattern' in the image ... not a hash of the RGB array as some suggest
Then you want a distributed method of fast hash comparison based on matching threshold on the encapsulated hash or encoding of the pattern. Erlang would be good for this :)
Advantages are:
Will, if you use any AI/Training, spot duplicates regardless of encoding, size, aspect, hue and lum modification, dynamic range/subsampling differences and in some cases perspective
Disadvantages:
Can be hard to code .. something like OpenCV might help
Probabilistic ... false positives are likely but can be reduced with neural networks and other AI
Slow unless you can encapsulate pattern qualities and distribute the search (MapReduce style)
Checkout image analysis books such as:
Pattern Classification 2ed
Image Processing Fundamentals
Image Processing - Principles and Applications
And others
If you are scaling the image, then things are simpler. If not, then you have to contend with the fact that scaling is lossy in more ways than sample reduction.
Using the byte size of the image for comparison would be suitable for many applications. Another way would be to:
Strip out the metadata.
Calculate the MD5 (or other suitable hashing algorithm) for the
image.
Compare that to the MD5 (or whatever) of the potential dupe
image (provided you've stripped out
the metadata for that one too)
You could use an algorithm like SIFT (Scale Invariant Feature Transform) to determine key points in the pictures and match these.
See http://en.wikipedia.org/wiki/Scale-invariant_feature_transform
It is used e.g. when stitching images in a panorama to detect matching points in different images.
You want to perform an image hash. Since you didn't specify a particular language I'm guessing you don't have a preference. At the very least there's a Matlab toolbox (beta) that can do it: http://users.ece.utexas.edu/~bevans/projects/hashing/toolbox/index.html. Most of the google results on this are research results rather than actual libraries or tools.
The problem with MD5ing it is that MD5 is very sensitive to small changes in the input, and it sounds like you want to do something a bit "smarter."
Pretty interesting question. Fastest and easiest would be to calculate crc32 of content byte array but that would work only on 100% identical images. For more intelligent compare you would probably need some kind of fuzy logic analyzis...
I've implemented at least a trivial version of this. I transform and resize all images to a very small (fixed size) black and white thumbnail. I then compare those. It detects exact, resized, and duplicates transformed to black and white. It gets a lot of duplicates without a lot of cost.
The easiest thing to do is to do a hash (like MD5) of the image data, ignoring all other metadata. You can find many open source libraries that can decode common image formats so it's quite easy to strip metadata.
But that doesn't work when image itself is manipulated in anyway, including scaling, rotating.
To do exactly what you want, you have to use Image Watermarking but it's patented and can be expensive.
This is just an idea: Possibly low frequency components present in the DCT of the jpeg could be used as a size invariant identifier.