Compression algorithm for contiguous numbers - encoding

I'm looking for an efficient encoding for storing simulated coefficients.
The data has thousands of curves with each 512 contiguous numbers with single precision. The data may be stored as fixed point while it should preserve about 23-bit precision (compared to unity level).
The curves could look like those:
My best approach was to convert the numbers to 24-bit fixed point. Repeatedly I took the adjacent difference as long as the sum-of-squares decreases. When compressing the resulting data using LZMA (xz,lzip) I get about 7.5x compression (compared to float32).
Adjacent differences are good at the beginning, but they emphasize the quantization noise at each turn.
I've also tried the cosine transform after subtracting the slope/curve at the boundaries. The resulting compression was much weaker.
I tried AEC but LZMA compressed much stronger. The highest compression was using bzip3 (after adjacent differences).
I found no function to fit the data with high precision and a limited parameter count.
Is there a way to reduce the penalty of quantization noise when using adjacent differences?
Are there encodings which are better suited for this type of data?

You could try a higher-order predictor. Your "adjacent difference" is a zeroth-order predictor, where the next sample is predicted to be equal to the last sample. You take the differences between the actuals and the predictions, and then compress those differences.
You can try first, second, etc. order predictors. A first-order predictor would look at the last two samples, draw a line between those, and predict that the next sample will fall on the line. A second-order predictor would look at the last three samples, fit those to a parabola, and predict that the next sample will fall on the parabola. And so on.
Assuming that your samples are equally spaced on your x-axis, then the predictors for x[0] up through cubics are:
x[-1] (what you're using now)
2*x[-1] - x[-2]
3*x[-1] - 3*x[-2] + x[-3]
4*x[-1] - 6*x[-2] + 4*x[-3] - x[-4]
(Note that the coefficients are alternating-sign binomial coefficients.)
I doubt that the cubic polynomial predictor will be useful for you, but experiment with all of them to see if any help.
Assuming that the differences are small, you should use a variable-length integer to represent them. The idea would be to use one byte for each difference most of the time. For example, you could code seven bits of difference, say -64 to 63, in one byte with the high bit clear. If the difference doesn't fit in that, then make the high bit set, and have a second byte with another seven bits for a total of 14 with that second high bit clear. And so on for larger differences.

Related

findpeaks - MINPEAKDISTANCE

I'm trying to understand how MINPEAKDISTANCE works. I returned to the documentation, here, but it wasn't very clear how this parameter works.
Can you kindly clarify it a bit?
Thanks.
Minimum peak separation Specify the minimum peak distance, or minimum
separation between peaks as a positive integer. You can use the
'MINPEAKDISTANCE' option to specify that the algorithm ignore small
peaks that occur in the neighborhood of a larger peak. When you
specify a value for 'MINPEAKDISTANCE', the algorithm initially
identifies all the peaks in the input data and sorts those peaks in
descending order. Beginning with the largest peak, the algorithm
ignores all identified peaks not separated by more than the value of
'MINPEAKDISTANCE'. Default: 1
So if you consider your peak heights as values in the "y" direction, then the separation that this is talking about is in the "x" direction. So for example look at this image (from Matlab docs and if you have the image processing toolbox you can get the data too load noisyecg.mat):
lets say you just want to identify thos 4 big distinct peaks, but not the hundreds of little peaks caused by noise, setting MINPEAKDISTANCE is a feasible way accomplish this because the noisy peaks are at a much higher frequency, i.e. they are closer to each other in the "x" direction, or have a smaller distance separating them than the big peaks do. So choosing a large enough MINPEAKDISTANCE, say 100 or 350 for example depending on what peaks you're interested in, would help you to not detect these undesired noise peaks.
Try findpeaks on this data with different MINPEAKDISTANCE values and see what you get!
If you've got noisy data, you may find that instead of one solid peak, you get lots of small ones (see the folowing image).
The important data here is when the signal is high and when it is low - you don't care about small variations in value, you only want to use one of those peaks and not look at all the smaller local ones around it. If you know the frequency of your signal (i.e. how often the peaks should occur), you can tell the function to ensure that the peaks are separated by a certain amount.
In the above example, the peak is every 15 milliseconds and lasts for 5 milliseconds, so you might set your MINPEAKDISTANCE parameter to 15 or so.

Interpolating a histogram matlab

I have discrete empirical data which forms a histogram with gaps. I.e. no observations were made of certain values. However in reality those values may well occur.
This is a fig of the scatter graph.
So my question is, SHOULD I interpolate between xaxis values to make bins for the histogram ? If so what would you suggest to be best practice?
Regards,
Don't do it.
With that many sample points, the probability (p-value) of getting empty bins if the distribution is smooth is quite low. There's some underlying reason they're empty, which you may want to investigate. I can think of two possibilities:
Your data actually is discrete (perhaps someone rounded off to 1 signficant figure during data collection, or quantization error was significantly in an ADC) and then unit conversion caused irregular gaps. Even conversion from .12 and .13 to 12,13 as shown could cause this issue, if .12 is actually represented as .11111111198 inside the computer. But this would tend to double-up in a neighboring bin and the gaps would tend to be regularly spaced, so I doubt this is the cause. (For example, if 128 trials of a Bernoulli coin-flip experiment were done for each data point, and someone recorded the percentage of heads in each series to the nearest 1%, you could multiply by 1.28/% to try to recover the actual number of heads, but there'd be 28 empty bins)
Your distribution has real lobes. Because the frequency is significantly reduced following each empty bin, I favor this explanation.
But these are just starting suggestions for your own investigation.

How to find the "optimal" cut-off point (threshold)

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨
If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.

Histogram computational efficiency

I am trying to plot a 2 GB matrix using MATLAB hist on a computer with 4 GB RAM. The operation is taking hours. Are there ways to increase the performance of the computation, by pre-sorting the data, pre-determining bin sizes, breaking the data into smaller groups, deleting the raw data as the data is added to bins, etc?
Also, after the data is plotted, I need to adjust the binning to ensure the curve is smooth. This requires starting over and re-binning the raw data. I assume the strategy involving the least computation would be to first bin the data using very small bins and then manipulate the bin size of the output, rather than re-binning the raw data. What is the best way to adjust bin sizes post-binning (assuming the bin sizes can only grow and not shrink)?
I don't like answers to StackOverflow Questions of the form "well even though you asked how to do X, you don't really want to do X, you really want to do Y, so here's a solution to Y"
But that's what i am going to do here. I think such an answer is justified in this rare instance becuase the answer below is in accord with sound practices in statistical analysis and because it avoids the current problem in front of you which is crunching 4 GB of datda.
If you want to represent the distribution of a population using a non-parametric density estimator, and you wwish to avoid poor computational performance, a kernel density estimator (KDE) will do the job far better than a histogram.
To begin with, there's a clear preference for KDEs versus histograms among the majority of academic and practicing statisticians. Among the numerous texts on this topic, ne that i think is particularly good is An introduction to kernel density estimation )
Reasons why KDE is preferred to histogram
the shape of a histogram is strongly influenced by the choice of
total number of bins; yet there is no authoritative technique for
calculating or even estimating a suitable value. (Any doubts about this, just plot a histogram from some data, then watch the entire shape of the histogram change as you adjust the number of bins.)
the shape of the histogram is strongly influenced by the choice of
location of the bin edges.
a histogram gives a density estimate that is not smooth.
KDE eliminates completely histogram properties 2 and 3. Although KDE doesn't produce a density estimate with discrete bins, an analogous parameter, "bandwidth" must still be supplied.
To calculate and plot a KDE, you need to pass in two parameter values along with your data:
kernel function: the most common options (all available in the MATLAB kde function) are: uniform, triangular, biweight, triweight, Epanechnikov, and normal. Among these, gaussian (normal) is probably most often used.
bandwith: the choice of value for bandwith will almost certainly have a huge effect on the quality of your KDE. Therefore, sophisticated computation platforms like MATLAB, R, etc. include utility functions (e.g., rusk function or MISE) to estimate bandwith given oother parameters.
KDE in MATLAB
kde.m is the function in MATLAB that implementes KDE:
[h, fhat, xgrid] = kde(x, 401);
Notice that bandwith and kernel are not supplied when calling kde.m. For bandwitdh: kde.m wraps a function for bandwidth selection; and for the kernel function, gaussian is used.
But will using KDE in place of a histogram solve or substantially eliminate the very slow performance given your 2 GB dataset?
It certainly should.
In your Question, you stated that the lagging performance occurred during plotting. A KDE does not require mapping of thousands (missions?) of data points a symbol, color, and specific location on a canvas--instead it plots a single smooth line. And because the entire data set doesn't need to be rendered one point at a time on the canvas, they don't need to be stored (in memory!) while the plot is created and rendered.

Distance to nearest palindrome

I'd like an algorithm to provide some kind of measure of how symmetrical a string is.In looking through previous questions, I found one on finding the number of letters that need to be added to a string to turn it into a palindrome. This is close to what I'm looking for but too restrictive in the set of allowable editing operations.
My motivation for this is that I'd like to make an improved version of a video that I put on Youtube called "Numbers are Colorful" The video shows Golden Ratio bases and a couple other related systems using irrational bases. Surprisingly, one system is to begin with completely symmetrical. but the others exhibit partial symmetry which I would like to highlight.
Are you looking for repetition or symmetry? So far I have seen no example that points to symmetry only repetition. 1001010.0010101 is not symmetrical. They are related by a circular shift, i.e. take the first set of digits [1001010], shift it to the left by 1 [0010101] and now you have the right side.
Unless you make it clear what you are trying to identify, this question is too poorly defined to give a sensible answer. If you really mean symmetrical, show me an example of symmetry. You might as well mean "I can see some interesting pattern here" which is so poorly defined it's difficult to quantify.
That said, digital signal processing is the sort of area you might look into for identifying interesting patterns. For example, if you are looking for repetition then I suggest you attempt to use an algorithm designed for detecting repeating patterns.
Consider the digits in your number to be an input signal. Perform frequency analysis on this signal to detect repeating sections of numbers. If you have a strong repeating component in your series of digits this should relate to a strong frequency component in your analysis. You can measure the strength of this pattern from identifying the fundamental frequency by performing the Fourier transform, and summing all of the harmonics for the most significant frequency bin. Divide this by the total energy of the signal and this will give you a measure between 0 and 1 for how "repetitive" the signal is, and will also identify the periodicity of the signal. You may be better off using time-domain algorithms like Autocorrelation, AMDF, or the YIN estimator. (Particularly AMDF)
A similar approach can be adopted if you were to consider actual symmetry (i.e. the numbers are still very similar when you reverse them).Take your input number, create a new signal by reversing it, and then measure their "sameness" at each discrete phase. If you have a digit of length N you could consider padding it with 0's to the length 2N before performing the comparison of the signal with it's inverted self, to consider the possibility of digits lying outside the length of the number.
The time-domain techniques are more likely to work because they are not affected so much by discontinuities. They do literally compare "sameness" of a signal by either computing the difference of all the points at each phase or multiplying the numbers together at each phase. In the subtraction case you hope to get to 0 when they are similar. In the multiplication case you hope to get a peak in the function when the numbers are back in phase. They are however more prone to noise (which in this context means the numbers which aren't quite right).