This optimization technique works great to optimize 3D Look Up Tables (LUTs) and appropriately minimize errors due to interpolation.
Using this optimization tool, the nodes become unevenly spaced in order to best fit the input data, however, for my application, I need to have evenly spaced nodes within my lookup table. This is due to constraints of the LUT implementation where the nodes are specified as a min and max and assumed to be evenly spaced between those values. This is an examples of such an implementation although it’s common for many lut formats to do the same thing.
I want to be able to utilize the optimization, but also create a uniformly spaced table. Is there a way to convert the optimized table to table with evenly spaced nodes without losing the optimization perhaps using a preceding 1D shaper LUT? Maybe by effectively shaping the data going into the uniformly spaced table 3D such that the results would match those of the optimized 3D alone.
Related
The dataset I want to cluster consists of ~1000 samples and 10 features, which have different scales and ranges (negative, positive, both). Using scipy.stats.normaltest() I found that none of the features are normally-distributed (all p-values < 1e-4, small enough to reject the null hypothesis that the data are taken from a normal distribution). But all of the distance measures that I'm aware of assume normally-distributed data (I was using Mahalanobis until I realized how non-uniform the data was). What distance measures would one use in this situation? Or is this where one simply has to normalize every feature and hope that that doesn't introduce bias?
Why do you think all distances would assume normal (which btw. is not the same as uniform) data?
Consider Euclidean distance. In many physical applications this distance makes perfect sense, because it is "as the crow flies". Manhattan distance makes a lot of sense when movement is constrained to two axes that cannot be used at the same time. These are completely appropriate for non-normal distributed data.
I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.
I'm using the griddata() command in MATLAB to go from a spherical grid of sizes on the order (128x256x1500) to a Cartesian cubic grid, centered on the sphere and containing N^3 regularly-spaced points (where N is between 128 and 512). I need to do this for dozens or hundreds of checkpoints in my simulation, and several variables per checkpoint. I'm going to need to interpolate from the same spherical grid to the same cubic grid several hundred or several thousand times over, using new data on the spherical grid each time!
Since the most computationally expensive part of this routine is the triangulation and interpolation, I would like to cache some information the first time the routine is run and use that information for subsequent runs.
I think I could probably cache a table of vertex indices and associated interpolation weights for every point on the cubic grid, but I'm not sure how/where to do this....
As far as I can tell, this is not possible using the current implementation of griddata(). Is there any way I could do something like this -- perhaps re-writing the griddata() routine?
I have a set of 3d points in Matlab but the problem is that my data found here. And as you can see there are some outliers which are affecting my clustering results. So if anyone could please advise how I can delete these outliers from my data.
Having looked at your data, I don't think any clustering algorithm will do what you want. Instead, you will probably need to train a classifier. This is what the Kinect people did, train a classifier using millions of real and synthetic postures, to have it label limbs, head, etc.
The reason why I don't think density based clustering will work either is because your data is a single, density-connected, body-with-two-boxes-shaped blob. But without knowing what a "body" and a "box" is, segmentation will be rather arbitrary. Or in the case of density based clustering: it will not segment at all, or it will segment e.g. by the rather low resultion of your z axis. Furthermore, your X and Y axes come from a grid based image scan (I assume), so you have a very uniform density on the X and Y axes - but the arms, for example, are not of a lower density than the body or boxes.
You can, however, use DBSCAN with rather broad (and easy to set) parameters to remove the noise.
E.g. in ELKI the following parameters yield reasonable results:
java -jar elki.jar -dbc.in /tmp/XX.csv -algorithm clustering.DBSCAN \
-dbscan.epsilon 0.05 -dbscan.minpts 100
The majority cluster is your data with the outliers removed; even with this blob near the foot removed.
To speed up the clustering process, you can add the parameters
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory \
-pagefile.pagesize 1000 -spatial.bulkstrategy SortTileRecursiveBulkSplit
which yields a runtime opf 4.5 seconds here. This obviously is not good enough for realtime operation as on a Kinect; but it is not surprising to see a directed classification algorithm to outperform an unsupervised method - this is in fact to be expected.
Here is the result of clustering the data set with the parameters above:
I am trying to plot a 2 GB matrix using MATLAB hist on a computer with 4 GB RAM. The operation is taking hours. Are there ways to increase the performance of the computation, by pre-sorting the data, pre-determining bin sizes, breaking the data into smaller groups, deleting the raw data as the data is added to bins, etc?
Also, after the data is plotted, I need to adjust the binning to ensure the curve is smooth. This requires starting over and re-binning the raw data. I assume the strategy involving the least computation would be to first bin the data using very small bins and then manipulate the bin size of the output, rather than re-binning the raw data. What is the best way to adjust bin sizes post-binning (assuming the bin sizes can only grow and not shrink)?
I don't like answers to StackOverflow Questions of the form "well even though you asked how to do X, you don't really want to do X, you really want to do Y, so here's a solution to Y"
But that's what i am going to do here. I think such an answer is justified in this rare instance becuase the answer below is in accord with sound practices in statistical analysis and because it avoids the current problem in front of you which is crunching 4 GB of datda.
If you want to represent the distribution of a population using a non-parametric density estimator, and you wwish to avoid poor computational performance, a kernel density estimator (KDE) will do the job far better than a histogram.
To begin with, there's a clear preference for KDEs versus histograms among the majority of academic and practicing statisticians. Among the numerous texts on this topic, ne that i think is particularly good is An introduction to kernel density estimation )
Reasons why KDE is preferred to histogram
the shape of a histogram is strongly influenced by the choice of
total number of bins; yet there is no authoritative technique for
calculating or even estimating a suitable value. (Any doubts about this, just plot a histogram from some data, then watch the entire shape of the histogram change as you adjust the number of bins.)
the shape of the histogram is strongly influenced by the choice of
location of the bin edges.
a histogram gives a density estimate that is not smooth.
KDE eliminates completely histogram properties 2 and 3. Although KDE doesn't produce a density estimate with discrete bins, an analogous parameter, "bandwidth" must still be supplied.
To calculate and plot a KDE, you need to pass in two parameter values along with your data:
kernel function: the most common options (all available in the MATLAB kde function) are: uniform, triangular, biweight, triweight, Epanechnikov, and normal. Among these, gaussian (normal) is probably most often used.
bandwith: the choice of value for bandwith will almost certainly have a huge effect on the quality of your KDE. Therefore, sophisticated computation platforms like MATLAB, R, etc. include utility functions (e.g., rusk function or MISE) to estimate bandwith given oother parameters.
KDE in MATLAB
kde.m is the function in MATLAB that implementes KDE:
[h, fhat, xgrid] = kde(x, 401);
Notice that bandwith and kernel are not supplied when calling kde.m. For bandwitdh: kde.m wraps a function for bandwidth selection; and for the kernel function, gaussian is used.
But will using KDE in place of a histogram solve or substantially eliminate the very slow performance given your 2 GB dataset?
It certainly should.
In your Question, you stated that the lagging performance occurred during plotting. A KDE does not require mapping of thousands (missions?) of data points a symbol, color, and specific location on a canvas--instead it plots a single smooth line. And because the entire data set doesn't need to be rendered one point at a time on the canvas, they don't need to be stored (in memory!) while the plot is created and rendered.