Is it possible to apply a visual encoding to a series that uses a separate dataset? Use case as follows:
Dataset encoded as a TypedArray (int16) that contains x, y positions for a scatter plot.
Dataset encoded as a TypedArray (uint8) that contains color values. The user may elect to change dataset 2 to color by a different feature.
The obvious solution would be to merge the two datasets client side as (x,y,color,x,y,color,...), but would like to avoid duplicating the datasets in memory and the overhead associated with the transform. I see many examples of reusing a dataset across multiple series, but not applying multiple datasets to a single series.
Thank you!
Related
I am doing a two-channel data acquisition to read out the voltages of two diodes. I am using LabVIEW to do this. Below you can see the part of the code relevant for my question.
The voltages are converted to temperatures by 1D interpolation, as shown in the above code. 0.9755 V corresponds to 2 degrees, 1.68786 V to 100 degrees. For this simple example, I expected the waveform chart to display two constant curves, one at 2 and one at 100. However, it plots only one curve that zigzags between the two values. How can I get two separate plots?
Add "Index Array" at the "yi" output of the "Interpolate 1D VI" function and expand it to two elements. Then put "Bundle" function with two inputs. Connect "Index Array" outputs to them. The resulting cluster connect to the chart.
That's it :)
Explanation: to display multiple plots in the chart, you need to feed it with a cluster, not an array. Your "Interpolate 1D VI" output gives you 2-element array (result of interpolation for 0.9755 and 1.68786), so you need to convert it to a cluster using Bundle function.
The general answer to this question is that if you open the context help and hover over the BD terminal of a graph or chart, you will see the various data types it supports:
This will help if you want to get it right every time, as the various charts and graphs can each take different types of data to display different types of graphs and remembering them can be tricky.
This optimization technique works great to optimize 3D Look Up Tables (LUTs) and appropriately minimize errors due to interpolation.
Using this optimization tool, the nodes become unevenly spaced in order to best fit the input data, however, for my application, I need to have evenly spaced nodes within my lookup table. This is due to constraints of the LUT implementation where the nodes are specified as a min and max and assumed to be evenly spaced between those values. This is an examples of such an implementation although it’s common for many lut formats to do the same thing.
I want to be able to utilize the optimization, but also create a uniformly spaced table. Is there a way to convert the optimized table to table with evenly spaced nodes without losing the optimization perhaps using a preceding 1D shaper LUT? Maybe by effectively shaping the data going into the uniformly spaced table 3D such that the results would match those of the optimized 3D alone.
I need to test my Gap Statistics algorithm (which should tell me the optimum k for the dataset) and in order to do so I need to generate a big dataset easily clustarable, so that I know a priori the optimum number of clusters. Do you know any fast way to do it?
It very much depends on what kind of dataset you expect - 1D, 2D, 3D, normal distribution, sparse, etc? And how big is "big"? Thousands, millions, billions of observations?
Anyway, my general approach to creating easy-to-identify clusters is concatenating sequential vectors of random numbers with different offsets and spreads:
DataSet = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
Groups = [1*ones(1000,1);2*ones(1000,1);3*ones(1000,1)];
This can be extended to N features by using e.g.
randn(1000,5)
or concatenating horizontally
DataSet1 = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
DataSet2 = [-100+7*randn(1000,1);1+0.1*randn(1,1000);20+3*randn(1,1000)];
DataSet = [DataSet1 DataSet2];
and so on.
randn also takes multidimensional inputs like
randn(1000,10,3);
For looking at higher-dimensional clusters.
If you don't have details on what kind of datasets this is going to be applied to, you should look for these.
Once I have collected and organized data in a SOM how do I identify clusters?
(Items are aggregated and clustered using many traits - upwards of 10)
Specifically I want to find the 'center' of the cluster - therefor giving me the 'center' node(s).
You could use a relative small map and consider each node a cluster, but this is far from optimal. If you want to apply an automated cluster detection method you should definitely read
Clustering of the Self−Organizing Map
and search similar bibliography.
You could also use more sophisticated versions of SOM algorithm (multi leveled, self growing, etc).
In any case, keep in mind that the problem of finding the "correct" number of clusters doesn't have a finite solution.
As far as I can tell, SOM is primarily a data-driven dimensionality reduction and data compression method. So it won't cluster the data for you; it may actually tend to spread clusters in the projection (i.e. split them into multiple cells).
However, it may work well for some data sets to either:
Instead of processing the full data set, work only on the SOM nodes (weighted by the number of elements assigned to them), which should be significantly smaller
Instead of working in the original space, work in the lower-dimensional space that the SOM represents
And then run a regular clustering algorithm on the transformed data.
Though an old question I've encountered the same issue and I've had some success implementing Estimating the Number of Clusters in Multivariate Data by Self-Organizing Maps, so I thought I'd share.
The linked algorithm uses the U-matrix to highlight the boundaries of the individual clusters and then uses an image processing algorithm called watershedding to identify the components. For this to work correctly the regions in the u-matrix are required to be concave within the resolution of your quantization (which when converted to a binary image, simply results in using a floodfill to identify the regions).
I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.