What are cuboids and how do they work - olap-cube

I new to multidimensional modelling. I was reading into OLAP cubes and come across cuboids I am very confused about what they are and how they work. For example if we 3D olap cube which has Product, Time, Location as it axises how many cuboids are there. What is the difference between a cuboid and base cuboid.

It might help to forget about the number of dimensions. In non-OLAP terminology a cube refers to a three dimensional shape of equal size on each dimension. In OLAP terminology a cube can have as many dimensions as there is data, and the number of elements in each dimension can differ. E.g. a 'sales' cube can be associated with dimensions such as Product, Time, City, Basket Size, Sales Channel, Price Band, etc. It's still called a cube.
The term cuboid is just referring to taking a slice of a cube and either ignoring or else fixing the value in one or more of the dimensions. Given your example, if you were to ignore the Location dimension then the resulting dataset would be a cuboid where your data varied by Time and Product only (and was thus implicitly covering all Locations). And even though this only leaves two dimensions, we still call it a cube.
As described on this blog OLAP systems will often store pre-calculated aggregated figures for different cuboids to allow for more efficient processing of aggregate queries.

Related

Principal component analysis and feature reductions

I have a matrix composed of 35 features, I need to reduce those
feature because I think many variable are dependent. I undertsood PCA
could help me to do that, so using matlab, I calculated:
[coeff,score,latent] = pca(list_of_features)
I notice "coeff" contains matrix which I understood (correct me if I'm wrong) have column with high importance on the left, and second column with less importance and so on. However, it's not clear for me which column on "coeff" relate to which column on my original "list_of_features" so that I could know which variable is more important.
PCA doesn't give you an order relation on your original features (which feature is more 'important' then others), rather it gives you directions in feature space, ordered according to the variance, from high variance (1st direction, or principle component) to low variance. A direction is generally a linear combination of your original features, so you can't expect to get information about a single feature.
What you can do is to throw away a direction (one or more), or in other words project you data into the sub-space spanned by a subset of the principle components. Usually you want to throw the directions with low variance, but that's really a choice which depends on what is your application.
Let's say you want to leave only the first k principle components:
x = score(:,1:k) * coeff(:,1:k)';
Note however that pca centers the data, so you actually get the projection of the centered version of your data.

Significance of 99% of variance covered by the first component in PCA

What does it mean/signify when the first component covers for more than 99% of the total variance in PCA analysis ?
I have a feature vector of size 500X1000 on which I used Matlab's pca function which returns [coeff,score,latent,tsquared,explained]. The variable 'explained' returns the percentage of variance covered by each component.
The explained tells you how accurately you could represent the data by just using that principal component. In your case it means that just using the main principal component, you can describe very accurately (to a 99%) the data.
Lets make a 2D example. Imagine you have data that is 100x2 and you do PCA.
the result could be something like this (taken from the internets)
This data will give you an explained value for the first principal component (PCA 1st dimension big green arrow in the figure) of around 90%.
What does it means?
It means that if you project all your data to that line, you will reconstruct the points with 90% of accuracy (of course, you will loose the information in the PCA 2nd dimension direction).
In your example, with 99% it visually means that almost all the points in blue are laying on the big green arrow, with very little variation in the small green arrow direction.
Of course it is way more difficult to visualize with 1000 dimensions instead of 2, but I hope you understand.

Sea Ice data - MATLAB 3D matrix

I want to make a matrix in which data is in a matrix and I can pull out each grid in the matrix as a certain long, lat point. The data lasts over 3 years so I would also need a 3rd dimension as time.
What I have now is three 1437x159 doubles of lat, long, and sea ice data. How do I combine them into a 3d matrix that fits the criteria I mentioned above? Basically, I want to be able to say, I want data at -50S lat and 50W lon at day 47 and be able to index into the array and find that answer.
Thanks!
Yes- this can be done without too much difficulty. I have done analogous work in analyzing atmospheric data across time.
The fundamental problem is that you have data organized by hour, with grids varying dynamically over time, and you need to analyze data over time. I would recommend one of two approaches.
Approach 1: Grid Resampling
This involves resampling the grid data over a uniform, standardized grid. Define the grid using the Matlab ndgrid() function, then resample each point using interp2(), and concatonate into a uniform 3D matrix. You can then directly interpolate within this resampled data using interp3(). This approach involves minimal programming, with the trade-off of losing some of the original data in the resampling process.
Approach 2: Dynamic Interpolation
Define a custom class wrapper around your data object, say 'SeaIce', and write your own SeaIce.interp3() method. This method would load the grid information for each hour, perform the interpolation first in the lateral dimension, and subsequently in the time dimension. This ensures that no information is lost via interpolation, with the tradeoff of more coding involved.
The second approach is described in detail (for the wind domain) in my publication "Wind Analysis in Aviation Applications". Slides available here.

Clustering on non-numeric dimensions

I recently started working on clustering and k-means algorithm and was trying to come up with a good use case and solve it.
I have the following data about the items sold in different cities.
Item City
Item1 New York
Item2 Charlotte
Item1 San Francisco
...
I would like to cluster the data based on variables city and item to find groups of cities that might have similar patterns for the items sold.The problem is the k-means I use do not accept non-numeric input. Any idea how should I proceed with this to find a meaningful solution.
Thanks
SV
Clustering requires a distance definition. A cluster is only a cluster if the items are "closer" according to some distance function. The closer they are, the more likely they belong to the same cluster.
In your case, you can try to cluster based on various data related to the cities, like their geographical coordinates, or demographic informations, and see if the clusters overlap in the various cases !
In order for k-means to produce usable results, the means must be meaningful.
Even if you would e.g. use binary vectors, k-means on these would not make a lot of sense IMHO.
Probably the best use case to get started with k-means is color quantization. Take a picture, and use the RGB values of every pixel as 3d vectors. Then run k-means with k as the desired number of colors. The color centers are your final palette, and every pixel will be mapped to the closest center for color reduction.
The reason why this works well with k-means are twofold:
the mean actually makes sense for finding the mean color of multiple pixels
the axes R, G and B have a similar meaning and scale, so there is no bias
If you want to step beyond, try to do the same e.g. in HSB space. And you'll run into difficulties if you want it to be really good. Because the hue value is cyclic, which is inconcistent with the mean. Assuming the hue is on 0-360 degrees, then the "mean" hue of "1" and "359" is not 180 degrees, but 0. So on this data, k-means results will be suboptimal.
See e.g. https://en.wikipedia.org/wiki/Color_quantization for details as well as the two dozen k-means questions here with respect to sparse and binary data.
You may still need to abstractly represent your data in numerical form. This May Help
http://www.analyticbridge.com/forum/topics/clustering-with-non-numeric?commentId=2004291%3AComment%3A40805
Try to re-analyze the problem again and Understand if there is any relationship that you can take advantage of and represent in numerical form.
I worked on a Project where I had to represent Colors by their RGB values. It worked preety good.
Hope this helps

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.