How would we use "tsfresh" for time-series clustering of stock data,
where we do not have a vector of target values?
The select_features function requires a vector of target values.
First calculate a set of features from your stock time series (e.g. take price and volume data). To do that, you will have to convert your stock data into dataframe in one of the tsfresh input formats (https://tsfresh.readthedocs.io/en/latest/text/data_formats.html).
tsfresh will return a feature matrix that you can then feed to clustering algorithms, e.g. from scikit-learn (http://scikit-learn.org/stable/modules/clustering.html). So, by use of tsfresh you move your problem from the time series domain into the feature matrix domain.
Related
I have a given a feature vector containing both numerical and categorical data and a set of observations with corresponding feature vector.
The outcome variable to be classified upon is a multidimensional vector with numerical entries of same type. How can I do that? Random Forest?
My application is to predict the travel speeds in a city given the daytime, weekday, weather, temperature etc.
More precisely, I intend to classify my observations into clusters such that the overall speeds are somewhat similar in each sample.
I have asked this question also on "Cross Validated" forum, but with no answer so far, so I am trying also here:
I would like to compute similarity matrix (which I will further use for clustering purposes) from my data (failure data from automotive company). The data consist of these variables:
START DATE + TIME (dd/mm/yyyy hh/mm/ss), DURATION (in seconds), DAY OF THE WEEK (mon,tue,...), WORKING TEAM (1,2,3), LOCALIZATION (1,2,3,...,20), FAILURE TYPE
From this, it is clear, that there are continuous and categorical data. What method would you suggest to calculate similarities between failure types? I think I can not use Euclidean distance, or Gowe's similarity. Thank you in advance.
No, you need an ad hoc function that represents your knowledge about what the data means in the real world. Presumably it will be mainly applying a weight to a continuous difference, and a 2D simple matrix for the discrete categorical variables. But don't rule our censoring of extreme values or fuzzification.
I want to train a SVM classifier in MATLAB for threat detection. The training data is in Excel file and contains both numeric and text fields/columns. When I export this data to MATLAB, it is either in table or cell format. How do I convert it in matrix format?
P.S: Using xlsread function does not import text data.
There are 4 type of attributes in data. Numerical ,discrete , nominal and ordinal. Here you can read more about them . First run an statistical analysis for each feature in your dataset to know the basic statistics such as mean, median, max , min , variable type and if it like nominal or ordinal distinct words and all. So you then have a pretty good idea what you are dealing with.Then according to the variable type you can decide which vectorization we are using.if it is an numerical variable you can divide it into different classes and feature scaling . if it an ordinal variable you can give logical order . if it is nominal variable you can give a identical numerical names. Here , you are just checking how much each feature bring the impact to final prediction
My advice , use Weka GUI too to visualize the data. Then you can pre process the data with column by column
You need to transform your text fields into numeric using dummy variables or another technique, or drop them entirely if they actually are id's (e.g. patient name for medical data, record number, respondent uuid for a survey, etc.)
This would actually be easier in R or Python+Pandas, but in Matlab, you will need to perform encoding by yourself, working from the cell array towards a matrix. Or you can try this toolbox.
I have a set of data in a vector. If I were to plot a histogram of the data I could see (by clever inspection) that the data is distributed as the sum of three distributions;
One normal distribution centered around x_1 with variance s_1;
One normal distribution centered around x_2 with variance s_2;
Once lognormal distribution.
My data is obviously a subset of the 'real' data.
What I would like to do is to take a random subset of my data away from my data ensuring that the resulting subset is a reasonable representative sample of the original data.
I would like to do this as easily as possible in matlab but am new to both statistics and matlab and am unsure where to start.
Thank you for any help :)
If you can identify each of the 3 distributions (in the sense that you can estimate their parameters), one approach could be to select a random subset of your data and then try to estimate the parameters for each distribution and see whether they are close enough (according to your own definition of "close") to the parameters of the original distributions. You should repeat this process several time and look at the average difference given a random subset size.
I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.