Fitting a normal distribution from a generator - scipy

Fitting a distribution with sciPy is straightforward when the data are in a list:
import scipy as sc
data = sc.stats.norm.rvs(5.0, 2.0, size=100)
mu, std = sc.stats.norm.fit(data)
But my data are coming from a generator. norm.fit donĀ“t directly work with generators. Transfering the whole data to a list before giving them to norm.fit is not possible. Is there any other library to solve my problem?

Related

Is there a possibility to import a KERAS model to Matlab without the Deep Learning Toolbox?

I have worked out a LSTM model and would like to incorporate it into a MATLAB framework. Via the Deep Learning Toolbox the functions importKerasLayers and importKerasNetwork can be called.
Is there also a way to implement the model without the Deep Learning Toolbox?
With a new version of MATLAB, it is possible to invoke Python from MATLAB. Check this URL https://in.mathworks.com/help/matlab/matlab_external/create-object-from-python-class.html#mw_c224a09a-f56b-48e5-b9ec-145388506204
I'm not sure, what you are trying to achieve by importing keras inside MATLAB. I guess, you may want to use MATLAB for your data preprocessing activities. If that is the case, you may complete preprocessing separately and save data in to numpy or pandas format to consume from Python.

Advice on Speeding up SciPy Custom Distribution Sampling & Fitting

I am trying to fit a custom distribution to a large (~O(500,000) measurements) dataset using scipy. I have derived a theoretical PDF based on some other factors, but both by hand and using symbolic integration software I cannot find an exact form of the CDF.
Currently, simply evaluating 1000 random samples from my custom distribution is expensive, which I believe is due to the need to invert an unknown CDF. If I cannot find an explicit form of the CDF and it's inverse, is there anything else I can do to speed up usage of this distribution?
I've used maple, matlab and Sympy to try and determine a CDF, yet none give a result. I also tried down-sampling my data whilst still retaining the tail attributes, but this still required so much data that doing anything with the distribution was slow.
My distribution is a sub-class of SciPy's rv_continuous class.
Thanks for any advice.
This sounds like you want to sample from a Kernel Density Estimation of the probability distribution. While Scipy does offer a Gaussian Kernel package, for that many measurements you would be much better off using sklearn's implementation. A good resource with code examples can be found on Jake VanderPlas's blog.

K-means finds a singleton cluster when I standardize features (Wholesale customers dataset)

I am studying the Wholesale customers dataset. Running the elbow method I find that k=5 seems to be a good number of clusters. Unfortunately, when I standardize my features I get a singleton cluster, even with several inits. This does not happen when I don't standardize.
I know that standardization of the features is an often-asked question, however I still don't understand if that's good practice or not. Here I standardize because the variances of some features are quite different. If it's a bad idea here, can you please explain why?
Here is an example of MDS visualisation of K-means result. As you can see, at the bottom left of the picture there is point which has its own cluster (it has a unique color). Is it because it's an outlier? Should I remove it by hand before running K-means?
Here is a MWE if you want to rerun the experiment yourself. Please don't hesitate to be straightforward if I somehow made a mistake.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.manifold import MDS
df = pd.read_csv("./wholesale-dataset.csv")
X = StandardScaler().fit_transform(df.values[:,2:])
km = KMeans(5)
km.fit(X)
mds = MDS().fit_transform(X)
fkm = plt.figure()
fkm.gca().scatter(mds[:,0], mds[:,1], c=km.labels_)
There is nothing wrong with k-means producing Singleton clusters.
When you have outliers in your data, making such clusters is likely improving the SSE objective of k-means. So this behavior is correct.
But judging from your plot, I'd argue that your correct k is 1. There is one big blob there, some outliers, but not multiple clusters.

deciding to the type of kernel parameter in Kernel PCA

I am new to machine learning and I am trying to do unsupervised learning with k-means clustering (even if I read that k-means cannot work very well with categorical data). I encoded my categorical variables and tried to apply kernel PCA since I have a categorical feature (it is gender). I noticed that there are several values for the kernel parameter which are 'linear', 'poly', 'rbf', 'sigmoid', 'cosine' and 'precomputed'.
I searched on internet but I couldn't find any proper explanation on these. I could not be sure if the usage of kernel at PCA and SVM are the same either. Is there anyone who can explain what they are, when they should be used and/or how to choose the correct one for our dataset? Since we cannot visualize our dataset with more than 3 dimensions, how will we decide its shape to choose the correct parameter? Part of the code is below just to show where the parameter is used:
# Applying Kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 2, kernel = 'linear')
X = kpca.fit_transform(X)
Thank you in advance.
None of these predefined kernels supports mixed data either. They are vector kernels.
Linear kennel should give the same result as non-kernel PCA, just a lot slower.
There is not much relationship to SVM except the use of kernels. And kernels like rbf make much more sense when you can do hyperparameter optimization in a supervised classification task. Since choosing such parameters is hard, making good use of KernelPCA is difficult except for toy problems.

Selecting model with maximum R-squared when curve fitting in MATLAB

I am modeling a time-series data set (x and y) with multiple methods (cubic, 4th-degree polynomial, and exponential).
Is there a way to program matlab such that it selects the model with the maximum R-squared value, and then uses that model to predict a future outcome? I understand this can be done manually with the curve fitting toolbox and looking at the results, but even then I think I would still need to write the equation out and solve for the value of interest manually.
I would like to automate the process a bit more via code to select the optimal model and use it to predict future results. Below is the main part of my code. Any help would be appreciated.
[f3, gof] = fit(x,y,'poly3','Normalize','on');
plot(f3,x,y);
[f4, gof] = fit(x,y,'poly4','Normalize','on');
[fexp, gof] = fit(x,y,'exp1');
See math on p12 of this working paper. You can download the PDF file from the Web site:
http://www.nber.org/papers/w5027
The authors also point at some older literature.