Hey Folks looking to map pyspark and sklearn gradient boosting regressorss parameters.
What is the sklearn equivalent of maxIter and minInfoGain ?
I read through the documentation and tried using chat gpt
Related
can anyone please share any evaluation metrics used for KMeans clustering in pyspark ML libraray. Except Silhouette or SSE score, which I already have calculated.
I found a couple of metrics but they are available in scikit library of python but I am working in pyspark, example Calinski-Harabasz Index to name.
I read this link that explains: Anomaly detection with PCA in Spark
But whats the code to extract PCA features and project them from training data to test data?
From what I understood, we have to use the same set of features for train on test data.
I would like to create my own estimator in python in PySpark.
I would like to use the L-BFGS optimization algorithm to fit or any other optimization algorithm available.
How can I access these optimization algorithms through pyspark?
Is there an example that shows how this is done?
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.classification.LogisticRegressionWithLBFGS.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS
I am using kmeans from sklearn and from pyspark.ml.
The spark version is much faster. However, it doesn't seem have an option that I need. With sklearn kmeans I can specify an initial values for the cluster centers: KMeans(init=centers,...).
I don't see such an option for pyspark. Am I missing it, or am I out of luck and it doesn't exist?
Thank you
Wondering if there a runWithValidation feature for Gradient Boosted Trees (GBT) in Spark ml to prevent overfitting. It's there in mllib which works with RDDs. I am looking the same for dataframes.
Found a K-Fold Cross Validation support in Spark. It can be done using CrossValidation() with Estimators, Evaluators, ParamMap and number of folds. This helps in finding the best parameters for the model i.e model tuning.
Refer http://spark.apache.org/docs/latest/ml-tuning.html for more details.