I am using Spark 1.5.0 MLlib Random Forest algorithm (Scala code) to do two-class classification. As the dataset I am using is highly imbalanced, so the majority class is down sampled at 10% sampling rate.
Is it possible to use the sampling weight (10 in this case) in the Spark Random Forest training? I don't see weight among the input parameters for trainClassifier() in Random Forest.
Not at all in Spark 1.5 and only partially (Logistic/LinearRegression) in Spark 1.6
https://issues.apache.org/jira/browse/SPARK-7685
Here's the umbrella JIRA tracking all the subtasks
https://issues.apache.org/jira/browse/SPARK-9610
Related
I am using PySpark to implement a Churn classification model for a business problem and the dataset I have is imbalanced. So when I train the model, I randomly select a dataset with equal numbers of 1's and 0's.
Then I applied the model in a real-time data and the number of predicted 1's and 0's were obviously equal.
Now, I need to calibrate my trained model. But I couldn't find a way to do it in PySpark. Does anyone have an idea how to calibrate a model in PySpark, May be something like CalibratedClassifierCV ?
I read this link that explains: Anomaly detection with PCA in Spark
But whats the code to extract PCA features and project them from training data to test data?
From what I understood, we have to use the same set of features for train on test data.
I have been working on clustering a dataset in scala using spark 2.2.0. Now i have made the clusters , i want to test/evaluate the quality of it.Though i have been able to find the Set Of Sum of squared of errors for each value of K, but i was hoping to do a silhouette test. could any one please help in sharing any relevant functions,packages for doing so in scala.
Silhouette is not scalable. It uses pairwise distances, this will always take O(n^2) time to compute.
Have you considered using already implemented in MLlib Within Set Sum of Squared Errors (http://spark.apache.org/docs/latest/ml-clustering.html#k-means) which also can help determining the number of clusters. (Cluster analysis in R: determine the optimal number of clusters)
I have a Spark DataFrame representing energy consumption (in kW) of particular device in particular moment (datestamped). I would like to calculate energy consumption in kWh, it means calculate integral over this dataset for a given time interval. How can I accomplish it using Spark?
I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.
I know that Spark supports PCA and also PCA + Kmeans.
However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.