Calculating integral using Spark and DataFrames - pyspark

I have a Spark DataFrame representing energy consumption (in kW) of particular device in particular moment (datestamped). I would like to calculate energy consumption in kWh, it means calculate integral over this dataset for a given time interval. How can I accomplish it using Spark?

Related

Splitting Pyspark dataframe into equal parts, record function runtime

I have a Pyspark dataframe split into 10 partitions. I'm trying to run a linear regression on each partition and store the runtime for each partition in a variable (to be able to plot later - ultimately, I want to compare the efficiency of several models and visualize this)
I tried this:
ts = [timeit.timeit(M1.predict(m1features))]
but it doesn't work because m1features would need to be a vector, and I want to measure Spark's performance. How would I go about measuring runtime of the prediction on each partition?

Model Probability Calibration in Pyspark

I am using PySpark to implement a Churn classification model for a business problem and the dataset I have is imbalanced. So when I train the model, I randomly select a dataset with equal numbers of 1's and 0's.
Then I applied the model in a real-time data and the number of predicted 1's and 0's were obviously equal.
Now, I need to calibrate my trained model. But I couldn't find a way to do it in PySpark. Does anyone have an idea how to calibrate a model in PySpark, May be something like CalibratedClassifierCV ?

Pyspark columnSimilarities() usage for calculation of cosine similarities between products

I have a big dataset and need to calculate cosine similarities between products in the context of item-item collaborative filtering for product recommendations. As the data contains more than 50000 items and 25000 rows, I opted for using Spark and found the function columnSimilarities() which can be used on DistributedMatrix, specifically on a RowMatrix or IndexedRowMatrix.
But, there is 2 issues I'm wondering about.
1) In the documentation, it's mentioned that:
A RowMatrix is backed by an RDD of its rows, where each row is a local
vector. Since each row is represented by a local vector, the number of
columns is limited by the integer range but it should be much smaller
in practice.
As I have many products it seems that RowMatrix is not the best choice for building the similarity Matrix from my input which is a Spark Dataframe. That's why I decided to start by converting the dataframe to a CoordinateMatrix and then use toRowMatrix() because columnSimilarities() requires input parameter as RowMatrix. Meanwhile, I'm not sure of its performance..
2) I found out that:
the columnSimilarities method only returns the off diagonal entries of
the upper triangular portion of the similarity matrix.
reference
Does this mean I cannot get the similarity vectors of all the products?
So your current strategy is to compute the similarity between each item, i, and each other item. This means at best you have to compute the upper triangular of the distance matrix, I think that's (i^2 / 2) - i calculations. Then you have to sort for each of those i items.
If you are willing to trade off a little accuracy for runtime you can use approximate nearest neighbors (ANN). You might not find exactly the top NNS for an item but you will find very similar items and it will be orders of magnitude faster. No one dealing with moderately sized datasets calculates (or has the time to wait to calculate) the full set of distances.
Each ANN search method creates an index that will only generate a small set of candidates and compute distances within that subset (this is the fast part). The way the index is constructed provides different guarantees about the accuracy of the NN retrieval (this is the approximate part).
There are various ANN search libraries out there, annoy, nmslib, LSH. An accessible introduction is here: https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html
HTH. Tim

How to find silhoutte in k means clustering while doing it in Scala/Spark 2.2.0

I have been working on clustering a dataset in scala using spark 2.2.0. Now i have made the clusters , i want to test/evaluate the quality of it.Though i have been able to find the Set Of Sum of squared of errors for each value of K, but i was hoping to do a silhouette test. could any one please help in sharing any relevant functions,packages for doing so in scala.
Silhouette is not scalable. It uses pairwise distances, this will always take O(n^2) time to compute.
Have you considered using already implemented in MLlib Within Set Sum of Squared Errors (http://spark.apache.org/docs/latest/ml-clustering.html#k-means) which also can help determining the number of clusters. (Cluster analysis in R: determine the optimal number of clusters)

Can sample weight be used in Spark MLlib Random Forest training?

I am using Spark 1.5.0 MLlib Random Forest algorithm (Scala code) to do two-class classification. As the dataset I am using is highly imbalanced, so the majority class is down sampled at 10% sampling rate.
Is it possible to use the sampling weight (10 in this case) in the Spark Random Forest training? I don't see weight among the input parameters for trainClassifier() in Random Forest.
Not at all in Spark 1.5 and only partially (Logistic/LinearRegression) in Spark 1.6
https://issues.apache.org/jira/browse/SPARK-7685
Here's the umbrella JIRA tracking all the subtasks
https://issues.apache.org/jira/browse/SPARK-9610