spark scala avoid shuffling after RowMatix calculation - scala

scala spark : How to avoid RDD shuffling in join after Distributed Matrix operation
created a dense matrix as a input to calculate cosine distance between columns
val rowMarixIn = sc.textFile("input.csv").map{ line =>
val values = line.split(" ").map(_.toDouble)
Vectors.dense(values)
}
Extracted set of entries from co-ordinated matrix after the cosine calculations
val coMarix = new RowMatrix(rowMarixIn)
val similerRows = coMatrix.columnSimilarities()
//extract entires over a specific Threshold
val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) =>
if (sim > someTreshold )){
col,sim
}`
We have a another RDD with rdd2(key,Val2)
just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2)
val joinedRDD = rowIndices.join(rdd2)
this will result in a shuffle ,
What are best practices to follow in order to avoid shuffle or any suggestion on a better approach is much appreciated

Related

Apply function to subset of Spark Datasets (Iteratively)

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
(sampleDS)
})
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Creating a Spark DataFrame from an RDD of lists

Spark - how to get top N of rdd as a new rdd (without collecting at the driver)

I am wondering how to filter an RDD that has one of the top N values. Usually I would sort the RDD and take the top N items as an array in the driver to find the Nth value that can be broadcasted to filter the rdd like so:
val topNvalues = sc.broadcast(rdd.map(_.fieldToThreshold).distict.sorted.take(N))
val threshold = topNvalues.last
val rddWithTopNValues = rdd.filter(_.fieldToThreshold >= threshold)
but in this case my N is too large, so how can I do this purely with RDDs like so?:
def getExpensiveItems(itemPrices: RDD[(Int, Float)], count: Int): RDD[(Int, Float)] = {
val sortedPrices = itemPrices.sortBy(-_._2).map(_._1).distinct
// How to do this without collecting results to driver??
val highPrices = itemPrices.getTopNValuesWithoutCollect(count)
itemPrices.join(highPrices.keyBy(x => x)).map(_._2._1)
}
Use zipWithIndex on the sorted rdd and then filter by the index up to n items. To illustrate the case consider this rrd sorted in descending order,
val rdd = sc.parallelize((1 to 10).map( _ => math.random)).sortBy(-_)
Then
rdd.zipWithIndex.filter(_._2 < 4)
delivers the first top four items without collecting the rdd to the driver.

The proper way to compute correlation between two Seq columns into a third column

I have a DataFrame where each row has 3 columns:
ID:Long, ratings1:Seq[Double], ratings2:Seq[Double]
For each row I need to compute the correlation between those Vectors.
I came up with the following solution which seems to be inefficient (not working as Jarrod Roberson has mentioned) as I have to create RDDs for each Seq:
val similarities = ratingPairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
Similarity(row.getAs[Long]("ID"), corr)
})
Is there a way to compute such correlations properly?
Let's assume you have a correlation function for arrays:
def correlation(arr1: Array[Double], arr2: Array[Double]): Double
(for potential implementations of that function, which is completely independent of Spark, you can ask a separate question or search online, there are some close-enough resource, e.g. this implementation).
Now, all that's left to do is to wrap this function with a UDF and use it:
import org.apache.spark.sql.functions._
import spark.implicits._
val corrUdf = udf {
(arr1: Seq[Double], arr2: Seq[Double]) => correlation(arr1.toArray, arr2.toArray)
}
val result = df.select($"ID", corrUdf($"ratings1", $"ratings2") as "correlation")

Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark

I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect),
Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(
MatrixEntry(0,0,-1.0), MatrixEntry(0,1,-1.0), MatrixEntry(1,0,-1.0),
MatrixEntry(1,1,-1.0), MatrixEntry(1,2,-1.0), MatrixEntry(2,1,-1.0),
MatrixEntry(2,2,-1.0), MatrixEntry(0,3,-1.0), MatrixEntry(0,4,-1.0),
MatrixEntry(0,5,-1.0), MatrixEntry(3,0,-1.0), MatrixEntry(4,0,-1.0),
MatrixEntry(3,3,-1.0), MatrixEntry(3,4,-1.0), MatrixEntry(4,3,-1.0),
MatrixEntry(4,4,-1.0))
This is only a small sample size. The Matrix is of size a N x N (where N = 1 million) though a majority of it is sparse. What is one of the efficient way of getting row sums of this matrix in Spark Scala? The goal is to create a new RDD composed of row sums i.e. of size N where 1st element is row sum of row1 and so on ..
I can always convert this coordinateMatrix to IndexedRowMatrix and run a for loop to compute rowsums one iteration at a time, but it is not the most efficient approach.
any idea is greatly appreciated.
It will be quite expensive due to shuffling (this is the part you cannot really avoid here) but you can convert entries to PairRDD and reduce by key:
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD
val mat: CoordinateMatrix = ???
val rowSums: RDD[Long, Double)] = mat.entries
.map{case MatrixEntry(row, _, value) => (row, value)}
.reduceByKey(_ + _)
Unlike solution based on indexedRowMatrix:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
mat.toIndexedRowMatrix.rows.map{
case IndexedRow(i, values) => (i, values.toArray.sum)
}
it requires no groupBy transformation or intermediate SparseVectors.

Spark - scala: shuffle RDD / split RDD into two random parts randomly

How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).
I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)
But how can I Shuffle the rdd?
Or is there a better way to split the list?
I've found a simple and fast way to split the array:
val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))
It will split the data using the provided weights.
You should use randomSplit method:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array
Here is its implementation in spark 1.0:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)
}.toArray
}