Custom Evaluator during cross validation SPARK - pyspark

My aim is to add a rank based evaluator to the CrossValidator function (PySpark)
cvExplicit = CrossValidator(estimator=cvSet, numFolds=8, estimatorParamMaps=paramMap,evaluator=rnkEvaluate)
Although I need to pass the evaluated dataframe into the function, and I do not know how to do that part.
class rnkEvaluate():
def __init__(self, user_col = "user", rating_col ="rating", prediction_col = "prediction"):
print(user_col)
print(rating_col)
print(prediction_col)
def isLargerBetter():
return True
def evaluate(self,predictions):
denominator =
predictions.groupBy().sum(self._rating_col).collect()[0][0]
TODO
rest of the calculation ...
return numerator / denominator
Somehow I need to pass the predictions dataframe at every fold iteration, but I could not manage it.

I've solved this issue, here follows the code:
import numpy as np
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel
from pyspark.sql.functions import rand
result = []
class CrossValidatorVerbose(CrossValidator):
def writeResult(result):
resfile = open('executions/results.txt', 'a')
resfile.writelines("\n")
resfile.writelines(result)
resfile.close()
def _fit(self, dataset):
est = self.getOrDefault(self.estimator)
epm = self.getOrDefault(self.estimatorParamMaps)
numModels = len(epm)
eva = self.getOrDefault(self.evaluator)
metricName = eva.getMetricName()
nFolds = self.getOrDefault(self.numFolds)
seed = self.getOrDefault(self.seed)
h = 1.0 / nFolds
randCol = self.uid + "_rand"
df = dataset.select("*", rand(seed).alias(randCol))
metrics = [0.0] * numModels
for i in range(nFolds):
foldNum = i + 1
print("Comparing models on fold %d" % foldNum)
validateLB = i * h
validateUB = (i + 1) * h
condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
validation = df.filter(condition)
train = df.filter(~condition)
for j in range(numModels):
paramMap = epm[j]
model = est.fit(train, paramMap)
predictions = model.transform(validation, paramMap)
#print(predictions.show())
metric = eva.evaluate(spark=spark, predictions=predictions)
metrics[j] += metric
avgSoFar = metrics[j] / foldNum
res=("params: %s\t%s: %f\tavg: %f" % (
{param.name: val for (param, val) in paramMap.items()},
metricName, metric, avgSoFar))
writeResult(res)
result.append(res)
print(res)
if eva.isLargerBetter():
bestIndex = np.argmax(metrics)
else:
bestIndex = np.argmin(metrics)
bestParams = epm[bestIndex]
bestModel = est.fit(dataset, bestParams)
avgMetrics = [m / nFolds for m in metrics]
bestAvg = avgMetrics[bestIndex]
print("Best model:\nparams: %s\t%s: %f" % (
{param.name: val for (param, val) in bestParams.items()},
metricName, bestAvg))
return self._copyValues(CrossValidatorModel(bestModel, avgMetrics))
evaluator = RankUserWeighted("user","rating","prediction")
cvImplicit = CrossValidatorVerbose(estimator=customImplicit, numFolds=8, estimatorParamMaps=paramMap
,evaluator=evaluator)

Related

Scala Mismatch MonteCarlo

I try to implement a version of the Monte Carlo algorithm in Scala but i have a little problem.
In my first loop, i have a mismatch with Unit and Int, but I didn't know how to slove this.
Thank for your help !
import scala.math._
import scala.util.Random
import scala.collection.mutable.ListBuffer
object Main extends App{
def MonteCarlo(list: ListBuffer[Int]): List[Int] = {
for (i <- list) {
var c = 0.00
val X = new Random
val Y = new Random
for (j <- 0 until i) {
val x = X.nextDouble // in [0,1]
val y = Y.nextDouble // in [0,1]
if (x * x + y * y < 1) {
c = c + 1
}
}
c = c * 4
var p = c / i
var error = abs(Pi-p)
print("Approximative value of pi : $p \tError: $error")
}
}
var liste = ListBuffer (200, 2000, 4000)
MonteCarlo(liste)
}
A guy working usually with Python.
for loop does not return anything, so that's why your method returns Unit but expects List[Int] as return type is List[Int].
Second, you have not used scala interpolation correctly. It won't print the value of error. You forgot to use 's' before the string.
Third thing, if want to return list, you first need a list where you will accumulate all values of every iteration.
So i am assuming that you are trying to return error for all iterations. So i have created an errorList, which will store all values of error. If you want to return something else you can modify your code accordingly.
def MonteCarlo(list: ListBuffer[Int]) = {
val errorList = new ListBuffer[Double]()
for (i <- list) {
var c = 0.00
val X = new Random
val Y = new Random
for (j <- 0 until i) {
val x = X.nextDouble // in [0,1]
val y = Y.nextDouble // in [0,1]
if (x * x + y * y < 1) {
c = c + 1
}
}
c = c * 4
var p = c / i
var error = abs(Pi-p)
errorList += error
println(s"Approximative value of pi : $p \tError: $error")
}
errorList
}
scala> MonteCarlo(liste)
Approximative value of pi : 3.26 Error: 0.11840734641020667
Approximative value of pi : 3.12 Error: 0.02159265358979301
Approximative value of pi : 3.142 Error: 4.073464102067881E-4
res9: scala.collection.mutable.ListBuffer[Double] = ListBuffer(0.11840734641020667, 0.02159265358979301, 4.073464102067881E-4)

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble

Creating an RDD to collect the results of an iterative calculation

I would like to create an RDD to collect the results of an iterative calculation .
How can I use a loop (or any alternative) to replace the following code:
import org.apache.spark.mllib.random.RandomRDDs._
val n = 10
val step1 = normalRDD(sc, n, seed = 1 )
val step2 = normalRDD(sc, n, seed = (step1.max).toLong )
val result1 = step1.zip(step2)
val step3 = normalRDD(sc, n, seed = (step2.max).toLong )
val result2 = result1.zip(step3)
...
val step50 = normalRDD(sc, n, seed = (step49.max).toLong )
val result49 = result48.zip(step50)
(creating the N step RDDs and zipping then together at the end would also be ok as long the 50 RDDs are created iteratively to respect the seed = (step(n-1).max) condition)
A recursive function would work:
/**
* The return type is an Option to handle the case of a user specifying
* a non positive number of steps.
*/
def createZippedNormal(sc : SparkContext,
numPartitions : Int,
numSteps : Int) : Option[RDD[Double]] = {
#scala.annotation.tailrec
def accum(sc : SparkContext,
numPartitions : Int,
numSteps : Int,
currRDD : RDD[Double],
seed : Long) : RDD[Double] = {
if(numSteps <= 0) currRDD
else {
val newRDD = normalRDD(sc, numPartitions, seed)
accum(sc, numPartitions, numSteps - 1, currRDD.zip(newRDD), newRDD.max)
}
}
if(numSteps <= 0) None
else Some(accum(sc, numPartitions, numSteps, sc.emptyRDD[Double], 1L))
}

Retrieving not only top one predictions from Multiclass Regression with Spark [duplicate]

I'm running a Bernoulli Naive Bayes using code:
val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Unfortunately this code isn't working for NaiveBayes.
Concerning the probabilities for Bernouilli Naive Bayes, here is an example :
// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))
// Transforming dummy data into LabeledPoint
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
// labels
val labels = model.labels
// Probabilities for all feature vectors
val features = parsedData.map(lp => lp.features)
model.predictProbabilities(features).take(10) foreach println
// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector ${testVector} => probability : ${model.predictProbabilities(testVector)}")
As for the AUC :
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Concerning the inquiry from the chat :
val results = parsedData.map { lp =>
val probs: Vector = model.predictProbabilities(lp.features)
(for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
}.flatMap(identity)
results.take(10).foreach(println)
// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)
and if you are only interested in the argmax classes :
val results = training.map { lp => val probs: Vector = model.predictProbabilities(lp.features)
val bestClass = probs.argmax
(labels(bestClass), probs(bestClass))
}
results.take(10) foreach println
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
Note: Works with Spark 1.5+
EDIT: (for Pyspark users)
It seems like some are having troubles getting probabilities using pyspark and mllib. Well that's normal, spark-mllib doesn't present that function for pyspark.
Thus you'll need to use the spark-ml DataFrame-based API :
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes
df = spark.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)
model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction |probability |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0 |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0 |
# |[0.0,1.0]|0.0 |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0 |
# |[1.0,0.0]|1.0 |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0 |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
You'll just need to select your prediction column and compute your AUC.
For more information about Naive Bayes in spark-ml, please refer to the official documentation here.

How to do X * diag(Y) in Scala Breeze?

How to do X * diag(Y) in Scala Breeze? X could be for example a CSCMatrix and Y could be a DenseVector?
In MATLAB syntax, this would be:
X * spdiags(0, Y, N, N )
Or:
X .* repmat( Y', K, 0 )
In SciPy syntax, this would be a 'broadcast multiply':
Y * X
How to do X * diag(Y) in Scala Breeze?
I wrote my own sparse diagonal method, and dense / sparse multiplication method in the end.
Use like this:
val N = 100000
val K = 100
val A = DenseMatrix.rand(N,K)
val b = DenseVector.rand(N)
val c = MatrixHelper.spdiag(b)
val d = MatrixHelper.mul( A.t, c )
Here are the implementations of spdiag and mul:
// Copyright Hugh Perkins 2012
// You can use this under the terms of the Apache Public License 2.0
// http://www.apache.org/licenses/LICENSE-2.0
package root
import breeze.linalg._
object MatrixHelper {
// it's only efficient to put the sparse matrix on the right hand side, since
// it is a column-sparse matrix
def mul( A: DenseMatrix[Double], B: CSCMatrix[Double] ) : DenseMatrix[Double] = {
val resultRows = A.rows
val resultCols = B.cols
var row = 0
val result = DenseMatrix.zeros[Double](resultRows, resultCols )
while( row < resultRows ) {
var col = 0
while( col < resultCols ) {
val rightRowStartIndex = B.colPtrs(col)
val rightRowEndIndex = B.colPtrs(col + 1) - 1
val numRightRows = rightRowEndIndex - rightRowStartIndex + 1
var ri = 0
var sum = 0.
while( ri < numRightRows ) {
val inner = B.rowIndices(rightRowStartIndex + ri)
val rightValue = B.data(rightRowStartIndex + ri)
sum += A(row,inner) * rightValue
ri += 1
}
result(row,col) = sum
col += 1
}
row += 1
}
result
}
def spdiag( a: Tensor[Int,Double] ) : CSCMatrix[Double] = {
val size = a.size
val result = CSCMatrix.zeros[Double](size,size)
result.reserve(a.size)
var i = 0
while( i < size ) {
result.rowIndices(i) = i
result.colPtrs(i) = i
result.data(i) = i
//result(i,i) = a(i)
i += 1
}
//result.activeSize = size
result.colPtrs(i) = i
result
}
}