Spark Mllib - Frequent Pattern Mining - Association Rules - Not getting the expected results - scala

I've the following dataset:
[A,D]
[C,A,B]
[A]
[A,E,D]
[B,D]
And I am trying to extract some association rules using Frequent Pattern Mining using Spark Mllib. For that I've the following code:
val transactions = sc.textFile("/user/cloudera/teste")
import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
val freqItemsets = transactions.repartition(10).map(_.split(",")).flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
val ar = new AssociationRules().setMinConfidence(0.8)
val results = ar.run(freqItemsets)
results.collect().foreach { rule =>
println("[" + rule.antecedent.mkString(",")
+ "=>"
+ rule.consequent.mkString(",") + "]," + rule.confidence)}
But all the rules extracted have confidence equal to 1:
[[C=>A],1.0
[[C=>B]],1.0
[A,B]=>[C],1.0
[E=>D]],1.0
[E=>[A],1.0
[A=>B]],1.0
[A=>[C],1.0
[[C,A=>B]],1.0
[[A=>D]],1.0
[E,D]=>[A],1.0
[[A,E=>D]],1.0
[[C,B]=>A],1.0
[[B=>D]],1.0
[B]=>A],1.0
[B]=>[C],1.0
I really not understanding the issue that I've in my code... Anyone knows what is the error that I have to calculate the confidence?
Many thanks!

Your data set is too tiny. The maximum frequency of any item in your data is 3. So you can have confidences 0, 1/3, 1/2, 2/3, 1. Only 1 is larger than 0.8.
Try setting minimum confidence to 0.6, then you can actually get
[A]=>[D] confidence 0.666

Related

Getting a negative value for global clustering coefficient in Spark GraphX

I'm working on a large graph in GraphX, and I want to calculate the global clustering coefficient. I'm using a function from the book Spark GraphX in Action, which is :
def clusteringCoefficient[VD:ClassTag,ED:ClassTag](g:Graph[VD,ED]) = {
val numTriplets = g.aggregateMessages[Set[VertexId]](
et => { et.sendToSrc(Set(et.dstId));
et.sendToDst(Set(et.srcId)) },
(a,b) => a ++ b) // #A
.map(x => {val s = (x._2 - x._1).size; s*(s-1) / 2})
.reduce((a,b) => a + b)
println(numTriplets)
if (numTriplets == 0) 0.0 else
g.triangleCount.vertices.map(_._2).reduce(_ + _) /
numTriplets.toFloat
}
I canonicalise the graph and partition it before running the algorithm, but for some graphs I get a negative clustering coefficient which is impossible. I put the print statement in the function just for debugging and for those graphs I get a negative number for numTriplets.
I'm not very experienced with scala so I can't see if there is a bug in the implementation.
Any help would be appreciated!

How does the fold action work in Spark?

Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?
Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.
You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })

Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples

Software Version: Apache Spark v1.3
Context: I've been trying to "translate" Octave/MATLAB code to Scala on Apache Spark. More precisely, I work on ex1data1.txt and ex1data2.txt from coursera practical part ex1. I've made such translation into Julia lang (it went smoothly) and now I've been struggling with Spark...without success.
Problem: Performance of my implementation on Spark is very poor. I cannot even say it works correctly. That's why for ex1data1.txt I added polynomial feature, and I also worked with: theta0 using setIntercept(true) and with extra non-normalized column of 1.0 values(in this case I set Intercept to false). I receive only silly results.
So, then I 've decided to start working with ex1data2.txt. Below you can find the code and the expected result. Of course Spark result is wrong.
Did you have similar experience? I will be grateful for your help.
The Scala code for the exd1data2.txt:
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}
object MLibOnEx1data2 extends App {
val conf = new SparkConf()
conf.set("spark.app.name", "coursera ex1data2.txt test")
val sc = new SparkContext(conf)
val input = sc.textFile("hdfs:///ex1data2.txt")
val trainData = input.map { line =>
val parts = line.split(',')
val y = parts(2).toDouble
val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
println(s"x = $features y = $y")
LabeledPoint(y, features)
}.cache()
// Building the model
val numIterations = 1500
val alpha = 0.01
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(trainData.map(x => x.features))
val scaledTrainData = trainData.map{ td =>
val normFeatures = scaler.transform(td.features)
println(s"normalized features = $normFeatures")
LabeledPoint(td.label, normFeatures)
}.cache()
val tsize = scaledTrainData.count()
println(s"Training set size is $tsize")
val alg = new LinearRegressionWithSGD().setIntercept(true)
alg.optimizer
.setNumIterations(numIterations)
.setStepSize(alpha)
.setUpdater(new SquaredL2Updater)
.setRegParam(0.0) //regularization - off
val model = alg.run(scaledTrainData)
println(s"Theta is $model.weights")
val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))
println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338
// Evaluate model on training examples and compute training error
val valuesAndPreds = scaledTrainData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = ((valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
println("Training Mean Squared Error = " + MSE)
// Save and load model
val trySaveAndLoad = util.Try(model.save(sc, "myModelPath"))
.flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
.getOrElse(-1)
println(s"trySaveAndLoad result is $trySaveAndLoad")
}
STDOUT result is:
Training set size is 47
Theta is (weights=[52090.291641275864,19342.034885388926],
intercept=181295.93717032953).weights
Estimate the price of a 1650 sq-ft, 3 br house = 153983.5541846754
dollars
Training Mean Squared Error = 1.5876093757127676E10
trySaveAndLoad result is -1
Well, after some digging I believe there is nothing here. First I saved content of the valuesAndPreds to text file:
valuesAndPreds.map{
case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'
Rest of the code is written in R.
First lets create a model using closed form solution:
# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model
model <- lm(V3 ~ ., df)
For reference:
> summary(model)
Call:
lm(formula = V3 ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-130582 -43636 -10829 43698 198147
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340413 9637 35.323 < 2e-16 ***
V1 110631 11758 9.409 4.22e-12 ***
V2 -6650 11758 -0.566 0.575
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
Now prediction:
closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
plot(
closedFormPrediction, df$V3,
ylab="Actual", xlab="Predicted",
main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))
'
Now we can compare above to SGD results:
sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)
plot(
sgd$V2, sgd$V1, ylab="Actual",
xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))
Finally lets compare both:
plot(
sgd$V2, closedFormPrediction,
xlab="SGD", ylab="Closed form", main="SGD vs Closed form")
So, result are clearly not perfect but nothing seems to be completely off here.

Iterative algorithms with Spark streaming

So I understand that Spark can perform iterative algorithms on single RDDs for example Logistic regression.
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
The above example is iterative because it maintains a global state w that is updated after each iteration and its updated value is used in the next iteration. Is this functionality possible in Spark streaming? Consider the same example, except now points is a DStream. In this case, you could create a new DStream that calculates the gradient with
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
But how would you handle the global state w. It seems like w would have to be a DStream too (using updateStateByKey maybe), but then its latest value would somehow need to be passed into the points map function which I don't think is possible. I don't think DStreams can communicate in this way. Am I correct, or is it possible to have iterative computations like this in Spark Streaming?
I just found out that this is quite straightforward with the foreachRDD function. MLlib actually provides models that you can train with DStreams and I found the answer in the streamingLinearAlgorithm code. It looks like you can just keep your global update variable locally in the driver and update it within the .foreachRDD so there is actually no need to transform it into a DStream itself. So you can apply this to the example I provided with something like
points.foreachRDD{(rdd,time) =>
val gradient=rdd.map(p=> (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
)).reduce(_ + _)
w -= gradient
}
Hmm... you can achieve something by parallelizing your iterator and then folding on it to update your gradient.
Also... I think you should keep Spark Streaming out of it as this problem does not look like having any feature which links it to any kind Streaming requirements.
// So, assuming... points is somehow a RDD[ Point ]
val points = sc.textFile(...).map(parsePoint).cache()
var w = Vector.random(D)
// since fold is ( T )( ( T, T) => T ) => T
val temps = sc.parallelize( 1 to ITERATIONS ).map( w )
// now fold over temps.
val gradient = temps.fold( w )( ( acc, v ) => {
val gradient = points.map( p =>
(1 / (1 + exp(-p.y*(acc dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
acc - gradient
}

How can I create a TF-IDF for Text Classification using Spark?

I have a CSV file with the following format :
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
Thanks
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
The second is an inverted index like
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
If anyone can improve on this, I'm very interested.