Spark PageRank Tuning - scala

I am running a PageRank using scala+spark2.4 on yarn,but always failed after running several hours/jobs.
--driver-memory 60G --driver-cores 4
--num-executors 250 --executor-cores 2 --executor-memory 32g
input data:
weightFile has 1000 .gz files,each 500MB, total 500GB
linkFile has 1000 .gz fules, each 500MB, total 500GB
How should I change my code or spark configs?
sc.setCheckpointDir(checkpointFile)
val weightData = sc.textFile(weightFile).repartition(20000)
val weightUrlData = weightData.map{line => val lines = line.split("\t"); (hash(lines(0)) , lines(0), lines(1).toFloat)}
weightUrlData.persist(StorageLevel.DISK_ONLY)
var dataWeight = weightUrlData.map{x => (x._1,x._3)}
dataWeight = dataWeight.reduceByKey{(a,b) => if(a > b) a else b}
val dataUrl = weightUrlData.map{x => (x._1,x._2)}
val totalZ = dataWeight.count.toFloat
val sum1 = dataWeight.map(x => x._2).sum().toFloat
dataWeight = dataWeight.map{x => (x._1,x._2/sum1)}
val linkData = sc.textFile(linkFile).repartition(20000)
val links = linkData.map{line => val lines = line.split("\t");(hash(lines(0)),(hash(lines(1)),lines(2).toFloat))}.groupByKey()
links.persist(StorageLevel.DISK_ONLY)
links.count()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
urls.map(url => (url._1, url._2*rank))
}
contribs.persist(StorageLevel.DISK_ONLY)
contribs.count()
val ranksTmp = contribs.reduceByKey(_ + _).mapValues(0.85 * _)
val Zranks = ranksTmp.map(x => x._2).sum()
val Z = totalZ - Zranks
println("total Z: " + totalZ + " Z: " + Z)
val randnZ = dataWeight.map{x => (x._1,x._2*Z)}
val rankResult = ranksTmp.rightOuterJoin(randnZ).map{case(a,(b,c)) => (a,b.getOrElse(0.0) + c) }
ranks = ranks.join(rankResult).map{case(a,(b,c)) => (a,c)}
if(i % 2 == 0) {
ranks.persist(StorageLevel.MEMORY_AND_DISK)
ranks.checkpoint()
ranks.count()
}else{
ranks.count()
}
if(i == iters) {
rankResult.map{case(a,b) => a.toString + "\t" + b.toString}.saveAsTextFile(outputFile)
dataUrl.join(rankResult).values.map{case (a,b) => a + "\t" + b.toString}.saveAsTextFile(outputFile + "UrlAndWeight")
}
```

It is really hard to guess why your code not working properly just by looking at it. A few years ago I implemented a Pagerank for ranking users in a social graph and it worked without a hitch for me - link. Maybe it'd be helpful for you. Spark's Pregel interface runs pagerank until convergence or you may set a fixed number of iterations.

Related

Spark iterative Kmeans not getting expected results?

I am writing a naive implementation of Kmeans in Spark for my homework:
import breeze.linalg.{ Vector, DenseVector, squaredDistance }
import scala.math
def parse(line: String): Vector[Double] = {
DenseVector(line.split(' ').map(_.toDouble))
}
def closest_assign(p: Vector[Double], centres: Array[Vector[Double]]): Int = {
var bestIndex = 1
var closest = Double.PositiveInfinity
for (i <- 0 until centres.length) {
val tempDist = squaredDistance(p, centres(i))
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
bestIndex
}
val fileroot:String="/FileStore/tables/"
val file=sc.textFile(fileroot+"data.txt")
.map(parse _)
.cache()
val c1=sc.textFile(fileroot+"c1.txt")
.map(parse _)
.collect()
val c2=sc.textFile(fileroot+"c2.txt")
.map(parse _)
.collect()
val K=10
val MAX_ITER=20
var kPoints=c2
for(i<-0 until MAX_ITER){
val closest = file.map(p => (closest_assign(p, kPoints), (p, 1)))
val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) }
val newPoints = pointStats.map { pair =>
(pair._1, pair._2._1 * (1.0 / pair._2._2))
}.collectAsMap()
for (newP <- newPoints) {
kPoints(newP._1) = newP._2
}
val tempDist = closest
.map { x => squaredDistance(x._2._1, newPoints(x._1)) }
.fold(0) { _ + _ }
println(i+" time finished iteration (cost = " + tempDist + ")")
}
In theory tempDist should become smaller and smaller as the program runs but in reality it goes the other way around. Also I found c1 and c2 changes value after the for(i<-0 until MAX_ITER) loop. But c1 and c2 should be val values! Is the way I load c1 and c2 wrong? c1 and c2 are two different initial clusters for the data.

Set similarity join using Spark

I have two text files each line is in the form of (id, sequence of numbers).
I have a threshold value as well.
File 1 looks like below where, in the first line, 0 is the id and rest is
a sequence of numbers.
0 1 4 5 6
1 2 3 6
2 4 5 6
Similarly I have file 2 with following contents.
0 1 4 6
1 2 5 6
2 3 5
I have to find those lines which have similarity value greater or equal to a threshold. Similarity value can be calculated as the intersection of two lines divided by the union of two lines. For example line id- 0 of file1 has seq 1,4,5,6 and line id-0 of file2 has seq 1,4,6. They have intersection size = 3 and union size = 4. Therefore their similarity will be 3/4 = 0.75 which is greater than the threshold.
I have written python code to do this task and trying to convert it to Scala.
with open("file1.txt") as f1:
content1 = f1.readlines()
content1 = [x.strip() for x in content1]
with open("file2.txt") as f2:
content2 = f2.readlines()
content2 = [x.strip() for x in content2]
threshold = 0.5
final_index_list_with_similarity = []
for i in range(len(content1)):
for j in range(len(content2)):
index_content1 = content1[i][0]
index_content2 = content2[j][0]
s = set(content1[i][1:])
t = set(content2[j][1:])
intersect = s & t
intersect_size = len(intersect) - 1
union_size = len(s) + len(t) - intersect_size - 2 #substracting two because I am getting two extra space.
similarity = intersect_size/union_size
if similarity >= threshold:
final_index_list_with_similarity.append((index_content1, index_content2, similarity))
print(final_index_list_with_similarity)
Output : [('0', '0', 0.75), ('1', '1', 0.5), ('2', '0', 0.5), ('2', '1', 0.5)]
What I have tried till now in scala is something looks like this.
val inputFile1 = args(0)
val inputFile2 = args(1)
val threshold = args(2).toDouble
val ouputFolder = args(3)
val conf = new SparkConf().setAppName("SetSimJoin").setMaster("local")
val sc = new SparkContext(conf)
val lines1 = sc.textFile(inputFile1).flatMap(line => line.split("\n"))
val lines2 = sc.textFile(inputFile2).flatMap(line => line.split("\n"))
val elements1 = lines1.map { x => x.drop(x.split(" ")(0).length.toInt + 1) }.flatMap { x => x.split(" ") }.map { x => (x, 1) }.reduceByKey(_+_)
val elements2 = lines2.map { x => x.drop(x.split(" ")(0).length.toInt + 1) }.flatMap { x => x.split(" ") }.map { x => (x, 1) }.reduceByKey(_+_)
This gives me the frequency of every number in the entire file.
Any help or guidance will be much appreciated.
Both files can be joined as RDD, and then formula applied: "intersection size/union size":
val lines1 = sparkContext.textFile("inputFile1")
val lines2 = sparkContext.textFile("inputFile2")
val rdd1 = lines1.map(_.split(" ")).map(v => (v(0), v.tail))
val rdd2 = lines2.map(_.split(" ")).map(v => (v(0), v.tail))
val result = rdd1.join(rdd2).map(r => (
r._1,
r._2._1.intersect(r._2._2).size * 1.0 /
r._2._1.union(r._2._2).distinct.size
)
)
result.foreach(println)
Output is:
(1,0.5)
(0,0.75)
(2,0.25)

Counting by range

The following script can be used to "count by" keys
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(nbr => (nbr, 1))
val nbrCountsWithReduce = nbrPairsRDD
.reduceByKey(_ + _)
.collect()
nbrCountsWithReduce.foreach(println)
it returns:
(1,1)
(2,2)
(3,3)
(4,4)
How could it be modified to map by range rather than absolute values and give the following output if we had two ranges 1:2 and 3:4:
(1:2,3)
(3:4,7)
One option is to convert the list into double and use the histogram function:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(_.toDouble).histogram(2)
One easy way that I can think of is to map the keys to individual ranges, for eg :
val nbrRangePairs = sc.parallelize(nbr)
.map(nbr => (computeRange(nbr), 1))
.reduceByKey(_ + _)
.collect()
// function to compute Ranges
def computeRange(num : int) : String =
{
if(num < 3)
return "1:2"
else if(num < 5)
return "2:3"
else
return "invalid"
}
Here is the code snippet to compute aggregations by range:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrs = sc.parallelize(nbr)
var lb = 1
var incr = 1
var ub = lb + incr
val nbrsMap = nbrs.map(rec => {
if(rec > ub) {
lb = rec
ub = lb + incr
}
(lb.toString + ":" + ub.toString, 1)
})
nbrsMap.reduceByKey((acc, value) => acc + value).foreach(println)
(1:2,3)
(3:4,7)

Spark Logistic regression and metrics

I want to run logistic regression 100 times with random splitting into test and training. I want to then save the performance metrics for individual runs and then later use them for gaining insight into the performance.
for (index <- 1 to 100) {
val splits = training_data.randomSplit(Array(0.90, 0.10), seed = index)
val training = splits(0).cache()
val test = splits(1)
logrmodel = train_LogisticRegression_model(training)
performLogisticRegressionRuns(logrmodel, test, index)
}
spark.stop()
}
def performLogisticRegressionRuns(model: LogisticRegressionModel, test: RDD[LabeledPoint], iterationcount: Int) {
private val sb = StringBuilder.newBuilder
// Compute raw scores on the test set. Once I cle
model.clearThreshold()
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val bcmetrics = new BinaryClassificationMetrics(predictionAndLabels)
// I am showing two sample metrics, but I am collecting more including recall, area under roc, f1 score etc....
val precision = bcmetrics.precisionByThreshold()
precision.foreach { case (t, p) =>
// If threshold is 0.5 as what we want, then get the precision and append it to the string. Idea is if score is <0.5 class 0, else class 1.
if (t == 0.5) {
println(s"Threshold is: $t, Precision is: $p")
sb ++= p.toString() + "\t"
}
}
val auROC = bcmetrics.areaUnderROC
sb ++= iteration + auPRC.toString() + "\t"
I want to save the performance results of each iteration in separate file. I tried this, but it does not work, any help with this will be great
val data = spark.parallelize(sb)
val filename = "logreg-metrics" + iterationcount.toString() + ".txt"
data.saveAsTextFile(filename)
}
I was able to resolve this, I did the following. I converted the String to a list.
val data = spark.parallelize(List(sb))
val filename = "logreg-metrics" + iterationcount.toString() + ".txt"
data.saveAsTextFile(filename)

Strange results when using Scala collections

I have some tests with results that I can't quite explain.
The first test does a filter, map and reduce on a list containing 4 elements:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{ i =>
counter.incrementAndGet()
true
}
val mapped = filtered.map{ i =>
counter.incrementAndGet()
i*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(11 == counter.get)
}
The counter is incremented 11 times as I expected: once for each element during filtering, once for each element during mapping and three times to add up the 4 elements.
Using wildcards the result changes:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
I can't work out how to use wildcards in the reduce (code doesnt compile), but now, the counter is only incremented 5 times!!
So, question #1: Why do wildcards change the number of times the counter is called and how does that even work?
Then my second, related question. My understanding of views was that they would lazily execute the functions passed to the monadic methods, but the following code doesn't show that.
{
val counter = new AtomicInteger(0)
val l = Seq(1, 2, 3, 4).view
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
println("after filter: " + counter.get)
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
println("after map: " + counter.get)
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("after reduce: " + counter.get)
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
The output is:
after filter: 1
after map: 2
after reduce: 5
counted 5 and result is 20
Question #2: How come the functions are being executed immediately?
I'm using Scala 2.10
You're probably thinking that
filter {
println
_ > 0
}
means
filter{ i =>
println
i > 0
}
but Scala has other ideas. The reason is that
{ println; _ > 0 }
is a statement that first prints something, and then returns the > 0 function. So it interprets what you're doing as a funny way to specify the function, equivalent to:
val p = { println; (i: Int) => i > 0 }
filter(p)
which in turn is equivalent to
println
val temp = (i: Int) => i > 0 // Temporary name, forget we did this!
val p = temp
filter(p)
which as you can imagine doesn't quite work out the way you want--you only print (or in your case do the increment) once at the beginning. Both your problems stem from this.
Make sure if you're using underscores to mean "fill in the parameter" that you only have a single expression! If you're using multiple statements, it's best to stick to explicitly named parameters.