spark stanford parser out of memory - scala

I'm using StanfordCoreNLP 2.4.1 on Spark 1.5 to parse Chinese sentences, but ran into Java heap OOM exception. The code is like below:
val modelpath = "edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz"
val lp = LexicalizedParser.loadModel(modelpath)
val dataWords = data.map(x=>{
val tokens = x.split("\t")
val id = tokens(0)
val word_seg = tokens(2)
val comm_words = word_seg.split("\1").filter(_.split(":").length == 2).map(y=>(y.split(":")(0), y.split(":")(1)))
(id, comm_words)
}).filter(_._2.nonEmpty)
val dataSenSlice = dataWords.map(x=>{
val id = x._1
val comm_words = x._2
val punctuationIndex = Array(0) ++ comm_words.zipWithIndex.filter(_._1._2 == "34").map(_._2) ++ Array(comm_words.length - 1)
val senIndex = (punctuationIndex zip punctuationIndex.tail).filter(z => z._1 != z._2)
val senSlice = senIndex.map(z=>{
val begin = if (z._1 > 0) z._1 + 1 else z._1
val end = if (z._2 == comm_words.length - 1) z._2 + 1 else z._2
if (comm_words.slice(begin, end).filter(_._2 != "34").nonEmpty) {
val sen = comm_words.slice(begin, end).filter(_._2 != "34").map(_._1).mkString(" ").trim
sen
} else ""
}).filter(l=>l.nonEmpty && l.length<20)
(id, senSlice)
}).filter(_._2.nonEmpty)
val dataPoint = dataSenSlice.map(x=>{
val id = x._1
val senSlice = x._2
val senParse = senSlice.map(y=>{
StanfordNLPParser.senParse(lp, y)// java code wrapped sentence parser
})
id + "\t" + senParse.mkString("\1")
})
dataPoint.saveAsTextFile(PARSED_MERGED_POI)
The sentence I feed into parser is a sentence concatenated by segmented words using spaces.
The exception I ran into is:
17/08/09 10:28:15 WARN TaskSetManager: Lost task 1062.0 in stage 0.0 (TID 1219, rz-data-hdp-dn15004.rz.******.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.union(Pattern.java:5149)
at java.util.regex.Pattern.clazz(Pattern.java:2513)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.util.regex.Pattern.matches(Pattern.java:1128)
at java.lang.String.matches(String.java:2063)
at edu.stanford.nlp.parser.lexparser.ChineseUnknownWordModel.score(ChineseUnknownWordModel.java:97)
at edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel.score(BaseUnknownWordModel.java:124)
at edu.stanford.nlp.parser.lexparser.ChineseLexicon.score(ChineseLexicon.java:54)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1602)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1634)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635)
I'm wondering if I use the right way to do sentence parsing, or some other things are wrong.

Suggestions:
increase the number of partitions e.g.
data.repartition(500)
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
increase executor and driver memory, e.g. add 'spark-submit' parameter:
--executor-memory 8G
--driver-memory 4G

Related

Spark PageRank Tuning

I am running a PageRank using scala+spark2.4 on yarn,but always failed after running several hours/jobs.
--driver-memory 60G --driver-cores 4
--num-executors 250 --executor-cores 2 --executor-memory 32g
input data:
weightFile has 1000 .gz files,each 500MB, total 500GB
linkFile has 1000 .gz fules, each 500MB, total 500GB
How should I change my code or spark configs?
sc.setCheckpointDir(checkpointFile)
val weightData = sc.textFile(weightFile).repartition(20000)
val weightUrlData = weightData.map{line => val lines = line.split("\t"); (hash(lines(0)) , lines(0), lines(1).toFloat)}
weightUrlData.persist(StorageLevel.DISK_ONLY)
var dataWeight = weightUrlData.map{x => (x._1,x._3)}
dataWeight = dataWeight.reduceByKey{(a,b) => if(a > b) a else b}
val dataUrl = weightUrlData.map{x => (x._1,x._2)}
val totalZ = dataWeight.count.toFloat
val sum1 = dataWeight.map(x => x._2).sum().toFloat
dataWeight = dataWeight.map{x => (x._1,x._2/sum1)}
val linkData = sc.textFile(linkFile).repartition(20000)
val links = linkData.map{line => val lines = line.split("\t");(hash(lines(0)),(hash(lines(1)),lines(2).toFloat))}.groupByKey()
links.persist(StorageLevel.DISK_ONLY)
links.count()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
urls.map(url => (url._1, url._2*rank))
}
contribs.persist(StorageLevel.DISK_ONLY)
contribs.count()
val ranksTmp = contribs.reduceByKey(_ + _).mapValues(0.85 * _)
val Zranks = ranksTmp.map(x => x._2).sum()
val Z = totalZ - Zranks
println("total Z: " + totalZ + " Z: " + Z)
val randnZ = dataWeight.map{x => (x._1,x._2*Z)}
val rankResult = ranksTmp.rightOuterJoin(randnZ).map{case(a,(b,c)) => (a,b.getOrElse(0.0) + c) }
ranks = ranks.join(rankResult).map{case(a,(b,c)) => (a,c)}
if(i % 2 == 0) {
ranks.persist(StorageLevel.MEMORY_AND_DISK)
ranks.checkpoint()
ranks.count()
}else{
ranks.count()
}
if(i == iters) {
rankResult.map{case(a,b) => a.toString + "\t" + b.toString}.saveAsTextFile(outputFile)
dataUrl.join(rankResult).values.map{case (a,b) => a + "\t" + b.toString}.saveAsTextFile(outputFile + "UrlAndWeight")
}
```
It is really hard to guess why your code not working properly just by looking at it. A few years ago I implemented a Pagerank for ranking users in a social graph and it worked without a hitch for me - link. Maybe it'd be helpful for you. Spark's Pregel interface runs pagerank until convergence or you may set a fixed number of iterations.

Is there any stable method on the SparkSession/SparkContext/RDD that we can call to easily detect when eviction is happening?

Is there any stable method on the SparkSession/SparkContext/RDD that we can call to easily detect when eviction is happening?
For more context see Disable new Spark behaviour of evicting cached partitions when insufficient memory or When was automatic Spark RDD partition cache eviction implemented?
You can retrieve RddInfo array from SparkContext, and interrogate its elements for the partition counts of an RDD you're interested in. If some of the partitions were evicted/didnt fit into executor storage, the number numCachedPartitions will be less than total number of RDD's partitions numPartitions.
scala> val rdd = sc.textFile("file:///etc/spark/conf/spark-defaults.conf").repartition(10)
rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at repartition at <console>:27
scala> rdd.persist().count()
res14: Long = 34
scala> val rddStorage = rdd.context.getRDDStorageInfo(0)
rddStorage: org.apache.spark.storage.RDDInfo = RDD "MapPartitionsRDD" (9) StorageLevel: StorageLevel(memory, deserialized, 1 replicas); CachedPartitions: 10; TotalPartitions: 10; MemorySize: 5.1 KB; DiskSize: 0.0 B
scala> val fullyCached = (rddStorage.numCachedPartitions == rddStorage.numPartitions)
fullyCached: Boolean = true
Zero in the above, ...getRDDStorageInfo(0), is used for illustration purposes only. In reality, instead of simply using 0, you'd need to get the id of an RDD you're interested in (see RDD.id), and then iterate through the RDDInfo[] array to find the element with rddInfo.id = id. You can probably also use rddInfo.name to do the same thing if you give the RDD a name.
Finally, you could just detect if any RDD has eviction with something like this:
sparkSession
.sparkContext.getRDDStorageInfo.filter(_.isCached)
.find(rdd => rdd.numCachedPartitions < rdd.numPartitions)
.foreach(rdd =>
throw new IllegalArgumentException(s"RDD is being evicted, please configure cluster with more memory. " +
s"numCachedPartitions = ${rdd.numCachedPartitions}, " +
s"numPartitions = ${rdd.numPartitions}, " +
s"name = ${rdd.name}, " +
s"id = ${rdd.id}, " +
s"memSize = ${rdd.memSize}, " +
s"diskSize = ${rdd.diskSize}, " +
s"externalBlockStoreSize = ${rdd.externalBlockStoreSize}"
))

SparkError: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)

Error:
ERROR TaskSetManager: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)
Goal: Obtain recommendation for all the users using the model and overlap with each users test data and generate overlap ratio.
I have build a recommendation model using spark mllib. I evaluate the overlap ration of test data per user and recommended items per user and generate mean overlap ratio.
def overlapRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {
val testData: RDD[(Int, Iterable[Int])] = test_data.map(r => (r.user, r.product)).groupByKey
val n = testData.count
val recommendations: RDD[(Int, Array[Int])] = model.recommendProductsForUsers(20)
.mapValues(_.map(r => r.product))
val overlaps = testData.join(recommendations).map(x => {
val moviesPerUserInRecs = x._2._2.toSet
val moviesPerUserInTest = x._2._1.toSet
val localHitRatio = moviesPerUserInRecs.intersect(moviesPerUserInTest)
if(localHitRatio.size > 0)
1
else
0
}).filter(x => x != 0).count
var r = 0.0
if (overlaps != 0)
r = overlaps / n
return r
}
But the problem here is that it ends up throwing above maxResultSize error. In my spark configuration I did following to increase the maxResultSize.
val conf = new SparkConf()
conf.set("spark.driver.maxResultSize", "6g")
But that didn't solve the problem, I went almost close to the amount that I allocate the driver memory yet the issue didn't get resolve. While the code is getting execute I kept eyes on my spark job and what I saw is bit puzzling.
[Stage 281:==> (47807 + 100) / 1000000]15/12/01 12:27:03 ERROR TaskSetManager: Total size of serialized results of 47809 tasks (6.0 GB) is bigger than spark.driver.maxResultSize (6.0 GB)
At above stage code is executing MatrixFactorization code in spark-mllib recommendForAll around line 277 (not exactly sure the line number).
private def recommendForAll(
rank: Int,
srcFeatures: RDD[(Int, Array[Double])],
dstFeatures: RDD[(Int, Array[Double])],
num: Int): RDD[(Int, Array[(Int, Double)])] = {
val srcBlocks = blockify(rank, srcFeatures)
val dstBlocks = blockify(rank, dstFeatures)
val ratings = srcBlocks.cartesian(dstBlocks).flatMap {
case ((srcIds, srcFactors), (dstIds, dstFactors)) =>
val m = srcIds.length
val n = dstIds.length
val ratings = srcFactors.transpose.multiply(dstFactors)
val output = new Array[(Int, (Int, Double))](m * n)
var k = 0
ratings.foreachActive { (i, j, r) =>
output(k) = (srcIds(i), (dstIds(j), r))
k += 1
}
output.toSeq
}
ratings.topByKey(num)(Ordering.by(_._2))
}
recommendForAll method get called in from recommendProductsForUsers method.
But looks like the method is spinning off 1M tasks. Data that get fed comes from 2000 part files so I am confuse how it started to spit 1M tasks and I think that might be the problem.
My question is how can I actually resolve this problem. Without using this approach its really hard to calculate overlap ratio or recall#K. This is on spark 1.5 (cloudera 5.5)
the 2GB problem is not new to the Spark community: https://issues.apache.org/jira/browse/SPARK-6235
Re/ the partition size greater than 2GB, try to repartition (myRdd.repartition(parallelism)) your RDD to a greater number of partitions (w/r/t/ your current level of parallelism), thus reducing each single partition's size.
Re/ the number of tasks spinned (hence partitions created), my hypothesis is that it might come out of the srcBlocks.cartesian(dstBlocks) API call, which produces an output RDD made of (z = srcBlocks's number of partitions * dstBlocks's number of partitions) partitions.
In this case, you might consider leveraging myRdd.coalesce(parallelism) API instead of the repartition one to avoid shuffle (and partitions seralialization related problems).

How to improve join performance when broadcast variable is used?

I am new to Spark. I have two RDDs where one with a size of 9 GB (400 million lines) (RDD1) and the other 110 KB (4 million lines) (RDD2). I use RDD2 as a broadcast variable to decrease the shuffle process. My code works but again for the reduceByKey part it is too slow.
I have been playing around partition numbers. If I set 10,000 partitions for both RDD it starts to spill. So i increased it to 20K, 30K and 100K. It stopped spilling however it is extremely slow. On the other hand, I used
set("spark.akka.frameSize","1000") but it did not work out. How could I improve this code?
Here is my code:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[*]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4")
.set("spark.executor.memory","128g")
.set("spark.driver.memory","128g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\.txt",30000)...RDD1
val emp_new = sc.textFile("\\.txt",10000)...RDD2
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)

TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark)

I'm trying to run SparkPi example on my standalone mode cluster.
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
Note: I made a little change in this line:
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
Problem: I'm using spark-shell (Scala interface) to run this code. When I try this code, I receive this error repeatedly:
15/02/09 06:39:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Note: I can see my workers in my Master's WebUI and also I can see a new job in the Running Applications section. But there is no end for this application and I see error repeatedly.
What is the problem?
Thanks
If you want to run this from spark shell, then start the shell with argument --master spark://192.168.17.129:7077 and enter the following code:
import scala.math.random
import org.apache.spark._
val slices = 10
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Otherwise, compile the code into a jar and run it with spark-submit. But remove setMaster from the code and add it as 'master' argument to spark-submit script. Also remove the allowMultipleContexts argument from the code.
You need only one spark context.