I have a file that maps from documentId to entities and I extract document co-occurrences. The entities RDD looks like this:
//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])]
To extract relationships between entities and their frequency within each document, I use the following code:
def hashId(str: String) = {
Hashing.md5().hashString(str, Charsets.UTF_8).asLong()
}
val docRelTupleRDD = docEntityTupleRDD
//flatMap at SampleGraph.scala:62
.flatMap { case(docId, entities) =>
val entitiesWithId = entities.map { case(name, _, freq) => (hashId(name), freq) }.toList
val relationships = entitiesWithId.combinations(2).collect {
case Seq((id1, freq1), (id2, freq2)) if id1 != id2 =>
// Make sure left side is less than right side
val (first, second) = if (id1 < id2) (id1, id2) else (id2, id1)
((first, second), (docId.toInt, freq1 * freq2))
}
relationships
}
val zero = collection.mutable.Map[Int, Int]()
val edges: RDD[Edge[immutable.Map[Int, Int]]] = docRelTupleRDD
.aggregateByKey(zero)(
(map, v) => map += v,
(map1, map2) => map1 ++= map2
)
.map { case ((e1, e2), freqMap) => Edge(e1, e2, freqMap.toMap) }
Each edge stores the relationship frequency per document in a Map. When I'm trying to write the edges to file:
edges.saveAsTextFile(outputFile + "_edges")
I receive the following errors after some time:
15/12/28 02:39:40 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 127198 ms exceeds timeout 120000 ms
15/12/28 02:39:40 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 127198 ms
15/12/28 02:39:40 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 0.0
15/12/28 02:42:50 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:42:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): ExecutorLostFailure (executor driver lost)
15/12/28 02:43:55 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
15/12/28 02:46:04 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): ExecutorLostFailure (executor driver lost)
[...]
15/12/28 02:47:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/28 02:48:36 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:49:39 INFO TaskSchedulerImpl: Cancelling stage 0
15/12/28 02:49:39 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at SampleGraph.scala:62) failed in 3321.145 s
15/12/28 02:51:06 WARN SparkContext: Killing executors is only supported in coarse-grained mode
[...]
My spark configuration looks like this:
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setAppName("wordCount")
.setMaster("local[8]")
.set("spark.executor.memory", "8g")
.set("spark.driver.maxResultSize", "8g")
// Increase memory fraction to prevent disk spilling
.set("spark.shuffle.memoryFraction", "0.3")
// Disable spilling
// If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
// This spilling threshold is specified by spark.shuffle.memoryFraction.
.set("spark.shuffle.spill", "false")
I already increased the executor memory and refactored a previous reduceByKey construct with aggregateByKey after research on the internet. Error stays the same. Can someone help me?
Related
I've tried a lot of times to apply a function which applies some modification to a spark Dataframe which contains some text strings. Below is the corresponding code but it always give me this error:
An error occurred while calling o699.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 1 times, most recent failure: Lost task 0.0 in stage 27.0 (TID 29, localhost, executor driver):
import os
import sys
from pyspark.sql import SparkSession
#!hdfs dfs -rm -r nixon_token*
spark = SparkSession.builder \
.appName("spark-nltk") \
.getOrCreate()
data = spark.sparkContext.textFile('1970-Nixon.txt')
def word_tokenize(x):
import nltk
return str(nltk.word_tokenize(x))
test_tok = udf(lambda x: word_tokenize(x),StringType())
resultDF = df_test.select("spans", test_tok('spans').alias('text_tokens'))
resultDF.show()
I'm running logistic regression with SGD on a large libsvm file. The file is about 10 GB in size with 40 million training examples.
When I run my scala code with spark-submit, I notice that spark spends a lot of time logging this:
18/02/07 04:44:50 INFO HadoopRDD: Input split: file:/ebs2/preprocess/xaa:234881024+33554432
18/02/07 04:44:51 INFO Executor: Finished task 6.0 in stage 1.0 (TID 7). 875 bytes result sent to driver
18/02/07 04:44:51 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 9, localhost, executor driver, partition 8, PROCESS_LOCAL, 7872 bytes)
18/02/07 04:44:51 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 7) in 1025 ms on localhost (executor driver) (7/307)
Why is Spark doing so many 'HadoopRDD: Input splits'? What's the purpose of that, and how do I go about speeding up or getting rid of this process?
Here is the code:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import scala.compat.Platform._
object test {
def main(args: Array[String]) {
val nnodes = 1
val epochs = 3
val conf = new SparkConf().setAppName("Test Name")
val sc = new SparkContext(conf)
val t0=currentTime
val train = MLUtils.loadLibSVMFile(sc, "/ebs2/preprocess/xaa", 262165, 4)
val test = MLUtils.loadLibSVMFile(sc, "/ebs2/preprocess/xab", 262165, 4)
val t1=currentTime;
println("START")
val lrAlg = new LogisticRegressionWithSGD()
lrAlg.optimizer.setMiniBatchFraction(10.0/40000000.0)
lrAlg.optimizer.setNumIterations(12000000)
lrAlg.optimizer.setStepSize(0.01)
val model = lrAlg.run(train)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
}
}
I fixed the speed issues by running
train = train.coalesce(1)
train.cache()
and by increasing the memory to a total of 64 gigs. Previously Spark might not have been caching properly due to not enough RAM.
I am trying to do an use case with Kafka and Spark using Scala. I built a consumer and a producer using kafka libs and now I am building the data processor to count words using Spark. Here are my build.sbt:
name := """scala-akka-stream-kafka"""
version := "1.0"
// scalaVersion := "2.12.4"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.kafka" %% "kafka" % "0.10.2.0",
"org.apache.kafka" % "kafka-streams" % "0.10.2.0",
"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.0.0")
dependencyOverrides ++= Seq(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.5",
"com.fasterxml.jackson.core" % "jackson-module-scala" % "2.6.5")
resolvers ++= Seq(
"Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
)
resolvers += Resolver.sonatypeRepo("releases")
My word count data processor is with some error on the line val wordMap = words.map( word => (word, 1)):
package com.spark.streams
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Durations, StreamingContext}
import scala.collection.mutable
object WordCountSparkStream extends App {
val kafkaParam = new mutable.HashMap[String, String]()
kafkaParam.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
kafkaParam.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParam.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParam.put(ConsumerConfig.GROUP_ID_CONFIG, "group1")
kafkaParam.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
kafkaParam.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
val conf = new SparkConf().setMaster("local[2]").setAppName("WordCountSparkStream")
// Read messages in batch of 5 seconds
val sparkStreamingContext = new StreamingContext(conf, Durations.seconds(5))
//Configure Spark to listen messages in topic test
val topicList = List("streams-plaintext-input")
// Read value of each message from Kafka and return it
val messageStream = KafkaUtils.createDirectStream(sparkStreamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicList, kafkaParam))
val lines = messageStream.map(consumerRecord => consumerRecord.value().asInstanceOf[String])
// Break every message into words and return list of words
val words = lines.flatMap(_.split(" "))
// Take every word and return Tuple with (word,1)
val wordMap = words.map( word => (word, 1))
// Count occurance of each word
val wordCount = wordMap.reduceByKey((first, second) => first + second)
//Print the word count
wordCount.print()
sparkStreamingContext.start()
sparkStreamingContext.awaitTermination()
// "streams-wordcount-output"
}
But this is not compilation error. not even lib conflict. It says I cannot deserialize. But I am using String deserializer that is what my producing is generating.
17/12/12 17:02:50 INFO DAGScheduler: Submitting 8 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountSparkStream.scala:37) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7))
17/12/12 17:02:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
17/12/12 17:02:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4710 bytes)
17/12/12 17:02:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4710 bytes)
17/12/12 17:02:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/12/12 17:02:50 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/12/12 17:02:50 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ClassNotFoundException: scala.None$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1863)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2037)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2282)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2206)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2064)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:309)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
try this:
fork:=true
works for me, but i don't know how~
Error:
ERROR TaskSetManager: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)
Goal: Obtain recommendation for all the users using the model and overlap with each users test data and generate overlap ratio.
I have build a recommendation model using spark mllib. I evaluate the overlap ration of test data per user and recommended items per user and generate mean overlap ratio.
def overlapRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {
val testData: RDD[(Int, Iterable[Int])] = test_data.map(r => (r.user, r.product)).groupByKey
val n = testData.count
val recommendations: RDD[(Int, Array[Int])] = model.recommendProductsForUsers(20)
.mapValues(_.map(r => r.product))
val overlaps = testData.join(recommendations).map(x => {
val moviesPerUserInRecs = x._2._2.toSet
val moviesPerUserInTest = x._2._1.toSet
val localHitRatio = moviesPerUserInRecs.intersect(moviesPerUserInTest)
if(localHitRatio.size > 0)
1
else
0
}).filter(x => x != 0).count
var r = 0.0
if (overlaps != 0)
r = overlaps / n
return r
}
But the problem here is that it ends up throwing above maxResultSize error. In my spark configuration I did following to increase the maxResultSize.
val conf = new SparkConf()
conf.set("spark.driver.maxResultSize", "6g")
But that didn't solve the problem, I went almost close to the amount that I allocate the driver memory yet the issue didn't get resolve. While the code is getting execute I kept eyes on my spark job and what I saw is bit puzzling.
[Stage 281:==> (47807 + 100) / 1000000]15/12/01 12:27:03 ERROR TaskSetManager: Total size of serialized results of 47809 tasks (6.0 GB) is bigger than spark.driver.maxResultSize (6.0 GB)
At above stage code is executing MatrixFactorization code in spark-mllib recommendForAll around line 277 (not exactly sure the line number).
private def recommendForAll(
rank: Int,
srcFeatures: RDD[(Int, Array[Double])],
dstFeatures: RDD[(Int, Array[Double])],
num: Int): RDD[(Int, Array[(Int, Double)])] = {
val srcBlocks = blockify(rank, srcFeatures)
val dstBlocks = blockify(rank, dstFeatures)
val ratings = srcBlocks.cartesian(dstBlocks).flatMap {
case ((srcIds, srcFactors), (dstIds, dstFactors)) =>
val m = srcIds.length
val n = dstIds.length
val ratings = srcFactors.transpose.multiply(dstFactors)
val output = new Array[(Int, (Int, Double))](m * n)
var k = 0
ratings.foreachActive { (i, j, r) =>
output(k) = (srcIds(i), (dstIds(j), r))
k += 1
}
output.toSeq
}
ratings.topByKey(num)(Ordering.by(_._2))
}
recommendForAll method get called in from recommendProductsForUsers method.
But looks like the method is spinning off 1M tasks. Data that get fed comes from 2000 part files so I am confuse how it started to spit 1M tasks and I think that might be the problem.
My question is how can I actually resolve this problem. Without using this approach its really hard to calculate overlap ratio or recall#K. This is on spark 1.5 (cloudera 5.5)
the 2GB problem is not new to the Spark community: https://issues.apache.org/jira/browse/SPARK-6235
Re/ the partition size greater than 2GB, try to repartition (myRdd.repartition(parallelism)) your RDD to a greater number of partitions (w/r/t/ your current level of parallelism), thus reducing each single partition's size.
Re/ the number of tasks spinned (hence partitions created), my hypothesis is that it might come out of the srcBlocks.cartesian(dstBlocks) API call, which produces an output RDD made of (z = srcBlocks's number of partitions * dstBlocks's number of partitions) partitions.
In this case, you might consider leveraging myRdd.coalesce(parallelism) API instead of the repartition one to avoid shuffle (and partitions seralialization related problems).
I am new to spark so forgive me for asking a basic question. I'm trying to import my tsv file into spark but I'm not sure if its working.
val lines = sc.textFile("/home/cloudera/Desktop/Test/test.tsv
val split_lines = lines.map(_.split("\t"))
split_lines.first()
Everything seems to be working fine. Is there a way I can see if the tsv file has loaded properly?
SAMPLE DATA: (all using tabs as spaces)
hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda
So far, your code looks good to me. If you print that first line to the console, do you see the expected data?
To explore the Spark API, the best thing to do is to use the Spark-shell, a Scala REPL enriched with Spark-specifics that builds a default Spark Context for you.
It will let you explore the data a lot easier.
Here's an example loading ~65k lines csv file. Similar usecase to what you're doing, I guess:
$><spark_dir>/bin/spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT
/_/
scala> val lines=sc.textFile("/home/user/playground/ts-data.csv")
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> val csv=lines.map(_.split(";"))
csv: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14
scala> csv.count
(... spark processing ...)
res0: Long = 67538
// let's have a look at the first record
scala> csv.first
14/06/01 12:22:17 INFO SparkContext: Starting job: first at <console>:17
14/06/01 12:22:17 INFO DAGScheduler: Got job 1 (first at <console>:17) with 1 output partitions (allowLocal=true)
14/06/01 12:22:17 INFO DAGScheduler: Final stage: Stage 1(first at <console>:17)
14/06/01 12:22:17 INFO DAGScheduler: Parents of final stage: List()
14/06/01 12:22:17 INFO DAGScheduler: Missing parents: List()
14/06/01 12:22:17 INFO DAGScheduler: Computing the requested partition locally
14/06/01 12:22:17 INFO HadoopRDD: Input split: file:/home/user/playground/ts-data.csv:0+1932934
14/06/01 12:22:17 INFO SparkContext: Job finished: first at <console>:17, took 0.003210457 s
res1: Array[String] = Array(20140127, 0000df, d063b4, ***, ***-Service,app180000m,49)
// groupby id - count unique
scala> csv.groupBy(_(4)).count
(... Spark processing ...)
res2: Long = 37668
// records per day
csv.map(record => record(0)->1).reduceByKey(_+_).collect
(... more Spark processing ...)
res8: Array[(String, Int)] = Array((20140117,1854), (20140120,2028), (20140124,3398), (20140131,6084), (20140122,5076), (20140128,8310), (20140123,8476), (20140127,1932), (20140130,8482), (20140129,8488), (20140118,5100), (20140109,3488), (20140110,4822))
* Edit using data added to the question *
val rawData="""hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda"""
//split lines
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
// Split using space as separator
val byId=rdd.map(_.split(" ")).groupBy(_(1))
byId.count
res11: Long = 3