Can not apply Spark User defined Functions - pyspark

I've tried a lot of times to apply a function which applies some modification to a spark Dataframe which contains some text strings. Below is the corresponding code but it always give me this error:
An error occurred while calling o699.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 1 times, most recent failure: Lost task 0.0 in stage 27.0 (TID 29, localhost, executor driver):
import os
import sys
from pyspark.sql import SparkSession
#!hdfs dfs -rm -r nixon_token*
spark = SparkSession.builder \
.appName("spark-nltk") \
.getOrCreate()
data = spark.sparkContext.textFile('1970-Nixon.txt')
def word_tokenize(x):
import nltk
return str(nltk.word_tokenize(x))
test_tok = udf(lambda x: word_tokenize(x),StringType())
resultDF = df_test.select("spans", test_tok('spans').alias('text_tokens'))
resultDF.show()

Related

Spark spends a long time on HadoopRDD: Input split

I'm running logistic regression with SGD on a large libsvm file. The file is about 10 GB in size with 40 million training examples.
When I run my scala code with spark-submit, I notice that spark spends a lot of time logging this:
18/02/07 04:44:50 INFO HadoopRDD: Input split: file:/ebs2/preprocess/xaa:234881024+33554432
18/02/07 04:44:51 INFO Executor: Finished task 6.0 in stage 1.0 (TID 7). 875 bytes result sent to driver
18/02/07 04:44:51 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 9, localhost, executor driver, partition 8, PROCESS_LOCAL, 7872 bytes)
18/02/07 04:44:51 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 7) in 1025 ms on localhost (executor driver) (7/307)
Why is Spark doing so many 'HadoopRDD: Input splits'? What's the purpose of that, and how do I go about speeding up or getting rid of this process?
Here is the code:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import scala.compat.Platform._
object test {
def main(args: Array[String]) {
val nnodes = 1
val epochs = 3
val conf = new SparkConf().setAppName("Test Name")
val sc = new SparkContext(conf)
val t0=currentTime
val train = MLUtils.loadLibSVMFile(sc, "/ebs2/preprocess/xaa", 262165, 4)
val test = MLUtils.loadLibSVMFile(sc, "/ebs2/preprocess/xab", 262165, 4)
val t1=currentTime;
println("START")
val lrAlg = new LogisticRegressionWithSGD()
lrAlg.optimizer.setMiniBatchFraction(10.0/40000000.0)
lrAlg.optimizer.setNumIterations(12000000)
lrAlg.optimizer.setStepSize(0.01)
val model = lrAlg.run(train)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
}
}
I fixed the speed issues by running
train = train.coalesce(1)
train.cache()
and by increasing the memory to a total of 64 gigs. Previously Spark might not have been caching properly due to not enough RAM.

How do I convert a WrappedArray column in spark dataframe to Strings?

I am trying to convert a column which contains Array[String] to String, but I consistently get this error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 78.0 failed 4 times, most recent failure: Lost task 0.3 in stage 78.0 (TID 1691, ip-******): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
Here's the piece of code
val mkString = udf((arrayCol:Array[String])=>arrayCol.mkString(","))
val dfWithString=df.select($"arrayCol").withColumn("arrayString",
mkString($"arrayCol"))
WrappedArray is not an Array (which is plain old Java Array not a natve Scala collection). You can either change signature to:
import scala.collection.mutable.WrappedArray
(arrayCol: WrappedArray[String]) => arrayCol.mkString(",")
or use one of the supertypes like Seq:
(arrayCol: Seq[String]) => arrayCol.mkString(",")
In the recent Spark versions you can use concat_ws instead:
import org.apache.spark.sql.functions.concat_ws
df.select(concat_ws(",", $"arrayCol"))
The code work for me:
df.select("wifi_ids").rdd.map(row =>row.get(0).asInstanceOf[WrappedArray[WrappedArray[String]]].toSeq.map(x=>x.toSeq.apply(0)))
In your case,I guess it is:
val mkString = udf(arrayCol=>arrayCol.asInstanceOf[WrappedArray[String]].toArray.mkString(","))
val dfWithString=df.select($"arrayCol").withColumn("arrayString",mkString($"arrayCol"))

Co-occurrence graph RpcTimeoutException in Apache Spark

I have a file that maps from documentId to entities and I extract document co-occurrences. The entities RDD looks like this:
//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])]
To extract relationships between entities and their frequency within each document, I use the following code:
def hashId(str: String) = {
Hashing.md5().hashString(str, Charsets.UTF_8).asLong()
}
val docRelTupleRDD = docEntityTupleRDD
//flatMap at SampleGraph.scala:62
.flatMap { case(docId, entities) =>
val entitiesWithId = entities.map { case(name, _, freq) => (hashId(name), freq) }.toList
val relationships = entitiesWithId.combinations(2).collect {
case Seq((id1, freq1), (id2, freq2)) if id1 != id2 =>
// Make sure left side is less than right side
val (first, second) = if (id1 < id2) (id1, id2) else (id2, id1)
((first, second), (docId.toInt, freq1 * freq2))
}
relationships
}
val zero = collection.mutable.Map[Int, Int]()
val edges: RDD[Edge[immutable.Map[Int, Int]]] = docRelTupleRDD
.aggregateByKey(zero)(
(map, v) => map += v,
(map1, map2) => map1 ++= map2
)
.map { case ((e1, e2), freqMap) => Edge(e1, e2, freqMap.toMap) }
Each edge stores the relationship frequency per document in a Map. When I'm trying to write the edges to file:
edges.saveAsTextFile(outputFile + "_edges")
I receive the following errors after some time:
15/12/28 02:39:40 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 127198 ms exceeds timeout 120000 ms
15/12/28 02:39:40 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 127198 ms
15/12/28 02:39:40 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 0.0
15/12/28 02:42:50 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:42:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): ExecutorLostFailure (executor driver lost)
15/12/28 02:43:55 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
15/12/28 02:46:04 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): ExecutorLostFailure (executor driver lost)
[...]
15/12/28 02:47:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/28 02:48:36 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:49:39 INFO TaskSchedulerImpl: Cancelling stage 0
15/12/28 02:49:39 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at SampleGraph.scala:62) failed in 3321.145 s
15/12/28 02:51:06 WARN SparkContext: Killing executors is only supported in coarse-grained mode
[...]
My spark configuration looks like this:
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setAppName("wordCount")
.setMaster("local[8]")
.set("spark.executor.memory", "8g")
.set("spark.driver.maxResultSize", "8g")
// Increase memory fraction to prevent disk spilling
.set("spark.shuffle.memoryFraction", "0.3")
// Disable spilling
// If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
// This spilling threshold is specified by spark.shuffle.memoryFraction.
.set("spark.shuffle.spill", "false")
I already increased the executor memory and refactored a previous reduceByKey construct with aggregateByKey after research on the internet. Error stays the same. Can someone help me?

Transform RDD into RowMatrix for PCA

The original data I have looks like this:
RDD data:
key -> index
1 -> 2
1 -> 3
1 -> 5
2 -> 1
2 -> 3
2 -> 4
How can I convert the RDD to the following format?
key -> index1, index2, index3, index4, index5
1 -> 0,1,1,0,1
2 -> 1,0,1,1,0
My current method is:
val vectors = filtered_data_by_key.map( x => {
var temp = Array[AnyVal]()
x._2.copyToArray(temp)
(x._1, Vectors.sparse(filtered_key_size, temp.map(_.asInstanceOf[Int]), Array.fill(filtered_key_size)(1) ))
})
I got some strange error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 54.0 failed 1 times, most recent failure: Lost task 3.0 in stage 54.0 (TID 75, localhost): java.lang.IllegalArgumentException: requirement failed
When I try to debug this program using the following code:
val vectors = filtered_data_by_key.map( x => {
val temp = Array[AnyVal]()
val t = x._2.copyToArray(temp)
(x._1, temp)
})
I found temp is empty, so the problem is in copyToArray().
I am not sure how to solve this.
I don't understand the question completely. Why are your keys important? And what is the maximum index value? In your code you arre using distinct number of keys as the maximum value of index but I believe that is a mistake.
But I will assume the maximum index value is 5. In that case, I believe this would be what you're looking for:
val vectors = data_by_key.map({case(k,it)=>Vectors.sparse(5,it.map(x=>x-1).toArray,
Array.fill(it.size)(1))})
val rm = new RowMatrix(vectors)
I decreased index number by one because they should start with 0.
The error 'requirement failed' is due to your index and values vectors not having the same size.

How can I run Spark job programmatically

I wan't to run Spark job programmatically - submit SparkPi calculation to remote cluster directly from Idea (my laptop):
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
.setMaster("spark://host-name:7077")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
However, when I run it, I observe the following error:
14/12/08 11:31:20 ERROR security.UserGroupInformation: PriviledgedActionException as:remeniuk (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
... 4 more
When I run the same script with spark-submit from my laptop, I see the same error.
And only when I upload the jar to remote cluster (machine, where master is running), job complete successfully:
./bin/spark-submit --master spark://host-name:7077 --class com.viaden.crm.spark.experiments.SparkPi ../spark-experiments_2.10-0.1-SNAPSHOT.jar
According to the exception stack, it should be your local firewall issue.
please refer to this similar case
Intermittent Timeout Exception using Spark