Getting Py4JJavaError Pyspark error on using rdd - pyspark

I am getting the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
on this line:
result = df.select('student_age').rdd.flatMap(lambda x: x).collect()
'student_age' is a column name. It was running fine until last week but now this error.
Does anyone have any insights on that?

Using collect is dangerous for this very reason, It's prone to Out Of Memory errors. I suggest removing it. You also do not need to use a rdd for this you can do this with a data frame:
result = df.select(explode(df['student_age'])) #returns a dataFrame
#write code to use a data frame instead of any array.
If nothing else changed, likely the data did, and finally outgrew the size in memory.
It's also possible that you have new 'bad' data that is throwing an error.
Either way you could likely prove this by find this(OOM) or prove the data is bad by printing it.
def f(row):
print(row.student_age)
result.foreach(f) # used for simple stuff that doesn't require heavy initialization.
IF that works you may want to break your code down to use foreachPartition. This will let you do math on each value in the memory of each executor. The only trick is that within fun below as you are executing this code on the executor you cannot reference anything that uses sparkContext. (Python code only instead of Pyspark).
def f(rows):
#intialize a database connection here
for row in rows:
print(row.student_age) # do stuff with student_age
#close database connection here
result.foreachPartition(f) # used for things that need heavy initialization
Spark foreachPartition vs foreach | what to use?

This issue is solved, here is the answer:
result = [i[0] for i in df.select('student_age').toLocalIterator()]

Related

Exception handling in Apache Spark

I have been researching on the proper way of handling exceptions in Apache Spark jobs. I have read through different questions in Stackoverflow but I still haven't got to a conclusion. From my point of view there are three ways of handling exceptions:
Try catch/block surrounding the lambda function that is going to perform the computation. This is tricky because the block will have to be placed surrounding the code that triggers the lazy computation. If an error happens then I assume there won't be any RDD to work with (Taken from this blog entry)
val lines: RDD[String] = sc.textFile("large_file.txt")
val tokens =
lines.flatMap(_ split " ")
.map(s => s(10))
try {
// This try-catch block catch all the exceptions thrown by the
// preceding transformations.
tokens.saveAsTextFile("/some/output/file.txt")
} catch {
case e : StringIndexOutOfBoundsException =>
// Doing something in response of the exception
}
Try catch/block inside the lambda function: This implies deciding on the correct output of a caught exception inside the lambda function.
rdd.map({
Try(fn) match{
case Success: _
case Failure:<<Record with error flag>>
}).filter(record.errorflag==null)
Let the exception propagate. The task will fail and the Spark framework will relaunch the task again. This works when the error is caused by reasons outside the code scope. e.g. (memory leak, connection to another service lost momentarily.)
What's the correct way of handling exceptions?. I guess it depends on what you want to achieve with the RDD operation. If an error in one of the RDD records means that the output is not valid then option 1 is the way to go. If we expect some of the records to fail, we go for option 2. Option 3 does not even need to make a choice as it is the normal behaviour of the platform.
In the past we did not bother with try/catch approach except for input parameter checking.
For the rest we just relied on checking the return code as in:
spark-submit --master yarn ... bla bla
ret_val=$?
...
Why? As you need to correct something in general and we needed to then start over again. It's hard to dynamically correct certain things. Your scheduling tool can pick this up as well, Rundeck, Airflow...et al.
More advanced restart options are possible but simply get convoluted, but could be done. As you allude to in option 2. Never seen that done though.

Difference between calling sql() and using Spark API call()

I am new to Spark/Scala/Hive. Am just wondering if there are any differences between calling
spark = new SparkSession(...).getHiveContext()
spark.sql("SELECR * FROM table")
and
spark = new SparkSession(...).getHiveContext() // not using
spark.read.table(table).select(from("*"))
??
Particularly, are there any performance difference.
These two snippets have the same run-time performance.
The second API is safer, is you make a typo or try to used some non supported operation it will give you a quick and clear compilation error. It's funny that you wrote SELECR and not SELECT, that a good illustration of this point :)

Source.fromFile not working for HDFS file path

i am trying to read file contents from my hdfs for that i am using Source.fromFile(). It is working fine when my file is in local system but throwing error when i am trying to read file from HDFS.
object CheckFile{
def main(args:Array[String]) {
for (line <- Source.fromFile("/user/cloudera/xxxx/File").getLines()) {
println(line)
}
}
}
Error:
java.io.FileNotFoundException: hdfs:/quickstart.cloudera:8080/user/cloudera/xxxx/File (No such file or directory)
i searched but i am not able to find any solutions to this.
Please help
If you are using Spark you should use SparkContext to load the files. Source.fromFile uses the local file system.
Say you have your SparkContext at sc,
val fromFile = sc.textFile("hdfs://path/to/file.txt")
Should do the trick. You might have to specify the node address, though.
UPDATE:
To add to the comment. You want to read some data from hdfs and store it as a Scala collection. This is bad practice as the file might contain milions of lines and it will crash due to insufficient amount of memory; you should use RDDs and not built-in Scala collections. Nevertheless, if this is what you want, you could do:
val fromFile = sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray
Which would produce a local collection of desired type (Array in this case).
sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray.mkString will give the result as string

How can one spread the result of foreachRDD

let's say I have a code as follows :
var model = initiliazeModel(some_params)
dstream.foreachRDD { rdd =>
model = model.update(rdd)
println(model)
}
println(model) // or doing some thing on the model
My problem is that even if the first println gives the desired result ie. the model up-to-date, the second println displays the initialized model and not the updated one !!!
My question is how can I spread the updated model outside the block foreachRDD ?!
I also think of a synchronization problem because the 2nd println is run before the 1st one !!!
Thanks for help !
You have a common misconception here. In general, when you call map, filter, foreach, and any other transformation, you are not executing anything just yet. You closures are sent to executors and the stages configured, but all things are evaluated lazily. Your main program proceed ahead, either adding more configuration or other things, not waiting for all computations to be done. Thus, when your program reaches your second println (miliseconds after), the model has not changed nor has any other println been called.
In Scala, I have no idea, but in Java, you can enclose your foreach and model variable within a class as a static members and then use model variable after the success of foreach in another class.
Accumulators are global in Spark. you can update the accumulator variable anywhere in the Program and it gets reflects everywhere regardless of wether it is different executor or driver program.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
creating and initilizing Accumulator
val accumulator = sc.accumulator(0)
Initializing the accumulator
accumulator.add(1)
accessing the latest value
accumulator.value
hope this helps

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.