pyspark job on qubole fails with "Retrying exception reading mapper output" - pyspark

I have a pyspark job running via qubole which fails with the following error.
Qubole > Shell Command failed, exit code unknown
Qubole > 2016-12-03 17:36:53,097 ERROR shellcli.py:231 - run - Retrying exception reading mapper output: (22, 'The requested URL returned error: 404 Not Found')
Qubole > 2016-12-03 17:36:53,358 ERROR shellcli.py:262 - run - Retrying exception reading mapper logs: (22, 'The requested URL returned error: 404 Not Found')
The job is run with the following configurations :
--num-executors 38 --executor-cores 2 --executor-memory 12288M --driver-memory 4000M --conf spark.storage.memoryFraction=0.3 --conf spark.yarn.executor.memoryOverhead=1024
Cluster contains 30 slave count. m2.2xlarge, 4 cores master and slave nodes.
Any insights on the root cause of the issue will be useful.

In many cases - above error is really not the main reason of failure. In qubole the spark job is submitted via a shellCli ( 1 mapper command which invokes the main pyspark job using spark-submit on one of the slave nodes ) - and since the same shellCli process invokes the driver in yarn-client mode - often times if this process goes bad due to any reason ( i.e. memory issues with driver ) then you might hit this issue.
Other less probable reason could be - network connectivity where qubole tier is unable to connect to the process/slave node where this 1 mapper invoker job is running.

Related

Container killed by YARN for exceeding memory limits in Spark Scala

I am Facing below Error while Running my Spark Scala code using Spark-submit command.
ERROR cluster.YarnClusterScheduler: Lost executor 14 on XXXX: Container killed by YARN for exceeding memory limits. 55.6 GB of 55 GB physical memory used.
The Code of the Line Number it throws the error is below...
df.write.mode("overwrite").parquet("file")
I am Writing a Parquet file.... It was working till yesterday not sure from last run only it is throwing the error with same input file.
Thanks,
Naveen
By Running with below conf in spark-submit command, the issues is resolved and code ran successfully.
--conf spark.dynamicAllocation.enabled=true
Thanks,
Naveen

ERROR HbaseConnector: Can't get the location for replica 0

I'm trying to perform some read/write operation with Hbase using Spark. When I'm running my spark code using spark-submit command
bin/spark-submit --master local[*] --class com.test.driver.Driver /home/deb/computation/target/computation-1.0-SNAPSHOT.jar "function=avg" "signals=('.tagname_qwewf')" "startTime=2018-10-10T13:51:47.135Z" "endTime=2018-10-10T14:36:11.073Z"
it's executing without any error.
But when I'm trying to do the same from Intellij I'm getting the below errors
8/12/17 01:51:45 ERROR HbaseConnector: An exception while reading dataframe from HBase
18/12/17 01:51:45 ERROR HbaseConnector: Can't get the location for replica 0
18/12/17 01:51:45 ERROR Driver: No historical data found for signals in the expression.
Any suggestion how to resolve this issue.

Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

~/spark/spark-2.1.1-bin-hadoop2.7/bin$ ./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/20 16:41:30 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.42.80:32141.
17/06/20 16:41:31 INFO RestSubmissionClient: Submission successfully created as driver-20170620204130-0005. Polling submission state...
17/06/20 16:41:31 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20170620204130-0005 in spark://192.168.42.80:32141.
17/06/20 16:41:31 INFO RestSubmissionClient: State of driver driver-20170620204130-0005 is now ERROR.
17/06/20 16:41:31 INFO RestSubmissionClient: Driver is running on worker worker-20170620203037-172.17.0.5-45429 at 172.17.0.5:45429.
17/06/20 16:41:31 ERROR RestSubmissionClient: Exception from the cluster:
java.nio.file.NoSuchFileException: /home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
java.nio.file.Files.copy(Files.java:1274)
org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:608)
org.apache.spark.util.Utils$.copyFile(Utils.scala:579)
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:664)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:463)
org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:154)
org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:172)
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91)
17/06/20 16:41:31 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20170620204130-0005",
"serverSparkVersion" : "2.1.1",
"submissionId" : "driver-20170620204130-0005",
"success" : true
}
Log from spark-worker:
2017-06-20T20:41:30.807403232Z 17/06/20 20:41:30 INFO Worker: Asked to launch driver driver-20170620204130-0005
2017-06-20T20:41:30.817248508Z 17/06/20 20:41:30 INFO DriverRunner: Copying user jar file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar to /opt/spark/work/driver-20170620204130-0005/myproj-assembly-0.1.0.jar
2017-06-20T20:41:30.883645747Z 17/06/20 20:41:30 INFO Utils: Copying /home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar to /opt/spark/work/driver-20170620204130-0005/myproj-assembly-0.1.0.jar
2017-06-20T20:41:30.885217508Z 17/06/20 20:41:30 INFO DriverRunner: Killing driver process!
2017-06-20T20:41:30.885694618Z 17/06/20 20:41:30 WARN Worker: Driver driver-20170620204130-0005 failed with unrecoverable exception: java.nio.file.NoSuchFileException: home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
Any idea why? Thanks
UPDATE
Is the following command right?
./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
UPDATE
I think I understand a little more about the spark and why I had this problem and spark-submit error: ClassNotFoundException. The key point is that though the word REST used here REST URL: spark://127.0.1.1:6066 (cluster mode), the application jar will not be uploaded to the cluster after submission, which is different with my understanding. so, the spark cluster cannot find the application jar, and cannot load the main class.
I will try to find how to setup the spark cluster and use the cluster mode to submit application. No idea whether client mode will use more resources for streaming jobs.
Blockquote
UPDATE
I think I understand a little more about the spark and why I had this problem and >spark-submit error: ClassNotFoundException. The key point is that though the word >REST used here REST URL: spark://127.0.1.1:6066 (cluster mode), the application >jar will not be uploaded to the cluster after submission, which is different with >my understanding. so, the spark cluster cannot find the application jar, and >cannot load the main class.
That's why you have to locate the jar-file in the master node OR put it into the hdfs before the spark submit.
This is how to do it:
1.) Transfering the file to the master node with ubuntu command
$ scp <file> <username>#<IP address or hostname>:<Destination>
For example:
$ scp mytext.txt tom#128.140.133.124:~/
2.) Transfering the file to the HDFS:
$ hdfs dfs -put mytext.txt
Hope I could help you.
You are submiting the application with cluster mode, this mean a Spark driver application will be created somewhere, the file must exist here.
That why with Spark, its recommanded to use a distributed file system like HDFS or S3.
The standalone mode cluster wants to pass jar files to hdfs because the driver is on any node in the cluster.
hdfs dfs -put xxx.jar /user/
spark-submit --master spark://xxx:7077 \
--deploy-mode cluster \
--supervise \
--driver-memory 512m \
--total-executor-cores 1 \
--executor-memory 512m \
--executor-cores 1 \
--class com.xiyou.bi.streaming.game.common.DmMoGameviewOnlineLogic \
hdfs://xxx:8020/user/hutao/xxx.jar

Why is "Error communicating with MapOutputTracker" reported when Spark tries to send GetMapOutputStatuses?

I'm using Spark 1.3 to do an aggregation on a lot of data. The job consists of 4 steps:
Read a big (1TB) sequence file (corresponding to 1 day of data)
Filter out most of it and get about 1GB of shuffle write
keyBy customer
aggregateByKey() to a custom structure that build a profile for that customer, corresponding to a HashMap[Long, Float] per customer. The Long keys are unique and never bigger than 50K distinct entries.
I'm running this with this configuration:
--name geo-extract-$1-askTimeout \
--executor-cores 8 \
--num-executors 100 \
--executor-memory 40g \
--driver-memory 4g \
--driver-cores 8 \
--conf 'spark.storage.memoryFraction=0.25' \
--conf 'spark.shuffle.memoryFraction=0.35' \
--conf 'spark.kryoserializer.buffer.max.mb=1024' \
--conf 'spark.akka.frameSize=1024' \
--conf 'spark.akka.timeout=200' \
--conf 'spark.akka.askTimeout=111' \
--master yarn-cluster \
And getting this error:
org.apache.spark.SparkException: Error communicating with MapOutputTracker
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:117)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
...
Caused by: org.apache.spark.SparkException: Error sending message [message = GetMapOutputStatuses(0)]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113)
... 21 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
The job and the logic have been shown to work with a small test set and I can even run this job for some dates but not for others. I've googled around and found hints that "Error communicating with MapOutputTracker" is related to internal Spark messages, but I already increased "spark.akka.frameSize", "spark.akka.timeout" and "spark.akka.askTimeout" (this last one does not even appear on Spark documentation, but was mentioned in the Spark mailing list), to no avail. There is still some timeout going on at 30 seconds that I have no clue how to identify or fix.
I see no reason for this to fail due to data size, as the filtering operation and the fact that aggregateByKey performs local partial aggregations should be enough to address the data size. The number of tasks is 16K (automatic from the original input), much more than the 800 cores that are running this, on 100 executors, so it is not as simple as the usual "increment partitions" tip. Any clues would be greatly appreciated! Thanks!
I had a similar issue, that my job would work fine with a smaller dataset, but will fail with larger ones.
After a lot of configuration changes, I found that the changing the driver memory settings has much more of an impact than changing the executor memory settings.
Also using the new garbage collector helps a lot. I am using the following configuration for a cluster of 3, with 40 cores each. Hope the following config helps:
spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.driver.memory=8g
spark.driver.cores=10
spark.driver.maxResultSize=8g
spark.executor.memory=16g
spark.executor.cores=25
spark.default.parallelism=50
spark.eventLog.dir=hdfs://mars02-db01/opt/spark/logs
spark.eventLog.enabled=true
spark.kryoserializer.buffer=512m
spark.kryoserializer.buffer.max=1536m
spark.rdd.compress=true
spark.storage.memoryFraction=0.15
spark.storage.MemoryStore=12g
What's going on in the driver at the time of this failure? It could be due to memory pressure on the driver causing it to be unresponsive. If I recall correctly, the MapOutputTracker that it's trying to get to when it calls GetMapOutputStatuses is running in the Spark driver driver process.
If you're facing long GCs or other pauses for some reason in that process this would cause the exceptions you're seeing above.
Some things to try would be to try jstacking the driver process when you start seeing these errors and see what happens. If jstack doesn't respond, it could be that your driver isn't sufficiently responsive.
16K tasks does sound like it would be a lot for the driver to keep track of, any chance you can increase the driver memory past 4g?
Try the following property
spark.shuffle.reduceLocality.enabled = false.
Refer to this link.
https://issues.apache.org/jira/browse/SPARK-13631

Spark job using HBase fails

Any Spark job I run that involves HBase access results in the errors below. My own jobs are in Scala, but supplied python examples end the same. The cluster is Cloudera, running CDH 5.4.4. The same jobs run fine on a different cluster with CDH 5.3.1.
Any help is greatly apreciated!
...
15/08/15 21:46:30 WARN TableInputFormatBase: initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done.
...
15/08/15 21:46:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, some.server.name): java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:163)
...
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 14 more
run spark-shell with this parameters:
--driver-class-path .../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-java-options "-Dspark.executor.extraClassPath=.../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
Why it works is described here.