Pyspark dataproc job failure

Pyspark dataproc job failure - pyspark

I got below error log while submitting the pyspark dataproc job on creating recommendations.
18/09/15 06:11:36
INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
18/09/15 06:11:36
org.spark_project.jetty.server.Server: Started #3317ms
18/09/15 06:11:37 INFO org.spark_project.jetty.server.AbstractConnector:
StartedServerConnector#6322b8bd{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/09/15 06:11:37 INFO
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version:
1.6.8-hadoop218/09/15 06:11:38
INFO org.apache.hadoop.yarn.client.RMProxy:
Connecting to ResourceManager at cluster-d21a-m/10.128.0.4:8032
18/09/15 06:11:41
INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted
application application_1536988234373_0004 18/09/15 06:11:46
WARN org.apache.spark.SparkContext: Spark is not running in local mode,
therefore the checkpoint directory must not be on the local filesystem.
Directory 'checkpoint/' appears to be on the local filesystem.read ...
Traceback (most recent call last):File "/tmp/job-
614e830d/train_and_apply.py", line 50, in
model = ALS.train(dfRates.rdd, 20, 20) # you could tune these numbers,
but these are reasonable choices File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py",
line 272, in train
File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py",
line 229,
in_prepareFile "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 1364, in firstValueError:RDD is empty/09/15 06:11:53 INFO
org.spark_project.jetty.server.AbstractConnector:
Stopped Spark#6322b8bd{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}`
Any suggestion ?

Related

executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown - how I can debug on driver side?

I'm getting that logs from the executor (beginning at the buttom):
2021-11-30 21:44:42
2021-11-30 18:44:42,911 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)) - Deleting directory /var/data/spark-0646270c-a2d0-47d4-8e6c-0bc735bc255d/spark-a54cf7e4-baaf-4411-9073-0c1fb1e4cc5b
2021-11-30 21:44:42
2021-11-30 18:44:42,910 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)) - Shutdown hook called
2021-11-30 21:44:42
2021-11-30 18:44:42,902 ERROR [SIGTERM handler] executor.CoarseGrainedExecutorBackend (SignalUtils.scala:$anonfun$registerLogger$2(43)) - RECEIVED SIGNAL TERM
2021-11-30 21:44:42
2021-11-30 18:44:42,823 INFO [CoarseGrainedExecutorBackend-stop-executor] storage.BlockManager (Logging.scala:logInfo(57)) - BlockManager stopped
2021-11-30 21:44:42
2021-11-30 18:44:42,822 INFO [CoarseGrainedExecutorBackend-stop-executor] memory.MemoryStore (Logging.scala:logInfo(57)) - MemoryStore cleared
2021-11-30 21:44:42
2021-11-30 18:44:42,798 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown
How I can enable any kind of logging in the Spark Driver to understand, what kind of event on the Driver has triggered the executor to shutdown? There is no lack of the memory to Driver or Executor, the pod metrics show that they occupy much more than it is limited + overhead. So, looks like the reason of shutdown signal isn't a lack of the resoures, but may be some hidden exception, not logged anywhere.
According to the advice of #mazaneicha I have tried to set longer timeouts, but still getting the same error
implicit val spark: SparkSession = SparkSession
.builder
.master("local[1]")
.config(new SparkConf().setIfMissing("spark.master", "local[1]")
.set("spark.eventLog.dir", "file:///tmp/spark-events")
.set("spark.dynamicAllocation.executorIdleTimeout", "100s") //spark.dynamicAllocation.executorIdleTimeout
.set("spark.dynamicAllocation.schedulerBacklogTimeout", "100s") //spark.dynamicAllocation.schedulerBacklogTimeout
)
.getOrCreate()

The reason of the failure was actually posted to the logs:
2021-12-01 15:05:46,906 WARN [main] streaming.StreamingQueryManager (Logging.scala:logWarning(69)) - Stopping existing streaming query [id=b13a69d7-5a2f-461e-91a7-a9138c4aa716, runId=9cb31852-d276-42d8-ade6-9839fa97f85c], as a new run is being started.
WHy the query were stopped? That's because in Scala I was creating streaming queries in a loop, based on collection, while keeping all the query names and all the checkpoint names the same. After making them unique (i just used the string values from the collection), the failure problem has gone.

Nutch/Hadoop: regex-normalize.xml and regex-urlfilter.txt not found error even though they exist

I am trying to run nutch and hadoop through eclipse and followed a couple tutorials to set it up. I am currently stuck at a nullpointerexception that I believe is being caused due to regex-urlfilter.txt and regex-normalize.xml not being found.
Here is the error trace from the logs:-
[LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.conf.Configuration - regex-normalize.xml not found
4473 [LocalJobRunner Map Task Executor #0] WARN org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - Can't load the default rules!
4477 [LocalJobRunner Map Task Executor #0] DEBUG org.apache.nutch.util.ObjectCache - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/tmp/hadoop-338737067/mapred/local/localRunner/338737067/job_local1524701719_0001/job_local1524701719_0001.xml, instantiating a new object cache
4486 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.conf.Configuration - regex-urlfilter.txt not found
4486 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Starting flush of map output
4516 [LocalJobRunner Map Task Executor #0] DEBUG org.apache.hadoop.util.concurrent.ExecutorHelper - afterExecute in thread: LocalJobRunner Map Task Executor #0, runnable type: java.util.concurrent.FutureTask
4516 [Thread-3] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
4521 [Thread-3] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1524701719_0001
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:551)
Both these files exist in \workspace\apache-nutch-1.16\conf folder and I am not sure what I am doing wrong. I have double checked that my environment variables are set correctly for HADOOP_HOME and HADOOP_BIN and they are pointing to the right directories. I am not sure which directory they are looking into to find regex-urlfilter.txt and regex-normalize.xml. Any help in resolving this issue would be appreciated.
I am using Hadoop 3.0.0 and apache-nutch-1.16.

The conf/ folder needs to be on the Java classpath. This is easiest done by running Nutch using one of the provided scripts bin/nutch or bin/crawl. If the binary package is used, the script location is apache-nutch-1.16/bin/nutch. With the source package it's apache-nutch-1.16/runtime/local/bin/nutch after ant runtime has been executed. Using the scripts also allows to have the configuration files in a different directory and point NUTCH_CONF_DIR to this directory. The scripts will just put this location in front of the classpath.

using boilerpipe with pyspark

I am using boilerpipe to get text out of html. However there is some issue that I have not been able to resolve. I have a list of 50k elements. I am creating an rdd of 1000 elements and then processing them and saving the resultant rdd in hdfs. The error that I have encountered is this:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 265, in <module>
x = get_data(line[:-1],c)
File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 208, in get_data
sc.parallelize(warcrecords).repartition(72).map(lambda s: classify(s)).saveAsTextFile(file_name)
File "/home/hadoopuser/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1552, in saveAsTextFile
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o40.saveAsTextFile
17/09/19 18:11:10 INFO SparkContext: Invoking stop() from shutdown hook
17/09/19 18:11:10 INFO SparkUI: Stopped Spark web UI at http://192.168.0.255:4040
17/09/19 18:11:10 INFO DAGScheduler: Job 0 failed: saveAsTextFile at NativeMethodAccessorImpl.java:0, took 14.746797 s
17/09/19 18:11:10 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at NativeMethodAccessorImpl.java:0) failed in 7.906 s due to Stage cancelled because SparkContext was shut down
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#ec3ca3)
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1505824870317,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
17/09/19 18:11:10 INFO StandaloneSchedulerBackend: Shutting down all executors
17/09/19 18:11:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/09/19 18:11:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/09/19 18:11:10 INFO MemoryStore: MemoryStore cleared
17/09/19 18:11:10 INFO BlockManager: BlockManager stopped
17/09/19 18:11:10 INFO BlockManagerMaster: BlockManagerMaster stopped
17/09/19 18:11:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/09/19 18:11:10 INFO SparkContext: Successfully stopped SparkContext
17/09/19 18:11:10 INFO ShutdownHookManager: Shutdown hook called
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763/pyspark-b73e541b-1182-4449-96bc-26eabca1803d
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763
In the hdfs file, resultant of first 1000 elements are saved but going onwards it throws the above error. What is the fix to this?

removing this line from the code did the trick. still don't know why.
from boilerpipe.extract import Extractor

Spark RDD method "saveAsTextFile" throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

I am calling this method on an RDD[String] with destination in the arguments. (Scala)
Even after deleting the directory before starting, the process gives this error.
I am running this process on EMR cluster with output location at aws S3.
Below is the command used:
spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g --driver-cores 4 s3://bi-aws-users/sbatheja/hotel-shopper-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d 3 -p 100 --search-bucket s3a://hda-prod-business.hotwire.hotel.search --prd-output-path s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput/
Log:
16/07/07 11:27:47 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/07 11:27:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/07/07 11:27:47 INFO SparkContext: Successfully stopped SparkContext
16/07/07 11:27:47 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: **org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput already exists)**
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/07/07 11:27:47 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/07/07 11:27:47 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1467889642439_0001
16/07/07 11:27:47 INFO ShutdownHookManager: Shutdown hook called
16/07/07 11:27:47 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1467889642439_0001/spark-7f836950-a040-4216-9308-2bb4565c5649
It creates "_temporary" directory in the location, which contains empty part files.

In short, a word:
Make sure the scala version of spark-core and scala-library is consistent.
I encountered the same problem.
As I saving the file to the HDFS, it throws an exception: org.apache.hadoop.mapred.FileAlreadyExistsException
Then I checked the HDFS file directory, there is a empty temporary folder: TARGET_DIR/_temporary/0.
You can submit the job, open the detailed configuration:./spark-submit --verbose.
And then look at the full context and log, there must be other errors caused.
My job in the RUNNING state, the first error is thrown:
17/04/23 11:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
Then the job will be retried and re-executed. At this time, job re-implementation, it will find just the directory has been created. And also throws the directory already exists.
After confirming that the first error is version compatibility issues.
The spark version is 2.1.0, the corresponding spark-core scala version is 2.11, and the scala-library dependency of the scala version is 2.12.xx.
When the two scala version of the change is consistent (usually modify the scala-library version), you can solve the first exception problem, then job can be normal FINISHED.
pom.xml example:
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.7</version>
</dependency>

Monitoring a task in apache Spark

I start spark master using : ./sbin/start-master.sh
as described at :
http://spark.apache.org/docs/latest/spark-standalone.html
I then submit the Spark job :
sh ./bin/spark-submit \
--class simplespark.Driver \
--master spark://`localhost`:7077 \
C:\\Users\\Adrian\\workspace\\simplespark\\target\\simplespark-0.0.1-SNAPSHOT.jar
How can run a simple app which demonstrates a parallel task running ?
When I view http://localhost:4040/executors/ & http://localhost:8080/ there are no
tasks running :
The .jar I'm running (simplespark-0.0.1-SNAPSHOT.jar) just contains a single Scala object :
package simplespark
import org.apache.spark.SparkContext
object Driver {
def main(args: Array[String]) {
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("knn")
.setSparkHome("C:\\spark-1.1.0-bin-hadoop2.4\\spark-1.1.0-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val sc = new SparkContext(conf);
val l = List(1)
sc.parallelize(l)
while(true){}
}
}
Update : When I change --master spark://localhost:7077 \ to --master spark://Adrian-PC:7077 \
I can see update on the Spark UI :
I have also updated Driver.scala to read default context, as I'm not sure if I set it correctly for submitting Spark jobs :
package simplespark
import org.apache.spark.SparkContext
object Driver {
def main(args: Array[String]) {
System.setProperty("spark.executor.memory", "2g")
val sc = new SparkContext();
val l = List(1)
val c = sc.parallelize(List(2, 3, 5, 7)).count()
println(c)
sc.stop
}
}
On Spark console I receive multiple same all same messages :
14/12/26 20:08:32 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
So it appears that the Spark job is not reaching the master ?
Update2 : After I start (thanks to Lomig Mégard comment below) the worker using :
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://Adrian-PC:7077
I receive error :
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Executor app-20141227212351-0003/8 removed: java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor added: app-20141227212351-0003/9 on worker-20141227211411-Adrian-PC-58199 (Adrian-PC:58199) with 4 cores
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Granted executor ID app-20141227212351-0003/9 on hostPort Adrian-PC:58199 with 4 cores, 2.0 GB RAM
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor updated: app-20141227212351-0003/9 is now RUNNING
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor updated: app-20141227212351-0003/9 is now FAILED (java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified)
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Executor app-20141227212351-0003/9 removed: java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified
14/12/27 21:23:52 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
14/12/27 21:23:52 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED
14/12/27 21:23:52 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at Driver.scala:14)
14/12/27 21:23:52 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
I'm running the scripts on Windows using Cygwin. To fix this error I copy the Spark installation to cygwin C:\ drive. But then I receive a new error :
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0

You have to start the actual computation to see the job.
val c = sc.parallelize(List(2, 3, 5, 7)).count()
println(c)
Here count is called an action, you need at least one of them to begin a job. You can find the list of available actions in the Spark doc.
The other methods are called transformations. They are lazily executed.
Don't forget to stop the context at the end, instead of your infinite loop, with sc.stop().
Edit: For the updated question, you allocate more memory to the executor than there is available in the worker. The defaults should be fine for simple tests.
You also need to have a running worker linked to your master. See this doc to start it.
./sbin/start-master.sh
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark dataproc job failure - pyspark

Related

executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown - how I can debug on driver side?

Nutch/Hadoop: regex-normalize.xml and regex-urlfilter.txt not found error even though they exist

using boilerpipe with pyspark

Spark RDD method "saveAsTextFile" throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

Monitoring a task in apache Spark

Categories

Resources