I am using spark-1.5.0-cdh5.6.0. tried the sample application (scala)
command is:
> spark-submit --class com.cloudera.spark.simbox.sparksimbox.WordCount --master local /home/hadoop/work/testspark.jar
Got the following error:
ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/user/spark/applicationHistory does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:424)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:100)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:541)
at com.cloudera.spark.simbox.sparksimbox.WordCount$.main(WordCount.scala:12)
at com.cloudera.spark.simbox.sparksimbox.WordCount.main(WordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Spark has a feature called "history server" which allows you to browse historical events after the SparkContext dies. This property is set via setting spark.eventLog.enabled to true.
You have two options, either specify a valid directory to store the event log via the spark.eventLog.dir config value, or simply set spark.eventLog.enabled to false if you don't need it.
You can read more on that in the Spark Configuration page.
I got the same error which working with nltk in spark, To fix this I just removed all the nltk related properties from spark-conf.default.
Related
I usually work with Pyspark but I had to deal with a spark streaming job written in Scala. I am running the spark-submit on EMR directly it works but running the same through Airflow throws me the following error. I don't even to where to start debugging the issue. Any ideas would be greatly appreciated.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: com.typesafe.config.ConfigException$IO: available_application.properties -Dlog4j.configuration=log4j-yarn.properties: java.io.FileNotFoundException: available_application.properties -Dlog4j.configuration=log4j-yarn.properties (No such file or directory)
at com.typesafe.config.impl.Parseable.parseValue(Parseable.java:183)
at com.typesafe.config.impl.Parseable.parseValue(Parseable.java:170)
at com.typesafe.config.impl.Parseable.parse(Parseable.java:227)
at com.typesafe.config.ConfigFactory.parseFile(ConfigFactory.java:595)
at com.typesafe.config.ConfigFactory.loadDefaultConfig(ConfigFactory.java:244)
at com.typesafe.config.ConfigFactory.access$000(ConfigFactory.java:38)
at com.typesafe.config.ConfigFactory$1.call(ConfigFactory.java:378)
at com.typesafe.config.ConfigFactory$1.call(ConfigFactory.java:375)
at com.typesafe.config.impl.ConfigImpl$LoaderCache.getOrElseUpdate(ConfigImpl.java:58)
at com.typesafe.config.impl.ConfigImpl.computeCachedConfig(ConfigImpl.java:86)
at com.typesafe.config.ConfigFactory.load(ConfigFactory.java:375)
at com.typesafe.config.ConfigFactory.load(ConfigFactory.java:299)
at com.typesafe.config.ConfigFactory.load(ConfigFactory.java:288)
at com.nike.tdp.AvailabilityKafkaEvents$.main(AvailabilityKafkaEvents.scala:101)
at com.nike.tdp.AvailabilityKafkaEvents.main(AvailabilityKafkaEvents.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.io.FileNotFoundException: available_application.properties -Dlog4j.configuration=log4j-yarn.properties (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at com.typesafe.config.impl.Parseable$ParseableFile.reader(Parseable.java:512)
at com.typesafe.config.impl.Parseable.rawParseValue(Parseable.java:193)
at com.typesafe.config.impl.Parseable.parseValue(Parseable.java:176)
... 19 more
22/10/26 19:14:02 INFO ShutdownHookManager: Shutdown hook called
Caused by: java.io.FileNotFoundException: available_application.properties -Dlog4j.configuration=log4j-yarn.properties
is the main piece of information in the error you've shown.
It looks like you've made a typo in the parameters for running the app and available_application.properties -Dlog4j.configuration=log4j-yarn.properties is interpreted as the configuration file name instead of only available_application.properties (I assume).
Check the parameters used to run your app, maybe quotes in wrong place or missing? Maybe extra whitespace? ...
I have a JAR file generated via a Maven project that works fine when I run it locally via java -jar JARFILENAME.jar. However, when I try to run the same JAR file on Dataproc I get the following error:
22/06/27 13:13:45 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/06/27 13:13:46 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/06/27 13:13:46 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
22/06/27 13:13:49 INFO org.sparkproject.jetty.util.log: Logging initialized #7373ms to org.sparkproject.jetty.util.log.Slf4jLog
22/06/27 13:13:51 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest.getPercentiles([D)Lscala/collection/Seq;
at com.amazon.deequ.analyzers.ApproxQuantile.fromAggregationResult(ApproxQuantile.scala:84)
at com.amazon.deequ.analyzers.ScanShareableAnalyzer.metricFromAggregationResult(Analyzer.scala:192)
at com.amazon.deequ.analyzers.ScanShareableAnalyzer.metricFromAggregationResult$(Analyzer.scala:185)
at com.amazon.deequ.analyzers.ApproxQuantile.metricFromAggregationResult(ApproxQuantile.scala:50)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.successOrFailureMetricFrom(AnalysisRunner.scala:362)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$runScanningAnalyzers$5(AnalysisRunner.scala:330)
at scala.collection.immutable.List.map(List.scala:297)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.liftedTree1$1(AnalysisRunner.scala:328)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.runScanningAnalyzers(AnalysisRunner.scala:318)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:167)
at com.amazon.deequ.VerificationSuite.doVerificationRun(VerificationSuite.scala:121)
at com.amazon.deequ.VerificationRunBuilder.run(VerificationRunBuilder.scala:173)
at com.amazon.deequ.thesis.GCTestOne$.$anonfun$main$1(GCTestOne.scala:42)
at com.amazon.deequ.thesis.GCTestOne$.$anonfun$main$1$adapted(GCTestOne.scala:11)
at com.amazon.deequ.examples.ExampleUtils$.withSpark(ExampleUtils.scala:32)
at com.amazon.deequ.thesis.GCTestOne$.main(GCTestOne.scala:11)
at com.amazon.deequ.thesis.GCTestOne.main(GCTestOne.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I quite don't get why Dataproc has a NoSuchMethodError when everything runs fine locally.
Someone knows why this is?
Version mismatch with GCP. I had Spark 3.2.1, but the clusters run on 3.1.
I have a DS joined in Spark. I am using scala. I want to partition the DS on a column "date". I am using the following syntax. The code does not throw Compile Error:
joined.repartition(joined.col("date")).show()
The Scala repartition function is defined as follows:
def repartition(partitionExprs : org.apache.spark.sql.Column*) : org.apache.spark.sql.Dataset[T] = { /* compiled code */ }
However when I execute the code in Spark, I get the following error in Run time. What am I doing wrong?
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ScalaRunTime$.wrapRefArray([Ljava/lang/Object;)Lscala/collection/immutable/ArraySeq;
at example.Stocks$.main(Stocks.scala:38)
at example.Stocks.main(Stocks.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
You are probably compiling to the wrong version of Scala (and probably Spark as well). You can verify your running version with spark-submit --version.
If you can isolate then, may be try to run the same on spark-shell. I usually use it for troubleshooting the spark jobs.
I have a simple spark job which replaces spaces with commas in a given input file.
When this job is submitted locally (Using IDE and executing the built jar) it completes successfully and when the master is set to "yarn-client" the job hangs for very long time and throws the following exception.
We have a usecase where we want to submit the job programatically rather than building a jar and submitting it through spark-submit.
Spark version : 1.6.1
Hadoop version : 2.7.1
and i got all the spark, yarn and hadoop dependencies in my pom.
Job failed due to following exception
java.net.ConnectException: Call From spark.node123.com/192.168.2.1 to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:129)
at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:129)
at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:62)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:128)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at tardis.platform.TardisContext$.apply(TardisContext.scala:20)
at tardis.common.plugins.Heartbeat.isAbleTocreateContext(Heartbeat.scala:45)
at tardis.common.plugins.Heartbeat.performAction(Heartbeat.scala:33)
at tardis.core.scheduler.jobs.PluginExecutorJob.execute(PluginExecutorJob.scala:40)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
at org.apache.hadoop.ipc.Client.call(Client.java:1446)
... 25 more
I had to add the hadoop and yarn configurations to successfully submit the application in yarn-client mode.
You can not remotely submit your spark job in client mode since your computer have to run the driver program itself which require a lot of connection. If you insist using this method, you have to config your firewall to allow some port to connect to the cluster. Using cluster mode or submit it from master node are much less painful.
I am using spark-1.5.2 with scala-2.11.7, after successfully build with sbt/sbt assembly when i run ./bin/spark-shell i got below error.
16/02/10 19:20:22 ERROR SparkContext: Error initializing SparkContext.
akka.ConfigurationException: Akka JAR version [2.3.4] does not match the provided config version [2.3.11]
at akka.actor.ActorSystem$Settings.<init>(ActorSystem.scala:209)
at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:504)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:245)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:52)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:247)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
at $line3.$read$$iwC$$iwC.<init>(<console>:9)
at $line3.$read$$iwC.<init>(<console>:18)
at $line3.$read.<init>(<console>:20)
at $line3.$read$.<init>(<console>:24)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.<init>(<console>:7)
at $line3.$eval$.<clinit>(<console>)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Spark-shell starts successfully but SparkContext was not created.
Does any one know how to deal with Akka Jar version mismatch?