Scala Spark Application is not exiting with proper status - scala

I am working on a spark stand alone application , which has a below code structure:
code structure:
jobStatus=Pass
try
{
calls rest API based on status it throws exception or success
}
catch{
catch all exceptions
jobStatus =Fail
}
finally
{
jobStatus match {
//case Pass => System.exit(0)
case Pass => return(0)
//case Fail => System.exit(1)
case Fail => return(1)
}
}
So,if i use System.exit (0) or (1) the spark application always exit with Failed Status(even though there is no exceptions).
If i use return(0) or (1) the spark application always ends with Success status(even though there is exception)
Versions used :
Scala Version : 2.11.0
Spark version: 2.2.0
Spark log (return(0) or return(1) case:
INFO SparkContext: Successfully stopped SparkContext
INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
INFO ShutdownHookManager: Shutdown hook called
Spark log (System.exit(0) or System.exit(1) case:
INFO SparkContext: Successfully stopped SparkContext
INFO ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Shutdown hook called before final status was reported.)
INFO AMRMClientImpl: Waiting for application to be successfully unregistered
I am looking for suggestions or inputs to fix this scenario to end the application with proper status. Let me know if you need any other details on this.

Related

Where to find spark log in dataproc when running job on cluster mode

I am running the following code as job in dataproc.
I could not find logs in console while running in 'cluster' mode.
import sys
import time
from datetime import datetime
from pyspark.sql import SparkSession
start_time = datetime.utcnow()
spark = SparkSession.builder.appName("check_confs").getOrCreate()
all_conf = spark.sparkContext.getConf().getAll()
print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
print(all_conf)
print("\n\n======================\n\n\n")
incoming_args = sys.argv
if len(incoming_args) > 1:
sleep_time = int(incoming_args[1])
print("Sleep time is {} seconds".format(sleep_time))
if sleep_time > 0:
time.sleep(sleep_time)
end_time = datetime.utcnow()
time_taken = (end_time - start_time).total_seconds()
print("Script execution completed in {} seconds".format(time_taken))
If I trigger the job using the deployMode as cluster property, I could not see corresponding logs.
But if the job is triggered in default mode which is client mode, able to see the respective logs.
I have given the dictionary used for triggering the job.
"spark.submit.deployMode": "cluster"
{
'placement': {
'cluster_name': dataproc_cluster
},
'pyspark_job': {
'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
'args': trigger_params,
"properties": {
"spark.submit.deployMode": "cluster",
"spark.executor.memory": "3155m",
"spark.scheduler.mode": "FAIR",
}
}
}
21/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized #3350ms to org.sparkproject.jetty.util.log.Slf4jLog
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started #3467ms
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector#18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
21/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
21/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
21/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
=====
Executing at 2021-12-07 19:11:35.100277
[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
======================
Sleep time is 1 seconds
Script execution completed in 9.411261 seconds
21/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark#18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
Logs not coming in console while running in client mode
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:8032
21/12/07 19:09:05 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:09:05 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:09:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0013
When running jobs in cluster mode, the driver logs are in the Cloud Logging yarn-userlogs. See the doc:
By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user creates the Dataproc cluster by setting cluster properties to --properties spark:spark.submit.deployMode=cluster or submits the job in cluster mode by setting job properties to --properties spark.submit.deployMode=cluster, driver output is listed in YARN userlogs, which can be accessed in Logging.
We can access the logs using query in Logs explorer in google cloud.
resource.type="cloud_dataproc_cluster" resource.labels.cluster_name="my_cluster_name"
resource.labels.cluster_uuid="aaaaa-123435-bbbbbb-ccccc"
severity=DEFAULT
jsonPayload.container_logname="stdout"
jsonPayload.message!=""
log_name="projects/my-project_id/logs/yarn-userlogs"

Illegal access error when deleting Google Pub Sub subscription upon JVM shutdown

I'm trying to delete a Google Pub Sub subscription in a JVM shutdown hook, but I'm encountering an illegal access error with the Google Pub Sub subscription admin client when the shutdown hook runs. I've tried using both sys.addShutdownHook as well as Runtime.getRuntime().addShutdownHook, but I get the same error either way.
val deleteInstanceCacheSubscriptionThread = new Thread {
override def run: Unit = {
cacheUpdateService. deleteInstanceCacheUpdateSubscription()
}
}
sys.addShutdownHook(deleteInstanceCacheSubscriptionThread.run)
// Runtime.getRuntime().addShutdownHook(deleteInstanceCacheSubscriptionThread)
This is the stack trace:
Exception in thread "shutdownHook1" java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [META-INF/services/com.google.auth.http.HttpTransportFactory]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1385)
at org.apache.catalina.loader.WebappClassLoaderBase.findResources(WebappClassLoaderBase.java:985)
at org.apache.catalina.loader.WebappClassLoaderBase.getResources(WebappClassLoaderBase.java:1086)
at java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:348)
at java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
at java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
at com.google.common.collect.Iterators.getNext(Iterators.java:845)
at com.google.common.collect.Iterables.getFirst(Iterables.java:779)
at com.google.auth.oauth2.OAuth2Credentials.getFromServiceLoader(OAuth2Credentials.java:318)
at com.google.auth.oauth2.ServiceAccountCredentials.<init>(ServiceAccountCredentials.java:145)
at com.google.auth.oauth2.ServiceAccountCredentials.createScoped(ServiceAccountCredentials.java:505)
at com.google.api.gax.core.GoogleCredentialsProvider.getCredentials(GoogleCredentialsProvider.java:92)
at com.google.api.gax.rpc.ClientContext.create(ClientContext.java:142)
at com.google.cloud.pubsub.v1.stub.GrpcSubscriberStub.create(GrpcSubscriberStub.java:263)
at com.google.cloud.pubsub.v1.stub.SubscriberStubSettings.createStub(SubscriberStubSettings.java:242)
at com.google.cloud.pubsub.v1.SubscriptionAdminClient.<init>(SubscriptionAdminClient.java:178)
at com.google.cloud.pubsub.v1.SubscriptionAdminClient.create(SubscriptionAdminClient.java:159)
at com.google.cloud.pubsub.v1.SubscriptionAdminClient.create(SubscriptionAdminClient.java:150)
at com.company.pubsub.services.GooglePubSubService.$anonfun$deleteSubscription$2(GooglePubSubService.scala:384)
at com.company.utils.TryWithResources$.withResources(TryWithResources.scala:21)
at com.company.pubsub.services.GooglePubSubService.$anonfun$deleteSubscription$1(GooglePubSubService.scala:384)
at com.company.scalalogging.Logging.time(Logging.scala:43)
at com.company.scalalogging.Logging.time$(Logging.scala:35)
at com.company.pubsub.services.GooglePubSubService.time(GooglePubSubService.scala:30)
at com.company.pubsub.services.GooglePubSubService.deleteSubscription(GooglePubSubService.scala:382)
at com.company.cache.services.CacheUpdateService.deleteInstanceCacheUpdateSubscription(CacheUpdateService.scala:109)
at com.company.cache.services.CacheUpdateHandlerService$$anon$1.run(CacheUpdateHandlerService.scala:132)
at com.company.cache.services.CacheUpdateHandlerService$.$anonfun$addSubscriptionShutdownHook$1(CacheUpdateHandlerService.scala:135)
at scala.sys.ShutdownHookThread$$anon$1.run(ShutdownHookThread.scala:37)
It seems like by the time the shutdown hook runs the Pub Sub library has already shut down, so we can't access the subscription admin client anymore. But, I was wondering if there was anyway to delete the subscription before this happens.

How to setup Narayana ConnectionManager so it doesn't stop after some transactions

I'm using Spring Boot, Spring Session and JTA Narayana (arjuna), I'm sending select and insert statements in a loop using two different threads.
The application runs correctly for some time but after some number of transactions, the Arjuna ConnectionManager fails to get a connection and generates the following exception:
2019-10-05 22:48:20.724 INFO 27032 --- [o-auto-1-exec-4] c.m.m.db.PrepareStatementExec : START select
2019-10-05 22:49:20.225 WARN 27032 --- [nsaction Reaper] com.arjuna.ats.arjuna : ARJUNA012117: TransactionReaper::check timeout for TX 0:ffffc0a82101:c116:5d989ef0:6e in state RUN
2019-10-05 22:49:20.228 WARN 27032 --- [Reaper Worker 0] com.arjuna.ats.arjuna : ARJUNA012095: Abort of action id 0:ffffc0a82101:c116:5d989ef0:6e invoked while multiple threads active within it.
2019-10-05 22:49:20.234 WARN 27032 --- [Reaper Worker 0] com.arjuna.ats.arjuna : ARJUNA012381: Action id 0:ffffc0a82101:c116:5d989ef0:6e completed with multiple threads - thread http-nio-auto-1-exec-10 was in progress with java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
com.arjuna.ats.internal.jdbc.ConnectionManager.create(ConnectionManager.java:134)
com.arjuna.ats.jdbc.TransactionalDriver.connect(TransactionalDriver.java:89)
java.sql.DriverManager.getConnection(DriverManager.java:664)
java.sql.DriverManager.getConnection(DriverManager.java:208)
com.mono.multidatasourcetest.db.PrepareStatementExec.executeUpdate(PrepareStatementExec.java:51)
Source code is in github https://github.com/saavedrah/multidataset-test
I'm wondering if the connection should be closed or if I should change some settings in Arjuna to make the ConnectionManager work.
although what you are showing is a stack trace being printed by the Narayana BasicAction class (rather than an exception) the result for you is ultimately the same and you need to close your connections.
You should most likely look to add it in close to the same place you are doing the getConnection calls within https://github.com/saavedrah/multidataset-test/blob/cf910c345db079a4e910a071ac0690af28bd3f81/src/main/java/com/mono/multidatasourcetest/db/PrepareStatementExec.java#L38
e.g.
//connection = getConnection
//do something with it
//connection.close()
But as Connection is AutoCloseable you could just do:
try (Connection connection = DriverManager.getConnection) {
connnection.doSomething();
}

Getting exception while doing block() on Mono object I got back from ReactiveMongoRepository object

I have a service that streams data to a second service that receives stream of objects and saves them to my MongoDB.
inside my subscribe function on the Flux object that I get from the streaming service I use the save method from the ReactiveMongoRepository interface.
when I try to use the block function and get the data I get the following error :
2019-10-11 13:30:38.559 INFO 19584 --- [localhost:27017] org.mongodb.driver.connection : Opened connection [connectionId{localValue:1, serverValue:25}] to localhost:27017
2019-10-11 13:30:38.566 INFO 19584 --- [localhost:27017] org.mongodb.driver.cluster : Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 0, 1]}, minWireVersion=0, maxWireVersion=7, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=6218300}
2019-10-11 13:30:39.158 INFO 19584 --- [ctor-http-nio-4] quote-monitor-service : onNext(Quote(id=null, ticker=AAPL, price=164.8, instant=2019-10-11T10:30:38.800Z))
2019-10-11 13:30:39.411 INFO 19584 --- [ctor-http-nio-4] quote-monitor-service : cancel()
2019-10-11 13:30:39.429 INFO 19584 --- [ntLoopGroup-2-2] org.mongodb.driver.connection : Opened connection [connectionId{localValue:3, serverValue:26}] to localhost:27017
2019-10-11 13:30:39.437 WARN 19584 --- [ctor-http-nio-4] io.netty.util.ReferenceCountUtil : Failed to release a message: DefaultHttpContent(data: PooledSlicedByteBuf(freed), decoderResult: success)
io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
at io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:74) ~[netty-common-4.1.39.Final.jar:4.1.39.Final]
at io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:138) ~[netty-common-4.1.39.Final.jar:4.1.39.Final]
at
reactor.core.Exceptions$ErrorCallbackNotImplemented: java.lang.IllegalStateException: block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-4
Caused by: java.lang.IllegalStateException: block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-4
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:77) ~[reactor-core-3.2.12.RELEASE.jar:3.2.12.RELEASE]
at reactor.core.publisher.Mono.block(Mono.java:1494) ~[reactor-core-3.2.12.RELEASE.jar:3.2.12.RELEASE]
at
my code:
stockQuoteClient.getQuoteStream()
.log("quote-monitor-service")
.subscribe(quote -> {
Mono<Quote> savedQuote = quoteRepository.save(quote);
System.out.println("I saved a quote! Id: " +savedQuote.block().getId());
});
after some digging, I manage to get it to work but I don't understand why it works now.
the new code:
stockQuoteClient.getQuoteStream()
.log("quote-monitor-service")
.subscribe(quote -> {
Mono<Quote> savedQuote = quoteRepository.insert(quote);
savedQuote.subscribe(result ->
System.out.println("I saved a quote! Id :: " + result.getId()));
});
the definition of block(): Subscribe to this Mono and block indefinitely until a next signal is received.
the definition of subscribe(): Subscribe to this Mono and request unbounded demand.
can someone help me understand why the block didn't work and the subscribe worked?
what am I missing here?
Blocking is bad, since it ties up a thread waiting for a response. It's very bad in a reactive framework which has few threads at its disposal, and is designed so that none of them should be unnecessarily blocked.
This is the very thing that reactive frameworks are designed to avoid, so in this case it simply stops you doing it:
block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-4
Your new code, in contrast, works asynchronously. The thread isn't blocked, as nothing actually happens until the repository returns a value (and then the lambda that you passed to savedQuote.subscribe() is executed, printing out you result to the console.)
However, the new code still isn't optimal / normal from a reactive streams perspective, as you're doing all your logic in your subscribe method. The normal thing to do is to us a series of flatMap/map calls to transform the items in the stream, and use doOnNext() for side effects (such as printing out a value):
stockQuoteClient.getQuoteStream()
.log("quote-monitor-service")
.flatMap(quoteRepository::insert)
.doOnNext(result -> System.out.println("I saved a quote! Id :: " + result.getId())))
.subscribe();
If you're doing any serious amount of work with reactor / reactive streams, it would be worth reading up on them in general. They're very powerful for non-blocking work, but they do require a different way of thinking (and coding) than more "standard" Java.

How to catch an exception that occurred on a spark worker?

val HTF = new HashingTF(50000)
val Tf = Case.map(row=>
HTF.transform(row)
).cache()
val Idf = new IDF().fit(Tf)
try
{
Idf.transform(Tf).map(x=>LabeledPoint(1,x))
}
catch {
case ex:Throwable=>
println(ex.getMessage)
}
Code like this isn't working.
HashingTF/Idf belongs to org.spark.mllib.feature.
I'm still getting an exception that says
org.apache.spark.SparkException: Failed to get broadcast_5_piece0 of broadcast_5
I cannot see any of my files in the error log, how do I debug this?
It seems that the worker ran out of memory.
Instant temporary Fix:
Run the application without caching.
Just remove .cache()
How to Debug:
Probably Spark UI might have the complete exception details.
check Stage details
check the logs and thread dump in Executor tab
If you find multiple exceptions or errors try to resolve it in sequence.
Most of the times resolving 1st error will resolve subsequent errors.