Where to find spark log in dataproc when running job on cluster mode - pyspark

I am running the following code as job in dataproc.
I could not find logs in console while running in 'cluster' mode.
import sys
import time
from datetime import datetime
from pyspark.sql import SparkSession
start_time = datetime.utcnow()
spark = SparkSession.builder.appName("check_confs").getOrCreate()
all_conf = spark.sparkContext.getConf().getAll()
print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
print(all_conf)
print("\n\n======================\n\n\n")
incoming_args = sys.argv
if len(incoming_args) > 1:
sleep_time = int(incoming_args[1])
print("Sleep time is {} seconds".format(sleep_time))
if sleep_time > 0:
time.sleep(sleep_time)
end_time = datetime.utcnow()
time_taken = (end_time - start_time).total_seconds()
print("Script execution completed in {} seconds".format(time_taken))
If I trigger the job using the deployMode as cluster property, I could not see corresponding logs.
But if the job is triggered in default mode which is client mode, able to see the respective logs.
I have given the dictionary used for triggering the job.
"spark.submit.deployMode": "cluster"
{
'placement': {
'cluster_name': dataproc_cluster
},
'pyspark_job': {
'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
'args': trigger_params,
"properties": {
"spark.submit.deployMode": "cluster",
"spark.executor.memory": "3155m",
"spark.scheduler.mode": "FAIR",
}
}
}
21/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized #3350ms to org.sparkproject.jetty.util.log.Slf4jLog
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started #3467ms
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector#18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
21/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
21/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
21/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
=====
Executing at 2021-12-07 19:11:35.100277
[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
======================
Sleep time is 1 seconds
Script execution completed in 9.411261 seconds
21/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark#18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
Logs not coming in console while running in client mode
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:8032
21/12/07 19:09:05 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:09:05 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:09:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0013

When running jobs in cluster mode, the driver logs are in the Cloud Logging yarn-userlogs. See the doc:
By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user creates the Dataproc cluster by setting cluster properties to --properties spark:spark.submit.deployMode=cluster or submits the job in cluster mode by setting job properties to --properties spark.submit.deployMode=cluster, driver output is listed in YARN userlogs, which can be accessed in Logging.

We can access the logs using query in Logs explorer in google cloud.
resource.type="cloud_dataproc_cluster" resource.labels.cluster_name="my_cluster_name"
resource.labels.cluster_uuid="aaaaa-123435-bbbbbb-ccccc"
severity=DEFAULT
jsonPayload.container_logname="stdout"
jsonPayload.message!=""
log_name="projects/my-project_id/logs/yarn-userlogs"

Related

Scala Spark Application is not exiting with proper status

I am working on a spark stand alone application , which has a below code structure:
code structure:
jobStatus=Pass
try
{
calls rest API based on status it throws exception or success
}
catch{
catch all exceptions
jobStatus =Fail
}
finally
{
jobStatus match {
//case Pass => System.exit(0)
case Pass => return(0)
//case Fail => System.exit(1)
case Fail => return(1)
}
}
So,if i use System.exit (0) or (1) the spark application always exit with Failed Status(even though there is no exceptions).
If i use return(0) or (1) the spark application always ends with Success status(even though there is exception)
Versions used :
Scala Version : 2.11.0
Spark version: 2.2.0
Spark log (return(0) or return(1) case:
INFO SparkContext: Successfully stopped SparkContext
INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
INFO ShutdownHookManager: Shutdown hook called
Spark log (System.exit(0) or System.exit(1) case:
INFO SparkContext: Successfully stopped SparkContext
INFO ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Shutdown hook called before final status was reported.)
INFO AMRMClientImpl: Waiting for application to be successfully unregistered
I am looking for suggestions or inputs to fix this scenario to end the application with proper status. Let me know if you need any other details on this.

Failed to establish connection to Neo4j usign bolt scheme even after successfully enabling Bolt

I want to connect to Neo4j database using my creds. I am tunneling into a machine and once that is done, I open my broswer at the port: localhost:7474.
I tried with both neo4j and bolt scheme to connect at the url:
bolt://<node_ip>:7687 and neo4j://<node_ip>:7687 but the connection times out.
I tried checking the logs but only found that bolt scheme is enabled:
bash-4.2$ tail -f /logs/debug.log
2021-07-02 21:26:03.323+0000 WARN [o.n.k.a.p.GlobalProcedures] Failed to load `org.apache.commons.logging.impl.LogKitLogger` from plugin jar `/home/sandbox/neo/plugins/apoc-4.2.0.2-all.jar`: org/apache/log/Logger
2021-07-02 21:26:03.946+0000 INFO [c.n.m.g.GlobalMetricsExtension] Sending metrics to CSV file at /home/sandbox/neo/metrics
2021-07-02 21:26:03.973+0000 INFO [o.n.b.BoltServer] Bolt enabled on 0.0.0.0:7687.
2021-07-02 21:26:03.974+0000 INFO [o.n.b.BoltServer] Bolt (Routing) enabled on 0.0.0.0:7688.
2021-07-02 21:26:03.974+0000 INFO [o.n.s.AbstractNeoWebServer$ServerComponentsLifecycleAdapter] Starting web server
2021-07-02 21:26:04.001+0000 INFO [o.n.s.m.ThirdPartyJAXRSModule] Mounted unmanaged extension [n10s.endpoint] at [/rdf]
2021-07-02 21:26:05.341+0000 INFO [c.n.s.e.EnterpriseNeoWebServer] Remote interface available at http://<node_ip>:7474/
2021-07-02 21:26:05.341+0000 INFO [o.n.s.AbstractNeoWebServer$ServerComponentsLifecycleAdapter] Web server started.
2021-07-02 21:35:34.565+0000 INFO [c.n.c.c.c.l.s.Segments] [system/00000000] Pruning SegmentFile{path=raft.log.0, header=SegmentHeader{formatVersion=2, recordOffset=56, prevFileLastIndex=-1, segmentNumber=0, prevIndex=-1, prevTerm=-1}}
2021-07-02 21:35:46.079+0000 INFO [c.n.c.c.c.l.s.Segments] [neo4j/32f6599b] Pruning SegmentFile{path=raft.log.0, header=SegmentHeader{formatVersion=2, recordOffset=56, prevFileLastIndex=-1, segmentNumber=0, prevIndex=-1, prevTerm=-1}}
The query log is empty, as I could not execute any query:
bash-4.2$ tail -f query.log
2021-07-02 21:25:52.510+0000 INFO Query started: id:1 - 1009 ms: 0 B - embedded-session neo4j - - call db.clearQueryCaches() - {} - runtime=pipelined - {}
2021-07-02 21:25:52.580+0000 INFO id:1 - 1080 ms: 112 B - embedded-session neo4j - - call db.clearQueryCaches() - {} - runtime=pipelined - {}
The other articles or answers that I read were mostly about misconfiguration: wrong ports but I don't think that is the case with me since I checked from debug.log file that my ports are alright.
FWIW, I am using 3 replicas for my Neo4j and right now, connecting to just one pod.
I am tunnelling both the ports:
ssh -L 7687:$IP:7687 -L 7474:$IP:7474 domain_name.com -N```
Perhaps you've already checked this, but if not, can you ensure that port 7687 is also forwarded. When I tunnelled via browser, my expectation was that 7474 would be sufficient, but it turned out that forwarding 7687 is also necessary.
So, instead of providing localhost in the connection string, I made a silly mistake of writing down the actual IP and that was the reason for connection timeout.

rundeck :how to correct the configuration of rundeck to access via the browser

i have problem accessing rundeck
[2021-05-03T17:33:33,231] WARN beans.GenericTypeAwarePropertyDescriptor - Invalid JavaBean property 'exceptionMappings' being accessed! Ambiguous write methods found next to actually used [public void grails.plugin.springsecurity.web.authentication.AjaxAwareAuthenticationFailureHandler.setExceptionMappings(java.util.List)]: [public void org.springframework.security.web.authentication.ExceptionMappingAuthenticationFailureHandler.setExceptionMappings(java.util.Map)]
[2021-05-03T17:33:41,756] INFO rundeckapp.BootStrap - Starting Rundeck 3.3.10-20210301 (2021-03-02) ...
[2021-05-03T17:33:41,757] INFO rundeckapp.BootStrap - using rdeck.base config property: /var/lib/rundeck
[2021-05-03T17:33:41,768] INFO rundeckapp.BootStrap - loaded configuration: /etc/rundeck/framework.properties
[2021-05-03T17:33:41,805] INFO rundeckapp.BootStrap - RSS feeds disabled
[2021-05-03T17:33:41,806] INFO rundeckapp.BootStrap - Using jaas authentication
[2021-05-03T17:33:41,811] INFO rundeckapp.BootStrap - Preauthentication is disabled
[2021-05-03T17:33:41,918] INFO rundeckapp.BootStrap - Rundeck is ACTIVE: executions can be run.
[2021-05-03T17:33:42,283] WARN rundeckapp.BootStrap - [Development Mode] Usage of H2 database is recommended only for development and testing
[2021-05-03T17:33:42,590] INFO rundeckapp.BootStrap - Rundeck startup finished in 945ms
[2021-05-03T17:33:42,877] INFO rundeckapp.Application - Started Application in 32.801 seconds (JVM running for 35.608)
Grails application running at http://xxx.xxx.xxx.xxx:4440 in environment: production
Session terminated, killing shell...[2021-05-04T10:20:46,596] INFO rundeckapp.BootStrap - Rundeck Shutdown detected
...killed.
can you help me please
by the way I have installed a vm under redhat
then I installed rundeck RPM
and from my physical machine when I do http: // rundecknode_ip: 4440
it returns me on the browser error 113 no route to host and on examination of the logs I have what I have posted above
when i do systemctl status rundeck it is active running

#LocatorApplication starts and then immediately stops

Everything seems to be created fine but once it finishes initializing everything it just stops.
#SpringBootApplication
#LocatorApplication
public class ServerApplication {
public static void main(String[] args) {
SpringApplication.run(ServerApplication.class, args);
}
}
Log:
2020-08-03 10:59:18.250 INFO 7712 --- [ main] o.a.g.d.i.InternalLocator : Locator started on 10.25.209.139[8081]
2020-08-03 10:59:18.250 INFO 7712 --- [ main] o.a.g.d.i.InternalLocator : Starting server location for Distribution Locator on LB183054.dmn1.fmr.com[8081]
2020-08-03 10:59:18.383 INFO 7712 --- [ main] c.f.g.l.LocatorSpringApplication : Started LocatorSpringApplication in 8.496 seconds (JVM running for 9.318)
2020-08-03 10:59:18.385 INFO 7712 --- [m shutdown hook] o.a.g.d.i.InternalDistributedSystem : VM is exiting - shutting down distributed system
2020-08-03 10:59:18.395 INFO 7712 --- [m shutdown hook] o.a.g.i.c.GemFireCacheImpl : GemFireCache[id = 1329087972; isClosing = true; isShutDownAll = false; created = Mon Aug 03 10:59:15 EDT 2020; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
2020-08-03 10:59:18.416 INFO 7712 --- [m shutdown hook] o.a.g.d.i.ClusterDistributionManager : Shutting down DistributionManager 10.25.209.139(locator1:7712:locator)<ec><v0>:41000.
2020-08-03 10:59:18.517 INFO 7712 --- [m shutdown hook] o.a.g.d.i.ClusterDistributionManager : Now closing distribution for 10.25.209.139(locator1:7712:locator)<ec><v0>:41000
2020-08-03 10:59:18.518 INFO 7712 --- [m shutdown hook] o.a.g.d.i.m.g.Services : Stopping membership services
2020-08-03 10:59:18.518 INFO 7712 --- [ip View Creator] o.a.g.d.i.m.g.Services : View Creator thread is exiting
2020-08-03 10:59:18.520 INFO 7712 --- [Server thread 1] o.a.g.d.i.m.g.Services : GMSHealthMonitor server thread exiting
2020-08-03 10:59:18.536 INFO 7712 --- [m shutdown hook] o.a.g.d.i.ClusterDistributionManager : DistributionManager stopped in 120ms.
2020-08-03 10:59:18.537 INFO 7712 --- [m shutdown hook] o.a.g.d.i.ClusterDistributionManager : Marking DistributionManager 10.25.209.139(locator1:7712:locator)<ec><v0>:41000 as closed.
Yes, this is the expected behavior, OOTB.
Most Apache Geode processes (clients (i.e. ClientCache), Locators, Managers and "peer" Cache nodes/members of a cluster/distributed system) only create daemon Threads (i.e. non-blocking Threads). Therefore, the Apache Geode JVM process will startup, initialize itself and then shutdown immediately.
Only an Apache Geode CacheServer process (a "peer" Cache that has a CacheServer component to listen for client connections), starts and continues to run. That is because the ServerSocket used to listen for client Socket connections is created on a non-daemon Thread (i.e. blocking Thread), which prevents the JVM process from shutting down. Otherwise, a CacheServer would fall straight through as well.
You might be thinking, well, how does Gfsh prevent Locators (i.e. using the start locator command) and "servers" (i.e. using the start server command) from shutting down?
NOTE: By default, Gfsh creates a CacheServer instance when starting a GemFire/Geode server using the start server command. The CacheServer component of the "server" can be disabled by specifying the --disable-default-server option to the start server command. In this case, this "server" will not be able to serve clients. Still the peer node/member will continue to run, but not without extra help. See here for more details on the start server Gfsh command.
So, how does Gfsh prevent the processes from falling through?
Under-the-hood, Gfsh uses the LocatorLauncher and ServerLauncher classes to configure and fork the JVM processes to launch Locators and servers, respectively.
By way of example, here is Gfsh's start locator command using the LocatorLauncher class. Technically, it uses the configuration from the LocatorLauncher class instance to construct (and specifically, here) the java command-line used to fork and launch (and specifically, here) a separate JVM process.
However, the key here is the specific "command" passed to the LocatorLauncher class when starting the Locator, which is the START command (here).
In the LocatorLauncher class, we see that the START command does the following, from the main method, to the run method, it starts the Locator, then waitsOnLocator (with implementation).
Without the wait, the Locator would fall straight through as you are experiencing.
You can simulate the same effect (i.e. "falling straight through") using the following code, which uses the Apache Geode API to configure and launch a Locator (in-process).
public class ApacheGeodeLocatorApplication {
public static void main(String[] args) {
LocatorLauncher locatorLauncher = new LocatorLauncher.Builder()
.set("jmx-manager", "true")
.set("jmx-manager-port", "0")
.set("jmx-manager-start", "true")
.setMemberName("ApacheGeodeBasedLocator")
.setPort(0)
.build();
locatorLauncher.start();
//locatorLauncher.waitOnLocator();
}
}
This simple little program will fall straight through. However, if you uncomment locatorLaucncher.waitOnLocator(), then the JVM process will block.
This is not unlike what SDG's LocatorFactoryBean class (see source) is doing actually. It, too, uses the LocatorLauncher class to configure and bootstrap a Locator in-process. The LocatorFactoryBean is the class used to configure and bootstrap a Locator when declaring the SDG #LocatorApplication annotation on your #SpringBootApplication class.
However, I do think there is room for improvement, here. Therefore, I have filed DATAGEODE-361.
In the meantime, and as a workaround, you can achieve the same effect of a blocking Locator by having a look at the Smoke Test for the same in Spring Boot for Apache Geode (SBDG) project. See here.
However, after DATAGEODE-361 is complete, the extra logic preventing the Locator JVM process from shutting down will no longer be necessary.

kaa data collection doesn't retrieve data mongodb

I installed kaa iot server manually on ubuntu 16.04, and use data collection sample to test how it works.
the code run without any errors, but when I run these commands below nothing happens:
mongo kaa
db.logs_$my_app_token$.find()
I even comment out bind_ip of mongodb.conf and restart mongodb, zookeeper and kaa-node services, but nothings changed.
I also regenerated SDK and rebuild project but that wouldn't help either.
finally this is the kaa log:
2018-06-05 15:03:53,899 [Thread-3] TRACE
o.k.k.s.c.s.l.DynamicLoadManager - DynamicLoadManager recalculate() got 0 redirection rules
2018-06-05 15:03:59,472 [EPS-core-dispatcher-6] DEBUG
o.k.k.s.o.s.a.a.c.OperationsServerActor - Received: org.kaaproject.kaa.server.operations.service.akka.messages.core.stats.StatusRequestMessage#30d61bb1
2018-06-05 15:03:59,472 [EPS-core-dispatcher-6] DEBUG o.k.k.s.o.s.a.a.c.OperationsServerActor - [14fc1a87-8b34-47f6-8f39-d91aff7bfff7] Processing status request
2018-06-05 15:03:59,475 [pool-5-thread-1] INFO o.k.k.s.o.s.l.DefaultLoadBalancingService - Updated load info: {"endpointCount": 0, "loadAverage": 0.02}
2018-06-05 15:03:59,477 [Curator-PathChildrenCache-0] INFO o.k.k.s.c.s.l.DynamicLoadManager - Operations server [-1835393002][localhost:9090] updated
2018-06-05 15:03:59,477 [Curator-PathChildrenCache-4] DEBUG o.k.k.s.o.s.c.DefaultClusterService - Update of node [localhost:9090:1528181889050]-[{"endpointCount": 0, "loadAverage": 0.02}] is pushed to resolver org.kaaproject.kaa.server.hash.ConsistentHashResolver#1d0276a4
2018-06-05 15:04:03,899 [Thread-3] INFO o.k.k.s.c.s.l.LoadDistributionService - Load distribution service recalculation started...
2018-06-05 15:04:03,899 [Thread-3] INFO o.k.k.s.c.s.l.DynamicLoadManager - DynamicLoadManager recalculate() started... lastBootstrapServersUpdateFailed false
2018-06-05 15:04:03,899 [Thread-3] DEBUG o.k.k.s.c.s.l.d.EndpointCountRebalancer - No rebalancing in standalone mode
2018-06-05 15:04:03,899 [Thread-3] TRACE o.k.k.s.c.s.l.DynamicLoadManager - DynamicLoadManager recalculate() got 0 redirection rules
thank you for your help to fix this problem...
After lots of searching and checking, I finally found it!!!
there are multiple reason that this would happen:
if you are using Kaa Sanddbox make sure that you set your network setting into bridge (not NAT).
check your iptables and find out if these ports are open: 9888,9889,9997,9999.
if you are using virtual machine as your server, make sure that hosts firewall system doesn't block the ports.(This is what happened to me...)