Cannot create dataproc cluster - google-cloud-dataproc

When creating DataProc cluster using Web Console or gcloud, getting the same error:
gcloud dataproc --region us-west1 clusters create cluster-02eb --subnet default --zone us-west1-a --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --image-version 1.2 --scopes 'https://www.googleapis.com/auth/cloud-platform' --project projectxyz123 --initialization-actions 'gs://dataproc-initialization-actions/jupyter/jupyter.sh'
Waiting on operation [projects/projectxyz123/regions/us-west1/operations/53badbbc-41cc-4241-aedb-490981864bf9].
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/projectxyz123/regions/us-west1/operations/53badbbc-41cc-4241-aedb-490981864bf9] failed: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
Not Found.
What does this error means ?

This is currently a known issue for zone us-west1-a. Using an alternative zone should work (e.g. us-west1-b).

Related

Spark files not found in cluster deploy mode

I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:
spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties
I'm getting the following error, which states that the file can not be found at the Spark executor working directory:
//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
//This is me getting the error:
20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
at com.someOrg.somePackage.someClass.main(someClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
This is the function I use to attempt to read the properties files passed as arguments:
def loadPropertiesFromFile(path: String): Properties = {
val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
val properties = new Properties()
properties.load(inputStream)
properties
}
Invoked as:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))
The program works as expected when issued with client deploy mode:
spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties
This happens in Spark 2.4.5 / EMR 5.30.1.
Additionally, when I try to configure this job as an EMR step it does not even work in client mode. Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?
Option 1: Put those files in s3 and pass the s3 path.
Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.
Solved with suggestions from the above comments:
spark-submit --master yarn \
--deploy-mode cluster \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files s3://someBucket/resources/truststore.jks,s3://someBucket/resources/kafka.properties,s3://someBucket/resources/program.properties \
--class com.someOrg.someClass.someMain \
s3://someBucket/resources/program.jar kafka.properties program.properties
I was previously assuming that in cluster deploy mode the files under --files were also shipped alongside the driver deployed to a worker node (and thereby available in the working directory), if accessible from the machine where spark-submit is issued.
Bottom line: Regardless of where you issue spark-submit from and the availability of the files in that machine, in cluster mode, you must ensure that files are accessible from every worker node.
It is now working by pointing files location to S3.
Thank you all!

Kafka Failed to create new KafkaAdminClient on Kerberized Cluster

i have a kerberized cluster with Kafka on it.
I want to use Confluent Schema Registry with Kafka on cluster.
Launching the Schema Registry from my local pc, everything works just fine.
But when i uploaded it on a machine in the cluster and tried to run from it i get:
Error
ERROR Server died unexpectedly: (io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain:50)
org.apache.kafka.common.KafkaException: Failed to create new KafkaAdminClient
...
Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Message stream modified (41)
...
Caused by: javax.security.auth.login.LoginException: Message stream modified (41)
I also tried to run it from another machine on the cluster and i get the same result.
Schema-registry.properties
listeners=http://0.0.0.0:8081
kafkastore.connection.url=master01.domain.ext:2181,master02.domain.ext:2181
kafkastore.bootstrap.servers=SASL_PLAINTEXT://xxx.domain.ext:6667,SASL_PLAINTEXT://xxx.domain.ext:6667
kafkastore.topic=_schemas
debug=true
kafkastore.sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
useKeyTab=true \
storeKey=true \
keyTab="/path/schema-registry/etc/schema-registry/my-keytab.keytab" \
principal="kafka/xxx.domain.ext#DOMAIN.EXT";
kafkastore.sasl.kerberos.service.name=kafka
kafkastore.security.protocol=SASL_PLAINTEXT
kafkastore.sasl.mechanism=GSSAPI
EXECUTION COMMAND:
sudo bash bin/schema-registry-start ./etc/schema-registry/schema-registry.properties
QUESTIONS:
Why it works in my local pc and not on a machine on cluster?
What should i change?
P.S. i get the same result even trying to run the CMAK yahoo kafka-manager ( using same jaas.confing and same keytab )

Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

I am working with following spark config
maxCores = 5
driverMemory=2g
executorMemory=17g
executorInstances=100
Issue:
Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.
Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called
Not sure why yarn is killing my executors.
I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause.
You can access them via the Web-interface
nodeManagerAddress:PORT/logs
The PORT is specified in the yarn-site.xml under yarn.nodemanager.webapp.address. (default: 8042)
My Investigation-Workflow:
Collect logs (yarn logs ... command)
Identify node and container (in these logs) emitting the error
Search the NodeManager-logs by Timestamp of the error for a root cause
Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:
nodeManagerAdress:PORT/conf
I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.
One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.
spark.dynamicAllocation.executorIdleTimeout
This thread has more detailed information on how to resolve this issue which worked for me:
https://jira.apache.org/jira/browse/SPARK-21733
I had the same issue, my spark job was using only 1 task node and killing the other provisioned nodes. This also happened when switching to EMR Serverless, my job was being run on only one "thread". Please see below as it fixed it for me:
spark-submit \
--name KSSH-0.3 \
--class com.jiuye.KSSH \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
--num-executors 8 \
--jars $(echo /opt/software/spark2.1.1/spark_on_yarn/libs/*.jar | tr ' ' ',') \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.yarn.am.memory=1024m" \
--conf "spark.yarn.am.memoryOverhead=1024m" \
--conf "spark.yarn.driver.memoryOverhead=1024m" \
--conf "spark.yarn.executor.memoryOverhead=1024m" \
--conf "spark.yarn.am.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.streaming.kafka.maxRatePerPartition=1250" \
--conf "spark.locality.wait=1s" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.executor.heartbeatInterval=360000" \
--conf "spark.network.timeout=420000" \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" \
/opt/software/spark2.1.1/spark_on_yarn/KSSH-0.3.jar

Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

~/spark/spark-2.1.1-bin-hadoop2.7/bin$ ./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/20 16:41:30 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.42.80:32141.
17/06/20 16:41:31 INFO RestSubmissionClient: Submission successfully created as driver-20170620204130-0005. Polling submission state...
17/06/20 16:41:31 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20170620204130-0005 in spark://192.168.42.80:32141.
17/06/20 16:41:31 INFO RestSubmissionClient: State of driver driver-20170620204130-0005 is now ERROR.
17/06/20 16:41:31 INFO RestSubmissionClient: Driver is running on worker worker-20170620203037-172.17.0.5-45429 at 172.17.0.5:45429.
17/06/20 16:41:31 ERROR RestSubmissionClient: Exception from the cluster:
java.nio.file.NoSuchFileException: /home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
java.nio.file.Files.copy(Files.java:1274)
org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:608)
org.apache.spark.util.Utils$.copyFile(Utils.scala:579)
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:664)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:463)
org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:154)
org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:172)
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91)
17/06/20 16:41:31 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20170620204130-0005",
"serverSparkVersion" : "2.1.1",
"submissionId" : "driver-20170620204130-0005",
"success" : true
}
Log from spark-worker:
2017-06-20T20:41:30.807403232Z 17/06/20 20:41:30 INFO Worker: Asked to launch driver driver-20170620204130-0005
2017-06-20T20:41:30.817248508Z 17/06/20 20:41:30 INFO DriverRunner: Copying user jar file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar to /opt/spark/work/driver-20170620204130-0005/myproj-assembly-0.1.0.jar
2017-06-20T20:41:30.883645747Z 17/06/20 20:41:30 INFO Utils: Copying /home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar to /opt/spark/work/driver-20170620204130-0005/myproj-assembly-0.1.0.jar
2017-06-20T20:41:30.885217508Z 17/06/20 20:41:30 INFO DriverRunner: Killing driver process!
2017-06-20T20:41:30.885694618Z 17/06/20 20:41:30 WARN Worker: Driver driver-20170620204130-0005 failed with unrecoverable exception: java.nio.file.NoSuchFileException: home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
Any idea why? Thanks
UPDATE
Is the following command right?
./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar
UPDATE
I think I understand a little more about the spark and why I had this problem and spark-submit error: ClassNotFoundException. The key point is that though the word REST used here REST URL: spark://127.0.1.1:6066 (cluster mode), the application jar will not be uploaded to the cluster after submission, which is different with my understanding. so, the spark cluster cannot find the application jar, and cannot load the main class.
I will try to find how to setup the spark cluster and use the cluster mode to submit application. No idea whether client mode will use more resources for streaming jobs.
Blockquote
UPDATE
I think I understand a little more about the spark and why I had this problem and >spark-submit error: ClassNotFoundException. The key point is that though the word >REST used here REST URL: spark://127.0.1.1:6066 (cluster mode), the application >jar will not be uploaded to the cluster after submission, which is different with >my understanding. so, the spark cluster cannot find the application jar, and >cannot load the main class.
That's why you have to locate the jar-file in the master node OR put it into the hdfs before the spark submit.
This is how to do it:
1.) Transfering the file to the master node with ubuntu command
$ scp <file> <username>#<IP address or hostname>:<Destination>
For example:
$ scp mytext.txt tom#128.140.133.124:~/
2.) Transfering the file to the HDFS:
$ hdfs dfs -put mytext.txt
Hope I could help you.
You are submiting the application with cluster mode, this mean a Spark driver application will be created somewhere, the file must exist here.
That why with Spark, its recommanded to use a distributed file system like HDFS or S3.
The standalone mode cluster wants to pass jar files to hdfs because the driver is on any node in the cluster.
hdfs dfs -put xxx.jar /user/
spark-submit --master spark://xxx:7077 \
--deploy-mode cluster \
--supervise \
--driver-memory 512m \
--total-executor-cores 1 \
--executor-memory 512m \
--executor-cores 1 \
--class com.xiyou.bi.streaming.game.common.DmMoGameviewOnlineLogic \
hdfs://xxx:8020/user/hutao/xxx.jar

Configuring PostgreSQL HA on RHEL 7.0

We are experiencing a problem configuring PostgreSQL for HA using Corosync and Pacemaker.
crm_mon output is
Last updated: Thu Dec 18 10:24:04 2014
Last change: Thu Dec 18 10:16:30 2014 via crmd on umhtvappdpj05.arqiva.local
Stack: corosync
Current DC: umhtvappdpj06.arqiva.local (1) - partition with quorum
Version: 1.1.10-29.el7-368c726
2 Nodes configured
4 Resources configured
Online: [ umhtvappdpj05.arqiva.local umhtvappdpj06.arqiva.local ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
vip-rep (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
Node Attributes:
* Node umhtvappdpj05.arqiva.local:
+ master-pgsql : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : HS:alone
+ pgsql-xlog-loc : 0000000097000168
* Node umhtvappdpj06.arqiva.local:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000094000090
+ pgsql-status : PRI
Migration summary:
* Node umhtvappdpj05.arqiva.local:
* Node umhtvappdpj06.arqiva.local:`
Here node 06(umhtvappdpj06.arqiva.local) started as primary and node 05(umhtvappdpj05.arqiva.local) acts as standby but both are not connected.
recovery.conf on node 05
standby_mode = 'on'
primary_conninfo = 'host=10.52.6.95 port=5432 user=postgres application_name=umhtvappdpj05.arqiva.local keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p'
recovery_target_timeline = 'latest'`
Resources created are:
pcs resource create vip-master IPaddr2 \
ip="10.52.6.94" \
nic="ens192" \
cidr_netmask="24" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
pcs resource create vip-rep IPaddr2 \
ip="10.52.6.95" \
nic="ens192" \
cidr_netmask="24" \
meta migration-threshold="0" \
op start timeout="60s" interval="0s" on-fail="stop" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="ignore"
pcs resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/pgsql-9.3/bin/pg_ctl" \
psql="/usr/pgsql-9.3/bin/psql" \
pgdata="/pgdata/data" \
rep_mode="sync" \
node_list="10.52.6.85 10.52.6.92" \
restore_command="scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p" \
master_ip="10.52.6.95" \
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
restart_on_promote='true' \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op monitor timeout="60s" interval="9s" on-fail="restart" role="Master" \
op promote timeout="60s" interval="0s" on-fail="restart" \
op demote timeout="60s" interval="0s" on-fail="stop" \
op stop timeout="60s" interval="0s" on-fail="block" \
op notify timeout="60s" interval="0s"
[root#umhtvappdpj05 data]# pcs resource show --all
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started
vip-rep (ocf::heartbeat:IPaddr2): Started
[root#umhtvappdpj05 data]# `
The only anomaly was the corosync and pacemaker were first installed on node 6 when it was on a different subnet from node 5. Subsequently node 6 was shifted to same subnet as 5. Could this be the cause? Maybe re-install on node 6. Does seem to make sense.
thank you
Sameer
Here is how you can connect your replica with primary PostgreSQL.
1. Touch PGSQL.lock file in /var/lib/pgsql/tmp/ in umhtvappdpj05.arqiva.local
2. Stop PostgreSQL in node umhtvappdpj05.arqiva.local using systemctl
3. Do base backup/rsync of data dir/ from primary server to replica
4. Remove PGSQL.lock file from replica
5. Run pcs resource cleanup pgsql
These steps works for me always.
Let me know if doesn't.