Cannot connect Apache Spark to MongoDB with SSL - mongodb

I have successfully installed Apache Spark in Ubuntu 18.04. I have also added mongo-spark-connector to my spark installation. I am currently trying to connect to a MongoDB cluster I have setup externally. I can connect through various different sources to my MongoDB cluster with SSL enabled (SSL is required by MongoDB database server). When I try to connect through spark, the connection timeouts. To connect through SSL I usually use a private key (pk.pem) and CA Certificate (ca.crt).
I have done the following setup:
I have converted the private key file from PEM format to PKCS12 format
I have converted the CA certificate into PEM format
I have created a new keystore and added my newly formatted PKCS12 file (using keytool)
I have created a new truststore and added my CA certificate in PEM format (using keytool)
I currently start my script with the following command:
spark-submit \
--driver-java-options -Djavax.net.ssl.trustStore=/path/to/truststore.ks \
--driver-java-options -Djavax.net.ssl.trustStorePassword=tspassword \
--driver-java-options -Djavax.net.ssl.keyStore=/path/to/keystore.ks \
--driver-java-options -Djavax.net.ssl.keyStorePassword=kspassword \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.trustStore=/path/to/truststore.ks \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.trustStorePassword=tspassword \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.keyStore=/path/to/keystore.ks \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.keyStorePassword=kspassword \
python script.py
The following is my PySpark code:
mongo_url = 'mongodb://<user>:<pass>#<host>:<port>/db.collection?replicaSet=replica-set-name&ssl=true&authSource=test&readPreference=nearest'
mongo_df = sqlContext.read.format('com.mongodb.spark.sql.DefaultSource').option('uri', mongo_url).load()
When I execute the script the connection timeouts with the following output:
19/11/14 15:36:47 INFO SparkContext: Created broadcast 0 from broadcast at MongoSpark.scala:542
19/11/14 15:36:47 INFO cluster: Cluster created with settings {hosts=[<host>:<port>], mode=MULTIPLE, requiredClusterType=REPLICA_SET, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500, requiredReplicaSetName='haip-replica-set'}
19/11/14 15:36:47 INFO cluster: Adding discovered server <host>:<port> to client view of cluster
19/11/14 15:36:47 INFO MongoClientCache: Creating MongoClient: [<host>:<port>]
19/11/14 15:36:47 INFO cluster: No server chosen by com.mongodb.client.internal.MongoClientDelegate$1#140f4273 from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=<host>:<port>, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
19/11/14 15:36:47 INFO cluster: Exception in monitor thread while connecting to server mongodb-data-1.haip.io:27017
com.mongodb.MongoSocketReadException: Prematurely reached end of stream
at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:112)
at com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:580)
at com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:445)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:299)
at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:259)
at com.mongodb.internal.connection.CommandHelper.sendAndReceive(CommandHelper.java:83)
at com.mongodb.internal.connection.CommandHelper.executeCommand(CommandHelper.java:33)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.initializeConnectionDescription(InternalStreamConnectionInitializer.java:105)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:62)
at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:129)
at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:117)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/home/user/script.py", line 16, in <module>
mongo_df = sqlContext.read.format('com.mongodb.spark.sql.DefaultSource').option('uri', mongo_url).load()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches com.mongodb.client.internal.MongoClientDelegate$1#140f4273. Client view of cluster state is {type=REPLICA_SET, servers=[{address=<host>:<port>, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketReadException: Prematurely reached end of stream}}]
at com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:408)
at com.mongodb.internal.connection.BaseCluster.selectServer(BaseCluster.java:123)
at com.mongodb.internal.connection.AbstractMultiServerCluster.selectServer(AbstractMultiServerCluster.java:54)
at com.mongodb.client.internal.MongoClientDelegate.getConnectedClusterDescription(MongoClientDelegate.java:147)
at com.mongodb.client.internal.MongoClientDelegate.createClientSession(MongoClientDelegate.java:100)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.getClientSession(MongoClientDelegate.java:277)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:181)
at com.mongodb.client.internal.MongoDatabaseImpl.executeCommand(MongoDatabaseImpl.java:186)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:155)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:150)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:157)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.hasSampleAggregateOperator(MongoConnector.scala:237)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator$lzycompute(MongoRDD.scala:221)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator(MongoRDD.scala:221)
at com.mongodb.spark.sql.MongoInferSchema$.apply(MongoInferSchema.scala:68)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:97)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
19/11/14 15:37:17 INFO SparkContext: Invoking stop() from shutdown hook
19/11/14 15:37:17 INFO MongoClientCache: Closing MongoClient: [<host>:<port>]
Spark version: 2.4.4
Scala version: 2.11
Mongo Java Driver version: 3.11.2
Mongo Spark Connector version: 2.11-2.4.1

Keystore and Truststore were in an incorrect format (.ks). Once I changed the keys to the correct format (.jks) it worked.

Related

Connection ERROR while writing Dataframe (Pyspark 3.x on EMR 6.x) to RDS (MySQL)

I get "connection refused error" when I try to write the results of a Dataframe to an RDS (MySQL). I am using PySpark 3 on EMR cluster v6.x (1 master node, 1 slave node). The table does not exist yet. But the data base exist.
spark-submit --jars s3://{some s3 folder}/mysql-connector-java-8.0.25.jar s3://{some s3 folder}/pyspark_script.py
The part of the script that writes to mysql is here (after testing, its the only part of the script that delivers error/is not working): * I have changed the name of my db, user, and password here below
df.write\
.mode("overwrite")\
.format("jdbc")\
.option("url","jdbc:mysql://localhost:3306/{my database name}?useSSL=false")\
.option("driver","com.mysql.cj.jdbc.Driver")\
.option("dbtable","mydb_table")\
.option("user","myuser")\
.option("password","mypassword")\
.save()
This is the error I get: It is about connection refused.
I have already given the EMR Role access RDS and its data!
Traceback (most recent call last):
File "/mnt/tmp/spark-93919f38-ea4d-44d6-be7d-0416be972753/pyspark_script.py", line 57, in <module>
.option("password","assignment")\
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1107, in save
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o163.save.
: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:833)
at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:453)
at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:246)
at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:198)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:48)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:194)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:190)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167)
at com.mysql.cj.protocol.a.NativeSocketConnection.connect(NativeSocketConnection.java:89)
at com.mysql.cj.NativeSession.connect(NativeSession.java:144)
at com.mysql.cj.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:953)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:823)
... 45 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at com.mysql.cj.protocol.StandardSocketFactory.connect(StandardSocketFactory.java:155)
at com.mysql.cj.protocol.a.NativeSocketConnection.connect(NativeSocketConnection.java:63)
... 48 more
21/12/19 11:40:04 INFO SparkContext: Invoking stop() from shutdown hook
21/12/19 11:40:04 INFO AbstractConnector: Stopped Spark#74d96709{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
21/12/19 11:40:04 INFO SparkUI: Stopped Spark web UI at http://{ip}.eu-central-1.compute.internal:4040
21/12/19 11:40:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/12/19 11:40:04 INFO MemoryStore: MemoryStore cleared
21/12/19 11:40:04 INFO BlockManager: BlockManager stopped
21/12/19 11:40:04 INFO BlockManagerMaster: BlockManagerMaster stopped
21/12/19 11:40:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/12/19 11:40:04 INFO SparkContext: Successfully stopped SparkContext
21/12/19 11:40:04 INFO ShutdownHookManager: Shutdown hook called
21/12/19 11:40:04 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-fd1b8e7c-7b4c-424d-a451-743a6e075fbd
21/12/19 11:40:04 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-93919f38-ea4d-44d6-be7d-0416be972753
21/12/19 11:40:04 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-fd1b8e7c-7b4c-424d-a451-743a6e075fbd/pyspark-40fbaaf5-2e34-44ba-875f-88308084546d
Try this thread: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
EMR with localhost RDS? seems odd, did you missed set the IP correctly ?

Connection refused to Schema Registry

I have installed the new version of confluent i.e 5.4 and since after that I am unable to connect to the confluent, my schema registry also gets terminated untimely.
Today when I started the confluent and tried to produce the data, I recieved the following error:
2020-03-05 12:25:00,453] ERROR Failed to send HTTP request to endpoint: http://localhost:8081/subjects/avro-key/versions (io.confluent.kafka.schemaregistry.client.rest.RestService:245)
java.net.ConnectException: Connection refused (Connection refused)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
at java.base/java.net.Socket.connect(Socket.java:609)
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1248)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1362)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1337)
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:241)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:322)
at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:422)
at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:414)
at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:400)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.registerAndGetId(CachedSchemaRegistryClient.java:140)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.register(CachedSchemaRegistryClient.java:196)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.register(CachedSchemaRegistryClient.java:172)
at io.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.java:71)
at io.confluent.kafka.formatter.AvroMessageReader.readMessage(AvroMessageReader.java:199)
at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:55)
at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)'
Updated the question with the Schema-registry logs:
INFO Logging initialized #865ms to org.eclipse.jetty.util.log.Slf4jLog (org.eclipse.jetty.util.log:169)
[2020-03-09 12:35:51,851] INFO Adding listener: http://0.0.0.0:8081 (io.confluent.rest.ApplicationServer:316)
[2020-03-09 12:35:52,366] INFO Created schema registry namespace localhost:2181 /schema_registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryConfig:709)
[2020-03-09 12:35:53,329] INFO Initializing KafkaStore with broker endpoints: PLAINTEXT://LAP-LIN-897:9092 (io.confluent.kafka.schemaregistry.storage.KafkaStore:108)
[2020-03-09 12:38:03,215] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:77)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:248)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:75)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.configureBaseApplication(SchemaRegistryRestApplication.java:90)
at io.confluent.rest.Application.configureHandler(Application.java:217)
at io.confluent.rest.ApplicationServer.doStart(ApplicationServer.java:185)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:43)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: Timed out trying to create or validate schema topic configuration
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:177)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:119)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:246)
... 6 more
Caused by: java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:170)
... 8 more
'

Spark in cluster mode throws error if a SparkContext is not started

I have a Spark job that initializes the spark context only if it is really necessary:
val conf = new SparkConf()
val jobs: List[Job] = ??? //get some jobs
if(jobs.nonEmpty) {
val sc = new SparkContext(conf)
sc.parallelize(jobs).foreach(....)
} else {
//do nothing
}
It worked fine on Yarn if deploy-mode is 'client'
spark-submit --master yarn --deploy-mode client
Then I switched deploy mode to 'cluster' and it started to crash in case of jobs.isEmpty
spark-submit --master yarn --deploy-mode cluster
Below is the error text:
INFO yarn.Client: Application report for
application_1509613523426_0017 (state: ACCEPTED)
17/11/02 11:37:17
INFO yarn.Client: Application report for
application_1509613523426_0017 (state: FAILED) 17/11/02 11:37:17
INFO yarn.Client: client token: N/A diagnostics: Application
application_1509613523426_0017 failed 2 times due to AM Container for
appattempt_1509613523426_0017_000002 exited with exitCode: -1000 For
more detailed output, check application tracking
page:http://xxxxxx.com:8088/cluster/app/application_1509613523426_0017Then,
click on links to logs of each attempt. Diagnostics: File does not
exist:
hdfs://xxxxxxx/.sparkStaging/application_1509613523426_0017/__spark_libs__997458388067724499.zip
java.io.FileNotFoundException: File does not exist:
hdfs://xxxxxxx/.sparkStaging/application_1509613523426_0017/__spark_libs__997458388067724499.zip
at
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Failing this attempt. Failing the application. ApplicationMaster
host: N/A ApplicationMaster RPC port: -1 queue: dev start time:
1509622629354 final status: FAILED tracking URL:
http://xxxxxx.com:8088/cluster/app/application_1509613523426_0017 user: xxx Exception in thread "main"
org.apache.spark.SparkException: Application
application_1509613523426_0017 finished with failed status at
org.apache.spark.deploy.yarn.Client.run(Client.scala:1104) at
org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150) at
org.apache.spark.deploy.yarn.Client.main(Client.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/11/02 11:37:17 INFO util.ShutdownHookManager: Shutdown hook called
17/11/02 11:37:17 INFO util.ShutdownHookManager: Deleting directory
/tmp/spark-a5b20def-0218-4b0c-b9f8-fdf8a1802e95
Is it a bug in Yarn support or I'm missing something?
SparkContext is the one who is responsible for communication with cluster manager. If application is submitted to the cluster, but context is never created, YARN cannot determine the state of the application - this is why you get an error.

Can't connect with Mongo-Spark Connector using Mongo in Authentication mode

I'm trying to run a spark-submit job, using a MongoDB instance on a remote machine, via the Mongo-Spark Connector.
When I initiate the mongod service without the --auth flag, and run the spark-submit command like this:
./bin/spark-submit --master spark://10.0.3.155:7077 \
--conf "spark.mongodb.input.uri=mongodb://10.0.3.156/test.coll?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://10.0.3.156/test.coll" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 \
app1.py
Everything works like a charm.
But when I run the mongod service with the --auth flag, and run the spark-submit like that:
./bin/spark-submit --master spark://10.0.3.155:7077 \
--conf "spark.mongodb.input.uri=mongodb://admin:pass#10.0.3.156/test.coll?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://admin:pass#10.0.3.156/test.coll" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 \
app1.py
I get these errors:
py4j.protocol.Py4JJavaError: An error occurred while calling o47.save. : com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=UNKNOWN, servers=[{address=10.0.3.156:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSecurityException: Exception authenticating MongoCredential{mechanism=null, userName='admin', source='test', password=<hidden>, mechanismProperties={}}}, caused by {com.mongodb.MongoCommandException: Command failed with error 18: 'Authentication failed.' on server 10.0.3.156:27017. The full response is { "ok" : 0.0, "code" : 18, "errmsg" : "Authentication failed." }}}]
at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:158)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:133)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:128)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:118)
at com.mongodb.operation.DropCollectionOperation.execute(DropCollectionOperation.java:54)
at com.mongodb.operation.DropCollectionOperation.execute(DropCollectionOperation.java:39)
at com.mongodb.Mongo.execute(Mongo.java:781)
at com.mongodb.Mongo$2.execute(Mongo.java:764)
at com.mongodb.MongoCollectionImpl.drop(MongoCollectionImpl.java:419)
at com.mongodb.spark.sql.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:89)
at com.mongodb.spark.sql.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:89)
at com.mongodb.spark.MongoConnector$$anonfun$withCollectionDo$1.apply(MongoConnector.scala:186)
at com.mongodb.spark.MongoConnector$$anonfun$withCollectionDo$1.apply(MongoConnector.scala:184)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:154)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:171)
at com.mongodb.spark.MongoConnector.withCollectionDo(MongoConnector.scala:184)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:518)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
I've checked the credentials and roles, and all is fine. Can't figure out what I'm doing wrong...
You need to configure the authSource query parameter in your Connection String, otherwise the supplied database (test) will be used:
spark.mongodb.input.uri=mongodb://10.0.3.156/test.coll?authSource=admin&readPreference=primaryPreferred
So, it was a credential issue after all.
The thing is that I tried to connect to test DB, while the credentials I entered were for the admin DB.
That's it.

HDFS command line put throwing an exception

I am trying to put a file into hdfs using ssh from my client pc to the NameNode server. There are 2 machines: One NameNode and one DataNode. Here is the command I am trying;
$ bin/hadoop fs -fs hdfs://MY_IP:MY_PORT -put example.txt example.txt
But it throws an exception. The logs say;
2013-05-23 09:25:31,808 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:_my_user_name_ cause:java.io.IOException: Unknown protocol to DataNode: org.apache.hadoop.hdfs.protocol.ClientProtocol
2013-05-23 09:25:31,808 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on _my_port_, call getProtocolVersion(org.apache.hadoop.hdfs.protocol.ClientProtocol, 61) from _ip_:_port_: error: java.io.IOException: Unknown protocol to DataNode: org.apache.hadoop.hdfs.protocol.ClientProtocol
java.io.IOException: Unknown protocol to DataNode: org.apache.hadoop.hdfs.protocol.ClientProtocol
at org.apache.hadoop.hdfs.server.datanode.DataNode.getProtocolVersion(DataNode.java:1759)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
What can be the problem? Thanks a lot
Check your core-site.xml file and verify the hdfs url.
Check the version of your client's hadoop and cluster's hadoop, my guess is your client version is less than the cluster version.
java.io.IOException: Unknown protocol to DataNode: org.apache.hadoop.hdfs.protocol.ClientProtocol
at org.apache.hadoop.hdfs.server.datanode.DataNode.getProtocolVersion(DataNode.java:1759)