Oozie action Spark2 : Kerberos Delegation Token Error - pyspark

I have a cluster cloudera 5.12 with kerberos,
i can launch a simple pyspark script with spark2-submit with no error , but when i try to launch the same script with oozie , i get:
Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Delegation Token can be issued only with kerberos or web authentication
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7503)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:548)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getDelegationToken(AuthorizationProviderProxyClientProtocol.java:663)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:981)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)
my workflow is :
<workflow-app name="${wfAppName}" xmlns="uri:oozie:workflow:0.5">
<credentials>
<credential name="hcatauth" type="hcat">
<property>
<name>hcat.metastore.uri</name>
<value>thrift://XXXXX:9083</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>hive/_HOST#XXXXX</value>
</property>
</credential>
</credentials>
<start to="spark-launch"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-launch" cred="hcatauth">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>TestMe</name>
<jar>mini_job.py</jar>
<spark-opts>
--files hive-site.xml
--queue interne_isoprod
--principal interne_isoprod#XXXX
--keytab interne_isoprod.keytab
</spark-opts>
<file>${projectPath}/scripts/mini_job.py#mini_job.py</file>
<file>${projectPath}/lib/hive-site.xml#hive-site.xml</file>
<file>${projectPath}/lib/interne_isoprod.keytab#interne_isoprod.keytab</file>
</spark>
<ok to="End"/>
<error to="Kill"/>
and the pyspark script is :
from datetime import datetime
from pyspark.sql import SparkSession
import sys
if __name__ == "__main__":
spark = SparkSession.builder.appName('test_spark2').getOrCreate()
print(spark.version)
spark.stop()
i tried with and without sparkops with no luck

Related

Spark to read Cassandra

Hi I am trying to extract data from Cassandra using AWS Glue and writing PySpark Code. Below is the code and gave me error. Please suggest me how i can import classes/drivers.
I want to extract from Cassandra and create files into S3 Buckets.
#from awsglue.transforms import sys
import sys
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
sparkSession = glueContext.spark_session
#Use the CData JDBC driver to read Cassandra data from the Customer table into a DataFrame ##Note the populated JDBC URL and driver class name
#source_df = sparkSession.read.format("jdbc").option("url","jdbc:cassandra:RTK=5246...;Database=MyCassandraDB;Port=7000;Server=db-datastax02c-dc2.stage.impello.co.uk;")\.option("dbtable","reads_by_received_date").option("driver","cdata.jdbc.cassandra.CassandraDriver").load()*/
#df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).load()
glueJob = Job(glueContext)
glueJob.init(args['JOB_NAME'], args)
testdf = sparkSession.read.format("org.apache.spark.sql.cassandra")\
.option("spark.cassandra.connection.host", "server")\
.options(table="reads_by_received_date",keyspace="keyspace")\
.option("spark.cassandra.auth.username", "username") \
.option("spark.cassandra.auth.password", "username") \
.load()\
#.select(*)\
#.where( "received_year in (2020)")\
#.cache()
##Convert DataFrames to AWS Glue's DynamicFrames Object
dynamic_dframe = DynamicFrame.fromDF(testdf, glueContext, "dynamic_df")
##Write the DynamicFrame as a file in CSV format to a folder in an S3 bucket.
datatransfer = glueContext.write_dynamic_frame.from_options(frame = dynamic_dframe\
, connection_type = "s3"\
, connection_options = {"path": "s3://bucket/"}\
, format = "csv"\
, transformation_ctx = "datasink4"
)
glueJob.commit()
Error:
Aug 28, 2020, 4:43:27 PM Pending execution
Traceback (most recent call last): File "/tmp/CassandraToS3", line 27, in <module> .option("spark.cassandra.auth.password", "password") \ File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load return self._df(self._jreader.load()) File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o75.load. : java.io.IOException: Failed to open native connection to Cassandra at {} :: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=4f522a41): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)] at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:181) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:169) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:169) at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32) at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69) at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57) at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:89) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111) at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$.forSystemLocalPartitioner(TokenFactory.scala:98) at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:680) at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:57) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=4f522a41): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)] at com.datastax.oss.driver.api.core.AllNodesFailedException.copy(AllNodesFailedException.java:141) at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149) at com.datastax.oss.driver.api.core.session.SessionBuilder.build(SessionBuilder.java:633) at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createSession(CassandraConnectionFactory.scala:144) at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:175) ... 25 more Suppressed: com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.fail(ProtocolInitHandler.java:342) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.writeListener(ChannelHandlerRequest.java:87) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:183) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:76) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.send(ProtocolInitHandler.java:183) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler.onRealConnect(ProtocolInitHandler.java:118) at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.lambda$connect$0(ConnectInitHandler.java:57) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Suppressed: com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9042 Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) at com.datastax.oss.driver.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.channels.ClosedChannelException at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:921) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:897) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:740) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:726) at com.datastax.oss.driver.shaded.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:763) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:788) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:756) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:806) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:294) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:75) ... 20 more
AWS Glue does not provide native library support to Cassandra. You need to get Cassandra connector and follow the steps mentioned in ETL jobs against non-native JDBC data sources.
Once you have the jar downloaded from here then you can pass to your job and use it in your pyspark script.

PySpark Can't Connect to MongoDB, but Command-line Can

Trying to load a MongoDB collection into a PySpark DataFrame. First of all.. I'm able to connect using the command line on the NameNode:
mongo mongodb://USER:PASSWORD#HOST/DB_NAME
MongoDB shell version v3.6.3
connecting to: mongodb://HOST/DB_NAME
MongoDB server version: 3.6.3
>
I run the script on the cluster as follows:
spark-submit \
--master yarn \
--deploy-mode client \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 3 \
--num-executors 10 \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 \
load_from_mongo.py
Now I create a SparkSession:
spark = SparkSession.builder \
.appName("TestMongoLoad") \
.config("spark.mongodb.input.uri", 'mongodb://USER:PASSWORD#HOST:27017') \
.config("spark.mongodb.input.database", DB_NAME) \
.config("spark.mongodb.input.collection", COLLECTION_NAME) \
.getOrCreate()
Then I attempt to read in the DataFame:
df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.load()
df.show(5, truncate=False)
The result is that it fails authentication. Clearly I'm passing something incorrectly...
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-caedf270-dd43-42f2-a39e-e3d1b7134046;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 in central
found org.mongodb#mongo-java-driver;3.10.2 in central
[3.10.2] org.mongodb#mongo-java-driver;[3.10,3.11)
:: resolution report :: resolve 1129ms :: artifacts dl 4ms
:: modules in use:
org.mongodb#mongo-java-driver;3.10.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 1 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-caedf270-dd43-42f2-a39e-e3d1b7134046
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/7ms)
20/02/29 21:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/home/ubuntu/server/load_from_mongo.py", line 124, in <module>
main(args)
File "/home/ubuntu/server/load_from_mongo.py", line 102, in main
keyword_df = getKeywordCorpus(args.begin_dt, args.end_dt)
File "/home/ubuntu/server/load_from_mongo.py", line 79, in getKeywordCorpus
df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o57.load.
: com.mongodb.MongoSecurityException: Exception authenticating MongoCredential{mechanism=SCRAM-SHA-1, userName='USER', source='admin', password=<hidden>, mechanismProperties={}}
at com.mongodb.internal.connection.SaslAuthenticator.wrapException(SaslAuthenticator.java:173)
at com.mongodb.internal.connection.SaslAuthenticator.access$300(SaslAuthenticator.java:40)
at com.mongodb.internal.connection.SaslAuthenticator$1.run(SaslAuthenticator.java:70)
at com.mongodb.internal.connection.SaslAuthenticator$1.run(SaslAuthenticator.java:47)
at com.mongodb.internal.connection.SaslAuthenticator.doAsSubject(SaslAuthenticator.java:179)
at com.mongodb.internal.connection.SaslAuthenticator.authenticate(SaslAuthenticator.java:47)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.authenticateAll(InternalStreamConnectionInitializer.java:152)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:63)
at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:127)
at com.mongodb.internal.connection.UsageTrackingInternalConnection.open(UsageTrackingInternalConnection.java:50)
at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.open(DefaultConnectionPool.java:390)
at com.mongodb.internal.connection.DefaultConnectionPool.get(DefaultConnectionPool.java:106)
at com.mongodb.internal.connection.DefaultConnectionPool.get(DefaultConnectionPool.java:92)
at com.mongodb.internal.connection.DefaultServer.getConnection(DefaultServer.java:85)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.getConnection(ClusterBinding.java:115)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:212)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:206)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:116)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:109)
at com.mongodb.operation.CommandReadOperation.execute(CommandReadOperation.java:56)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:179)
at com.mongodb.client.internal.MongoDatabaseImpl.executeCommand(MongoDatabaseImpl.java:184)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:153)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:148)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:157)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.hasSampleAggregateOperator(MongoConnector.scala:237)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator$lzycompute(MongoRDD.scala:221)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator(MongoRDD.scala:221)
at com.mongodb.spark.sql.MongoInferSchema$.apply(MongoInferSchema.scala:68)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:97)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mongodb.MongoCommandException: Command failed with error 18 (AuthenticationFailed): 'Authentication failed.' on server HOST:27017. The full response is {"ok": 0.0, "errmsg": "Authentication failed.", "code": 18, "codeName": "AuthenticationFailed"}
at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:179)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:299)
at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:255)
at com.mongodb.internal.connection.CommandHelper.sendAndReceive(CommandHelper.java:83)
at com.mongodb.internal.connection.CommandHelper.executeCommand(CommandHelper.java:33)
at com.mongodb.internal.connection.SaslAuthenticator.sendSaslStart(SaslAuthenticator.java:130)
at com.mongodb.internal.connection.SaslAuthenticator.access$100(SaslAuthenticator.java:40)
at com.mongodb.internal.connection.SaslAuthenticator$1.run(SaslAuthenticator.java:54)
... 48 more
The answer, as #Lamanus alluded above, was to change the URI slightly:
MONGO_URL = "mongodb://USER:PASSWORD#HOST:27017/DB_NAME"
spark = SparkSession.builder \
.appName('TestMongoLoad') \
.config('spark.mongodb.input.uri', MONGO_URL) \
.config('spark.mongodb.input.collection', COLLECTION) \
.getOrCreate()

Can Same spring batch invoked with different job parameter from different multiple thread at the same time

Our requirement is to write multiple files at the same time. we are using spring batch to write file and we are lunching the spring batch from different thread. Each thread will have it is own application context. So we can assure that the singletone beans will not be shared across multiple thread. Below is my code snippet.
Spring batch config.
<bean id="reportDataReader" class="com.test.ist.batch2.rrm.batch.readers.RRMItmeReader"
scope="step">
<property name="verifyCursorPosition" value="false" />
<property name="dataSource" ref="dataSource" />
<property name="sql" value="#{jobParameters['sqlquery']}" />
<property name="rowMapper" ref="valueMapper" />
<property name="fetchSize" value="5000" />
</bean>
<bean id="valueMapper" class="com.test.ist.batch2.rrm.batch.mappers.DBValueMapper" scope="step"></bean>
<bean id="velocityFileWritter"
class="com.test.ist.batch2.rrm.batch.writers.RRMVelocityFileWriter"
scope="step">
</bean>
<bean id="velocityEngine"
class="org.springframework.ui.velocity.VelocityEngineFactoryBean">
<property name="velocityProperties">
<value>
resource.loader = class
class.resource.loader.class = org.apache.velocity.runtime.resource.loader.ClasspathResourceLoader
class.resource.loader.cache = true
class.resource.loader.modificationCheckInterval = 0
</value>
</property>
</bean>
<batch:job id="rrmReportGenJob">
<batch:step id="rrmReportGenStep">
<batch:tasklet>
<batch:chunk reader="reportDataReader" writer="velocityFileWritter"
commit-interval="${reportData.reader.commit-interval}">
</batch:chunk>
</batch:tasklet>
</batch:step>
</batch:job>
This how we are invoking the spring batch.
ThreadPoolExecutor tpe=new ThreadPoolExecutor(10, 10, 1000000, TimeUnit.MILLISECONDS, new LinkedBlockingQueue());
PetReportGenerator rrg=new PetReportGenerator(null);
ThreadTest tt=new ThreadTest(new PetReportGenerator(null), "161");
ThreadTest tt2=new ThreadTest(new PetReportGenerator(null), "162");
ThreadTest tt3=new ThreadTest(new PetReportGenerator(null), "163");
ThreadTest tt4=new ThreadTest(new PetReportGenerator(null), "165");
tpe.execute(tt);
tpe.execute(tt2);
tpe.execute(tt3);
tpe.execute(tt4);
In the constructor of PetReportGenerator we are initializing the bean config.
Below is the code snippet
private ApplicationContext appContext;
public PetReportGenerator(ApplicationContext reportContext){
if(null == reportContext){
//if(null == appContext){
appContext=new ClassPathXmlApplicationContext("spring-batch-jobs.xml");
//}
}else{
setAppContext(reportContext);;
}
}
Below is the code extract of how we invoke the spring batch
Job jobToExecute = (Job)SpringUtils.getBean(jobName);
JobParametersBuilder paramsBuilder = new JobParametersBuilder();
//By default add the Data time. This will help to lauch the same job again with same parameters
paramsBuilder.addLong("JOB_TIME", System.currentTimeMillis());
if(!jobParams.isEmpty()){
//Validate input fields.
String sqlToUse = validator.validateInput(jobParams);
for(Map.Entry entry:jobParams.entrySet()){
paramsBuilder.addString(entry.getKey(), entry.getValue());
}
}else{
throw new ReportGenerationException("Job input parameter is Empty");
}
jobexe=jobLauncher.run(jobToExecute, paramsBuilder.toJobParameters());
If it is run in a single thread it is working fine.
When it is invoked by multiple threads we are getting below error
09:09:26,742 ERROR pool-1-thread-3 job.AbstractJob:329 - Encountered fatal error executing job
java.lang.NullPointerException
at org.springframework.batch.core.repository.dao.MapJobExecutionDao.synchronizeStatus(MapJobExecutionDao.java:158)
at org.springframework.batch.core.repository.support.SimpleJobRepository.update(SimpleJobRepository.java:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:98)
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:262)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:95)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:207)
at com.sun.proxy.$Proxy14.update(Unknown Source)
at org.springframework.batch.core.job.AbstractJob.updateStatus(AbstractJob.java:416)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:299)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:135)
at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:128)
Can any one please help me to understand what could be wrong ?
The MapJobRepository is NOT intended for production use. It is NOT threadsafe. If you need the performance of in memory job repositories (loosing restartability, etc), use an in memory database like HSQLDB.
That note aside, if you are using thread safe components, there is no reason you can't launch multiple job instances with multiple threads.
Are you sure, that the MapJobExecutionDao is threadsafe in all aspects? I see, that a ConcurrentMap is used in side MapJobExecutionDao, but I'm not sure if this is enough. I once had a problem getting also a NullPointer from a Map that was accessed from different threads. The problem there was, that one thread caused a rehashing and when the second thread did access the map at that very moment, it received a nullpointer.
Are you sure, that the combinations of your identifying jobparameters is unique? I see, that you add a parameter Job_Time with System.currentTimeMillis(), but do you know, if that really resolves in an unique timestamp?
Have you tried to use the table based versions of JobExecutionDao and so on?

HBase client 1.0 get data failed

I met a problem about getting data by HBase v1.0 client API.
The followings are my HBase and custom Zookeeper settings:
HMaster hosts:
172.17.0.2 master
127.0.0.1 localhost
192.168.1.32 master2
hbase-site.xml:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/zookeeper/data</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master</value>
</property>
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
</configuration>
zoo.cfg:
dataDir=/home/hduser/zookeeper/data
clientPort=2181
server.1=master:2888:3888
After I run my client code to connect to zookeeper, it looks like this connection successful.
2015-04-09 08:51:37,901 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.1.30:56295
2015-04-09 08:51:37,906 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#868] - Client attempting to establish new session at /192.168.1.30:56295
2015-04-09 08:51:37,939 [myid:] - INFO [SyncThread:0:ZooKeeperServer#617] - Established session 0x14c9d46501d000b with negotiated timeout 40000 for client /192.168.1.30:56295
2015-04-09 08:52:04,000 [myid:] - INFO [SessionTracker:ZooKeeperServer#347] - Expiring session 0x14c9d46501d000a, timeout of 40000ms exceeded
2015-04-09 08:52:04,000 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor#494] - Processed session termination for sessionid: 0x14c9d46501d000a
However, when I try to put a data into a table, there is nothing happened I observed in zookeeper.out
What's the possible problem under this circumstance?
The following code is what I try to connected to HBase by scala
main.scala:
object Main {
def main(args: Array[String]): Unit = {
val con = new HBaseUtils().getConnection
try {
val table = con.getTable(TableName.valueOf("TimeIndexTable"))
val put = new Put(Bytes.toBytes("r2")).addColumn(Bytes.toBytes("post_info"), Bytes.toBytes("abc"), Bytes.toBytes("value"))
table.put(put)
println("Success")
}
catch {
case e: Exception => e.printStackTrace()
}
finally {
con.close()
}
}
}
HBaseUtils.scala:
class HBaseUtils {
private val hbaseURL = "192.168.1.31"
def getConnection = ConnectionFactory.createConnection(setConf)
def setConf = {
val config = HBaseConfiguration.create()
config.set("hbase.zookeeper.quorum",hbaseURL)
config
}
}

Spark program behave differently based on --master is set to local[4] or yarn-client

I am using the movielens dataset to load Movie information into a spark program and print the same using the following code snippet
import org.apache.spark.{SparkConf, SparkContext}
object MovieApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("movie-recommender")
val sc = new SparkContext(conf)
val movieFile = "/mnt/DATASETS/ml-1m/movies.dat"
val movieData = sc.textFile(movieFile)
val movies = movieData.map(_.split("::") match { case Array(movieid, title, genres) =>
val genreList = genres.split("|")
(movieid, title, genreList)
})
println("Num movies:" + movies.count())
movies.foreach { case movielist =>
println("ID:" + movielist._1 + "Title:" + movielist._2)
}
}
}
When I run the code using the command
spark-submit --master local[4] --class "MovieApp" movie-recommender.jar I get the expected output as
*root#philli ml]# /usr/lib/spark/bin/spark-submit --master local[4] --class "MovieApp" movie-recommender_2.10-1.0.jar
14/12/05 00:17:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Num movies:3883
ID:2020 Title:Dangerous Liaisons (1988)
ID:2021 Title:Dune (1984)
ID:2022 Title:Last Temptation of Christ, The (1988)
ID:2023 Title:Godfather: Part III, The (1990)
ID:2024 Title:Rapture, The (1991)
ID:2025 Title:Lolita (1997)
ID:2026 Title:Disturbing Behavior (1998)
ID:2027 Title:Mafia! (1998)
ID:2028 Title:Saving Private Ryan (1998)
ID:2029 Title:Billy's Hollywood Screen Kiss (1997)
...
*
but when I run the same on a hadoop cluster using the command
spark-submit --master yarn-client --class "MovieApp" movie-recommender.jar the output is different as below (no movie details???)
*[root#philli ml]# /usr/lib/spark/bin/spark-submit --master yarn-client --class "MovieApp" movie-recommender_2.10-1.0.jar
14/12/05 00:21:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/12/05 00:21:07 WARN BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
--args is deprecated. Use --arg instead.
Num movies:3883
[root#philli ml]# *
Why should the behavior of the program change between running it as local vs on the cluster....I have built spark-1.1.1 for hadoop using the command
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
package
The cluster I am using is HDP2.1
Sample movies.dat file is as follows:
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
When you run the program on a cluster, the foreach closure will be executed in the workers, therefore the println is happening, but on the worker's stdout and not on the driver.
Look into the yarn logs and you will find the expected output.