PySpark Can't Connect to MongoDB, but Command-line Can - mongodb

Trying to load a MongoDB collection into a PySpark DataFrame. First of all.. I'm able to connect using the command line on the NameNode:
mongo mongodb://USER:PASSWORD#HOST/DB_NAME
MongoDB shell version v3.6.3
connecting to: mongodb://HOST/DB_NAME
MongoDB server version: 3.6.3
>
I run the script on the cluster as follows:
spark-submit \
--master yarn \
--deploy-mode client \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 3 \
--num-executors 10 \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 \
load_from_mongo.py
Now I create a SparkSession:
spark = SparkSession.builder \
.appName("TestMongoLoad") \
.config("spark.mongodb.input.uri", 'mongodb://USER:PASSWORD#HOST:27017') \
.config("spark.mongodb.input.database", DB_NAME) \
.config("spark.mongodb.input.collection", COLLECTION_NAME) \
.getOrCreate()
Then I attempt to read in the DataFame:
df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.load()
df.show(5, truncate=False)
The result is that it fails authentication. Clearly I'm passing something incorrectly...
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
:: loading settings :: url = jar:file:/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-caedf270-dd43-42f2-a39e-e3d1b7134046;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 in central
found org.mongodb#mongo-java-driver;3.10.2 in central
[3.10.2] org.mongodb#mongo-java-driver;[3.10,3.11)
:: resolution report :: resolve 1129ms :: artifacts dl 4ms
:: modules in use:
org.mongodb#mongo-java-driver;3.10.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 1 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-caedf270-dd43-42f2-a39e-e3d1b7134046
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/7ms)
20/02/29 21:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/home/ubuntu/server/load_from_mongo.py", line 124, in <module>
main(args)
File "/home/ubuntu/server/load_from_mongo.py", line 102, in main
keyword_df = getKeywordCorpus(args.begin_dt, args.end_dt)
File "/home/ubuntu/server/load_from_mongo.py", line 79, in getKeywordCorpus
df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o57.load.
: com.mongodb.MongoSecurityException: Exception authenticating MongoCredential{mechanism=SCRAM-SHA-1, userName='USER', source='admin', password=<hidden>, mechanismProperties={}}
at com.mongodb.internal.connection.SaslAuthenticator.wrapException(SaslAuthenticator.java:173)
at com.mongodb.internal.connection.SaslAuthenticator.access$300(SaslAuthenticator.java:40)
at com.mongodb.internal.connection.SaslAuthenticator$1.run(SaslAuthenticator.java:70)
at com.mongodb.internal.connection.SaslAuthenticator$1.run(SaslAuthenticator.java:47)
at com.mongodb.internal.connection.SaslAuthenticator.doAsSubject(SaslAuthenticator.java:179)
at com.mongodb.internal.connection.SaslAuthenticator.authenticate(SaslAuthenticator.java:47)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.authenticateAll(InternalStreamConnectionInitializer.java:152)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:63)
at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:127)
at com.mongodb.internal.connection.UsageTrackingInternalConnection.open(UsageTrackingInternalConnection.java:50)
at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.open(DefaultConnectionPool.java:390)
at com.mongodb.internal.connection.DefaultConnectionPool.get(DefaultConnectionPool.java:106)
at com.mongodb.internal.connection.DefaultConnectionPool.get(DefaultConnectionPool.java:92)
at com.mongodb.internal.connection.DefaultServer.getConnection(DefaultServer.java:85)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.getConnection(ClusterBinding.java:115)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:212)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:206)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:116)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:109)
at com.mongodb.operation.CommandReadOperation.execute(CommandReadOperation.java:56)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:179)
at com.mongodb.client.internal.MongoDatabaseImpl.executeCommand(MongoDatabaseImpl.java:184)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:153)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:148)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:157)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.hasSampleAggregateOperator(MongoConnector.scala:237)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator$lzycompute(MongoRDD.scala:221)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator(MongoRDD.scala:221)
at com.mongodb.spark.sql.MongoInferSchema$.apply(MongoInferSchema.scala:68)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:97)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mongodb.MongoCommandException: Command failed with error 18 (AuthenticationFailed): 'Authentication failed.' on server HOST:27017. The full response is {"ok": 0.0, "errmsg": "Authentication failed.", "code": 18, "codeName": "AuthenticationFailed"}
at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:179)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:299)
at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:255)
at com.mongodb.internal.connection.CommandHelper.sendAndReceive(CommandHelper.java:83)
at com.mongodb.internal.connection.CommandHelper.executeCommand(CommandHelper.java:33)
at com.mongodb.internal.connection.SaslAuthenticator.sendSaslStart(SaslAuthenticator.java:130)
at com.mongodb.internal.connection.SaslAuthenticator.access$100(SaslAuthenticator.java:40)
at com.mongodb.internal.connection.SaslAuthenticator$1.run(SaslAuthenticator.java:54)
... 48 more

The answer, as #Lamanus alluded above, was to change the URI slightly:
MONGO_URL = "mongodb://USER:PASSWORD#HOST:27017/DB_NAME"
spark = SparkSession.builder \
.appName('TestMongoLoad') \
.config('spark.mongodb.input.uri', MONGO_URL) \
.config('spark.mongodb.input.collection', COLLECTION) \
.getOrCreate()

Related

unable to connect to Snowflake from AWS Glue

I'm having issues connecting to Snowflake from aws glue.
I'm trying to read a table from Snowflake without any luck, any help would be appreciated.
Error is below:
23/02/14 01:32:55 INFO Utils: Successfully started service 'sparkDriver' on port 38325.
23/02/14 01:32:59 INFO GlueContext: GlueMetrics configured and enabled
23/02/14 01:33:01 ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
File "/tmp/TestSFConn.py", line 111, in <module>
.option("dbtable", snowflake_database+"."+snowflake_schema+"."+source_table_name).load()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 210, in load
return self._df(self._jreader.load())
File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o104.load.
: java.lang.NoClassDefFoundError: scala/$less$colon$less
at net.snowflake.spark.snowflake.DefaultSource.shortName(DefaultSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$2(DataSource.scala:659)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$2$adapted(DataSource.scala:659)
at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:247)
at scala.collection.Iterator.foreach(Iterator.scala:937)
at scala.collection.Iterator.foreach$(Iterator.scala:937)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
at scala.collection.IterableLike.foreach(IterableLike.scala:70)
at scala.collection.IterableLike.foreach$(IterableLike.scala:69)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:246)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:104)
at scala.collection.TraversableLike.filter(TraversableLike.scala:258)
What am I missing? I'm not able to figure out why I'm unable to connect.
I have also added the jar files in the "Dependent JARs path" in job details in Glue.
this is what I added:
s3://aws-glue-poc/snowflake_files/spark-snowflake_2.13-2.11.1-spark_3.3.jar,
s3://aws-glue-poc/snowflake_files/snowflake-jdbc-3.13.27.jar
Code below:
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
sc.setLogLevel("ALL")
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print("Spark session created")
try:
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
snowflake_database="DEV_123"
snowflake_schema="schema123"
source_table_name="TABLE1"
snowflake_options = {
"sfURL": "XXXXXXXXXXXXXXXXXXXX.snowflakecomputing.com",
"sfUser": "USER1",
"sfPassword": "1234567",
"sfDatabase": snowflake_database,
"sfSchema": snowflake_schema,
"sfWarehouse": "WAREHOUSE_1234",
"tracing" : "ALL"
}
print("12345 - Before Read")
df = spark.read\
.format(SNOWFLAKE_SOURCE_NAME)\
.options(**snowflake_options)\
.option("dbtable", snowflake_database+"."+snowflake_schema+"."+source_table_name).load()
df.show()
print("12345 - After Read")
df1 = df.select(df["*"])
df1.write.format("snowflake") \
.options(**snowflake_options) \
.option("dbtable", "TABLE_23").mode("overwrite") \
.save()
except Exception as glue_exception_error:
print("##################### -- Error: " + str(glue_exception_error) + " -- ##########################")
raise
For the Spark connector v2.11.1, you will need to use JDBC driver v3.13.24 rather than 3.13.27

pyspark issue while connecting

I am new to the pyspark.
i was trying to initialize a pyspark session .
But getting the below error. I am doing the pyspark2 command in local machine .
When i tried first time using scala the spark session invokation is correct . Then i tried to invoke Pyspark that time i am getting error. Please let me know how i can come out of this error
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/08 22:55:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/03/08 22:55:41 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
py4j.ClientServerConnection.run(ClientServerConnection.java:106)
java.base/java.lang.Thread.run(Thread.java:833)
C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\bin\..\python\pyspark\shell.py:42: UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\bin\..\python\pyspark\shell.py", line 38, in <module>
spark = SparkSession._create_shell_session() # type: ignore
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\sql\session.py", line 553, in _create_shell_session
return SparkSession.builder.getOrCreate()
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\sql\session.py", line 228, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 392, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 146, in __init__
self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 209, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 329, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\java_gateway.py", line 1585, in __call__
return_value = get_return_value(
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:460)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\bin>SUCCESS: The process with PID 21928 (child process of PID 14900) has been terminated.
SUCCESS: The process with PID 14900 (child process of PID 31720) has been terminated.
SUCCESS: The process with PID 31720 (child process of PID 10468) has been terminated.

java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource when using spark-submit

The following error is present when using spark submit... Otherwise works well pyspark:
: java.lang.ClassNotFoundException: Failed to find data source:
com.mongodb.spark.sql.DefaultSource. Please find packages at
http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:679)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:967)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:304)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Here's the code that I'm running:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("myApp")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.coll")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll")
.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
.getOrCreate()
)
people = spark.createDataFrame(
[
("Bilbo Baggins", 50),
("Gandalf", 1000),
("Thorin", 195),
("Balin", 178),
("Kili", 77),
("Dwalin", 169),
("Oin", 167),
("Gloin", 158),
("Fili", 82),
("Bombur", None),
],
["name", "age"],
)
people.write.format("mongo").mode("append").save()
Try using --packages while submitting spark job using spark-submit.
spark-submit --master <url> --class <className> --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.2

Spark to read Cassandra

Hi I am trying to extract data from Cassandra using AWS Glue and writing PySpark Code. Below is the code and gave me error. Please suggest me how i can import classes/drivers.
I want to extract from Cassandra and create files into S3 Buckets.
#from awsglue.transforms import sys
import sys
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
sparkSession = glueContext.spark_session
#Use the CData JDBC driver to read Cassandra data from the Customer table into a DataFrame ##Note the populated JDBC URL and driver class name
#source_df = sparkSession.read.format("jdbc").option("url","jdbc:cassandra:RTK=5246...;Database=MyCassandraDB;Port=7000;Server=db-datastax02c-dc2.stage.impello.co.uk;")\.option("dbtable","reads_by_received_date").option("driver","cdata.jdbc.cassandra.CassandraDriver").load()*/
#df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).load()
glueJob = Job(glueContext)
glueJob.init(args['JOB_NAME'], args)
testdf = sparkSession.read.format("org.apache.spark.sql.cassandra")\
.option("spark.cassandra.connection.host", "server")\
.options(table="reads_by_received_date",keyspace="keyspace")\
.option("spark.cassandra.auth.username", "username") \
.option("spark.cassandra.auth.password", "username") \
.load()\
#.select(*)\
#.where( "received_year in (2020)")\
#.cache()
##Convert DataFrames to AWS Glue's DynamicFrames Object
dynamic_dframe = DynamicFrame.fromDF(testdf, glueContext, "dynamic_df")
##Write the DynamicFrame as a file in CSV format to a folder in an S3 bucket.
datatransfer = glueContext.write_dynamic_frame.from_options(frame = dynamic_dframe\
, connection_type = "s3"\
, connection_options = {"path": "s3://bucket/"}\
, format = "csv"\
, transformation_ctx = "datasink4"
)
glueJob.commit()
Error:
Aug 28, 2020, 4:43:27 PM Pending execution
Traceback (most recent call last): File "/tmp/CassandraToS3", line 27, in <module> .option("spark.cassandra.auth.password", "password") \ File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load return self._df(self._jreader.load()) File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o75.load. : java.io.IOException: Failed to open native connection to Cassandra at {} :: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=4f522a41): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)] at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:181) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:169) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:169) at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32) at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69) at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57) at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:89) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111) at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$.forSystemLocalPartitioner(TokenFactory.scala:98) at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:680) at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:57) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=4f522a41): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)] at com.datastax.oss.driver.api.core.AllNodesFailedException.copy(AllNodesFailedException.java:141) at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149) at com.datastax.oss.driver.api.core.session.SessionBuilder.build(SessionBuilder.java:633) at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createSession(CassandraConnectionFactory.scala:144) at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:175) ... 25 more Suppressed: com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.fail(ProtocolInitHandler.java:342) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.writeListener(ChannelHandlerRequest.java:87) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:183) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:76) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.send(ProtocolInitHandler.java:183) at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler.onRealConnect(ProtocolInitHandler.java:118) at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.lambda$connect$0(ConnectInitHandler.java:57) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) at com.datastax.oss.driver.shaded.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Suppressed: com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9042 Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) at com.datastax.oss.driver.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.channels.ClosedChannelException at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:921) at com.datastax.oss.driver.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:897) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:740) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:726) at com.datastax.oss.driver.shaded.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:763) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:788) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:756) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:806) at com.datastax.oss.driver.shaded.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025) at com.datastax.oss.driver.shaded.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:294) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:75) ... 20 more
AWS Glue does not provide native library support to Cassandra. You need to get Cassandra connector and follow the steps mentioned in ETL jobs against non-native JDBC data sources.
Once you have the jar downloaded from here then you can pass to your job and use it in your pyspark script.

Error while running spark on standalone cluster

I'm trying to run a simple Spark code on standalone cluster. Below is the code:
from pyspark import SparkConf,SparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("even-numbers").setMaster("spark://sumit-Inspiron-N5110:7077")
sc = SparkContext(conf)
inp = sc.parallelize([1,2,3,4,5])
even = inp.filter(lambda x: (x % 2 == 0)).collect()
for i in even:
print(i)
but, I'm getting error stating " Could not parse Master URL":
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Could not parse Master URL: '<pyspark.conf.SparkConf object at 0x7fb27e864850>'
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2760)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
18/01/07 16:59:47 INFO ShutdownHookManager: Shutdown hook called
18/01/07 16:59:47 INFO ShutdownHookManager: Deleting directory /tmp/spark-0d71782f-617f-44b1-9593-b9cd9267757e
I also tried setting the master as 'local', but it didn't work. Can someone help?
And yes, the command to run the job is
./bin/spark-submit even.py
Replace your following line
sc = SparkContext(conf)
with
sc = SparkContext(conf=conf)
you should have it solved.