Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab - scala

I have followed this post pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
and followed the resolution provided but still getting the same error. Please help.
I am trying to run this using Jupyter lab created using data proc cluster in GCP.
I am using Python 3 kernel (not PySpark) to allow you to configure the SparkSession in the notebook and include the spark-bigquery-connector required to use the BigQuery
Storage API.
!scala -version
scala version is Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('1.2. BigQuery Storage & Spark SQL - Python')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar')\
.config("viewsEnabled","true")\
.getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)
df = spark.read \
.format('bigquery') \
.option('table', 'table_name') \
.load()
below is the error, i am getting
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-30-bcc65722cf80> in <module>
1 df = spark.read \
2 .format('bigquery') \
----> 3 .option('table', 'wmt-rdl-stage.dim_tables.store_dim') \
4 .load()
/usr/lib/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
171 else:
--> 172 return self._df(self._jreader.load())
173
174 #since(1.4)
/opt/conda/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/conda/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o411.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.spark.bigquery.BigQueryUtilScala$
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:42)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class
.newInstance(Class.java:442)
at java.util.Se

Please switch to gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar. The number after the _ is the Scala binary version.

Related

Is there a pyspark function to give me 2 decimal places on multiple dataframe columns?

I'm new to coding and am new to pyspark and python (by new I mean I am a student and am learning it).
I keep getting error in my code and I can't figure out why. what I'm trying to do is get my code to give me a 2 decimal output that looks like this. Below is a sample output of what I want my output to look like:
+------+--------+------+------+
|col_ID| f.name |bal | avg. |
+------+--------+------+------+
|1234 | Henry |350.45|400.32|
|3456 | Sam |75.12 | 50.60|
+------+--------+------+------+
But instead here's my code and here's the error I'm getting with it:
My Code:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col #import col function for column manipulation
#import pyspark.sql.functions as func
spark=SparkSession.builder.getOrCreate()
df = spark.read.csv("/user/cloudera/Default2_Data.csv", header = True, inferSchema = True) \
.withColumn("income",round(df["income"],2)) \
.withColumn("balance",func.col("balance").cast('Float'))
#df.select(col("income").alias("income")),
#col("balance").alias("balance"),
#func.round(df["income"],2).alias("income1")
df.show(15)
Output:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/`spark2`/python/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in `get_return_value`(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling o707.withColumn.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) income#1101 missing from student#1144,income#1146,default#1143,RecordID#1142,balance#1145 in operator !Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]. Attribute(s) with the same name appear in the operation: income. Please check if the right attribute(s) are used.;;
!Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]
+- Relation[RecordID#1142,default#1143,student#1144,balance#1145,income#1146] csv
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:326)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1334)
at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2252)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2219)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
AnalysisException Traceback (most recent call last)
<ipython-input-102-13a967925c21> in <module>
1 df = spark.read.csv("/user/cloudera/Default2_Data.csv", header = True, inferSchema = True) \
----> 2 .withColumn("income",round(df["income"],2)) \
3 .withColumn("balance",func.col("balance").cast('Float'))
4 #df.select(col("income").alias("income")),
5 #col("balance").alias("balance"),
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/sql/dataframe.py in withColumn(self, colName, col)
1987 """
1988 assert isinstance(col, Column), "col should be Column"
-> 1989 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
1990
1991 #ignore_unicode_prefix
/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/cloudera/parcels/SPARK2-2.4.0.cloude`enter code here`ra2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'Resolved attribute(s) income#1101 missing from student#1144,income#1146,default#1143,RecordID#1142,balance#1145 in operator !Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]. Attribute(s) with the same name appear in the operation: income. Please check if the right attribute(s) are used.;;\n!Project [RecordID#1142, default#1143, student#1144, balance#1145, round(income#1101, 2) AS income#1152]\n+- Relation[RecordID#1142,default#1143,student#1144,balance#1145,income#1146] csv\n'
Replace the following line with:
withColumn("income",round(df["income"],2))
with the following:
withColumn("income",round(col('income'),2))

Cannot read from PSQL using Pyspark hosted in kubernetes

I have deployed pyspark 3.0.1 in Kubernetes.
I am using koalas in a jupyter notebook in order to perform some transformations and I need to write and read from Azure Database for PostgreSQL.
I can read it from pandas using the following code:
from sqlalchemy import create_engine
import psycopg2
import pandas
uri = 'postgres+psycopg2://<postgreuser>:<postgrepassword>#<server>:5432/<database>'
engine_azure = create_engine(uri, echo=False)
df = pdf.read_sql_query(f"select * from public.<table>", con=engine_azure)
I want to read this table from Pyspark using this code:
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import databricks.koalas as ks
from s3fs import S3FileSystem
import datetime
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3,org.postgresql:postgresql:42.1.1 pyspark-shell pyspark-shell"
os.environ['PYSPARK_SUBMIT_ARGS2'] = "--packages org.postgresql:postgresql:42.1.1 pyspark-shell"
sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
# Create Spark config for our Kubernetes based cluster manager
sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443")
sparkConf.setAppName("spark")
sparkConf.set("spark.kubernetes.container.image", "<image>")
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.driver.memory", "2000m")
sparkConf.set("spark.executor.memory", "2000m")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
sparkConf.set("spark.driver.port", "29414")
sparkConf.set("spark.driver.host", "<deployment>.svc.cluster.local")
sparkConf.set("spark.driver.extraClassPath", sparkClassPath)
# Initialize our Spark cluster, this will actually
# generate the worker nodes.
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext
df3 = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://<host>:5432/<database>") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "select * from public.<table>") \
.option("user", "<user>") \
.option("password", "<password>") \
.load()
But I receive this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-5-a529178ed9a0> in <module>
1 url = 'jdbc:postgresql://psql-mcf-prod1.postgres.database.azure.com:5342/cpke-prod'
2 properties = {'user': 'adminmcfpsql#psql-mcf-prod1.postgres.database.azure.com', 'password': '4vb44B^V8w2D*q!eQZgl',"driver": "org.postgresql.Driver"}
----> 3 df3 = spark.read.jdbc(url=url, table='select * from public.userinput_write_offs where reversed_date is NULL', properties=properties)
/usr/local/spark/python/pyspark/sql/readwriter.py in jdbc(self, url, table, column, lowerBound, upperBound, numPartitions, predicates, properties)
629 jpredicates = utils.toJArray(gateway, gateway.jvm.java.lang.String, predicates)
630 return self._df(self._jreader.jdbc(url, table, jpredicates, jprop))
--> 631 return self._df(self._jreader.jdbc(url, table, jprop))
632
633
/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o89.jdbc.
: org.postgresql.util.PSQLException: The connection attempt failed.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:275)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)
at org.postgresql.Driver.makeConnection(Driver.java:450)
at org.postgresql.Driver.connect(Driver.java:252)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:312)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.postgresql.core.PGStream.<init>(PGStream.java:68)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:144)
... 27 more
Your port number is incorrect - it should be 5432, not 5342. Therefore your connection timed out. If you change the line
.option("url", "jdbc:postgresql://<host>:5342/<database>")
to
.option("url", "jdbc:postgresql://<host>:5432/<database>")
maybe it will solve your problem.

I'm getting an error when I tired to load csv in pyspark

I have imported the mmlspark to use LightGBM, if I do not do this, any thing is well.
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
train_df = spark.read.csv('/content/drive/My Drive/BDCproj/train.csv', header=True, inferSchema=True)
test_df = spark.read.csv('/content/drive/My Drive/BDCproj/test.csv', header=True, inferSchema=True)
then my error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-55-ba0da364400e> in <module>()
----> 1 train_df = spark.read.csv('/content/drive/My Drive/BDCproj/train.csv', header=True, inferSchema=True)
2 test_df = spark.read.csv('/content/drive/My Drive/BDCproj/test.csv', header=True, inferSchema=True)
3 frames
/usr/local/lib/python3.6/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o214.csv.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:581)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:803)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:721)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1394)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:649)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:723)
at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat$class
at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:779)
... 29 more
My spark is 3.0.1
Try with this syntax once. If it helps do give it a green check.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages","com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3").config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven").getOrCreate()
train = spark.read.option("header",True).csv("/complete/path/to/train.csv")
test = spark.read.option("header",True).csv("/complete/path/to/test.csv")
Hope this works!

PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

I was trying to connect to MongoDB Atlas from PySpark and I have the following problem:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
sc = SparkContext
spark = SparkSession.builder \
.config("spark.mongodb.input.uri", "mongodb+srv://#USER#:#PASS##test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \
.config("spark.mongodb.output.uri", "mongodb+srv://#USER#:#PASS##test00-la3lt.mongodb.net/db.BUSQUEDAS?retryWrites=true") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
The error that returns this code is this:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-3-346df2de8d22> in <module>()
----> 1 df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
171 else:
--> 172 return self._df(self._jreader.load())
173
174 #since(1.4)
c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
c:\users\andres\appdata\local\programs\python\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o34.load.
: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation
at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:50)
at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:67)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.mongodb.client.model.Collation
How I can solve this problem?
Is a problem with the code or with the references?
In the pyspark config file, I have this:
./bin/pyspark --conf "spark.mongodb.input.uri=mongodb+srv://#USER#:#PASS##test00-la3lt.mongodb.net/db.BUSQUEDAS?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb+srv://#USER#:#PASS##test00-la3lt.mongodb.net/db.BUSQUEDAS" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.1.3
The version of Spark is 2.3.1 and Scala 2.11.8
The problem of this error is because is necesari add this references:
https://oss.sonatype.org/content/repositories/releases/org/mongodb/mongodb-driver/3.8.1/
https://oss.sonatype.org/content/repositories/releases/org/mongodb/mongodb-driver-core/3.8.1/ https://oss.sonatype.org/content/repositories/releases/org/mongodb/bson/3.8.1/
When I add this, the problem is solved

How to add jdbc drivers to classpath when using PySpark?

How / where do I install the jdbc drivers for spark sql? I'm running the all-spark-notebook docker image, and am trying to pull some data directly from a sql database into spark.
From what I can tell I can tell I need to include the drivers in my Classpath, I'm just not sure how to do that from pyspark?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "jdbc:postgresql:dbserver") \
.load()
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-f3b08ff6d117> in <module>()
2 spark = SparkSession .builder .master("local") .appName("Python Spark SQL basic example") .getOrCreate()
3
----> 4 jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "jdbc:postgresql:dbserver") .load()
/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
163 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
164 else:
--> 165 return self._df(self._jreader.load())
166
167 #since(1.4)
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o36.load.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
In order to include the driver for postgresql you can do the following:
from pyspark.conf import SparkConf
conf = SparkConf() # create the configuration
conf.set("spark.jars", "/path/to/postgresql-connector-java-someversion-bin.jar") # set the spark.jars
...
spark = SparkSession.builder \
.config(conf=conf) \ # feed it to the session here
.master("local") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
Now, since you are using Docker, I guess you have to mount the folder that has the driver jar and refer to the mounted folder. (e.g.: How to mount a host directory in a Docker container)
Hope this helps, good luck!
Edit: A diffferent way would be to give the --driver-class-path argument when using spark-submit like this:
spark-submit --driver-class-path=path/to/postgresql-connector-java-someversion-bin.jar file_to_run.py
but I'm guessing this is not how you will run this.
Putting the driver into the pyspark path works but the correct way to do it is to add something line this:
conf = pyspark.SparkConf().setAll([('spark.executor.id', 'driver'),
('spark.app.id', 'local-1631738601802'),
('spark.app.name', 'PySparkShell'),
('spark.driver.port', '32877'),
('spark.sql.warehouse.dir', 'file:/home/data_analysis_tool/spark-warehouse'),
('spark.driver.host', 'localhost'),
('spark.sql.catalogImplementation', 'hive'),
('spark.rdd.compress', 'True'),
('spark.driver.bindAddress', 'localhost'),
('spark.serializer.objectStreamReset', '100'),
('spark.master', 'local[*]'),
('spark.submit.pyFiles', ''),
('spark.app.startTime', '1631738600836'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.driver.extraClassPath','/tmp/postgresql-42.2.23.jar')])
note the line:
('spark.driver.extraClassPath','/tmp/postgresql-42.2.23.jar')
Here is the whole code:
import psycopg2
import pandas as pd
import pyspark
from pyspark.sql import SparkSession
from sqlalchemy import create_engine
import qgrid
#appName = "PySpark PostgreSQL Example - via psycopg2"
#master = "local"
#spark = SparkSession.builder.master(master).appName(appName).getOrCreate()
conf = pyspark.SparkConf().setAll([('spark.executor.id', 'driver'),
('spark.app.id', 'local-1631738601802'),
('spark.app.name', 'PySparkShell'),
('spark.driver.port', '32877'),
('spark.sql.warehouse.dir', 'file:/home/data_analysis_tool/spark-warehouse'),
('spark.driver.host', 'localhost'),
('spark.sql.catalogImplementation', 'hive'),
('spark.rdd.compress', 'True'),
('spark.driver.bindAddress', 'localhost'),
('spark.serializer.objectStreamReset', '100'),
('spark.master', 'local[*]'),
('spark.submit.pyFiles', ''),
('spark.app.startTime', '1631738600836'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.driver.extraClassPath','/tmp/postgresql-42.2.23.jar')])
sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()
sparkSession = SparkSession (sc)
sparkDataFrame = sparkSession.read.format("jdbc") \
.options(
url="jdbc:postgresql://localhost:5432/Database",
dbtable="test_features_3",
user="database_user",
password="Pa$$word").load()
print (sparkDataFrame.count())
sc.stop()