Databricks - read table from Snowflake to Databricks - pyspark

I've seen a few questions on Databricks to Snowflake but my question is how to get a table from Snowflake into Databricks.
What I've done so far:
Created a cluster and attached the cluster to my notebook (I'm using Python)
# Use secrets DBUtil to get Snowflake credentials.
user = dbutils.secrets.get("snowflake-user", "secret-user")
password = dbutils.secrets.get("snowflake-pw", "secret-pw")
sf_url = dbutils.secrets.get("snowflake-url", "secret-sf-url")
# snowflake connection options
options = {
"sfUrl": sf_url,
"sfUser": user,
"sfPassword": password,
"sfDatabase": "DEV",
"sfSchema": "PUBLIC",
"sfWarehouse": "DEV_WH"
}
then I tried to use spark.read to read the FBK_VIDEOS table in Snowflake:
# Read table from Snowflake.
df = spark.read.format("snowflake").options(**options).option("dbtable", "FBK_VIDEOS").load()
I've also tried: option("dbtable", "SELECT * FROM FBK_VIDEOS").load()
but I see the following error for df:
net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation
error:
The Traceback shows this when expanded:
Py4JJavaError Traceback (most recent call last)
<command-3339556253176158> in <module>
1 # Read table from Snowflake.
----> 2 df = spark.read.format("snowflake").options(**options).option("dbtable", "FBK_VIDEOS").load()
3
4 display(df)
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
208 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
209 else:
--> 210 return self._df(self._jreader.load())
211
212 def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)

Answering for completeness and for future users who might have a similar problem.
As answered in the comments: Snowflake uses a role-based access control system, so it is vitally important that the role being used has the necessary privileges. In this case, there is no USE ROLE shown in the code so whatever role was active when the query was run did not have sufficient privileges.

Related

Keep getting the Py4JError when working with the pyspark dataframe

I have a docker container up and running in vs code. With pyspark I connect to a postgres database on my local machine:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/opt/spark/jars/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://host.docker.internal:5432/postgres") \
.option("dbtable", "chicago_crime") \
.option("user", "postgres") \
.option("password", "postgres") \
.option("driver", "org.postgresql.Driver") \
.load()
type(df)
Output:
pyspark.sql.dataframe.DataFrame
Example code of what works:
df.printSchema()
df.select('ogc_fid').show() #(Raises a Py4JJavaError sometimes)
Example code of what does not work:
df.show(1) # Py4JJavaError and ConnectionRefusedError: [Errno 111] Connection refused
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
[... skipping hidden 1 frame]
Cell In[2], line 1
----> 1 df.show(1)
File /usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py:606, in DataFrame.show(self, n, truncate, vertical)
605 if isinstance(truncate, bool) and truncate:
--> 606 print(self._jdf.showString(n, 20, vertical))
607 else:
File /usr/local/lib/python3.9/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
File /usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
189 try:
--> 190 return f(*a, **kw)
191 except Py4JJavaError as e:
File /usr/local/lib/python3.9/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
325 if answer[1] == REFERENCE_TYPE:
...
--> 438 self.socket.connect((self.java_address, self.java_port))
439 self.stream = self.socket.makefile("rb")
440 self.is_connected = True
ConnectionRefusedError: [Errno 111] Connection refused
Anyone knows what this Py4JJavaError is? And how to overcome it?
PySpark is just a Wrapper around the actual implementation of Spark, which is written in Scala. Py4J enables you to communicate with the JVM process in Python.
That means the Py4JJavaError is only an abstraction, it tells you that the JVM process threw an Exception.
The real error is ConnectionRefusedError: [Errno 111] Connection refused.
I assume the error is caused while connecting to your Postgres instance.

Create pyspark dataframe from parquet file

I am quite new in pyspark and I am still trying to figure out who things work. What I am trying to do is after loading a parquet file in memory using pyarrow Itry to make it to pyspark dataframe. But I am getting an error.
I should mention that I am not reading directly through pyspark because the file in in s3 which gives me another error about "no filesystem for scheme s3"
so I am trying to work around. Below I have a reproducible example.
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
parquet_file=pq.ParquetDataset('s3filepath.parquet',filesystem=s3)
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
spark.createDataFrame(parquet_file)
------------------------------------------------------------------
TypeError Traceback (most recent
call last)
<ipython-input-20-0cb2dd287606> in <module>
----> 1 spark.createDataFrame(pandas_dataframe)
/usr/local/spark/python/pyspark/sql/session.py in
createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema =
self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema =
self._createFromLocal(map(prepare, data), schema)
749 jrdd =
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf =
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
TypeError: 'ParquetDataset' object is not iterable
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local', "retail")
sqlC = SQLContext(sc)
This is how you should read parquet files to spark df:
df = sqlC.read.parquet('path_to_file_or_dir')
You can read data from S3 via Spark as long as you have the public and secret keys for the S3 bucket ... this would be more efficient compared to going though arrow via pandas and then converting to spark dataframe because you would have to parallelize the serial read.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
df = spark.read.parquet("s3://path/to/parquet/files")
source doc => https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#access-aws-s3-directly

Error reading.writing from phoenix using pyspark

I am trying to put together a data pipeline on HDP 2.6.3 sandbox.(docker) I am using pyspark with phoenix (4.7) and HBase.
I have installed phoenix project from maven and successfully created a table with test records. I can see data in Hbase as well.
Now i am trying to read data from the table using pyspark with the following code:
import phoenix
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(appName="Phoenix test")
sqlContext = SQLContext(sc)
table = sqlContext.read.format("org.apache.phoenix.spark").option("table", "INPUT_TABLE").option("zkUrl", "localhost:2181:/hbase-unsecure").load()
phoenix ddl:
CREATE TABLE INPUT_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
UPSERT INTO INPUT_TABLE (id, col1, col2) VALUES (1, 'test_row_1',111);
UPSERT INTO INPUT_TABLE (id, col1, col2) VALUES (2, 'test_row_2',111 );
call:
spark-submit --class org.apache.phoenix.spark --jars /usr/hdp/current/phoenix-server/phoenix-4.7.0.2.5.0.0-1245-client.jar --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/spark2/conf/hbase-site.xml phoenix_test.py
Traceback (most recent call last):
File "/root/hdp/process_data.py", line 42, in
.format(data_source_format)\
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 593, in save
File "/usr/lib/python2.6/site-packages/py4j-0.10.6-py2.6.egg/py4j/java_gateway.py", line 1160, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/python2.6/site-packages/py4j-0.10.6-py2.6.egg/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o55.save.
: java.lang.UnsupportedOperationException: empty.tail
thanks,
clairvoyant

pyspark: AnalysisException when joining two data frame

I have two data frame created from sparkSQL:
df1 = sqlContext.sql(""" ...""")
df2 = sqlContext.sql(""" ...""")
I tried to join these two data frame on the column my_id like below:
from pyspark.sql.functions import col
combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner')
Then I got the following error. Any idea what I missed? Thanks!
AnalysisException Traceback (most recent call last)
<ipython-input-11-45f5313387cc> in <module>()
3 from pyspark.sql.functions import col
4
----> 5 combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner')
6 combined_df.take(10)
/usr/local/spark-latest/python/pyspark/sql/dataframe.py in join(self, other, on, how)
770 how = "inner"
771 assert isinstance(how, basestring), "how should be basestring"
--> 772 jdf = self._jdf.join(other._jdf, on, how)
773 return DataFrame(jdf, self.sql_ctx)
774
/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/local/spark-latest/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: "cannot resolve '`df1.my_id`' given input columns: [...
I think issue with your code is , you are trying to give "df1.my_id" as a column name instead of just col('my_id'). That is why the error says cannot resolve df1.my_id given input columns
you can do this without importing col.
combined_df = df1.join(df2, df1.my_id == df2.my_id, 'inner')
Not sure about pyspark but this should work if you have same field name in both dataframe
combineDf = df1.join(df2, 'my_id', 'outer')
Hope this helps!

Write PySpark 2.0.1 DataFrame as PostgreSQL table: UnicodeEncodeError

I've got the following error when trying to write a spark DataFrame as a PostgreSQL table:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-121-159b38b4c333> in <module>()
7 "password":"112211234",
8 "driver":"org.postgresql.Driver",
----> 9 "client_encoding":"utf8"
10 }
11 )
/home/ec2-user/spark-2.0.1-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in jdbc(self, url, table, mode, properties)
760 for k in properties:
761 jprop.setProperty(k, properties[k])
--> 762 self._jwrite.mode(mode).jdbc(url, table, jprop)
763
764
/home/ec2-user/spark-2.0.1-bin-hadoop2.6/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/home/ec2-user/spark-2.0.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/home/ec2-user/spark-2.0.1-bin-hadoop2.6/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
<type 'str'>: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u'An error occurred while calling o3418.jdbc.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 105.0 failed 4 times, most recent failure: Lost task 5.3 in stage 105.0 (TID 1937, 10.0.0.52): org.postgresql.util.PSQLException: \u041f\u043e\u0434\u0441\u043e\u0435\u0434\u0438\u043d\u0435\u043d\u0438\u0435 \u043f\u043e \u0430\u0434\u0440\u0435\u0441\u0443 localhost:5432 \u043e\u0442\u043a\u043b\u043e\u043d\u0435\u043d\u043e. \u041f\u0440\u043e\u0432\u0435\u0440\u044c\u0442\u0435 \u0447\u0442\u043e \u0445\u043e\u0441\u0442 \u0438 \u043f\u043e\u0440\u0442 \u0443\u043a\u0430\u0437\u0430\u043d\u044b \u043f\u0440\u0430\u0432\u0438\u043b\u044c\u043d\u043e \u0438 \u0447\u0442\u043e postmaster \u043f\u0440\u0438\u043d\u0438\u043c\u0430\u0435\u0442 TCP/IP-\u043f\u043e\u0434\u0441\u043e\u0435\u0434\u0438\u043d\u0435\u043d\u0438\u044f.\n\tat org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:262)\n\tat org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:52)\n\tat org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:216)\n\tat org.postgresql.Driver.makeConnection(Driver.java:404)\n\tat org.postgresql.Driver.connect(Driver.java:272)\n\tat org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
The DataFrame is the following:
from pyspark.sql import SQLContext, Row, DataFrame, SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName("test") \
.config("spark.some.config.option", "test") \
.getOrCreate()
fields = [
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
]
schema = StructType(fields)
test = spark.createDataFrame([
Row(id=1, name=u"a", age=34),
Row(id=2, name=u"b", age=25)
], schema)
test.show()
i.e. this one
+---+----+---+
| id|name|age|
+---+----+---+
| 1| a| 34|
| 2| b| 25|
+---+----+---+
To write it to PostgreSQL I use the code:
test.write.jdbc(
url="jdbc:postgresql://localhost:5432/db",
table="test",
mode="overwrite",
properties={
"user":"root",
"password":"12345",
"driver":"org.postgresql.Driver",
"client_encoding":"utf8"
}
)
But it generates the error shown above. Cannot find the reason of this exception.
The reading of an existing table created using postres console works fine.
I will be grateful any help.