I am converting existing oracle code to pyspark. While converting oracle JSON code to pyspark, found FOR ORDINALITY. How can we convert this to pyspark?
Thank you
I tried with row_number but not working.
Related
I am trying the below code but it is throwing some random error that I am unable to understand:
df.registerTempTable("Temp_table")
spark.sql("Update Temp_table set column_a='1'")
Currently spark sql does not support UPDATE statments. The workaround is to use create a delta lake / iceberg table using your spark dataframe and execute you sql query directly on this table.
For iceberg implementation refer to :
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
I got an output using cursor.fetchall().
How can I convert the output into Spark dataframe and create a parquet file in Pyspark?
You should use JDBC connection to connect Spark with your database
Is there a way to directly fetch the contents of a table from a postgresQL database into a pyspark dataframe using the psycopg2 library?
The solutions online so far only talk about using a pandas dataframe. But that is not possible with very large set of data in spark since it would be loading all the data to the driver node.
The code I am using is as follows:
conn = psycopg2.connect(database="databasename", user='user', password='pass', host='postgres.host, port= '5432'
)
cur = conn.cursor()
cur.execute("select * from database.table limit 10")
data = cur.fetchall()
The resulting data output is a tuple that is difficult to convert to a dataframe.
Any suggestions would be greatly appreciated
Directly use spark jdbc to connect to postgresql to read the data, and it will return a dataframe.
I am working on Pyspark in AWS Glue
I want to execute Stored Procedure/Function on Postgresql Database
Is it possible?
What is the syntax? Is there any special package needed?
Ankur
You can try using a module like pg8000 to run this function
You can also try calling the postgres function like you would select data from a specific table using the spark read function with jdbc as the format. Considering glue uses pyspark in the back end, i would imagine just giving the function name instead of a table name, should do the trick. Just remember to add the jdbc driver to your glue job
eg: You can do this in spark
jdbcDF = spark.read.format("jdbc").option("url","jdbc:postgresql://host:5432/db").option("driver", "org.postgresql.Driver").option("query", "SELECT * from function()").option("user", "user").option("password", "password").load()
I am having trouble casting a date column to a string using sqoop-import from an oracle database to an HDFS parquet file. I am using the following:
sqoop-import -Doraoop.oracle.session.initialization.statements="alter session set nls_date_format='YYYYMMDD'"
My understanding is that this should execute the above statement before it begins transferring data. I have also tried
-Duser.nls_date_format="YYYYMMDD"
But this doesn't work either, the resulting parquet file still contains the original date format as listed in the table. If it matters, I am running these in a bash script and also casting the same date columns to string using --map-column-java "MY_DATE_COL_NAME=String"What am I doing wrong?
Thanks very much.
Source: SqoopUserGuide
Oracle JDBC represents DATE and TIME SQL types as TIMESTAMP values. Any DATE columns in an Oracle database will be imported as a TIMESTAMP in Sqoop, and Sqoop-generated code will store these values in java.sql.Timestamp fields.
You can try casting date to String while importing within the query.
For Example
sqoop import -- query 'select col1, col2, ..., TO_CHAR(MY_DATE_COL_NAME, 'YYYY-MM-DD') FROM TableName WHERE $CONDITIONS'