I got an output using cursor.fetchall().
How can I convert the output into Spark dataframe and create a parquet file in Pyspark?
You should use JDBC connection to connect Spark with your database
Related
I am converting existing oracle code to pyspark. While converting oracle JSON code to pyspark, found FOR ORDINALITY. How can we convert this to pyspark?
Thank you
I tried with row_number but not working.
Is there any way in Pyspark to write dataframe to parquet file format use variable name as directory which is not part of dataframe schema.
Code
tables_list = ['abc','def','xyz']
for table_name in tables_list:
df.write.parquet(os.path.join("s3://bucket/output/"), table_name)
Error
table_name(abc,def,xyz) is not part of the schema.
Looks like there is a syntax error. Your code should work if you move second parantheses to end of the line for following line
df.write.parquet(os.path.join("s3://bucket/output/"), table_name)
I didn't try for s3 but code below creates an "abc" directory under "/tmp" on hdfs.
import os
tables_list = ['abc']
for table_name in tables_list:
df.write.parquet(os.path.join("/tmp",table_name))
I am trying the below code but it is throwing some random error that I am unable to understand:
df.registerTempTable("Temp_table")
spark.sql("Update Temp_table set column_a='1'")
Currently spark sql does not support UPDATE statments. The workaround is to use create a delta lake / iceberg table using your spark dataframe and execute you sql query directly on this table.
For iceberg implementation refer to :
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
Is there a way to directly fetch the contents of a table from a postgresQL database into a pyspark dataframe using the psycopg2 library?
The solutions online so far only talk about using a pandas dataframe. But that is not possible with very large set of data in spark since it would be loading all the data to the driver node.
The code I am using is as follows:
conn = psycopg2.connect(database="databasename", user='user', password='pass', host='postgres.host, port= '5432'
)
cur = conn.cursor()
cur.execute("select * from database.table limit 10")
data = cur.fetchall()
The resulting data output is a tuple that is difficult to convert to a dataframe.
Any suggestions would be greatly appreciated
Directly use spark jdbc to connect to postgresql to read the data, and it will return a dataframe.
I am trying to create some spark SQL queries for different tables which i have collected as a list. I want to create SQL queries for all the tables present in the hive database.The hive context has been initialized Following is my approach.
tables= spark.sql("select tables in survey_db")
# registering dataframe as temp view with 2 columns - tableName and db name
tables.createOrReplaceTempView("table_list")
# collecting my table names in a list
table_array= spark.sql("select collect_list(tableName) from table_list").collect()[0][0]
# array values(list)
table_array= [u'survey',u'market',u'customer']
I want to create spark SQL queries for the table names stored in table_array. for example:
for i in table_array:
spark.sql("select * from survey_db.'i'")
I cant use shell scripting as i have to write a pyspark script for this. Please advice if spark.sql queries can be created using loop/map . Thanks everyone.
You can achieve the same as follows:
sql_list = [f"select * from survey_db.{table}" for table in table_array]
for sql in sql_list:
df = spark.sql(sql)
df.show()