Exporting Hive Arrays to Postgresql - postgresql

I've been using Sqoop to move data between Hive tables and Postgresql. But apparently Sqoop does not support the Postgresql Array type.
I figured out a workaround when importing data from Postgresql to Hive; I use array_to_string() in the Sqoop --query parameter to transform my array to a string, and then created an external Hive table with;
colelction.delim $
field.delim ,
But I can't figure out a way to 'reverse' this process. Any clever workaround?

Related

pyspark - insert generated primary key in dataframe

I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.
The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()

Store JSONB PostgreSQL data type column into Athena

I am creating an Athena external table on a CSV that I generated from my PostgreSQL database.
The csv contains a columns that has a jsonb datatype.
If possible, I want to exclude this column from the table created in Athena, or kindly suggest a way to include this datatype.

How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's?

I'm trying to run the upsert/delete some of the values in DB2 database source table, which is a existing table on DB2. Is it possible using Pyspark/Spark SQL/Dataframes.
There is no direct way for update/delete in relational database using Pyspark job, but there are workarounds.
(1) You can create a identical empty table (secondary table) in relational database and insert data into secondary table using pyspark job, and write a DML trigger that would perform desired DML operation on your primary table.
(2) You can create a dataframe (eg. a) in spark that would be copy of your existing relational table and merge existing table dataframe with current dataframe(eg. b) and create a new dataframe(eg. c) that would be having latest changes. Now truncate the relational database table and reload with spark latest changes dataframe(c).
These is just a workaround and not a optimal solution for huge amount of data.

Show definition of custom database datatype

I'm currently working with an existing database, which has a custom datatype called geometry (using the postgis extension). When I extract data from that column in the database all it gives me is a string with 50 chars, but I don't know how the string is created and therefore how I can figure out what the actual data is.
Is there a way to look at custom datatype definitions in postgres?
I tried psql \dT and psql \dT+ but didn't see the definition.
Unfortunately I don't have pgadmin.
Thanks!

sqoop export of hive orc table

I have a hive table in orc format populated by pyspark dataframe_writer.
I need to export this table to oracle.I am having issues exporting the table because sqoop could not parse the orc file format.
Are there any special considerations or parameters that need to be specified with the sqoop command for exporting hive orc table.
A simple Google query points to that blog post labeled quite explicitly...
How to Sqoop Export a Hive ORC table to a Oracle Database?
And there is also that SO post labeled...
Reading ORC files and putting into RDBMS?
So it appears that you did not do any research.
By the way, did you consider using Spark to send the data directly into an Oracle staging table, via JDBC, without the intermediate ORC dump?
I just worked on the same sqoop from orc to Oracle. Make sure you have your ORC table pre-created with correct datatypes as you have them in dataframe. Same order of the columns will also ease the sqoop. If you tried any command , please post it.