sqoop export of hive orc table - pyspark

I have a hive table in orc format populated by pyspark dataframe_writer.
I need to export this table to oracle.I am having issues exporting the table because sqoop could not parse the orc file format.
Are there any special considerations or parameters that need to be specified with the sqoop command for exporting hive orc table.

A simple Google query points to that blog post labeled quite explicitly...
How to Sqoop Export a Hive ORC table to a Oracle Database?
And there is also that SO post labeled...
Reading ORC files and putting into RDBMS?
So it appears that you did not do any research.
By the way, did you consider using Spark to send the data directly into an Oracle staging table, via JDBC, without the intermediate ORC dump?

I just worked on the same sqoop from orc to Oracle. Make sure you have your ORC table pre-created with correct datatypes as you have them in dataframe. Same order of the columns will also ease the sqoop. If you tried any command , please post it.

Related

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.

Airflow export schema only from PostgreSQL to bigquery

In airflow, we can export databases like postgres, MySQL and etc to GCS. they have an option called schema file where the SCHEMA of the source table will be exported as a JSON file, and we can use it for creating the table on bigquery.
But unfortunately, we can export the schema file with select * from table; (or we can reduce the rows with select * from table limit 1). It will upload both the data and the schema files.
Is there a way to export only the schema file without data?
You can use INFORMATION_SCHEMA to pull the schema/metadata/columns from your table.
For example:
SELECT
*
FROM
`bigquery-public-data`.census_bureau_usa.INFORMATION_SCHEMA.COLUMNS
WHERE
table_name="population_by_zip_2010"
See here.

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

PySpark - Saving Hive Table - org.apache.spark.SparkException: Cannot recognize hive type string

I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.
The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.

Exporting Hive Arrays to Postgresql

I've been using Sqoop to move data between Hive tables and Postgresql. But apparently Sqoop does not support the Postgresql Array type.
I figured out a workaround when importing data from Postgresql to Hive; I use array_to_string() in the Sqoop --query parameter to transform my array to a string, and then created an external Hive table with;
colelction.delim $
field.delim ,
But I can't figure out a way to 'reverse' this process. Any clever workaround?