Unload Redshift data to S3 in parquet format - amazon-redshift

I'm trying to unload redshift data to S3, but it's unloading in CSV format. How can unload the Redshift table to S3 bucket in parquet format using Java?

https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
unload ('select * from lineitem')
to 's3://mybucket/lineitem/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
PARQUET
PARTITION BY (l_shipdate);

Related

Column mapping option argument is not supported for PARQUET based COPY

I have to insert parquet file data into redshift table. Number of columns in parquet might be less when compared to redshift table. I have used the below command.
COPY table_name
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
FORMAT AS PARQUET
But getting the below issue when I run the COPY command.
Column mapping option argument is not supported for PARQUET based COPY
I tried to use the column mapping like
COPY table_name(column1, column2..)
FROM s3_path
ACCESS_KEY_ID ...
SECRET_ACCESS_KEY ...
But am getting Delimiter not found issue. If I specify FORMAT AS PARQUET in the above COPY command (which has column list) then am getting Column mapping option argument is not supported for PARQUET based COPY.
Could you please let me know how to resolve this.
Thanks
The number of columns in the parquet file MUST match the table's columns reference . You can't use column mapping with parquet files.
What you can do: create a staging table and copy parquet file content to it. Then run a insert to your final table using insert into final_table (select col1, col2 from stg_table)

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.

Create a table from pyspark code on top of parquet file

I am writing data to a parquet file format using peopleDF.write.parquet("people.parquet")in PySpark code. Now what I am trying to do is that from the same code I want to create a table on top of this parquet file which then I can later query from. How can I do that?
You can use the saveAsTable method :
peopleDF.write.saveAsTable('people_table')
you have to create external table in hive like this:
CREATE EXTERNAL TABLE my_table (
col1 INT,
col2 INT
) STORED AS PARQUET
LOCATION '/path/to/';
Where /path/to/ is absolute path to files in HDFS.
If you want to use partitioning you can add PARTITION BY (col3 INT). In that case to see the data you have to execute repair.

sqoop export of hive orc table

I have a hive table in orc format populated by pyspark dataframe_writer.
I need to export this table to oracle.I am having issues exporting the table because sqoop could not parse the orc file format.
Are there any special considerations or parameters that need to be specified with the sqoop command for exporting hive orc table.
A simple Google query points to that blog post labeled quite explicitly...
How to Sqoop Export a Hive ORC table to a Oracle Database?
And there is also that SO post labeled...
Reading ORC files and putting into RDBMS?
So it appears that you did not do any research.
By the way, did you consider using Spark to send the data directly into an Oracle staging table, via JDBC, without the intermediate ORC dump?
I just worked on the same sqoop from orc to Oracle. Make sure you have your ORC table pre-created with correct datatypes as you have them in dataframe. Same order of the columns will also ease the sqoop. If you tried any command , please post it.

AWS Redshift loading data from S3

So I'm trying to load data into my Redshift database from an S3 bucket. I have a table 'Example' which has a field 'timestamp' in the format 'YY-MM-DD HH:MM:SS'.
Using the copy query to load the data, so I'm able to load for a specific pattern/prefix, but I want to load data after a certain timestamp, say, greater than '2014-07-09 10:00:00'. How do I approach this?
You have two options:
either process the file before you load it to S3 (and upload only the data with timestamp greater than $SOME_TIMESTAMP)
use the COPY command to load the file into intermediate table (can be even temp table - as long as you stay within the same session) and then run:
insert into YOUR_ORIGINAL_TABLE (select * from YOUR_TEMP_TABLE where timestamp > WHATEVER_YOU_NEED)