load orc format to aurora postgres DB - postgresql

We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?

This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.

Related

How to do BULK INSERT into PostgreSQL without specifying physical csv file in Apache NiFi

I need to do BULK INSERT into PostgreSQL table in Apache NiFi without specifying the physical csv file in COPY command. I just cannot store the CSV files on the disk and would like to do BULK INSERT using a flow file that is coming from previous processors and is already in CSV format (or I can change it to json, that's not an issue).
Please advise, what is the best way to do this in Apache NiFi?
PostgreSQL's COPY operation requires a path to a file. I'd recommend looking at PutDatabaseRecord which generates a PreparedStatement based on your CSV data and executes the statement in a single batch (unless Maximum Batch Size is set)

Load Parquet Files from AWS Glue To Redshift

Have an AWS Glue crawler which is creating a data catalog with all the tables from an S3 directory that contains parquet files.
I need to copy the contents of these files/ tables to the Redshift table.
I have a few tables where the Parquet file data size cannot be supported by Redshift. VARCHAR(6635) is not sufficient.
In the ideal scenario, would like to truncate these tables.
How do I use the COPY command to load this data into Redshift?
If I use spectrum, I can only user INSERT INTO from the external table to Redshift table, which I understand is slower than a bulk copy?
You can use string instead of varchar(6635) (Can be edited in the catalog as well ) , if not can you elaborate more on this, Of the files are in parquet then , Most of the Data conversion parameters
that copy provides cannot be used like Escape, null as etc ..
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

ETL with Dataprep - Union Dataset

I'm a newcomer to GCP, and I'm learning every day and I'm loving this platform.
I'm using GCP's dataprep to join several csv files (with the same column structure), treat some data and write to a BigQuery.
I created a storage (butcket) to put all 60 csv files inside. In dataprep can I define a data set to be the union of all these files? Or do you have to create a dataset for each file?
Thank you very much for your time and attention.
If you have all your files inside a directory in GCS you can import that directory as a single dataset. The process is the same as importing single files. You have to make sure though, that the column structure is exactly the same for all the files inside the directory.
If you create a separate dataset for each file you are more flexible on the structure they have when you use the UNION page to concatenate them.
However, if your use case is just to load all the files (~60) to a single table in Bigquery without any transformation, I would suggest to just use a BigQuery load job. You can use a wildcard in the Cloud Storage URI to specify the files you want. Currently, BigQuery load jobs are free of charge, so it would be a very cost-effective solution compared to the use of Dataprep.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Dynamically create table from csv

I am faced with a situation where we get a lot of CSV files from different clients but there is always some issue with column count and column length that out target table is expecting.
What is the best way to handle frequently changing CSV files. My goal is load these CSV files into Postgres database.
I checked the \COPY command in Postgres but it does have an option to create a table.
You could try creating a pg_dump compatible file instead which has the appropriate "create table" section and use that to load your data instead.
I recommend using an external ETL tool like CloverETL, Talend Studio, or Pentaho Kettle for data loading when you're having to massage different kinds of data.
\copy is really intended for importing well-formed data in a known structure.