How to do BULK INSERT into PostgreSQL without specifying physical csv file in Apache NiFi - postgresql

I need to do BULK INSERT into PostgreSQL table in Apache NiFi without specifying the physical csv file in COPY command. I just cannot store the CSV files on the disk and would like to do BULK INSERT using a flow file that is coming from previous processors and is already in CSV format (or I can change it to json, that's not an issue).
Please advise, what is the best way to do this in Apache NiFi?

PostgreSQL's COPY operation requires a path to a file. I'd recommend looking at PutDatabaseRecord which generates a PreparedStatement based on your CSV data and executes the statement in a single batch (unless Maximum Batch Size is set)

Related

load orc format to aurora postgres DB

We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.

PostgreSQL - mounting csv / other file - type volumes/tablespaces

In few other DB engines I can easily extract (part of) table to single file.
Then if needed I can 'mount' this file as regular table. Querying is obviously slow but this is very useful
I wonder if similar stuff is possible with psql ?
I know COPY FROM/TO function - but for bigger tables I need to wait ages in order to copy records from CSV
Yes, you can use file_fdw to access (read) a CSV file on the database server as if it were a table.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

How to use Apache Apex to ingest data in batch from DB2 to Vertica

Use Case: Ingest transaction data (e.g. rows = 10,000) in a single batch from DB2 and insert them to a Vertica database.
Question:
Should I get a single row from database or batch of 10k rows, process and then insert into destination database?
Is there any sample code which reads from one database and writes into another database?
You should always prefer batch execution , you will minimized your network roundtrip and improved your load to Vertica .
You can use the JDBC input and output operators to fetch from origin database and destination database. They should have configurable batch sizes. In general batching is faster than tuple by tuple.
Check https://github.com/apache/incubator-apex-malhar/tree/master/library/src/main/java/com/datatorrent/lib/db/jdbc
You can add multiple XML configuration files at src/site/conf in your project and select one of them at launch time.
This is described briefly at http://docs.datatorrent.com/application_packages/ under the section entitled "Adding pre-set configurations"

Dynamically create table from csv

I am faced with a situation where we get a lot of CSV files from different clients but there is always some issue with column count and column length that out target table is expecting.
What is the best way to handle frequently changing CSV files. My goal is load these CSV files into Postgres database.
I checked the \COPY command in Postgres but it does have an option to create a table.
You could try creating a pg_dump compatible file instead which has the appropriate "create table" section and use that to load your data instead.
I recommend using an external ETL tool like CloverETL, Talend Studio, or Pentaho Kettle for data loading when you're having to massage different kinds of data.
\copy is really intended for importing well-formed data in a known structure.