How to use Apache Apex to ingest data in batch from DB2 to Vertica - db2

Use Case: Ingest transaction data (e.g. rows = 10,000) in a single batch from DB2 and insert them to a Vertica database.
Question:
Should I get a single row from database or batch of 10k rows, process and then insert into destination database?
Is there any sample code which reads from one database and writes into another database?

You should always prefer batch execution , you will minimized your network roundtrip and improved your load to Vertica .

You can use the JDBC input and output operators to fetch from origin database and destination database. They should have configurable batch sizes. In general batching is faster than tuple by tuple.
Check https://github.com/apache/incubator-apex-malhar/tree/master/library/src/main/java/com/datatorrent/lib/db/jdbc
You can add multiple XML configuration files at src/site/conf in your project and select one of them at launch time.
This is described briefly at http://docs.datatorrent.com/application_packages/ under the section entitled "Adding pre-set configurations"

Related

AWS Redshift: How to run copy command from Apache NiFi without using firehose?

I have flow files with data records in it. I'm able to place it on S3 bucket. From there on I want to run COPY command and update command with joins to achieve MERGE / UPSERT operation. Can anyone suggest ways to solve this as firehose only executes copy command and I can't make UPSERT / MERGE operation as prescribed by AWS docs directly, so has to copy into staging table and update or insert using some conditions.
There are a number of ways to do this but I usually go with a lambda function run every 5 minutes or so that takes the data put in Redshift from firehose and merges it with existing data. Redshift likes to run on larger "chunks" of data and it is most efficient if you build up some size before performing these operations. The best practice is to move the data from the firehose target in an atomic operation like ALTER TABLE APPEND and use this new table as the source for merging. This is so firehose can keep adding data while the merge is in process.

How to do BULK INSERT into PostgreSQL without specifying physical csv file in Apache NiFi

I need to do BULK INSERT into PostgreSQL table in Apache NiFi without specifying the physical csv file in COPY command. I just cannot store the CSV files on the disk and would like to do BULK INSERT using a flow file that is coming from previous processors and is already in CSV format (or I can change it to json, that's not an issue).
Please advise, what is the best way to do this in Apache NiFi?
PostgreSQL's COPY operation requires a path to a file. I'd recommend looking at PutDatabaseRecord which generates a PreparedStatement based on your CSV data and executes the statement in a single batch (unless Maximum Batch Size is set)

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Redshift insert bottleneck

I am trying to migrate a huge table from postgres into Redshift.
The size of the table is about 5,697,213,832
tool: pentaho Kettle Table input(from postgres) -> Table output(Redshift)
Connecting with Redshift JDBC4
By observation I found the inserting into Redshift is the bottleneck. only about 500 rows/second.
Is there any ways to accelerate the insertion into Redshift in single machine mode ? like using JDBC parameter?
Have you consider using S3 as mid-layer?
Dump your data to csv files and apply gzip compression. Upload files to the S3 and then use copy command to load the data.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
The main reason for bottleneck of redshift performance, which i considered is that Redshift treats each and every hit to the cluster as one single query. It executes each query on its cluster and then proceeds to the next stage. Now when i am sending across multiple rows (in this case 10), each row of data is treated a separate query. Redshift executes each query one by one and loading of the data is completed once all the queries are executed. It means if you are having 100 million rows, there would be 100 million queries running on your Redshift cluster. Well the performance goes to dump !!!
Using S3 File Output step in PDI will load your data to S3 Bucket and then apply the COPY command on the redshift cluster to read the same data from S3 to Redshift. This will solve your problem of performance.
You may also read the below blog links :
Loading data to AWS S3 using PDI
Reading Data from S3 to Redshift
Hope this helps :)
Better to export data to S3, then use COPY command to import data into Redshift. In this way, the import process is fast while you don't need to vacuum it.
Export your data to S3 bucket and use the COPY command in Redshift . COPY command is the fastest way to insert data in Redshift .

Loading DB2 table rows as Marklogic documents

Is there any tool to quickly convert a DB2 table rows into collection of XML documents that we can load to Marklogic?
DB2 supports the SQL/XML publishing extensions that were introduced in SQL:2003. These functions include XMLSERIALIZE, XMLELEMENT, XMLATTRIBUTE, and XMLFOREST, and are easily added to a SQL SELECT statement to produce a simple, well-formed XML document for each row in the result set. By writing queries that retrieve the table names and column layouts from DB2's catalog views, it is possible to automate the creation of the XML-publishing SELECT statements for a large number of tables.
One way of doing this would be to use the MLSQL toolkit ( http://developer.marklogic.com/code/mlsql ). It allows accessing relational databases from within your XQuery code in MarkLogic. Not sure how the returned data actually looks like, but it should be easy to process it within XQuery, and insert your data as XML into MarkLogic.
Just make sure not to try to load a million records in one statement, but instead try to spawn batches of lets say 1000 records at a time. Spawning will also allow for handling it with multiple threads, so should be faster for that reason too..
HTH!
Do you need to stream from DB2 to MarkLogic? Or can you temporarily dump all the documents to an intermediary filesystem and then read them in? If you can dump, then simply use some DB2 tooling (like #Fred's answer above) to export the rows to a bunch of XML documenets in a filesystem and use one of many methods for reading in a directory full of XML files into MarkLogic (like Information Studio (UI or apis), RecordLoader, and so on).
If you have don't want to store them in the filesystem as an intermediary, then you could write an InformationStudio plugin for MarkLogic that will pull out each row and insert a document into MarkLogic. You'd like need some web-service or rest endpoint that the plugin could call to extract the document data from DB2.
Alternatively, I suspect you could use the DB2 tooling (described by #Fred) that will let you execute some code per row of your table. If you can do that in Java (or .Net), then pull in the MarkLogic XCC APIs which will give you the ability to write documents into MarkLogic.