Tableau Data store migration to Redshift - amazon-redshift

Currently we have a workbook developed in Tableau using Oracle server as the data store where we have all our tables and views. Now we are migrating to Redshift fora better performance. We have the same table structure as in the Oracle with the same table names and the field names in the Redshift. We already have the Tableau workbook developed and we need to point to Redshift tables and views now. How do we point the developed workbook to Redshift now, kindly help.
Also let me know any other inputs in this regard.
Thanks,
Raj

Use the Replace Data Source functionality of Tableau Desktop

You can bypass Replace Data Source and move data directly from Oracle to Redshift using bulk loaders.
Simple combo of SQL*Plus + Python + boto + psycopg2 will do the job.
It should:
Open read pipe from Oracle SQL*Plus
Compress data stream
Upload compressed stream to S3
Bulk append data from S3 to Redshift table.
You can check example of how to extract table or query data from Oracle and then load it to Redshift using COPY command from S3.

Related

load orc format to aurora postgres DB

We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History
page, this will automatically generate a CSV file containing the results.
This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.

Data Migration from one DB to another

I have to create an app which transfer data from snowflake to postgres everyday. Some tables in postgres are truncated before migration and all data from corresponding snowflake table is copied. While for other tables, data after last timestamp in postgres is copied from snowflake.
This job has to run at night sometime and not when customers are using the service at daytime.
What is the best way to do this ?
Do you have constraints, limiting your choices in:
ETL or bulk data tooling
Development languages?
According to this site, you can create a foreign data wrapper on Postgresql for snowflake

Redshift insert bottleneck

I am trying to migrate a huge table from postgres into Redshift.
The size of the table is about 5,697,213,832
tool: pentaho Kettle Table input(from postgres) -> Table output(Redshift)
Connecting with Redshift JDBC4
By observation I found the inserting into Redshift is the bottleneck. only about 500 rows/second.
Is there any ways to accelerate the insertion into Redshift in single machine mode ? like using JDBC parameter?
Have you consider using S3 as mid-layer?
Dump your data to csv files and apply gzip compression. Upload files to the S3 and then use copy command to load the data.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
The main reason for bottleneck of redshift performance, which i considered is that Redshift treats each and every hit to the cluster as one single query. It executes each query on its cluster and then proceeds to the next stage. Now when i am sending across multiple rows (in this case 10), each row of data is treated a separate query. Redshift executes each query one by one and loading of the data is completed once all the queries are executed. It means if you are having 100 million rows, there would be 100 million queries running on your Redshift cluster. Well the performance goes to dump !!!
Using S3 File Output step in PDI will load your data to S3 Bucket and then apply the COPY command on the redshift cluster to read the same data from S3 to Redshift. This will solve your problem of performance.
You may also read the below blog links :
Loading data to AWS S3 using PDI
Reading Data from S3 to Redshift
Hope this helps :)
Better to export data to S3, then use COPY command to import data into Redshift. In this way, the import process is fast while you don't need to vacuum it.
Export your data to S3 bucket and use the COPY command in Redshift . COPY command is the fastest way to insert data in Redshift .

DB2-clob data to PostgreSQL

We are planning to move data in DB2(28) to PostgreSQL(9.2).
We have already created database schema and tables in PostgreSQL. I am able to do data export from DB2 to csv format.
For importing data in PostgreSQL, the docs say "Copy command" which copies data from a file to the table.
In DB2, if data is of CLOB type, then separate file is created where CLOB data is kept. The main (data.csv) file contains references to CLOBs. How to import CLOB data in such cases?
I searched on net but could not find any opensource tool from PostgreSQL.
This is not a ready-to-use solution but may be a starting point. As far as I remember, DB2 is offering a ODBC interface. On the other hand on PostgreSQL you are able to "import" ODBC databases via Foreign data wrapper.The first step can be found at documentation. May be worth a try.

How to transfer or copy tables of DB2 to oracle database

I want to transfer some tables of DB2 to oracle daily for accessing them from web page,
But I don't know commands of DB2. How to do this?
I want this action should perform on database daily on particular time, so is there any tool is available to do this operation. And for writing the program for operating above query which programming language should I use? I am using windows XP.
I think Change Data Capture is used to replicate DML from one database to other databases continuously.
However, what you need is to transfer some data at a particular time each day, thus CDC could be too heavy for that.
You could do a simply "db2 export", and then you could import the generated file from Oracle.
There should be an option to create an adapter in Oracle that permits to query DB2 tables. The opposite is called federation in DB2 (InfoSphere Information Server) that permits to query Oracle tables.
Export http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0008303.html
CMD examples http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.dm.doc/doc/r0004567.html
Check this link
http://blogs.oracle.com/warehousebuilder/entry/simple_change_data_capture_from_db2_table_to_oracle_table
In 11.2 releases, Change Data Capture (CDC) can be done by code template mapping. This allows users to capture the data changes from heterogeneous data source, and load into the target across different platforms.