How to improve import speed on SQL Workbench/J - amazon-redshift

Tried like below, but it imports terribly slow, with speed 3 rows/sec
WbImport -file=c:/temp/_Cco_.txt
-table=myschema.table1
-filecolumns=warehouse_id,bin_id,cluster_name
---deleteTarget
-batchSize=10000
-commitBatch

WbInsert can use the COPY API of the Postgres JDBC driver.
To use it, use
WbImport -file=c:/temp/_Cco_.txt
-usePgCopy
-table=myschema.table1
-filecolumns=warehouse_id,bin_id,cluster_name
The options -batchSize and -commitBatch are ignored in that case, so you should remove them.
SQL Workbench/J will then essentially use the equivalent of a COPY ... FROM STDIN. That should be massively faster than regular INSERT statements.
This requires that the input file is formatted according to the requirements of the COPY command.

WbImport uses INSERT to load data. This is the worst way to load data into Redshift.
You should be using the COPY command for this as noted in the Redshift documentation:
"We strongly recommend using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow."

Related

Importing many columns from a CSV into Postgres

I am trying to import to pgAdmin a big table with more than 100 columns. Is there any way to import the table without creating those 100 columns in a table within the pgAdmin? That would be a considerably time-consuming task.
You are not importing data into pgAdmin, you are importing it into Postgres, and using pgAdmin to help you in that task. Graphical tools like pgAdmin are, at heart, just convenience wrappers around the actual functionality of the database, and everything they do can be done in other ways.
In the case of a simple task like creating a table, the relevant SQL syntax is well worth learning. It will work in any database tool, even (with some minor changes) on other SQL databases (e.g. MySQL), can be saved in version control, and manipulated with an editor of your choice.
You could even go so far as to write a script in the language of your choice that generates the SQL for you based on some other data (e.g. the headings of the CSV file) - although make sure you don't run that with third-party data without checking the result or taking extreme care with code injection and other security concerns!
The Postgres manual has an introduction to tables and creating them which would be a good place to start.

Export CSV from Mainframe DB2 in batch mode

how can I export in a CSV file the result of a SELECT query from Mainframe DB2 in Batch mode?
I have tried the FILE MANAGER online mode and it works but I need to use the batch mode for a better performance.
I can also use ISQL but I don't know which parameters I have to use to create a CSV file.
Thanks
If all else fails and you don't mind a little programming then coding your own program that runs the query and writes CSV is EXTREMELY easy.
I mention this because this might be better for you than relying on some tool.
As you're looking for improved performance I'd suggest you CALL the DSNUTILU stored procedure with the UNLOAD utility using DELIMITED COLDEL ',' and SHRLEVEL CHANGE ISOLATION UR parameters for CSV and to maximise concurrency on your DB2 for z/OS table. There are many other option depending on your requirements.
For reference refer to DSNUTILU stored procedure and Syntax and options of the UNLOAD control statement
On iserie you have the CPYTOIMPF command, may be on zos too

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.

Can COPY FROM tolerantly consume bad CSV?

I am trying to load text data into a postgresql database via COPY FROM. Data is definitely not clean CSV.
The input data isn't always consistent: sometimes there are excess fields (separator is part of a field's content) or there are nulls instead of 0's in integer fields.
The result is that PostgreSQL throws an error and stops loading.
Currently I am trying to massage the data into consistency via perl.
Is there a better strategy?
Can PostgreSQL be asked to be as tolerant as mysql or sqlite in that respect?
Thanks
PostgreSQL's COPY FROM isn't designed to handle bodgy data and is quite strict. There's little support for tolerance of dodgy data.
I thought there was little interest in adding any until I saw this proposed patch posted just a few days ago for possible inclusion in PostgreSQL 9.3. The patch has been resoundingly rejected, but shows that there's some interest in the idea; read the thread.
It's sometimes possible to COPY FROM into a staging TEMPORARY table that has all text fields with no constraints. Then you can massage the data using SQL from there. That'll only work if the SQL is at least well-formed and regular, though, and it doesn't sound like yours is.
If the data isn't clean, you need to pre-process it with a script in a suitable scripting language.
Have that script:
Connect to PostgreSQL and INSERT rows;
Connect to PostgreSQL and use the scripting language's Pg APIs to COPY rows in; or
Write out clean CSV that you can COPY FROM
Python's csv module can be handy for this. You can use any language you like; perl, python, php, Java, C, whatever.
If you were enthusiastic you could write it in PL/Perlu or PL/Pythonu, inserting the data as you read it and clean it up. I wouldn't bother.

is there any PostgreSQL loader like Oracle has?

how to use copy statement in postgresql to load data from a text file where the file has an escape character as a delimiter into a postgresql table?
Is there any otherway of loading data from textfile into a PostgreSQL table?
pg loader emulates oracles sql loader:
http://pgfoundry.org/projects/pgloader/
pg bulkload is used to load lots of data in an otherwise offline db. Useful for large data warehouses, fast, and somewhat dangerous and quirky:
http://pgbulkload.projects.postgresql.org/
You should use COPY with the DELIMITER 'xx' option. You probably need to play around a little bit to get it right, but the docs give a pretty good information about what to do with each option available to the command.