When I upload table from S3 to AWS redshift by using Glue, the table that shows on Redshift including single quote('') in the table.
I think it is the white space in the original table. Please help me to solve this problem. Thank you very much.
Related
I've been dealing with moving data between S3 and Aurora for a while. Right now I don't have much trouble truncating the table and importing all the dataset from fresh. But, I want to improve the scalability of my pipelines, so looking into the examples from aws to "Export and import data from Amazon S3 to Amazon Aurora PostgreSQL" I haven't identify any alternative that enables me to update my tables without truncating them or using staging tables.
I want to leave the staging tables as a last alternative 'cause I'm updating more than 10 tables each time, many times a day. And the burden on the DB would be the same or even more than just truncating and moving all the data from zero. Thus, I was wondering if there is a way to replicate the COPY command from RedShift just using pgsql, or any other similar option.
Right now my PROCEDURE is more or less like this:
TRUNCATE table_name;
SELECT aws_s3.table_import_from_s3('table_name', '', '(format csv)', '(bucket,folder/file.csv,s3-region)')
And on RedShift just a mere COPY is enough to update and insert like this:
COPY table_name FROM 's3://bucket/file.csv' iam_role 'arn:aws:iam:identifier:role/role_name' csv IGNOREHEADER 1;
I've already consider using another language and let the job be done by a lambda function, but I want to ask if there's another native way that I'm not taking into account.
How is possible to get table watermarks on AWS Redshift? I've searched in internet but I didn't find any command to extract it.
There is no concept of table watermarks for AWS Redshift.
Using PySpark in AWS Glue to load data from S3 files to Redshift table, in code used mode("Overwirte") got error stated that "can't drop table because other object depend on the table", turned out there is view created on top of that table, seams the "Overwrite" mode actually drop and re-create redshift table then load data, is there any option that only "truncate" table not dropping it?
AWS Glue uses databricks spark redshift connector (it's not documented anywhere but I verified that empirically). Spark redshift connector's documentation mentions:
Overwriting an existing table: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.
Here there is a related discussion inline to your question, where they have used truncate instead of overwrite, also its a combination of lambda & glue. Please refer here for detailed discussions and code samples. Hope this helps.
regards
I am trying to migrate a huge table from postgres into Redshift.
The size of the table is about 5,697,213,832
tool: pentaho Kettle Table input(from postgres) -> Table output(Redshift)
Connecting with Redshift JDBC4
By observation I found the inserting into Redshift is the bottleneck. only about 500 rows/second.
Is there any ways to accelerate the insertion into Redshift in single machine mode ? like using JDBC parameter?
Have you consider using S3 as mid-layer?
Dump your data to csv files and apply gzip compression. Upload files to the S3 and then use copy command to load the data.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
The main reason for bottleneck of redshift performance, which i considered is that Redshift treats each and every hit to the cluster as one single query. It executes each query on its cluster and then proceeds to the next stage. Now when i am sending across multiple rows (in this case 10), each row of data is treated a separate query. Redshift executes each query one by one and loading of the data is completed once all the queries are executed. It means if you are having 100 million rows, there would be 100 million queries running on your Redshift cluster. Well the performance goes to dump !!!
Using S3 File Output step in PDI will load your data to S3 Bucket and then apply the COPY command on the redshift cluster to read the same data from S3 to Redshift. This will solve your problem of performance.
You may also read the below blog links :
Loading data to AWS S3 using PDI
Reading Data from S3 to Redshift
Hope this helps :)
Better to export data to S3, then use COPY command to import data into Redshift. In this way, the import process is fast while you don't need to vacuum it.
Export your data to S3 bucket and use the COPY command in Redshift . COPY command is the fastest way to insert data in Redshift .
Currently we have a workbook developed in Tableau using Oracle server as the data store where we have all our tables and views. Now we are migrating to Redshift fora better performance. We have the same table structure as in the Oracle with the same table names and the field names in the Redshift. We already have the Tableau workbook developed and we need to point to Redshift tables and views now. How do we point the developed workbook to Redshift now, kindly help.
Also let me know any other inputs in this regard.
Thanks,
Raj
Use the Replace Data Source functionality of Tableau Desktop
You can bypass Replace Data Source and move data directly from Oracle to Redshift using bulk loaders.
Simple combo of SQL*Plus + Python + boto + psycopg2 will do the job.
It should:
Open read pipe from Oracle SQL*Plus
Compress data stream
Upload compressed stream to S3
Bulk append data from S3 to Redshift table.
You can check example of how to extract table or query data from Oracle and then load it to Redshift using COPY command from S3.