Glue Data write to Redshift too slow - pyspark

I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files.
first question:
Its taking a lot of time to write data to Redshift from glue even tho I am running 10 DPUs
second:
How can I make it more faster and efficient , should I write the data back after transformation to s3 in parquet format and then may be use a COPY command to directly write data to Redshift?
Please suggest the best Ideas and approaches.

Related

What's the best way to read/write from/to Redshift with Scala spark since spark-redshift lib is not supported publicly by Databricks

I have my Spark project in Scala I want to use Redshift as my DataWarehouse, I have found spark-redshift repo exists but Databricks made it private since a couple of years ago and doesn't support it publicly anymore.
What's the best option right now to deal with Amazon Redshift and Spark (Scala)
This is a partial answer as I have only been using Spark->Redshift in a real world use-case and never benchmarked Spark read from Redshift performance.
When it comes to writing from Spark to Redshift, by far the most performant way that I could find was to write parquet to S3 and then use Redshift Copy to load the data. Writing to Redshift through JDBC also works but it is several orders of magnitude slower than the former method. Other storage formats could be tried as well, but I would be surprised if any row-oriented format could beat Parquet as Redshift internally stores data in columnar format. Another columnar format that is supported by both Spark and Redshift is ORC.
I never came across a use-case of reading large amounts of data from Redshift using Spark as it feels more natural to load all the data to Redshift and use it for joins and aggregations. It is probably not cost-efficient to use Redshift just as a bulk storage and use another engine for joins and aggregations. For reading small amounts of data, JDBC works fine. For large reads, my best guess is Unload command and S3.

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.
I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.
I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.
First of all, I'd like to ask - will this solution work at all?
And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?
Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.
AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.
It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.
See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.

save PostgreSQL data in Parquet format

I'm working a project which needs to generate parquet files from a huge PostgreSQL database. The data size can be gigantic (ex: 10TB). I'm very new to this topic and has done some research online but did not find a direct way to convert the data to Parquet file. Here are my questions:
The only feasible solution I saw is to load Postgres table to Apache Spark via JDBC and save as a parquet file. But I assume it will be very slow while transferring 10TB data.
Is it possible to generate a huge parquet file size that is 10 TB? Or is it better to create multiple parquet files?
Hope my question is clear and I really appreciate any helpful feedbacks. Thanks in advance!
Use the ORC format instead of the parquet format for this volume.
I assume the data is partitioned, so I think it's a good idea to extract in parallel taking advantage of data partitioning.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Redshift insert bottleneck

I am trying to migrate a huge table from postgres into Redshift.
The size of the table is about 5,697,213,832
tool: pentaho Kettle Table input(from postgres) -> Table output(Redshift)
Connecting with Redshift JDBC4
By observation I found the inserting into Redshift is the bottleneck. only about 500 rows/second.
Is there any ways to accelerate the insertion into Redshift in single machine mode ? like using JDBC parameter?
Have you consider using S3 as mid-layer?
Dump your data to csv files and apply gzip compression. Upload files to the S3 and then use copy command to load the data.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
The main reason for bottleneck of redshift performance, which i considered is that Redshift treats each and every hit to the cluster as one single query. It executes each query on its cluster and then proceeds to the next stage. Now when i am sending across multiple rows (in this case 10), each row of data is treated a separate query. Redshift executes each query one by one and loading of the data is completed once all the queries are executed. It means if you are having 100 million rows, there would be 100 million queries running on your Redshift cluster. Well the performance goes to dump !!!
Using S3 File Output step in PDI will load your data to S3 Bucket and then apply the COPY command on the redshift cluster to read the same data from S3 to Redshift. This will solve your problem of performance.
You may also read the below blog links :
Loading data to AWS S3 using PDI
Reading Data from S3 to Redshift
Hope this helps :)
Better to export data to S3, then use COPY command to import data into Redshift. In this way, the import process is fast while you don't need to vacuum it.
Export your data to S3 bucket and use the COPY command in Redshift . COPY command is the fastest way to insert data in Redshift .