save PostgreSQL data in Parquet format - postgresql

I'm working a project which needs to generate parquet files from a huge PostgreSQL database. The data size can be gigantic (ex: 10TB). I'm very new to this topic and has done some research online but did not find a direct way to convert the data to Parquet file. Here are my questions:
The only feasible solution I saw is to load Postgres table to Apache Spark via JDBC and save as a parquet file. But I assume it will be very slow while transferring 10TB data.
Is it possible to generate a huge parquet file size that is 10 TB? Or is it better to create multiple parquet files?
Hope my question is clear and I really appreciate any helpful feedbacks. Thanks in advance!

Use the ORC format instead of the parquet format for this volume.
I assume the data is partitioned, so I think it's a good idea to extract in parallel taking advantage of data partitioning.

Related

SQL vs PySpark/Spark SQL

Could someone please help me understand why we need to use PySpark or SprakSQL etc if the source and target of my data is the same DB?
For example, lets say I need to load data to table X in PostgresDB from tables X and Y. Would it not be simpler and faster to just do it in Postgres instead of using SprakSQL or PySpark etc?
I understand the need for these solutions if data is from multiple sources, but if it is from same source, do I need to use PySpark?
You can use spark when you want to do heavy data transformations, it makes it easier to load and process due to distributed processing.
It totally depends on how large is the data and how you want to transform it.
Using Postgres will be a good idea if data is relatively small and no transformation is required.
It is not necessary to use PySpark. Both PySpark & SparkSQL have their value in managing/manipulating large volumes of data few hundred of GBs, TBs, or PBs in a distributed computing setup. If this is your case, please use PySpark, it will be more efficient to load, manipulate, process/shape the data before inserting it into another table.
Thank you all for the feedback. I think I will use glue pyspark if source and destination are different. Else i will use glue python with jdbc connection and have one session do the tasks without bringing data to dataframes.

Glue Data write to Redshift too slow

I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files.
first question:
Its taking a lot of time to write data to Redshift from glue even tho I am running 10 DPUs
second:
How can I make it more faster and efficient , should I write the data back after transformation to s3 in parquet format and then may be use a COPY command to directly write data to Redshift?
Please suggest the best Ideas and approaches.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Spark dataframe CSV vs Parquet

I am beginner in Spark and trying to understand the mechanics of spark dataframes.
I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.
dataframe_csv = sqlcontext.read.format("csv").load()
dataframe_parquet = sqlcontext.read.parquet()
Please explain the reason for the difference.
The reason because you see differente performance between csv & parquet is because parquet has a columnar storage and csv has plain text format. Columnar storage is better for achieve lower storage size but plain text is faster at read from a dataframe.

Storing & reading custom metadata in parquet files using Spark / Scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.