joining multiple s3 files in scala Programming in AWS Glue

joining multiple s3 files in scala Programming in AWS Glue - scala

How to join multiple s3 files in scala? based on the join data I need to insert or update the data into MySQL database.
let me know if there is any kind sample script for this.

Ramesh, though I donot have a scala script to join files and import to Mysql, may be this aws link may give you an idea for creating different dataframes for 3 files from S3 and then join them as needed, before processing for creating/importing to a mysql/redshift tables.
Create a Glue crawler & point to the 3 files, generate the database/table catalogs for the S3 files (Ref:Setting up glue catalog/crawlers )
In your scala scripts, create dataframes for the 3 tables, and then join them as needed.
URL: Aws examples for Join&Relationalize using Scala.
Thanks
Yuva

Related

Is there a way to use Spark SQL to query partition information in AWS Glue Data Catalog (similar to in Athena)?

I'm currently developing a Glue ETL script in PySpark that needs to query my Glue Data Catalog's partitions and join that information with other Glue tables programmatically.
At the moment, I'm able to do this with Athena using SELECT * FROM db_name.table_name$partitions JOIN table_name2 ON ..., but looks like this doesn't work with Spark SQL. The closest thing I've been able to find is SHOW PARTIIONS db_name.table_name, which doesn't seem to cut it.
Does anyone know an easy way I can leverage Glue ETL / Boto3 (Glue API) / PySpark to query my partition information in a SQL-like manner?
For the time being, the only possible workaround seems like the get_partitions() method in Boto3, but this looks like a lot more complex work to deal with from my end. I already have my Athena queries to get the information I need, so if there's ideally a way to replicate getting my tables' partitions in a similar way using SQL, that'd be amazing. Please let me know, thank you!

For those interested, an alternative workaround I've been able to find but still need to test out is the Athena API with the Boto3 client. I may also possibly use the AWS Wrangler integrated with Athena to retrieve a dataframe.

ETL with Dataprep - Union Dataset

I'm a newcomer to GCP, and I'm learning every day and I'm loving this platform.
I'm using GCP's dataprep to join several csv files (with the same column structure), treat some data and write to a BigQuery.
I created a storage (butcket) to put all 60 csv files inside. In dataprep can I define a data set to be the union of all these files? Or do you have to create a dataset for each file?
Thank you very much for your time and attention.

If you have all your files inside a directory in GCS you can import that directory as a single dataset. The process is the same as importing single files. You have to make sure though, that the column structure is exactly the same for all the files inside the directory.
If you create a separate dataset for each file you are more flexible on the structure they have when you use the UNION page to concatenate them.
However, if your use case is just to load all the files (~60) to a single table in Bigquery without any transformation, I would suggest to just use a BigQuery load job. You can use a wildcard in the Cloud Storage URI to specify the files you want. Currently, BigQuery load jobs are free of charge, so it would be a very cost-effective solution compared to the use of Dataprep.

GCP Dataproc spark consuming BigQuery

I'm very new on GCP Google Cloud Platform, so I hope my question will not look so silly.
Footstage:
The main goals is gather few extend tables from BigQuery and apply few transformations. Because of the size of the tables I'm planning use Dataproc deploying a Pyspark script, ideally I would be able to use sqlContext to apply few sql queries to the DFs (tables pulled from BQ). Finally, I could easily dump this info into a file within a data storage bucket.
Questions :
Can I use import google.datalab.bigquery as bq within my Pyspark script?
Is this proposed schema the most efficient or instead I might validate any other? keep in mind that I need to create many temporal queries and this is why I though on Spark.
I expect to use pandas and bq to read the results queries as pandas df following this example. Later, I might use sc.parallelize from Spark to transform the pandas df into a spark df. Is this approach the right one?
my script
Update:
After have a back and forth with #Tanvee that kindly attend this question we conclude that GCP requires an intermediate allocation step when you need to read data from DataStorage into your Dataproc. Briefly, your spark or hadoop script might need a temporal bucket where store the data from the table and then bring it into Spark.
References:
Big Query Connector \
Deployment
thanks so much

You will need to use BigQuery connector for spark. There are some examples in the GCP documentation here and here. It will create RDD which you can convert to dataframe and then you will be able to perform all typical transformations. Hope that helps.

You can directly use following options to connect bigquery table from spark.
You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly run your queries on dataproc using spark.
https://github.com/GoogleCloudPlatform/spark-bigquery-connector This is new connector which is in beta. This is spark datasource api to bigquery which is easy to use.
Please refer following link:
Dataproc + BigQuery examples - any available?

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743

Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark

The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse