Loading dataframe from dynamodb - aws glue pyspark - pyspark

I'm trying to read the records from dynamodb table. I have tried to use the dynamic frame. Since I have 8 million records in my table it was taking too long time to filter. Anyway I don't need to load 8 million records to dataframe.Instead of applying filter in dynamic frame, I want to know is any option is available to load dataframe by passing query. So few records only loaded to dataframe and it will work faster.

You can load dataframe by passing a query in spark.sql() but before that, you will have to run the AWS Glue crawler on the Dynamo DB table so that you can get a table corresponding to that Dynamo DB table in AWS Glue catalog and then you can use this table generated in Glue Catalog to read data using Spark dataframe directly.

Related

Create Dataframe in PySpark with a UDF GROUPED_MAP per group

I need to train a ML model per client_id. I have around 100,000 clients, so the same amount of models. I use Spark with the GROUPED_MAP UDF to train a ML model per client in parallel.
df.groupBy('client_id').apply(train_ml_model)
This works very well and split every trainings job per client_id on a worker node.
This UDF works only for a single dataframe with a column to do your groupBy on. But the problem is that my data must be first queried from a warehouse per client_id. So there is no single dataframe yet.
In the current setup, the main dataframe must be first created but this takes many hours. There is a query like SELECT data from datawarehouse WHERE ID IN (client_IDs) to collect the data for all client IDs. I use spark.read.load() for this.
There are several options to overcome this long loading time.
Option1:
Is there a way to use this spark sql loading in the GROUPED_MAP functionality? That you do a SELECT data FROM datawarehouse per client_id in your worker nodes. And only retreive the data for that specific client_id
If I try that, I receive the error that the JBDC functionality cannot work on a Worker Node.
Option2:
This option would be that you first load your main Dataframe using the parameters lowerBound, upperBound, numPartitions to speed up the query. But this will take several hours as well.
Are there other options possible in Spark?
Can you pls share the full load command, are u using the .option("dbtable", "SELECT data from datawarehouse WHERE ID IN (client_IDs)") while loading. This will filter at the source and then send the required rows to spark.

Incrementally loading into a Synapse table using Spark

I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.

Spark data pipeline initial load impact on production DB

I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?
I'm using a PostgreSQL database in case this matters.
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?
Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.
For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.
When writing data back to DB, does that incur load on DB as well?
Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.

Why are new columns added to parquet tables not available from glue pyspark ETL jobs?

We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs to aggregate the parquet table but the new column was not available. Using glue_context.create_dynamic_frame.from_catalog to load the dynamic frame our new column was never in the schema.
We tried several configurations for our table crawler. Using a single schema for all partitions, single schema for s3 path, schema per partition. We could always see the new column in the Glue table data but it was always null if we queried it from a Glue job using pyspark. The column was in the parquet when we downloaded some samples and available for querying via Athena.
Why are the new columns not available to pyspark?
This turned out to be a spark configuration issue. From the spark docs:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
We could enable schema merging in two ways.
set the option on the spark session spark.conf.set("spark.sql.parquet.mergeSchema", "true")
set mergeSchema to true in the additional_options when loading the dynamic frame.
source = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
additional_options={"mergeSchema": "true"}
)
After that the new column was available in the frame's schema.

What is the efficient way of doing Date Partitioning in BigQuery of 2 TB data?

Currently i have exported data as sharded to Google cloud, downloaded in server and streaming to the partitioned table, but the problem is that it takes long time. It streams like 1 Gb for 40 Minutes. Please help me to make it faster. My machine is 12 kernel and 20 Gb RAM CPU.
You can directly load data from Google Cloud Storage into your partition using a generated API call or other methods
To update data in a specific partition, append a partition decorator to the name of the partitioned table when loading data into the table. A partition decorator represents a specific date and takes the form:
$YYYYMMDD
For example, the following command replaces the data in the entire partition for the date January 1, 2016 (20160101) in a partitioned table named mydataset.table1 with content loaded from a Cloud Storage bucket:
bq load --replace --source_format=NEWLINE_DELIMITED_JSON 'mydataset.table1$20160101' gs://[MY_BUCKET]/replacement_json.json
Note: Because partitions in a partitioned table share the table schema, replacing data in a partition will not replace the schema of the table. Instead, the schema of the new data must be compatible with the table schema. To update the schema of the table with the load job, use configuration.load.schemaUpdateOptions.
Read more https://cloud.google.com/bigquery/docs/creating-partitioned-tables