Why Temporary GCS bucket is needed to write a dataframe to BigQuery: pyspark - pyspark

Recently I face an issue while writing the dataframe data into BigQuery using pyspark. Here it was:
pyspark.sql.utils.IllegalArgumentException: u'Temporary or persistent GCS bucket must be informed
After research the issue I found that Temporary GCS bucket to be mentioned spark.conf.
bucket = "temp_bucket"
spark.conf.set('temporaryGcsBucket', bucket)
I think there is no concept to have a file for a table in Biquery like Hive.
I would like to know more about it, why we need to have temp-gcs-bucket to write the data into bigquery?
I was searching for the reason behind this but I couldn't.
Please clarify.

Spark BigQuery connector has two write modes(writeMethod), 1. Direct 2.Indirect while writing data into BigQuery. This is a optional parameter, default is Indirect.
Indirect
You can specify indirect option like this option("writeMethod","indirect"). Its optional, and Indirect is default. This requires you to specify a temporary gcs bucket, if not you will get the error.
The need of temporary bucket is .
The connector writes the data to BigQuery by first buffering all the
data into a Cloud Storage temporary table. Then it copies all data
from into BigQuery in one operation.
Taken from the GCFS spark example docs here
Direct
In this method the data is written directly to BigQuery using the BigQuery Storage Write API
In scala you can specify like this option("writeMethod","direct"). which eliminates the need for a temporary bucket.
You can read more about the bigquery connector here

Related

How to export dataset from PostgreSQL to CSV on AWS so that users can download it?

I have an API where users can query some time-series data. But now I want to make the entire data set available for users to download for their own uses. How would I go about doing something like this? I have RDS, an EC2 instance setup. What would my next steps be?
In this scenario and without any other data or restrictions given, I would use S3 bucket in the center of this process.
Create an S3 Bucket to save the database/dataset dump.
Dump the database/dataset to S3. ( examples: docker, lambda )
Manually transform dataset to CSV or use a Lambda triggered on every dataset dump. (not sure if pg_dump can give you CSV out of the box)
Host those datasets in a bucket accessible to your users and allow access to them as per case:
You can create a publicly available bucket and share its HTTP URL.
You can create a pre-signed URL to allow limited access to your dataset
S3 is proposed since its cheap and you can find a lot of readily available tooling to work with.

How can I parameterize a BigQuery table in Dataprep?

I'm used to use Dataprep to recipe json and csv files from Cloud Storage, but today I tried to ingest a table from BigQuery and could not parametrize.
Is it possible to do that?
Here are some screenshots to illustrate my question:
The prefix that I need
The standard does not work
From Cloud Storage works
In order to ingest a table from BigQuery, you can directly create a Dataset with SQL. I am not sure on what you would like to achieve with the 'Search' input, but it does not accept regular expressions. So, the '*' would not be needed, and just writing 'event_', the interface will filter the matching entries.

How can we handle Data validations in snowpipe in Snowflake

My Scenario is I have data in AWS S3 flat files.
I am using SNS to trigger the Snow-pipe when new file arrives in S3.
To load the data from flat files in S3 to Snowflake table I am using Snow-pipe.
So While loading data from flat files to snowflake table by Snow-pipe,
Can I handle data-validation and couple of calculations on source data?
Please help me if we have any way to do this...
Thanks in Advance.
Validation_mode copy option is not yet supported by snowpipe. However, snowpipe does support simple transformations like column reordering, cast etc are supported. The best way to perform calculations and transform your data would be to load the data into a staging table and process downstream into target tables.
Reference:
https://docs.snowflake.net/manuals/sql-reference/sql/create-pipe.html#usage-notes
https://docs.snowflake.net/manuals/user-guide/data-load-transform.html

AWS Glue Scala Upsert

I am trying to Upsert data into an existing S3 bucket from another using AWS Glue in Scala. Is there a standard way to use this? One of the methods that I found was to use SQL's MERGE method. What are the advantages and disadvantages of using that?
Thanks
You can't really implement 'SQL MERGE' method in s3 since it's not possible to update existing data objects.
A workaround is to load existing rows in a Glue job, merge it with incoming dataset, drop obsolete records and overwrite all objects on s3. If you have a lot of data it would be more efficient to partition it by some columns and then override those partitions that should contain new data only.
If you goal is preventing duplicates then you can do similar: load existing, drop those records from incoming dataset that already exist in s3 (loaded on previous step) and then write to s3 new records only.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile