Query related to createExternalTable in Pyspark - pyspark

In Pyspark,
does createExternalTable() is used to create replica for datasets/data warehouse permanently or is just a temporary view for the particular action triggered?
If temporary view then what's the difference between createExternalTable() and createOrReplaceTempView()?

Related

Difference Between df.wirte and CREATE TABLE USING

I have always been under the impression that the following code create a Delta table,
data.write.format("delta").save("/path/to/delta-table")
This creates the files, sure, however, I noticed today that when I look at the Data section of Databricks, under the hive_metastore, this table does not show up.
In order for this table to show up there, I have to do something like,
CREATE TABLE some_table USING DELTA LOCATION "/path/to/delta-table"
What exactly is going on here? Was I wrong in my understanding that the .write operation creates a table? What is the difference between these commands?
DataFrameWriter has following methods:
def save(path: String): Unit
Saves the content of the DataFrame at the specified path.
def saveAsTable(tableName: String): Unit
Saves the content of the DataFrame as the specified table.
What you did by .save("/path/to/delta-table") was saving the data in delta format in the filesystem. In order for the table to be visible in data catalog (aka. metastore) you need to run CREATE TABLE providing the location.
You can write data using .saveAsTable("delta-table") - that would write the data under a path managed by the metastore and register the table in one step.

How to configure AWS glue crawler to read csv file having comma in dataset?

I have data as follow in csv file in S3 bucket:
"Name"|"Address"|"Age"
----------------------
"John"|"LA,USA"|"27"
I have created the crawler which has created the table and when I am trying to query data on Athena. Getting following data:
How to configure the AWS glue Crawler to create catalog table to read above data?
You must have figured it out already, but thought this answer would benefit anyone visits this question.
This can be resolved either using Crawler classifier or making modifications to table properties after table is created.
Using classifier:
Create classifier with "Quote symbol"
Add Classifer in Crawler you create.
Or you can modify table SerDe properties by editing table (after crawler creates table):

How to create an BQ external table on top of the delta table in GCS and show only latest snapshot

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

AWS DMS CDC task does not detect column name and type changes

I have created a CDC task that captures changes in a source PostgreSQL schema and writes them in Parquet format into a target S3 bucket. The task captures the inserts, updates and deletes correctly but fails to capture column name and type changes in the source.
When I change a column name or type of a table in the source and insert new rows to the table, the resulting Parquet file uses the old column name and type.
Is there a specific configuration I am missing? or it is not possible to achieve the desired outcome from this task in DMS?
if you change column at source and DMS will pick automatically from source and update at destination. check your DMS setting. you no need to do manually adding column at destination
Make sure you have the HandleSourceTableAltered parameter set to true in the task settings.[1] (The setting applies when the target metadata parameter BatchApplyEnabled is set to either true or false.)
Same goes for HandleSourceTableDropped or HandleSourceTableTruncated if this is relevant in your case.
Obviously, previously replicated Parquet files on S3 will not change to reflect this DDL change on the source.
[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.DDLHandling.html

create_dynamic_frame_from_catalog returning zero results

I'm trying to create a dynamic glue dataframe from an athena table but I keep getting an empty data frame.
The athena table is part of my glue data catalog
The create_dynamic_frame_method call doesn't raise any error. I tried loading a random table and it did complain just as a sanity check.
I know the Athena table has data, since querying the exact same table using Athena returns results
The table is an external json, partitioned table on s3
I'm using pyspark as shown below:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
# Create a DynamicFrame using the 'raw_data' table
raw_data_df =
glueContext.create_dynamic_frame.from_catalog(database="***",
table_name="raw_***")
# Print out information about this data, im getting zero here
print "Count: ", raw_data_df.count()
#also getting nothing here
raw_data_df.printSchema()
Anyone facing the same issue ? Could this be a permissions issue or a glue bug since no errors are raised?
There are several poorly documented features/gotchas in Glue which is sometimes frustrating.
I would suggest to investigate the following configurations of your Glue job:
Does the S3 bucket name has aws-glue-* prefix?
Put the files in S3 folder and make sure the crawler table definition is on folder
rather than actual file.
I have also written a blog on LinkedIn about other Glue gotchas if that helps.
Do you have subfolders under the path where your Athena table points to? glueContext.create_dynamic_frame.from_catalog does not recursively read the data. Either put the data in the root of where the table is pointing to or add additional_options = {"recurse": True} to your from_catalog call.
credit: https://stackoverflow.com/a/56873939/5112418