How to read AWS Glue Data Catalog table schemas programmatically - amazon-redshift

I have a set of daily CSV files of uniform structure which I will upload to S3. There is a downstream job which loads the CSV data into a Redshift database table. The number of columns in the CSV may increase and from that point onwards the new files will come with the new columns in them. When this happens, I would like to detect the change and add the column to the target Redshift table automatically.
My plan is to run a Glue Crawler on the source CSV files. Any change in schema would generate a new version of the table in the Glue Data Catalog. I would then like to programmatically read the table structure (columns and their datatypes) of the latest version of the Table in the Glue Data Catalog using Java, .NET or other languages and compare it with the schema of the Redshift table. In case new columns are found, I will generate a DDL statement to alter the Redshift table to add the columns.
Can someone point me to any examples of reading Glue Data Catalog tables using Java, .NET or other languages? Are there any better ideas to automatically add new columns to Redshift tables?

If you want to use Java, use the dependency:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-glue</artifactId>
<version>{VERSION}</version>
</dependency>
And here's a code snippet to get your table versions and the list of columns:
AWSGlue client = AWSGlueClientBuilder.defaultClient();
GetTableVersionsRequest tableVersionsRequest = new GetTableVersionsRequest()
.withDatabaseName("glue_catalog_database_name")
.withCatalogId("table_name_generated_by_crawler");
GetTableVersionsResult results = client.getTableVersions(tableVersionsRequest);
// Here you have all the table versions, at this point you can check for new ones
List<TableVersion> versions = results.getTableVersions();
// Here's how to get to the table columns
List<Column> tableColumns = versions.get(0).getTable().getStorageDescriptor().getColumns();
Here you can see AWS Doc for the TableVersion and the StorageDescriptor objects.
You could also use the boto3 library for Python.
Hope this helps.

Related

Even after setting the "orc.force.positional.evolution" to false hive is still picking up based on position

I have an external table where I have added few new columns and wanted to ensure that data in orc format file should be written from Spark dataframe to Hive external table based on the column name and not based on position and so have set "orc.force.positional.evolution"="false" in TBLPROPERTIES but still data is written based on a position which is incorrect.
Please suggest what I am missing here. I have used below question as a reference:
Hive external table with ORC format- how to map the column names in the orc file to the hive table columns?
I have a workaround of using select on spark Dataframe but looking for better options without making any code changes.
Hive version I am using is 3.1

How to configure AWS glue crawler to read csv file having comma in dataset?

I have data as follow in csv file in S3 bucket:
"Name"|"Address"|"Age"
----------------------
"John"|"LA,USA"|"27"
I have created the crawler which has created the table and when I am trying to query data on Athena. Getting following data:
How to configure the AWS glue Crawler to create catalog table to read above data?
You must have figured it out already, but thought this answer would benefit anyone visits this question.
This can be resolved either using Crawler classifier or making modifications to table properties after table is created.
Using classifier:
Create classifier with "Quote symbol"
Add Classifer in Crawler you create.
Or you can modify table SerDe properties by editing table (after crawler creates table):

How to create an BQ external table on top of the delta table in GCS and show only latest snapshot

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

XML-file produce by the SqlPivotScriptProducer does not contain additonal indexes

We are using the following producers :
sqlServer producer
Template Producer
SqlPivotScriptProducer
With the template producer we create additional indexes.
The xml-file produced by the SqlPivotScriptProducer does not contain these additonal indexes.
Has anybody an idea how to fix this?
The Pivot Script Producer generates the pivot file with the information from the model and the SQL Server database. In short, it uses the model to get a list of the objects that should be in the pivot file and uses the database to get the real definition of each object. For instance if your template replaces a stored procedure defined in the model, the pivot script will contain the definition of the stored procedure as in the template.So if your template creates new database objects (not in the model), they won't be in the pivot file.
You can customize the PivotRunner using the Action event
PivotRunner pivotRunner = new PivotRunner("Pivot\\Model1.pivot.xml");
pivotRunner.ConnectionString = CodeFluentContext.Get(Constants.Model1StoreName).Configuration.ConnectionString;
pivotRunner.Action += OnAction;
pivotRunner.Run();

Redshift - Adding a column, do we have to change our previous CSVs to include it?

I currently have a redshift table in our database that has 10 columns, and I want to add another. It's trivial to do an alter table to do this.
My question - When I do this, will all my old CSV files fail to insert into redshift (via COPY from S3) given they won't have this new column?
I was hoping the columns would just be NULL vs. it failing on import, but I haven't seen any documentation on this.
Ideally I wish I could specify the actual column name in the header row of the CSV, but I haven't seen if that is possible anywhere.
FILLRECORD in COPY command does that: 'Allows data files to be loaded when contiguous columns are missing at the end of some of the records'.