Databricks SQL AddColumn While Creating Delta Table - pyspark

I am trying to create a delta table with an added column in the DBSQL metastore from a delta bucket. I do not want to pass in the schema as this may change in the bucket over time but I do want to add a column to the metastore only that is a generatedAlways column so it populated with new values as the delta bucket receives new data. This is my code based on Databricks documentation:
DeltaTable.createIfNotExists(spark) \
.tableName("golddata.table") \
.addColumn("date", DateType(), generatedAlwaysAs="CAST(concat(year,month,'-01') AS DATE)") \
.location("cloudBucket://golddata/table") \
.execute()
This codes gives a schema mismatch error. Is there a way to add a column in Databricks SQL Metastore to the existing schema that is being loaded from my delta bucket? Will using the generatedAlways function updated when the data in the bucket is updated?

If you want to add columns to existing Delta Table you have to specify two options:
write or writeStream are set with .option("mergeSchema", "true")
spark.databricks.delta.schema.autoMerge.enabled is set to true
If these two are provided, then Delta should merge in your extra column into existing schema.
Sources:
https://learn.microsoft.com/en-us/azure/databricks/delta/update-schema#add-columns-with-automatic-schema-update

Related

Databricks - How to change a partition of an existing Delta table via table path?

I want to use this syntax that was posted in a similar thread to update the partitions of a delta table I have. The challenge is the delta table does not exist in databricks but we use databricks to update the delta able via Azure Data factory.
How can I adjust the below syntax to update the partitions and overwrite the table via the table path?
Python:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
SQL:
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
I tried to adjust the above code but could not adjust it to use the delta table path.
If you have path, then you need to use correct functions:
for Python, you need to use .load for reading, and .save for writing
if you're using SQL, then you specify table as following:
delta.`path`

Cannot create a table having a column whose name contains commas in Hive metastore

Tried to create a delta table from spark data frame using below command:
destination_path = "/dbfs/mnt/kidneycaredevstore/delta/df_corr_feats_spark_4"
df_corr_feats_spark.write.format("delta").option("delta.columnMapping.mode", "name").option("path",destination_path).saveAsTable("CKD_Features_4")
Getting below error:
AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: default.abc_features_4; Column: Adverse, abc initial encounter
Please note that there are around 6k columns in this data frame and it is developed by data scientist generate feature. So, we cannot rename columns.
How to fix this error? Can any configuration change in Metastore solve this issue?
Column mapping feature requires writer version 5 and reader version 2 of Delta protocol, therefore you need to specify this when saving:
df_corr_feats_spark.write.format("delta")
.option("delta.columnMapping.mode", "name")
.option("delta.minReaderVersion", "2")
.option("delta.minWriterVersion", "5")
.option("path", destination_path)
.saveAsTable("CKD_Features_4")

Why is Pyspark/Snowflake insensitive to underscores in column names?

Working in Python 3.7 and pyspark 3.2.0, we're writing out a PySpark dataframe to a Snowflake table using the following statement, where mode usually equals 'append'.
df.write \
.format('snowflake') \
.option('dbtable', table_name) \
.options(**sf_configs) \
.mode(mode)\
.save()
We've made the surprising discovery that this write can be insensitive to underscores in column names -- specifically, a dataframe with the column "RUN_ID" is successfully written out to a table with the column "RUNID" in Snowflake, with the column mapping accordingly. We're curious why this is so (I'm in particular wondering if the pathway runs through a LIKE statement somewhere, or if there's something interesting in the Snowflake table definition) and looking for documentation of this behavior (assuming that it's a feature, not a bug.)
According to the docs, the snowflake connector defaults to using column order instead of name, see parameter column_mapping.
The connector must map columns from the Spark data frame to the Snowflake table. This can be done based on column names (regardless of
order), or based on column order (i.e. the first column in the data
frame is mapped to the first column in the table, regardless of column
name).
By default, the mapping is done based on order. You can override that by setting this parameter to name, which tells the connector to
map columns based on column names. (The name mapping is
case-insensitive.)

Glue load job not retaining default column value in redshift

I have a Glue job that loads a CSV from S3 into a redshift table. There is 1 column (updated_date) which is not mapped. Default value for that column is set to current_timestamp in UTC. But each time the Glue job runs, this updated_date column is null.
I tried removing updated_dt from Glue metadata table. I tried removing updated_dt from SelectFields.apply() in Glue script.
When I do a normal insert statement in Redshift without using updated_dt column, default current_timestamp() value is inserted for those rows.
Thanks
Well I had the same problem. AWS support told me to convert the Glue DynamicFrame into a Spark DataFrame and use the Spark writer to load data into Redshift:
SparkDF = GlueDynFrame.toDF()
SparkDF.write.format('jdbc').options(
url = ‘<‘JDBC url>’,
dbtable=‘<schema>.<table>’,
user=‘<username>’,
password=‘<password>’).mode(‘append’).save()
Me on the other hand handled the problem by either dropping and creating the target table using preactions or just use some update to set the default values in the postactions.

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.