'DataFrame' object has no attribute 'to_delta' - pyspark

My code used to work. Why does my code not work anymore? I updated to the newer Databricks runtime 10.2 so I had to change some earlier code to use pandas on pyspark.
# Drop customer ID for AutoML
automlDF = churn_features_df.drop(key_id)
# Write out silver-level data to autoML Delta lake
automlDF.to_delta(mode='overwrite', path=automl_silver_tbl_path)
The error I am getting is 'DataFrame' object has no attribute 'to_delta'

I was able to get it to work as expected using to_pandas_on_spark(). My working code looks like this:
# Drop customer ID for AutoML
automlDF = churn_features_df.drop(key_id).to_pandas_on_spark()
# Write out silver-level data to autoML Delta lake
automlDF.to_delta(mode='overwrite', path=automl_silver_tbl_path)

Related

Saved delta file reads as an df - is it still part of delta lake?

I have problems understanding the concept of delta lake. Example:
I read a parquet file:
taxi_df = (spark.read.format("parquet").option("header", "true").load("dbfs:/mnt/randomcontainer/taxirides.parquet"))
Then I save it using asTable:
taxi_df.write.format("delta").mode("overwrite").saveAsTable("taxi_managed_table")
I read the just stored managed table:
taxi_read_from_managed_table = (spark.read.format("delta").option("header", "true").load("dbfs:/user/hive/warehouse/taxi_managed_table/"))
... and when I check the type it shows "pyspark.sql.dataframe.DataFrame", not deltaTable:
type(taxi_read_from_managed_table) # returns pyspark.sql.dataframe.DataFrame
Only after I transform it explicitly using the following command, I receive the type DeltaTable
taxi_delta_table = DeltaTable.convertToDelta(spark,"parquet.dbfs:/user/hive/warehouse/taxismallmanagedtable/")
type(taxi_delta_table) #returns delta.tables.DeltaTable
/////////////////////////////
Does that mean that the table in stage 4. is not a delta table and won’t provide the automatic optimizations provided by delta lake?
How do you establish if something is part of the delta lake or not?
I understand that delta live tables only work with delta.tables.DeltaTables, is that correct?
When you use spark.read...load() - it returns the Spark's DataFrame object that you can use to process the data. Under the hood this DataFrame use the Delta Lake table. DataFrame is abstracting the data source so you can work with different sources and apply the same operations.
On other hand, DeltaTable is a specific object that allows to apply only Delta-specific operations. You even don't need to perform convertToDelta to get it - just use DeltaTable.forPath or DeltaTable.forName functions to obtain its instance.
P.S. if you saved data with .saveAsTable(my_name), then you don't need to use .load, just use spark.read.table(my_name).

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Cloning Power BI DirectQuery into a regular query

I have a Power Bi dataset that someone shared with me. I would like to import it into Power Bi Desktop and transform its data
I used DirectQuery to import the dataset and I managed to create a calculated table:
My_V_Products = CALCULATETABLE(V_Products)
However, when I try using TransformData, I do not see this table. I guess this is due to the fact that this is not actually a table created from a query but from a DAX.
Is there a way to import the entire table using a query or convert the data to transformable data?
Only if the Dataset is on a premium capacity. If it is you can connect to the XMLA endpoint for the workspace using the Analysis Services connector, and create an Import table using a custom DAX query, like evaluate V_Products.

How to write the data back to Big Query using Databricks?

I would like to upload my data frame to a Big query table using data bricks. I used the below code and got the following errors.
bucket = "databricks-ci"
table = "custom-bidder.ciupdate.myTable"
df.write.format("bigquery").mode("overwrite").option("temporaryGcsBucket", bucket).option("table", table).save()
Error Message
I created a new bucket called "databricks-ci" and also created a dataset called "ciupdate" and just gave my table name here "myTable". My project is "custom-bidder"
I am not sure why it's not loading? Can anyone advise?

create_dynamic_frame_from_catalog returning zero results

I'm trying to create a dynamic glue dataframe from an athena table but I keep getting an empty data frame.
The athena table is part of my glue data catalog
The create_dynamic_frame_method call doesn't raise any error. I tried loading a random table and it did complain just as a sanity check.
I know the Athena table has data, since querying the exact same table using Athena returns results
The table is an external json, partitioned table on s3
I'm using pyspark as shown below:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
# Create a DynamicFrame using the 'raw_data' table
raw_data_df =
glueContext.create_dynamic_frame.from_catalog(database="***",
table_name="raw_***")
# Print out information about this data, im getting zero here
print "Count: ", raw_data_df.count()
#also getting nothing here
raw_data_df.printSchema()
Anyone facing the same issue ? Could this be a permissions issue or a glue bug since no errors are raised?
There are several poorly documented features/gotchas in Glue which is sometimes frustrating.
I would suggest to investigate the following configurations of your Glue job:
Does the S3 bucket name has aws-glue-* prefix?
Put the files in S3 folder and make sure the crawler table definition is on folder
rather than actual file.
I have also written a blog on LinkedIn about other Glue gotchas if that helps.
Do you have subfolders under the path where your Athena table points to? glueContext.create_dynamic_frame.from_catalog does not recursively read the data. Either put the data in the root of where the table is pointing to or add additional_options = {"recurse": True} to your from_catalog call.
credit: https://stackoverflow.com/a/56873939/5112418