Saved delta file reads as an df - is it still part of delta lake? - pyspark

I have problems understanding the concept of delta lake. Example:
I read a parquet file:
taxi_df = (spark.read.format("parquet").option("header", "true").load("dbfs:/mnt/randomcontainer/taxirides.parquet"))
Then I save it using asTable:
taxi_df.write.format("delta").mode("overwrite").saveAsTable("taxi_managed_table")
I read the just stored managed table:
taxi_read_from_managed_table = (spark.read.format("delta").option("header", "true").load("dbfs:/user/hive/warehouse/taxi_managed_table/"))
... and when I check the type it shows "pyspark.sql.dataframe.DataFrame", not deltaTable:
type(taxi_read_from_managed_table) # returns pyspark.sql.dataframe.DataFrame
Only after I transform it explicitly using the following command, I receive the type DeltaTable
taxi_delta_table = DeltaTable.convertToDelta(spark,"parquet.dbfs:/user/hive/warehouse/taxismallmanagedtable/")
type(taxi_delta_table) #returns delta.tables.DeltaTable
/////////////////////////////
Does that mean that the table in stage 4. is not a delta table and won’t provide the automatic optimizations provided by delta lake?
How do you establish if something is part of the delta lake or not?
I understand that delta live tables only work with delta.tables.DeltaTables, is that correct?

When you use spark.read...load() - it returns the Spark's DataFrame object that you can use to process the data. Under the hood this DataFrame use the Delta Lake table. DataFrame is abstracting the data source so you can work with different sources and apply the same operations.
On other hand, DeltaTable is a specific object that allows to apply only Delta-specific operations. You even don't need to perform convertToDelta to get it - just use DeltaTable.forPath or DeltaTable.forName functions to obtain its instance.
P.S. if you saved data with .saveAsTable(my_name), then you don't need to use .load, just use spark.read.table(my_name).

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Difference Between df.wirte and CREATE TABLE USING

I have always been under the impression that the following code create a Delta table,
data.write.format("delta").save("/path/to/delta-table")
This creates the files, sure, however, I noticed today that when I look at the Data section of Databricks, under the hive_metastore, this table does not show up.
In order for this table to show up there, I have to do something like,
CREATE TABLE some_table USING DELTA LOCATION "/path/to/delta-table"
What exactly is going on here? Was I wrong in my understanding that the .write operation creates a table? What is the difference between these commands?
DataFrameWriter has following methods:
def save(path: String): Unit
Saves the content of the DataFrame at the specified path.
def saveAsTable(tableName: String): Unit
Saves the content of the DataFrame as the specified table.
What you did by .save("/path/to/delta-table") was saving the data in delta format in the filesystem. In order for the table to be visible in data catalog (aka. metastore) you need to run CREATE TABLE providing the location.
You can write data using .saveAsTable("delta-table") - that would write the data under a path managed by the metastore and register the table in one step.

how to insert the data from delta table to a variable in order to apply drools rule on them

I am using spark with scala in which I am getting streaming datas from eventhubs and then storing them in delta table. In order to apply drools rule on them ,i need to pass them through variables...i am stuck where i have to get the data from delta table to variable.
It really depends what data you need to pass to that drools rules, and what you need to return. You can either use:
User defined function - you define a function that will receive one or more parameters (column values of specific rows). (more examples)
Use map function of Dataset / Dataframe class to process the whole Row (doc, and examples)
Delta Tables can be read into DataFrames. A variable can be assigned to point to the DataFrame.
df = spark.read.format("delta").load("some/delta/path")
Once the Delta Table is read, you can apply your custom transformations:
transformed_df = df.transform(first_transform).transform(second_transform)
Hope this helps point you in the right direction.

Spark : Dynamic generation of the query based on the fields in s3 file

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

How can I resolve table names to Parquet on the fly?

I need to run Spark SQL queries with my own custom correspondence from table names to Parquet data. Reading Parquet data to DataFrames with sqlContext.read.parquet and registering the DataFrames with df.registerTempTable isn't cutting it for my use case, because those calls have to be run before the SQL query, when I might not even know what tables are needed.
Rather than using registerTempTable, I'm trying to write an Analyzer that resolves table names using my own logic. However, I need to be able to resolve an UnresolvedRelation to a LogicalPlan representing Parquet data, but sqlContext.read.parquet gives a DataFrame, not a LogicalPlan.
A DataFrame seems to have a logicalPlan attribute, but that's marked protected[sql]. There's also a ParquetRelation class, but that's private[sql]. That's all I found for ways to get a LogicalPlan.
How can I resolve table names to Parquet with my own logic? Am I even on the right track with Analyzer?
You can actually retrieve the logicalPlan of your DataFrame with
val myLogicalPlan: LogicalPlan = myDF.queryExecution.logical