I'm working on the delta merge logic and wanted to delete a row on the delta table when the row gets deleted on the latest dataframe read.
My sample DF as shown below
df = spark.createDataFrame(
[
('Java', "20000"), # create your data here, be consistent in the types.
('PHP', '40000'),
('Scala', '50000'),
('Python', '10000')
],
["language", "users_count"] # add your column names here
)
Insert the data to a delta table
df.write.format("delta").mode("append").saveAsTable("xx.delta_merge_check")
On the next read, i've removed the row that shows ('python', '10000'), and now I want to delete this row from the delta table using delta merge API.
df_latest = spark.createDataFrame(
[
('Java', "20000"), # create your data here, be consistent in the types.
('PHP', '40000'),
('Scala', '50000')
],
["language", "users_count"] # add your column names here
)
I'm using the below code for the delta merge API
Read the existing delta table:
from delta.tables import *
test_delta = DeltaTable.forPath(spark,
"wasbs://storageaccount#xx.blob.core.windows.net/hive/warehouse/xx/delta_merge_check")
merge the changes:
test_delta.alias("t").merge(df_latest.alias("s"),
"s.language = t.language").whenMatchedDelete(condition = "s.language =
true").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
But unfortunately this doesn't delete the row('python', '10000') from the delta table, is there any other way to achieve this any help would be much appreciated.
This won't work the way you think it should - the basic problem is that your 2nd dataset doesn't have any information that data was deleted, so you somehow need to add this information to it. There are different approaches, based on the specific details:
Instead of just removing the row, you keep it, but add an another column that will show if data is deleted or not, something like this:
test_delta.alias("t").merge(df_latest.alias("s"),
"s.language = t.language").whenMatchedDelete(condition = "s.is_deleted =
true").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
use some other method to find diff between your destination table and input data - but this will really depends on your logic. If you able to calculate the diff, then you can use the approach that I described in the previous item.
If your input data is always a full set of data, you can just overwrite all data using overwrite mode - this will be even more performant than merge, because you don't rewrite the data
Related
I am using spark with scala in which I am getting streaming datas from eventhubs and then storing them in delta table. In order to apply drools rule on them ,i need to pass them through variables...i am stuck where i have to get the data from delta table to variable.
It really depends what data you need to pass to that drools rules, and what you need to return. You can either use:
User defined function - you define a function that will receive one or more parameters (column values of specific rows). (more examples)
Use map function of Dataset / Dataframe class to process the whole Row (doc, and examples)
Delta Tables can be read into DataFrames. A variable can be assigned to point to the DataFrame.
df = spark.read.format("delta").load("some/delta/path")
Once the Delta Table is read, you can apply your custom transformations:
transformed_df = df.transform(first_transform).transform(second_transform)
Hope this helps point you in the right direction.
I get an error when I execute the following line of code:
deltaTarget.alias('target').merge(df.alias('source'), mergeStatement).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
The error is the following:
AnalysisException: cannot resolve new_column in UPDATE clause given columns {List of target columns}. The 'new_column' is indeed not in the schema of the target delta table, but according to the documentation, this should just update the existing schema of the delta table and add the column.
I also enable the autoMerge with this command:
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled ","true")
I am not sure what exactly causes this error because in the past I was able to evolve the schema of delta tables automatically with these exact pieces of code.
Is there something that I am overlooking?
you have to remove the space after ..autoMerge.enabled in the spark.conf.set
--> it's
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled","true")
, not
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled ","true")`
I have the same problem with you, but i find that in delta lake docs, it may not likely support the part columns with upsertAll() and insertAll();
So i choose the upsertExpr() and insertExpr() with a big map contains all the columns.
delta lake merge : Schema validation
If I'm not mistaken you need to use the insertAll or updateAll options on the MERGE operation
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled","true")
Ensure there is no space after "enabled" in above line.
then you can use pass a spark sql:
spark.sql(f"""
MERGE INTO {data_path} delta USING global_temp.src source
ON delta.col1 = source.key1
AND delta.col2 = source.key2
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
""")
As per my analysis, append will re-add the data, even though its available in the table, whereas overwrite Savemode will update existing date if any and will add addition row in the data frame.
val secondCompaniesDF = Seq((100, "comp1"), (101, "comp2"),(103,"comp2"))
.toDF("companyid","name")
secondCompaniesDF.write.mode(SaveMode.Overwrite)
.option("createTableColumnTypes","companyid int , name varchar(100)")
.jdbc(url, "Company", connectionProperties)
If SaveMode is Append, and this program is re-executed company will have 3 rows, whereas in case of Overwrite, if re-execute with any changes or addition row, existing records will be updated and new row will be added
Note: Overwrite drops the table and re-create the table. Is there any way where existing record get updated and new record get inserted something like upsert.
For upsert and merge you can use delta lake by databricks or HUDI
Here are the links
https://github.com/apache/hudi
https://docs.databricks.com/delta/delta-intro.html
I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location
Our cluster has Spark 1.3 and Hive
There is a large Hive table that I need to add randomly selected rows to.
There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.
When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)
I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds.
One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.
Any feedback would be great. Thanks
I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:
in Hive:
create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'
And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:
df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)