Spark write data by SaveMode as Append or overwrite - scala

As per my analysis, append will re-add the data, even though its available in the table, whereas overwrite Savemode will update existing date if any and will add addition row in the data frame.
val secondCompaniesDF = Seq((100, "comp1"), (101, "comp2"),(103,"comp2"))
.toDF("companyid","name")
secondCompaniesDF.write.mode(SaveMode.Overwrite)
.option("createTableColumnTypes","companyid int , name varchar(100)")
.jdbc(url, "Company", connectionProperties)
If SaveMode is Append, and this program is re-executed company will have 3 rows, whereas in case of Overwrite, if re-execute with any changes or addition row, existing records will be updated and new row will be added
Note: Overwrite drops the table and re-create the table. Is there any way where existing record get updated and new record get inserted something like upsert.

For upsert and merge you can use delta lake by databricks or HUDI
Here are the links
https://github.com/apache/hudi
https://docs.databricks.com/delta/delta-intro.html

Related

Update column in a dataset only if matching record exists in another dataset in Tableau Prep Builder

Any way to do this? Basically trying to do a SQL UPDATE SET function if matching record for one or more key fields exists in another dataset.
Tried using Joins and Merge. Joins seems like more steps and the Merge appends records instead of updating the correlating rows.

Apache Hudi - How to understand the hudi write operation vs spark savemode?

How to understand the hudi write operation with upsert but df savemode with append? Since this will upsert the records, why append instead of overwrite? What's the difference?
Like showed in the pic:
Example: Upsert a DataFrame, specifying the necessary field names for recordKey => _row_key, partitionPath => partition, and precombineKey => timestamp
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) //Where clientOpts is of type Map[String, String]. clientOpts can include any other options necessary.
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
When you use the overwrite mode, you tell spark to delete the table and recreate it (or just the partitions which exist in your new df if you use a dynamic partitionOverwriteMode).
But when we use append mode, spark will append the new data to existing old data on disk/cloud storage. With hudi we can provide additional operation to merge the two versions of data and update old records which have key present in new data, keep old records which have a key not present in new data and add new records having new keys. This is totally different from overwriting data.

Delta merge logic whenMatchedDelete case

I'm working on the delta merge logic and wanted to delete a row on the delta table when the row gets deleted on the latest dataframe read.
My sample DF as shown below
df = spark.createDataFrame(
[
('Java', "20000"), # create your data here, be consistent in the types.
('PHP', '40000'),
('Scala', '50000'),
('Python', '10000')
],
["language", "users_count"] # add your column names here
)
Insert the data to a delta table
df.write.format("delta").mode("append").saveAsTable("xx.delta_merge_check")
On the next read, i've removed the row that shows ('python', '10000'), and now I want to delete this row from the delta table using delta merge API.
df_latest = spark.createDataFrame(
[
('Java', "20000"), # create your data here, be consistent in the types.
('PHP', '40000'),
('Scala', '50000')
],
["language", "users_count"] # add your column names here
)
I'm using the below code for the delta merge API
Read the existing delta table:
from delta.tables import *
test_delta = DeltaTable.forPath(spark,
"wasbs://storageaccount#xx.blob.core.windows.net/hive/warehouse/xx/delta_merge_check")
merge the changes:
test_delta.alias("t").merge(df_latest.alias("s"),
"s.language = t.language").whenMatchedDelete(condition = "s.language =
true").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
But unfortunately this doesn't delete the row('python', '10000') from the delta table, is there any other way to achieve this any help would be much appreciated.
This won't work the way you think it should - the basic problem is that your 2nd dataset doesn't have any information that data was deleted, so you somehow need to add this information to it. There are different approaches, based on the specific details:
Instead of just removing the row, you keep it, but add an another column that will show if data is deleted or not, something like this:
test_delta.alias("t").merge(df_latest.alias("s"),
"s.language = t.language").whenMatchedDelete(condition = "s.is_deleted =
true").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
use some other method to find diff between your destination table and input data - but this will really depends on your logic. If you able to calculate the diff, then you can use the approach that I described in the previous item.
If your input data is always a full set of data, you can just overwrite all data using overwrite mode - this will be even more performant than merge, because you don't rewrite the data

How to add partitioning to existing Iceberg table

How to add partitioning to existing Iceberg table which is not partitioned? Table is loaded with data already.
Table was created:
import org.apache.iceberg.hive.HiveCatalog
import org.apache.iceberg.catalog._
import org.apache.iceberg.spark.SparkSchemaUtil
import org.apache.iceberg.PartitionSpec
import org.apache.spark.sql.SaveMode._
val df1 = spark
.range(1000)
.toDF
.withColumn("level",lit("something"))
val catalog = new HiveCatalog(spark.sessionState.newHadoopConf())
val icebergSchema = SparkSchemaUtil.convert(df1.schema)
val icebergTableName = TableIdentifier.of("default", "icebergTab")
val icebergTable = catalog
.createTable(icebergTableName, icebergSchema, PartitionSpec.unpartitioned)
Any suggestions?
Right now, the way to add partitioning is to update the partition spec manually.
val table = catalog.loadTable(tableName)
val ops = table.asInstanceOf[BaseTable].operations
val spec = PartitionSpec.builderFor(table.schema).identity("level").build
val base = ops.current
val newMeta = base.updatePartitionSpec(spec)
ops.commit(base, newMeta)
There is a pull request to add an operation to make changes, like addField("level"), but that isn't quite finished yet. I think it will be in the 0.11.0 release.
Keep in mind:
After you change the partition spec, the existing data files will have null values in metadata tables for the partition fields. That doesn't mean that the values would have been null if the data were written with the new spec, just that the metadata doesn't have the values for existing data files.
Dynamic partition replacement will have a different behavior in the new spec because the granularity of a partition is different. Without a spec, INSERT OVERWRITE will replace the whole table. With a spec, just the partitions with new rows will be replaced. To avoid this, we recommend using the DataFrameWriterV2 interface in Spark, where you can be more explicit about what data values are overwritten.
For Spark 3.x, you can use ALTER TABLE SQL extensions to add partition field into existing table:
Iceberg supports adding new partition fields to a spec using ADD
PARTITION FIELD:
spark.sql("ALTER TABLE default.icebergTab ADD PARTITION FIELD level")
Adding a partition field is a metadata operation and does not change
any of the existing table data. New data will be written with the new
partitioning, but existing data will remain in the old partition
layout. Old data files will have null values for the new partition
fields in metadata tables.

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;