Apache Hudi - How to understand the hudi write operation vs spark savemode? - pyspark

How to understand the hudi write operation with upsert but df savemode with append? Since this will upsert the records, why append instead of overwrite? What's the difference?
Like showed in the pic:

Example: Upsert a DataFrame, specifying the necessary field names for recordKey => _row_key, partitionPath => partition, and precombineKey => timestamp
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) //Where clientOpts is of type Map[String, String]. clientOpts can include any other options necessary.
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)

When you use the overwrite mode, you tell spark to delete the table and recreate it (or just the partitions which exist in your new df if you use a dynamic partitionOverwriteMode).
But when we use append mode, spark will append the new data to existing old data on disk/cloud storage. With hudi we can provide additional operation to merge the two versions of data and update old records which have key present in new data, keep old records which have a key not present in new data and add new records having new keys. This is totally different from overwriting data.

Related

Spark write data by SaveMode as Append or overwrite

As per my analysis, append will re-add the data, even though its available in the table, whereas overwrite Savemode will update existing date if any and will add addition row in the data frame.
val secondCompaniesDF = Seq((100, "comp1"), (101, "comp2"),(103,"comp2"))
.toDF("companyid","name")
secondCompaniesDF.write.mode(SaveMode.Overwrite)
.option("createTableColumnTypes","companyid int , name varchar(100)")
.jdbc(url, "Company", connectionProperties)
If SaveMode is Append, and this program is re-executed company will have 3 rows, whereas in case of Overwrite, if re-execute with any changes or addition row, existing records will be updated and new row will be added
Note: Overwrite drops the table and re-create the table. Is there any way where existing record get updated and new record get inserted something like upsert.
For upsert and merge you can use delta lake by databricks or HUDI
Here are the links
https://github.com/apache/hudi
https://docs.databricks.com/delta/delta-intro.html

spark-shell load existing hive table by partition?

In spark-shell, how do I load an existing Hive table, but only one of its partitions?
val df = spark.read.format("orc").load("mytable")
I was looking for a way so it only loads one particular partition of this table.
Thanks!
There is no direct way in spark.read.format but you can use where condition
val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.
see here DataFrameReader
def load(paths: String*): DataFrame = {
...
}
In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)
when you say df.count then your parition column will be appled on data path of orc.
There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.
val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")

How to add partitioning to existing Iceberg table

How to add partitioning to existing Iceberg table which is not partitioned? Table is loaded with data already.
Table was created:
import org.apache.iceberg.hive.HiveCatalog
import org.apache.iceberg.catalog._
import org.apache.iceberg.spark.SparkSchemaUtil
import org.apache.iceberg.PartitionSpec
import org.apache.spark.sql.SaveMode._
val df1 = spark
.range(1000)
.toDF
.withColumn("level",lit("something"))
val catalog = new HiveCatalog(spark.sessionState.newHadoopConf())
val icebergSchema = SparkSchemaUtil.convert(df1.schema)
val icebergTableName = TableIdentifier.of("default", "icebergTab")
val icebergTable = catalog
.createTable(icebergTableName, icebergSchema, PartitionSpec.unpartitioned)
Any suggestions?
Right now, the way to add partitioning is to update the partition spec manually.
val table = catalog.loadTable(tableName)
val ops = table.asInstanceOf[BaseTable].operations
val spec = PartitionSpec.builderFor(table.schema).identity("level").build
val base = ops.current
val newMeta = base.updatePartitionSpec(spec)
ops.commit(base, newMeta)
There is a pull request to add an operation to make changes, like addField("level"), but that isn't quite finished yet. I think it will be in the 0.11.0 release.
Keep in mind:
After you change the partition spec, the existing data files will have null values in metadata tables for the partition fields. That doesn't mean that the values would have been null if the data were written with the new spec, just that the metadata doesn't have the values for existing data files.
Dynamic partition replacement will have a different behavior in the new spec because the granularity of a partition is different. Without a spec, INSERT OVERWRITE will replace the whole table. With a spec, just the partitions with new rows will be replaced. To avoid this, we recommend using the DataFrameWriterV2 interface in Spark, where you can be more explicit about what data values are overwritten.
For Spark 3.x, you can use ALTER TABLE SQL extensions to add partition field into existing table:
Iceberg supports adding new partition fields to a spec using ADD
PARTITION FIELD:
spark.sql("ALTER TABLE default.icebergTab ADD PARTITION FIELD level")
Adding a partition field is a metadata operation and does not change
any of the existing table data. New data will be written with the new
partitioning, but existing data will remain in the old partition
layout. Old data files will have null values for the new partition
fields in metadata tables.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

How to load data into hive external table using spark?

I want to try to load data into hive external table using spark.
please help me on this, how to load data into hive using scala code or java
Thanks in advance
Assuming that hive external table is already created using something like,
CREATE EXTERNAL TABLE external_parquet(c1 INT, c2 STRING, c3 TIMESTAMP)
STORED AS PARQUET LOCATION '/user/etl/destination'; -- location is some directory on HDFS
And you have an existing dataFrame / RDD in Spark, that you want to write.
import sqlContext.implicits._
val rdd = sc.parallelize(List((1, "a", new Date), (2, "b", new Date), (3, "c", new Date)))
val df = rdd.toDF("c1", "c2", "c3") //column names for your data frame
df.write.mode(SaveMode.Overwrite).parquet("/user/etl/destination") // If you want to overwrite existing dataset (full reimport from some source)
If you don't want to overwrite existing data from your dataset...
df.write.mode(SaveMode.Append).parquet("/user/etl/destination") // If you want to append to existing dataset (incremental imports)
**I have tried similar scenario and had satisfactory results.I have worked with avro data with schema in json.I streamed kafka topic with spark streaming and persisted the data in to hdfs which is the location of an external table.So every 2 seconds(the streaming duration the data will be stored in to hdfs in a seperate file and the hive external table will be appended as well).
Here is the simple code snippet
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val myEvent = dataframe.toDF()
import org.apache.spark.sql.SaveMode
myEvent.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("maprfs:///location/of/hive/external/table")
})
Don't forget to stop the 'SSC' at the end of the application.Doing it gracefully is more preferable.
P.S:
Note that while creating an external table make sure you are creating the table with schema identical to the dataframe schema. Because when getting converted in to a dataframe which is nothing but a table, the columns will be arranged in an alphabetic order.