Hudi partition and upsert are not working

Hudi partition and upsert are not working - pyspark

what is wrong in this config ,
partition keys are not working in HUDI as well as all the records get updated in the hudi dataset while doing the upsert . so couldnt extract the delta from the tables.
commonConfig = {'className' : 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc':'false',
'hoodie.datasource.write.precombine.field': 'hash_value',
'hoodie.datasource.write.recordkey.field': 'hash_value',
'hoodie.datasource.hive_sync.partition_fields':'year,month,day',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.ComplexKeyGenerator',
'hoodie.table.name': 'hudi_account',
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.hive_sync.database': 'hudi_db',
'hoodie.datasource.hive_sync.table': 'hudi_account',
'hoodie.datasource.hive_sync.enable': 'true',
'path': 's3://' + args['curated_bucket'] + '/stage_e/hudi_db/hudi_account'}
My usecase is to complete the upsert logic using hudi and partition using hudi . Upsert is partially working as it updates the entire recordset as like if i have 10k records in the raw bucket, while doing the upsert for 1k records , it updates the hudi time for all the 10k data.

Did your partition keys change?
By default hudi doesn't use global indexes, but per partition, I was having problems similar to yours, when I enabled global index it worked.
Try adding these settings:
"hoodie.index.type": "GLOBAL_BLOOM", # This is required if we want to ensure we upsert a record, even if the partition changes
"hoodie.bloom.index.update.partition.path": "true", # This is required to write the data into the new partition (defaults to false in 0.8.0, true in 0.9.0)
I found the answer on this blog: https://dacort.dev/posts/updating-partition-values-with-apache-hudi/
Here you can see more information about hudi indexes: https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/

Related

Nested fields and partitioning in MongoDB to BigQuery template

I want to use MongoDB to BigQuery dataflow template and I have 2 questions:
Can I somehow configure partitioning for the destination table? For e.g. if I want to dump my database every day?
Can I map nested fields in MongoDB to records in BQ instead of columns with string values?
I see User option with values FLATTEN and NONE, but FLATTEN will flatten documents for 1 level only.
May any of these two approaches help?:
Create a destination table with structure definition before running dataflow
Using UDF
I tried to use MongoDB to BigQuery dataflow with User option set to FLATTEN.

Customize the write operation in Mongo from Spark

How can I write to mongo using spark considering the following scenarios :
If the document is present, just update the matching fields with newer value and if the field is absent, add the new field. (The replaceDocument parameter if false will update the matching records but not add the new unmatched fields while if set to true, my old fields can get lost.)
I want to keep a datafield as READ-ONLY, example there are two fields, first_load_date and updated_on. first_load_date should never change, it is the day that record is created in mongo, and updated_on is when new fields are added or older ones replaced.
If document is absent, insert.
Main problem is replaceDocument = True will lead to loss of older fields not present in newer row, while False, will take care of matched but now the newer incoming fields.
I am using Mongo-Spark-Connector 2.4.1
df.write.format("mongo").mode("append").option("replaceDocument","true").option("database","db1").option("collection","my_collection").save()

I understood what you are trying to achieve here:
You can use something like :
(df
.write
.format("mongo")
.mode("append")
.option("ordered", "false")
.option("replaceDocument", "false")
.option("database", "db1")
.option("collection", "my_collection")
.save()
)
replaceDocument set to false will help you in preserving your old records and updating the matched ones while the you can get a BulkWriteException for which the ordered parameter set to false will help.

is there any impact in performance when we need to access last sort key for any particular partition key

I am creating a dynamoDB table. I am using an custromerId as partition key and versionNumber as sort key. suppose there are 1000 versions for any particular customerId. for my use-case I always want to find out last version of any customerId. will there be any difference in performance when i want first versionNumber and when i want last versionNumber or both will take same time.

No, actually we'll have a parameter ScanIndexForward (True/False). So based on this it starts reading the dynamoDB in ascending or the descending order.

Updating Data in MongoDB from Apache Spark Streaming

I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute.
The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAndWindow and saved to the Mongo.
val messages = stream.map(record => (record.key, record.value.toDouble))
val reduced = messages.reduceByKeyAndWindow((x: Double , y: Double) => (x + y),
Seconds(60), Seconds(60))
reduced.foreachRDD({ rdd =>
import spark.implicits._
val aggregatedPower = rdd.map({x => MyJsonObj(x._2, x._1)}).toDF()
aggregatedPower.write.mode("append").mongo()
})
This works so far, however it is possible, that some message come with a delay, of a minute, which results leads to having two json objects with the same timestamp in the dataBase.
{"_id":"5aa93a7e9c6e8d1b5c486fef","power":6.146849997,"timestamp":"2018-03-14 15:00"}
{"_id":"5aa941df9c6e8d11845844ae","power":5.0,"timestamp":"2018-03-14 15:00"}
The Documentation of the mongo-spark-connector didn't help me with finding a solution.
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?

Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
It seems that what you're looking for is a MongoDB operation called upsert. Where an update operation will insert a new document if the criteria has no match, and update the fields if there is a match.
If you are using MongoDB Connector for Spark v2.2+, whenever a Spark dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
Now you could try to create an RDD using MongoDB Spark Aggregation, specifying a $match filter to query where timestamp matches the current window:
val aggregatedRdd = rdd.withPipeline(Seq(
Document.parse(
"{ $match: { timestamp : '2018-03-14 15:00' } }"
)))
Modify the value of power field, and then write.mode('append').
You may also find blog: Data Streaming MongoDB/Kafka useful as well. If you would like to write a Kafka consumer and directly insert into MongoDB applying your logics using MongoDB Java Driver

How to seed data in a newly created key in Meteor/mongodb for existing docs

I have just modified mongodb Schema in meteorJS, and trying to access the recently created columns. As there is no data in those newly created columns I can't retrieve the data from the DB. Please help me out in Seeding some blank values to those columns in MeteorJS.

In a Meteor.startup you can run yourcollection.insert({column:""}) to insert a document. Be sure that the fake data is consistent with the schema.

Let's say you've added a new key called myNewKey
You can quickly seed data with:
MyCollection.update({ myNewKey: { $exists: false }},{ myNewKey: "foo" });

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Hudi partition and upsert are not working - pyspark

Related

Nested fields and partitioning in MongoDB to BigQuery template

Customize the write operation in Mongo from Spark

is there any impact in performance when we need to access last sort key for any particular partition key

Updating Data in MongoDB from Apache Spark Streaming

How to seed data in a newly created key in Meteor/mongodb for existing docs

Categories

Resources