Validating deltalake schema? - scala

We recently had a bug caused by a deltalake table's schema being updated to add a new field without the writer being properly sync'ed up. This caused deltalake's schema evolution to silently add a null value for that field, as designed, but the downstream consumer failed because that field was null.
Besides better communication when something like this occurs, is it possible to programmatically load and validate that the schema of the table we'll be writing to is as expected, to prevent such an error from occurring again?

Related

How to rename the id header of a debezium mongodb connector outbox message

I am trying to use the outbox event router of debezium for mongodb. The consumer is a spring cloud stream application. I cannot deserialize the message because spring cloud expects the message id header to be UUID, but it receives byte[]. I have tried different deserializers to no avail. I am thinking of renaming the id header in order to skip this spring cloud check, or remove it altogether. I have tried the ReplaceField SMT but it does not seem to modify the header fields.
Also is there a way to overcome this in spring?
The solution to the initial question is to use the DropHeaders SMT(https://docs.confluent.io/platform/current/connect/transforms/dropheaders.html).
This will remove the id header that is populated by debezium.
But as Oleg Zhurakousky mentioned, moving to a newer version of spring-cloud-stream without #StreamListener solves the underlying problem.
Apparently #StreamListener checks if a message has an id header and it demands to be of type Uuid. By using the new functional way of working with spring-cloud-stream, the id header is actually overwritten with a new generated value. This means that the value populated by debezium (the id column form the outbox table) is ignored. I guess if you need to check for duplicate delivery, maybe it is better to create your own header instead of using the id. I do not know if spring-cloud-stream generates the same id for the same message if it is redelivered.
Also keep in mind that even in the newer versions of spring-cloud-stream, if you use the deprecated #StreamListener, you will have the same problem.

How to deserialize Avro schema and then abandon schema before write to ES Sink Connector using SMT

Use Case and Description
My use case is described more here, but the gist of the issue is:
I am making a custom SMT and want to make sure the Elasticsearch sink connector deserializes incoming records properly, but then after that I don't need any sort of schema at all. Each record has a dynamic amount of fields set, so I don't want to have any makeUpdatedSchema step (e.g., this code) at all. This both keeps code more simple and I would assume improves performance since I don't have to recreate schemas for each record.
What I tried
I tried doing something like the applySchemaless code as shown here even when the record has a schema by returning something like this, with null for schema:
return newRecord(record, null, updatedValue);
However, in runtime it errors out, saying I have an incompatible schema.
Key Question
I might be misunderstanding the role of the schema at this point in the process (is it needed at all once we're in the Elasticsearch sink connector?) or how it works, and if so that would be helpful to know as well. But is there some way to write a custom SMT like this?

Possible option for PrimayKey in Table creation with KSQL?

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.

How to get logged in user using apex data loader?

I am inserting some 350 person account records through apex data loader.I have a trigger on account which uses logged in user details. While inserting the records i m getting this error
"LeadAssignment: execution of AfterInsert caused by: System.QueryException: List has no rows for assignment to SObject Trigger.LeadAssignment: line 20, column 1"
Can anyone tell me how to get logged in user details while using apex data loader?
Thank You.
It would be worth looking at https://developer.salesforce.com/forums/ForumsMain?id=906F00000008yJvIAI which deals with a similar situation, albeit some time ago, to see if the problem is called by NULL values.
If so, I would suggest using the check for NULL approach they originally tried in the trigger, and only adding the Try-Catch workaround if that is still required.

Can Primary-Keys be re-used once deleted?

0x80040237 Cannot insert duplicate key.
I'm trying to write an import routine for MSCRM4.0 through the CrmService.
This has been successful up until this point. Initially I was just letting CRM generate the primary keys of the records. But my client wanted the ability to set the key of a our custom entity to predefined values. Potentially this enables us to know what data was created by our installer, and what data was created post-install.
I tested to ensure that the Guids can be set when calling the CrmService.Update() method and the results indicated that records were created with our desired values. I ran my import and everything seemed successful. In modifying my validation code of the import files, I deleted the data (through the crm browser interface) and tried to re-import. Unfortunately now it throws and a duplicate key error.
Why is this error being thrown? Does the Crm interface delete the record, or does it still exist but hidden from user's eyes? Is there a way to ensure that a deleted record is permanently deleted and the Guid becomes free? In a live environment, these Guids would never have existed, but during my development I need these imports to be successful.
By the way, considering I'm having this issue, does this imply that statically setting Guids is not a recommended practice?
As far I can tell entities are soft-deleted so it would not be possible to reuse that Guid unless you (or the deletion service) deleted the entity out of the database.
For example in the LeadBase table you will find a field called DeletionStateCode, a value of 0 implies the record has not been deleted.
A value of 2 marks the record for deletion. There's a deletion service that runs every 2(?) hours to physically delete those records from the table.
I think Zahir is right, try running the deletion service and try again. There's some info here: http://blogs.msdn.com/crm/archive/2006/10/24/purging-old-instances-of-workflow-in-microsoft-crm.aspx
Zahir is correct.
After you import and delete the records, you can kick off the deletion service at a time you choose with this tool. That will make it easier to test imports and reimports.