Spring Cloud DataFlow and originalContentType

Spring Cloud DataFlow and originalContentType - spring-cloud

I'm using Spring Cloud Edgware and Spring Cloud DataFlow 1.2.3.
I'm having problems with contentType and originalContentType and, although I have a workaround, I'm not understanding why it's needed.
I have various dataflow streams all sinking data to a rabbit sink, 1.3.1.RELEASE of spring-cloud-starter-stream-sink-rabbit (call it datasink for the purposes of this explanation). Rabbit source in play is 1.3.1.RELEASE of spring-cloud-starter-stream-source-rabbit.
The data in the rabbit sink is correctly application/json. The streams producing the data have Processors which munge stuff and explicitly set the output contentType to application/json in the code. This has all been working correctly for over a year and still does.
There is now a need to introduce a bridge between the datasink and another rabbit sink. The new bridge stream is simply:
rabbit-source | rabbit-sink
where rabbit-source reads from the aforementioned datasink.
The sinked data in the bridge stubbornly has a contentType of application/octet-stream.
I have tried the following setting:
app.rabbit-source.spring.cloud.stream.bindings.output.content-type=application/json
This results in a sinked contentType of application/json but the payload is base64 encoded! Why is base64 encoding happening?
My workaround is to hack the rabbit sink and to programmatically overwrite the contentType header with the originalContentType header if present. I don't like this at all and would welcome a better solution and more understanding.

Sorry you're having an issue. Content-type negotiations have improved dramatically in the later versions of spring-cloud-stream and have been completely revamped in the 2.0 branch.
So, before going any further I would recommend upgrading to the latest GA version of spring-cloud-stream which is Ditmars.SR3 and see if that alone will address your issue, or even better to the latest 2.0 snapshot (we are very close to RC, so it is pretty stable).

Related

test both forwards and backwards compatibility of json schemas for both readers and writers

I am seeking to programmaticaly answer the question whether an arbitrary two message schemas are compatible. Ideally for each of JSON, Avro and Protobuf. I am aware that internally to the kafka schema registry there is such logic. Yet I want to ask programmatically in a deployment pipeline if when I am promoting an old topic reader between environments whether it will be able to read the latest (or historic) messages on the topic.
I am aware that:
The maven plugin can do this at consumer compile time but that option isn't open to me as I am not using Java for my "exotic" consumer. Yet I can generate the json schema it expects.
I am aware that I can invoke the schema registry API to ask about a new schema being compatible with an old one but I want to ask whether a reader that knows its own expected schema is compatible with the latest registered and that isn't supported.
I am aware that the topic can be set with FORWARD_TRANSITIVE or FULL_TRANSITIVE compatibility and based on knowing that I can assume my reader will always work. Yet I do not control the many topics in a large organization controlled by many other teams so I cannot enforce that the many teams with many existing topics set a correct policy.
I am aware that with careful testing and change management we can manually verify compatibility. The real world is messy and we will be doing this at scale with many inexperienced teams so anything that can go wrong will most certainly go wrong at some point.
I am aware that other folks must have wanted to do this so if I searched hard enough the answer should be there; yet I have read all the Q&As I could find and I really couldn't find any actual working answer to any of the past times this question has been asked.
What I want to control is that I don't promote a reader into production when at that point in time I can pull the latest registered schema and check it is compatible with the (generated) schema expected by the (exotic) reader that is being deployed.
I am aware I can just pull the source code that the kafka schema registry / maven plugin uses and roll my own solution but I feel that there must be a convenient solution out there I can easily script on a deployment pipeline to check "is this (generated) schema a subset of that (published/downloaded) one".

Okay I cracked open the Confluent Kafka Client code and wrote a simple Java CLI that uses it's logic and the generic ParsedSchema class that can be either a concrete json, or avro, or protobuf schema. The dependencies are:
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-client</artifactId>
<version>7.3.1</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-json-schema-provider</artifactId>
<version>7.3.1</version>
</dependency>
To be able to handle avro and protobuf it would need to have their corresponding schema provider dependency included. The code then parses a json schema from a string with:
final var readSchemaString = ... // load from file
final var jsonSchemaRead = new JsonSchema(readSchemaString);
final var writeSchemaString = ... // load from file
final var jsonSchemaWrite = new JsonSchema(writeSchemaString);
Then we can validate that using my helper class:
final var validator = new Validator();
final var errors = validator.validate(jsonSchemaRead, jsonSchemaWrite);
for( var e : errors) {
logger.error(e);
}
System.exit(errors.size());
The validator class is very basic just using "can be read" semantics:
public class Validator {
SchemaValidator validator = new SchemaValidatorBuilder().canBeReadStrategy().validateLatest();
public List<String> validate(ParsedSchema reader, ParsedSchema writer) {
return validator.validate(reader, Collections.singleton(writer));
}
}
The full code is over on github at /simbo1905/msg-schema-read-validator
At deployment time I can simply curl the latest schema in the registry (or any historic schemas) that writers should be using. I can generate the schema for my reader on disk. I can then run that simple tool to check that things are compatible.
Running things in a debugger I can see that there are at least 56 different schema compatibility rules that are being checked for JSON schema alone. It would not be feasible to try to code that up oneself.
It should be relatively simply to extend the code to add in the avro and protobuf providers to get a ParsedSchema for those if anyone wants to be fully generic.

How to rename the id header of a debezium mongodb connector outbox message

I am trying to use the outbox event router of debezium for mongodb. The consumer is a spring cloud stream application. I cannot deserialize the message because spring cloud expects the message id header to be UUID, but it receives byte[]. I have tried different deserializers to no avail. I am thinking of renaming the id header in order to skip this spring cloud check, or remove it altogether. I have tried the ReplaceField SMT but it does not seem to modify the header fields.
Also is there a way to overcome this in spring?

The solution to the initial question is to use the DropHeaders SMT(https://docs.confluent.io/platform/current/connect/transforms/dropheaders.html).
This will remove the id header that is populated by debezium.
But as Oleg Zhurakousky mentioned, moving to a newer version of spring-cloud-stream without #StreamListener solves the underlying problem.
Apparently #StreamListener checks if a message has an id header and it demands to be of type Uuid. By using the new functional way of working with spring-cloud-stream, the id header is actually overwritten with a new generated value. This means that the value populated by debezium (the id column form the outbox table) is ignored. I guess if you need to check for duplicate delivery, maybe it is better to create your own header instead of using the id. I do not know if spring-cloud-stream generates the same id for the same message if it is redelivered.
Also keep in mind that even in the newer versions of spring-cloud-stream, if you use the deprecated #StreamListener, you will have the same problem.

What are the advantages of the Confluent's Kakfa Avro serializers?

I can't seem to find clear in the docs what are the advantages of using AvroKafkaSerializer (that have schema support) vs serializing the object "manually" in code and sending them as bytes/string ?
Maybe schema check when producing a new message? What are the others ?

A message schema is a contract between a group of client applications producing and consuming messages. Schema validation is required when you have many independent applications that need to agree on a specific format, in order to exchange messages reliably.
If you also add a Schema Registry into the picture, then you don't need to include the schema in all your services or every single message, but you will get it from the common registry, with the additional support of schema evolution and validation rules (i.e. backward compatibility, versioning, syntax validation). It is one of the fundamental components in event driven architectures (EDA).

File Endpoint for Citrus Framework

I'm currently looking at using Citrus for our Integration Testing, however our Integration Software uses amongst others, file messages - where files are written to an inbound folder, picked up and processed which results in a new file message being written to an outbound folder or data being written to SQL.
I was wondering if Citrus can write a file with a certain payload to an inbound folder and then monitor for a file to appear in certain outbound folder and/or in a SQL table.
Example Test Case:
file()
.folder(todoInboundFolder)
.write()
.payload(new ClassPathResource("templates/todo.xml"));
file()
.folder(todoOutboundFolder)
.read()
.validate("/t:todo/t:correlationId", "${todocorrelationId}")
.validate("/t:todo/t:title", "${todoName}");
query(todoDataSource)
.statement("select count(*) as cnt from todo_entries where correlationid = '${todocorrelationId}'")
.validate("cnt", "1");
Additionaly - is there a way to specific the timeout to wait for the file/SQL entries to appear?

There is no direct implementation of the file endpoint yet in Citrus. There was a feature request but it was closed due to inactivity https://github.com/citrusframework/citrus/issues/151
You can solve this problem though by using a simple Apache Camel route to do the file transfer. Citrus is able to call the Camel route and use its outcome very easily. Read more about this here https://citrusframework.org/citrus/reference/2.8.0/html/index.html#apache-camel
This would be the workaround that can help right now. Other than that you can reopen or contribute to the issue.

Conditional routing in Apache NiFi

I'm using NiFi to get data from an Oracle database and put some of this data in Kafka (using the processor PutKafka).
Example : if the attribute "id" contains "aaabb"
Is that possible in Apache NiFi? How can i do it?

This should definitely be possible, the flow might be something like this...
1) ExecuteSQL or QueryDatabaseTable to get the data from the database, these produce Avro
2) ConvertAvroToJson processor to convert the Avro to Json
3) EvaluateJsonPath to extract the id field into an attribute
4) RouteOnAttribute to route flow files where the id attribute contains "aaabbb"
5) PutKafka to deliver any of the matching results from RouteOnAttribute

To add on to Bryan's example flow, I wanted to point you to some great documentation that should help introduce you to Apache NiFi.
Firstly, I would suggest checking out the NiFi documentation. It is very good and should help a lot. In addition to providing details on each of the processors Bryan mentioned it also has general documentation for every type of user.
For a basic introduction to build a NiFi flow check out this video.
For example templates check out this repo. It's a has an excel file at it's root level which has a description and list of processors for each template.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse