For how long does Dataflow remember attribute id in PubsubIO - apache-beam

The PubsubIO allows deduplicating messages based on the id attribute:
PubsubIO.readStrings().fromSubscription(pubSubSubscription).withIdAttribute("message_id"))
For how long does Dataflow remember this id? Is it documented anywhere?

It is documented, however it has not yet been migrated to the V2+ version of the docs. The information can still be found in the V1 docs:
https://cloud.google.com/dataflow/model/pubsub-io#using-record-ids
"If you've set a record ID label when using PubsubIO.Read, when Dataflow receives multiple messages with the same ID (which will be read from the attribute with the name of the string you passed to idLabel), Dataflow will discard all but one of the messages. However, Dataflow does not perform this de-duplication for messages with the same record ID value that are published to Cloud Pub/Sub more than 10 minutes apart."

Related

Can I update Apache Atlas metadata by adding a message directly into the Kafka topic?

I am trying to add a message to Entities_Topic to update the kafka_Topic type metadata in Apache Atlas. I wrote the data according to the JSON format of Message, but it didn't work.
application.log is displayed as follows:
graph rollback due to exception AtlasBaseException:Instance kafka_topicwith unique attribute {qualifiedName=atlas_test00#primary # clusterName to use in qualified name of entities. Default: primary} does not exist (GraphTransactionInterceptor:202)
graph rollback due to exception AtlasBaseException:Instance __AtlasUserProfile with unique attribute {name=admin} does not exist(GraphTransactionInterceptor:202)
And here is the message I passed into Kafka Topic earlier:
{"version":{"version":"1.0.0","versionParts":[1]},"msgCompressionKind":"NONE","msgSplitIdx":1,"msgSplitCount":1,"msgSourceIP":"192.168.1.110","msgCreatedBy":"","msgCreationTime":1664440029000,"spooled":false,"message":{"type":"ENTITY_NOTIFICATION_V2","entity":{"typeName":"kafka_topic","attributes":{"qualifiedName":"atlas_test_k1#primary # clusterName to use in qualifiedName of entities. Default: primary","name":"atlas_test01","description":"atlas_test_k1"},"displayText":"atlas_test_k1","isIncomplete":false},"operationType":"ENTITY_CREATE","eventTime":1664440028000}}
It is worth noting that there is no GUID in the message and I do not know how to create it manually. Also, I changed the time according to the timestamp of the current time. The JSON is passed in through the Kafka tool Offset Explorer.
My team leader wants to update the metadata by sending messages directly into Kafka, and I'm just trying to see if that's possible.
How can I implement this idea, or please tell me what's wrong.

How to rename the id header of a debezium mongodb connector outbox message

I am trying to use the outbox event router of debezium for mongodb. The consumer is a spring cloud stream application. I cannot deserialize the message because spring cloud expects the message id header to be UUID, but it receives byte[]. I have tried different deserializers to no avail. I am thinking of renaming the id header in order to skip this spring cloud check, or remove it altogether. I have tried the ReplaceField SMT but it does not seem to modify the header fields.
Also is there a way to overcome this in spring?
The solution to the initial question is to use the DropHeaders SMT(https://docs.confluent.io/platform/current/connect/transforms/dropheaders.html).
This will remove the id header that is populated by debezium.
But as Oleg Zhurakousky mentioned, moving to a newer version of spring-cloud-stream without #StreamListener solves the underlying problem.
Apparently #StreamListener checks if a message has an id header and it demands to be of type Uuid. By using the new functional way of working with spring-cloud-stream, the id header is actually overwritten with a new generated value. This means that the value populated by debezium (the id column form the outbox table) is ignored. I guess if you need to check for duplicate delivery, maybe it is better to create your own header instead of using the id. I do not know if spring-cloud-stream generates the same id for the same message if it is redelivered.
Also keep in mind that even in the newer versions of spring-cloud-stream, if you use the deprecated #StreamListener, you will have the same problem.

Apache Nifi: Is there a way to publish messages to kafka with a message key as combination of multiple attributes?

I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.
Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.
Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.
Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.

Why in the API rework was StoreName not specified in the table method of Kafka StreamsBuilder?

In the Kafka StreamsBuilder the signature for table is only:
table(java.lang.String topic)
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/StreamsBuilder.html
Where as before you were able to provide a store name:
table(java.lang.String topic, java.lang.String queryableStoreName)
https://kafka.apache.org/0110/javadoc/org/apache/kafka/streams/kstream/KStreamBuilder.html
Why was this removed?
It was not removed, but the API was reworked. Please read the upgrade notes for API changes: https://kafka.apache.org/11/documentation/streams/upgrade-guide
For this change in particular, the full details are documented via KIP-182: https://cwiki.apache.org/confluence/display/KAFKA/KIP-182%3A+Reduce+Streams+DSL+overloads+and+allow+easier+use+of+custom+storage+engines
You can specify the store name via Materialized parameter now:
table(String topic, Materialized materialized);

Remove read data for authenticated user?

In DDS what my requirement is, I have many subscribers but the publisher is single. My subscriber reads the data from the DDS and checks the message is for that particular subscriber. If the checking success then only it takes the data and remove from DDS. The message must maintain in DDS until the authenticated subscriber takes it's data. How can I achieve this using DDS (in java environment)?
First of all, you should be aware that with DDS, a Subscriber is never able to remove data from the global data space. Every Subscriber has its own cached copy of the distributed data and can only act on that copy. If one Subscriber takes data, then other Subscribers for the same Topic will not be influenced by that in any way. Only Publishers can remove data globally for every Subscriber. From your question, it is not clear whether you know this.
Independent of that, it seems like the use of a ContentFilteredTopic (CFT) is suitable here. According to the description, the Subscriber knows the file name that it is looking for. With a CFT, the Subscriber can indicate that it is only interested in samples that have a particular value for the file_name attribute. The infrastructure will take care of the filtering process and will ensure that the Subscriber will not receive any data with a different value for the attribute file_name. As a consequence, any take() action done on the DataReader will contain relevant information and there is no need to check the data first and then take it.
The API documentation should contain more detailed information about how to use a ContentFilteredTopic.