How can I add a blocklist at either the kafka topic level or as a processor in nifi? - apache-kafka

I have log message data that is being pushed to a kafka topic with a nifi kafka consumer pulling in the message data and routing it to various drops. There are a number of records I would like to scrub based on a set of internal user ID's and IP addresses. I have a list of about 20 IP addresses and 10 user ID's to scrub.
Is there a way to set a blocklist either in front of the topic that filters the data before it lands and is consumed by Nifi or a way to add this as a processor that would filter the data at Nifi before sinking to various sources?
Thanks

Using NiFi, you could do something like this:
Consume messages with ConsumeKafkaRecord then use a QueryRecord to filter messages with a SQL Query.
QueryRecord config would be:
A Dynamic Property filtered and a value SELECT * FROM FLOWFILE WHERE userid IN (user1,user2,user3) OR ipaddr IN (ip1,ip2,ip3)
This will give you an unmatched relationship for messages that did not match, and a filtered relationship for messages that did match. You can then do whatever you want with the two sets of messages.
If you didn't want to hard-code the list of users/IPs in the SQL, you could build it into your flow to pull those lists from an external source then reference them dynmically.

Related

Get output record partition within Kafka Streams

I have a KStream which branches and writes output records into different topics based on some internal logic. Is there any way I can know the partition of the output message from inside the KStream? The output topics have different number of partitions from the input ones.
When using the high-level DSL, you don't have access to the record metadata (which holds a key/value pair on specific partition the record came from). So you won't be able to achieve the goal using KStream implementation.
You could use the low-level processor API if you wanted, which would allow access to the metadata. Otherwise, you can add an implementation of ConsumerInterceptor, and map the partition value to the message before it goes to the consumer.

Retrieve info from Kafka that has a field matching one value of a very long list

I am kind of new to Kafka.
I have a conceptual question.
Let's assume that there is a Kafka topic (publish subscribe) which has messages (formatted in JSON). Each message has a field called "username".
There are multiple applications consuming this topic.
Assume that we have one application that handles messages for 100,000 users. This application has the list of 100,000 user names. So our application needs to watch the topic and process the messages that have the username field that matches to any one of our 100,000 user names.
One way of doing this is we read each message published and get the username in that message and iterate through the list of 100,000 usernames we have. If one name in our list matches the username, we process that, else we ignore that message.
Is there any other, more elegant way to do this like, is there any feature in Kafka streams or consumer api to do this?
Thanks
You must consume, deserialize, and inspect every record. You can't get around consumer api basics using any higher level library, but yes, ksqlDB or Kafka Streams make such code easier to write, just not any more performant
If you want to check a field is in a list, use a Hashset

Consume only topic messages for a given account

I have a service calculating reputation scores for accounts. It puts the calculation results in a Kafka topic called "ReputationScores". Each message looks something like this:
{ "account" : 12345, "repScore" : 98765}
I'd like my consumer to be able to consume only those messages for a specific account.
For example, I’d like to have a single instance of a consumer consume only messages with topic “ReputationScore” for account 12345. That instance should probably be the only member of its consumer group.
Can Kafka filter based on message contents? What's the best way to do this?
Thanks for your help.
Can Kafka filter based on message contents?
Since kafka itself doesn't know what's in your data, it cannot index it, therefore it's not readily searchable. You would need to process the full topic and have an explicit check for which deserialized records you want to parse. For example, this is what a stream processing application with a simple filter operation would provide you.
If you want to preserve the ability to do lookups by a particular item, you will either need to make a partitioner that segments all data you're interested in, or create a topic per item (which really only works for certain use cases, not things like individual user accounts).
You could look at inserting all events to an in-memory database, then performing queries against that

Filter Stream (Kafka?) to Dynamic Number of Clients

We are currently designing a big data processing chain. We are currently looking at NiFi/MiNiFi to ingest data, do some normalization, and then export to a DB. After the normalization we are planning to fork the data so that we can have a real time feed which can be consumed by clients using different filters.
We looking at both NiFi and/or Kafka to send data to the clients, but are having design problems with both of them.
With NiFi we are considering adding a websocket server which listens for new connections and adds their filter to a custom stateful processor block. That block would filter the data and tag it with the appropriate socket id if it matched a users filter and then generate X number of flow files to be sent to the matching clients. That part seems like it would work, but we would also like to be able to queue data in the event that a client connection drops for a short period of time.
As an alternative we are looking at Kafka, but it doesn't appear to support the idea of a dynamic number of clients connecting. If we used kafka streams to filter the data it appears we would need 1 topic per client? Which would likely eventually overrun our Zookeeper instance. Or we could use NiFi to filter and then insert into different partitions sort of like here: Kafka work queue with a dynamic number of parallel consumers. But there is still a limit to the number of partitions that are supported correct? Not to mention we would have to juggle the producer and consumers to read from the right partition as we scaled up and down.
Am I missing anything for either NiFi or Kafka? Or is there another project out there I should consider for this filtered data sending?

Kafka connect message ordering

How does a Kafka Sink connector ensure message ordering while fetching messages from partitions. I have multiple partitions and I have ensured message ordering while publishing of messages with hash-keys per partition.
Now, when more than one Sink Tasks(and their workers) are scaled from multiple JVMs with the responsibility to fetch messages from same partition and to notify a destination system via HTTP, how can I guarantee that the destination system will receive the messages in order.
Each sink task will receive the ordered events as available from their assigned topics, but as soon as it leaves the Kafka protocol handling, and is sent to a remote destination, whether that be a file or HTTP endpoint, order can only be guaranteed based upon that system's ordering semantics.
For example, if you're writing to Elasticsearch, you can "order" events (in Kibana) by specifying the timestamp field to index by. Similar for any (no)SQL database
A filesystem on the other hand, would order files by modification time, but events within any given file aren't guaranteed to be ordered (unless they come from one partition).
I find it unlikely an HTTP REST endpoint will be able to understand what order events need to be collected by, and that logic would need to be determined internally to that server endpoint. One option would be to post events to an endpoint that will accept the partition number, and the offset the record came from