how to merge few flowfile into one - merge

I need one flowfile on end of queue
my config now,

Minimum number of entries set to one simply mean if number of incoming entries met one then merge the flowfile together.
For Example if you set to 2 (regardless of the rest of the settings) this processor merge to flowfile together. So if you have 4 incoming flowfile you will have 2 flowfile after merge.

Related

How can I add a blocklist at either the kafka topic level or as a processor in nifi?

I have log message data that is being pushed to a kafka topic with a nifi kafka consumer pulling in the message data and routing it to various drops. There are a number of records I would like to scrub based on a set of internal user ID's and IP addresses. I have a list of about 20 IP addresses and 10 user ID's to scrub.
Is there a way to set a blocklist either in front of the topic that filters the data before it lands and is consumed by Nifi or a way to add this as a processor that would filter the data at Nifi before sinking to various sources?
Thanks
Using NiFi, you could do something like this:
Consume messages with ConsumeKafkaRecord then use a QueryRecord to filter messages with a SQL Query.
QueryRecord config would be:
A Dynamic Property filtered and a value SELECT * FROM FLOWFILE WHERE userid IN (user1,user2,user3) OR ipaddr IN (ip1,ip2,ip3)
This will give you an unmatched relationship for messages that did not match, and a filtered relationship for messages that did match. You can then do whatever you want with the two sets of messages.
If you didn't want to hard-code the list of users/IPs in the SQL, you could build it into your flow to pull those lists from an external source then reference them dynmically.

mark end of logical section at kafka when multiple partitions are used

I want to share a problem and a solution I used, as I think it may be beneficial for others, if people have any other solutions please share.
I have a table with 1,000,000 rows, which I want to send to kafka, and spread the data between 20 partitions.
I want to notify the consumer when producer reached end of data, I don't want to have direct connection between producer and consumer.
I know kafka is designed as logical endless stream of data, but I still need to mark the end of the specific table.
There was a suggestion to count the number of items per logical section, and send this data (to a metadata topic), so the consumer will be able to count items, and know when the logical section ended.
There are several disadvantages for this approach:
As data is spread between partitions, I can tell there are total x items at my logical section, however if there are multiple consumers (one per partition), they'll need to share a counter of consumed messages per logical section. I want to avoid this complexity. Also when consumer is stopped and resumed, it will need to know how many items were already consumed and keep context.
Regular producer session guarantees at-least-once delivery, which means I may have duplicated messages. Counting the messages will need to take this into account (and avoid counting duplicated messages).
There is also the case where I don't know in advance the number of items per logical session, (I'm also kind of consumer, consuming stream of event and signaled when data ended), so at this case, the producer will also need to have a counter, keep it when stopped and resumed etc. Having several producers will need to share the counter etc. So it adds a lot of complexity to the process.
Solution 1:
I actually want the last message at each partition indicate it is the last message.
I can do some work in advance, create some random message keys, send messages partitioned by key, and test to which partition each message is directed. As partitioning by keys is deterministic (for given number of partitions), I want to prepare a map of keys and the target partition. For example key: ‘xyz’ is directed to partition #0, key ‘hjk’ is directed to partition #1 etc, and finally have the reversed map, so for partition 0, use key ‘xyz’, for partition 1, use key ‘hjk’ etc.
Now I can send the entire table (except of the last 20 rows) with partition strategy random, so the data is spread between partitions, for almost entire table.
When I come to the last 20 rows, I’ll send them using partition key and I’ll set for each message partition key which will hash the message to a different partition. This way, each one of the 20 partitions will get one of the last 20 messages. For each one of the last 20 messages, I'll set a relevant header which will state it is the last one.
Solution 2:
Similar to solution 1, but send the entire table spread to random partitions. Now send 20 metadata messages, which I’ll direct to the 20 partitions using the partition by key strategy (by setting appropriate keys).
Solution 3:
Have additional control topic. After the table was sent entirely to the data topic, send a message to the control topic saying table is completed. The consumer will need to test the control topic from time to time, when it gets the 'end of data' message, it will know that if it reached the end of the partition, it actually reached the end of the data for that partition. This solution is less flexible and less recommended, but I wrote it as well.
Another one solution is to use open source analog of S3 (e.g. minio.io). Producers can uplod data, send message with link to object storage. Consumers will remove data frome object storage after collecting.

Timeout for aggregated records in Kafka table?

I use Kafka for processing messages. Messages can be divided on a few parts (it's a composite message). So in stream I can have for example one composite message that is divided on three parts. In other words it will be three records in Kafka stream, but it's one big message. I want use Kafka table for merge parts of composite message in one Kafka record. After merge one message will be inserted in database (Postgres). Every part has number and total number of parts. For example if I have three parts (three Kafka records) of one message in stream - every parts has field total number of parts with value 3.
How I understand, task is simple in positive scenario: aggregate parts in table, create stream from table and filter records that have equals aggregated parts size and total number of parts, map filtered in one merged message and insert it in database (Postgres).
But negative scenario is also possible. In rare cases one of parts can be not inserted in Kafka at all (or it will be inserted much later, after timeout). So for example in stream only two parts from three of one composite message will be present. And this case I must insert in database (Postgres) not fully constructed message (it will consist only two parts, not three). How can I implement this negative scenario in Kafka?
I would recommend to check out punctuations: https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor
Also note, that you can mix-and-match Processor API and DSL: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration
If you provide a store name for the KTable aggregation, you can connect the store to a custom processor that registers a punctuation. Overall, it might be better to use Processor API for the whole application instead of the DSL though.

Kafka: Can a single producer produce 2 different records to 2 different topics?

I have two types of records, let's call it X and Y. I want record X to go to TopicX and record Y to go to TopicY.
1) Do I need two different producers?
2) Is it better to have 2 partitions instead of 2 different topics?
3) How can I avoid having two different producers for better network usage.
Thank you!
if you are using the same key/value serializer (and other producer properties), you can use the same producer. Producer record contains information about topic to be send
common practice is to have topic per message type. For partitionion some ids are used (clientId, sessionId... ). So, if records you want to send have different logic, than it is better to use different topics.

Processing records in order in Storm

I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?
There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)
Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!