How does Storm handle fields grouping when you add more nodes? - real-time

Just reading more details on storm and came across it's ability to do fields grouping so for example if you where counting tweets per user and you had two tasks with a fields grouping of user-id the same user-id's would get sent to the same tasks.
So task 1 could have the following counts in memory
bob: 10
alice: 5
task 2 could have the following counts in memory
jill:10
joe: 4
If I added a new machine to the cluster to increase capacity and ran rebalance, what happens to my counts in memory? Will you start to get users with different counts?

Using fields grouping we can guide a specific field to go to a particular tasks.
Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
these task are always static in a storm's life cycle, what you can alter using the rebalance is number of executors(threads). in case of adding a new node to a cluster allows you to reconfigure the number of executors to run with out shutting down the topology but no matter what the number of tasks remains the same. its just that adding a new node gives you the advantage of increasing the performance by tuning the parallelism of storm.

In order to send the message to the same task every time storm will mod the hashcode of the value with the number of tasks (hashcode(values)% #tasks). If you were to increase your tasks your counts will not be accurate as they may not go to the same task/worker after re balance.
https://groups.google.com/forum/#!msg/storm-user/lCKnl8AmSVE/rVCH3uuUAcMJ

To fully understand it, you have to see the code:
Fields grouping is dependent on the field string and not on which spout emitted it. So a rebalance won't affect it. This is the function: https://github.com/apache/storm/blob/3b1ab3d8a7da7ed35adc448d24f1f1ccb6c5ff27/storm-core/src/jvm/org/apache/storm/daemon/GrouperFactory.java#L157-L161
#Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int targetTaskIndex = Math.abs(TupleUtils.listHashCode(outFields.select(groupFields, values))) % numTasks;
return Collections.singletonList(targetTasks.get(targetTaskIndex));
}
TupleUtils.listHashCode leads to
public static <T> int listHashCode(List<T> alist) {
if (alist == null) {
return 1;
} else {
return Arrays.deepHashCode(alist.toArray());
}
}

Based on the fields of one or more tuples, fields grouping allows you to control tuples sent to bolts.
It ensures that a given set of values for a combination of fields is always sent to the same bolt.

Related

Kafka Streams / How to get the partition an iterartor is iterating over?

in my Kafka Streams application, I have a task that sets up a scheduled (by the wall time) punctuator. The punctuator iterates over the entries of a store and does something with them. Like this:
var store = context().getStateStore("MyStore");
var iter = store.all();
while (iter.hasNext()) {
var entry = iter.next();
// ... do something with the entry
}
// Print a summary (now): N entries processed
// Print a summary (wish): N entries processed in partition P
Since I'm working with a single store here (which might be partitioned), I assume that every single execution of the punctuator is bound to a single partition of that store.
Is it possible to find out which partition the punctuator operates on? The java docs for ProcessorContext.partition() states that this method returns -1 within punctuators.
I've read Kafka Streams: Punctuate vs Process and the answers there. I can understand that a task is, in general, not tied to a particular partition. But an iterator should be tied IMO.
How can I find out the partition?
Or is my assumption that a particular instance of a store iterator is tied to a partion wrong?
What I need it for: I'd like to include the partition number in some log messages. For now, I have several nearly identical log messages stating that the punctuator does this and that. In order to make those messages "unique" I'd like to include the partition number into them.
Just to post here the answer that was provided in https://issues.apache.org/jira/browse/KAFKA-12328:
I just used context.taskId(). It contains the partition number at the end of the value, after the underscore. This was sufficient for me.

Kafka Connect S3 Sink Connector Partitioning large topics by id field

We've been working on adding Kafka Connect to our data platform for the last few weeks and think it would be a useful way of extracting data from Kafka into an S3 datalake. We've played around with FieldPartitioner and the TimeBasePartitioner and seen some pretty decent results.
We also have the need to partition by user id - but having tried using the FieldPartitioner on a user id field the connector is extremely slow - especially compared to partitioning by date etc. I understand that partitioning by an id will create a lot of output partitions and thus won't be as fast - which is fine but it needs to be able to keep up with producers.
So far we've tried increasing memory and heap - but we don't usually see any memory issues unless we bump the flush.size to a large number. We've also tried small flush sizes, very small and large rotate.schedule.interval.ms configurations. We've also looked at networking, but that seems to be fine - using other partitioners the network keeps up fine.
Before potentially wasting a lot of time on this has anyone attempted or succeeded in partitioning by an id field, especially on larger topics, using the S3 Sink Connector? Or has anyone got any suggestions in terms of configuration or setup that might be a good place to look?
I'm not used to Kafka's connector, but I will at least try to help.
I am not aware if you can configure the connector to kafka topic's partition level; I am assuming there's some way to do that here.
One possible way to do this would be focused on the step where your clients produce to the Kafka brokers. My suggestion is to implement your own Partitioner, in order to have a "further" control of where you want to send the data on kafka's side.
This is an example/simplification of your custom partitioner. For example, the key your producers send has this format: id_name_date. This custom partitioner tries to extract the first element (id) and then chooses the desired partition.
public class IdPartitioner implements Partitioner
{
#Override
public int partition(String topic, Object key, byte[] kb,
Object v, byte[] vb, Cluster cl)
{
try
{
String pKey= (String) key;
int id = Integer.parseInt(pKey.substring(0,pKey.indexOf("_")));
/* getPartitionForId would decide which partition number corresponds
for the received ID.You could also implement the logic directly here.*/
return getPartitionForId(id);
}
catch (Exception e)
{return 0;}
}
#Override
public void close()
{
//maybe some work here if needed
}
}
Even if you'll may need some more tunning on KafkaConnect side, I believe this option may be helpful. Assuming you have a topic with 5 partitions, and that getPartitionForId just checks the first number of the ID in order to decide the partition (for simplification purposes, min Id is 100 and max Id is 599).
So if the received key is, f.e: 123_tempdata_20201203, the partition method would return 0, that is, the 1st partition.
(The image shows P1 instead of P0 because i believe the example looks more natural this way, but be aware that the 1st partition is in fact defined as partition 0 . Ok to be honest I forgot about P0 while painting this and didn't save the template, so I had to search for an excuse, like: looks more natural).
Basically this would be a pre-adjustment, or acommodation, before the S3 upload.
I am aware maybe this isn't the ideal answer, as I don't know the exact specifications of your system. My guess is that there's some possibility to directly point topic partitions to s3 locations.
If there's no possibility to do so, at least I hope this could give you some further ideas. Cheers!

Kafka Stream Topology on multiple instances

We have a streams topology that will work on multiple machines. We are storing time-windowed aggregation results into state stores.
Since state stores are storing local data, aggregation should be done on another topic for overall aggregation, I think.
But it seems like I am missing something because none of the examples do the overall aggregations on another KStream or Processor.
Do we need to use the groupBy logic for storing overall aggregation, or use a GlobalKtable or just implement our own merger code somewehere?
What is the correct architecture for this?
In below code, I have tried to group all the messages coming to the processor with a constant key to store the overall aggregation on just one machine, but it would lose the parallelism that Kafka supplies, I think.
dashboardItemProcessor = streamsBuilder.stream("Topic25", Consumed.with(Serdes.String(), eventSerde))
.filter((key, event) -> event != null && event.getClientCreationDate() != null);
dashboardItemProcessor.map((key, event) -> KeyValue.pair(key, event.getClientCreationDate().toInstant().toEpochMilli()))
.groupBy((key, event) -> "count", Serialized.with(Serdes.String(), Serdes.Long()))
.windowedBy(timeWindow)
.count(Materialized.as(dashboardItemUtil.getStoreName(itemId, timeWindow)));
In below code, I have tried to group all the messages coming to the processor with a constant key to store the overall aggregation on just one machine, but it would lose the parallelism that Kafka supplies, I think.
This seems to be the right approach. And yes, you loos parallelism, but that is how an global aggregation work. In the end, one machine must compute it...
What you could improve though, is to do a two step approach: ie, first aggregate by "random" keys in parallel, and use a second step with only one key to "merge" the partial aggregates into a single one. This way, some parts of the computation are parallelized and only the final step (on hopefully reduced data load) is non-parallel. Using Kafka Streams, you need to implement this approach "manually".

Kafka Streams: Partial reprocessing by key

Scenario:
In a KafkaStreams web sessioning scenario,
with unlimited (or years-long) retention,
with interactive queries (this can be reviewed if necessary),
with many clients, which have many users each (each user particular to each client),
and where partitioning goes like this:
Partition by a function of (clientId, userId) % numberOfPartitions, setting a numberOfPartitions beforehand depending on the cluster size. This would allow sessioning to be performed on (clientId,userId) data, and should provide an even data distribution among the nodes (no hotspotting, on partition size or on write load).
However, when querying, I'd query by client(and time range). So then, I'd build an aggregated Ktable from that Sessions table, where key is the client, and Sessions are queried by (client, timeStart, timeEnd). That would make that data from a client to have to go into one node, which could pose scalability issues(too big a client), but since data is aggregated already, I guess that would be manageable.
Question:
In this scenario (variants appreciated), I'd like to be able to reprocess only for one client.
But data from one client would be scattered among (potentially all of) the partitions.
How can a partial reprocess be achieved in Kafka Streams with minor impact, and keep (old) state queryable in the meantime?
I think in general you already know the answer to your question, with the partitioning scheme like you have described it you will have to read all partitions if you want to reprocess a client, as the messages will be spread throughout all of them.
The only thing that I can come up with to limit the amount of overhead when reprocessing an entire client is to implement a partitioning scheme that groups several partitions for a client and then distributes users over those partitions to avoid overloading one partition with a particularly "large" client. The picture should hopefully clarify what I am probably failing to explain with words..
A custom partitioner to achieve this distribution could look somewhat like the following code. Please take this with a grain of salt, this is purely theoretical so far and has never been tested (or even run for that matter) - but it should illustrate the principle.
public class ClientUserPartitioner implements Partitioner {
int partitionGroupSize = 10;
#Override
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
// For this we expect the key to be of the format "client/user"
String[] splitValues = ((String)key).split("/");
String client = splitValues[0];
String user = splitValues[1];
// Check that partitioncount is divisible by group size
if (cluster.availablePartitionsForTopic(topic).size() % partitionGroupSize != 0) {
throw new ConfigException("Partitioncount must be divisible by "+ partitionGroupSize +" for this partitioner but is " +
cluster.availablePartitionsForTopic(topic).size() + " for topic " + topic);
}
// Calculate partition group from client and specific partition from user
int clientPartitionOffset = Utils.murmur2(client.getBytes()) % partitionGroupSize * partitionGroupSize;
int userPartition = Utils.murmur2(user.getBytes()) % partitionGroupSize;
// Combine group and specific value to get final partition
return clientPartitionOffset + userPartition;
}
#Override
public void configure(Map<String, ?> configs) {
if (configs.containsKey("partition.group.size")) {
this.partitionGroupSize = Integer.parseInt((String)configs.get("partition.group.size"));
}
}
#Override
public void close() {
}
}
Doing this will of course impact your distribution, it might be worth your while to run a few simulations with different values for partitionGroupSize and a representative sample of your data to estimate how even the distribution is and how much overhead you'd save when reprocessing an entire client.

Max number of tuple replays on Storm Kafka Spout

We’re using Storm with the Kafka Spout. When we fail messages, we’d like to replay them, but in some cases bad data or code errors will cause messages to always fail a Bolt, so we’ll get into an infinite replay cycle. Obviously we’re fixing errors when we find them, but would like our topology to be generally fault tolerant. How can we ack() a tuple after it’s been replayed more than N times?
Looking through the code for the Kafka Spout, I see that it was designed to retry with an exponential backoff timer and the comments on the PR state:
"The spout does not terminate the retry cycle (it is my conviction that it should not do so, because it cannot report context about the failure that happened to abort the reqeust), it only handles delaying the retries. A bolt in the topology is still expected to eventually call ack() instead of fail() to stop the cycle."
I've seen StackOverflow responses that recommend writing a custom spout, but I'd rather not be stuck maintaining a custom patch of the internals of the Kafka Spout if there's a recommended way to do this in a Bolt.
What’s the right way to do this in a Bolt? I don’t see any state in the tuple that exposes how many times it’s been replayed.
Storm itself does not provide any support for your problem. Thus, a customized solution is the only way to go. Even if you do not want to patch KafkaSpout, I think, introducing a counter and breaking the replay cycle in it, would be the best approach. As an alternative, you could also inherit from KafkaSpout and put a counter in your subclass. This is of course somewhat similar to a patch, but might be less intrusive and easier to implement.
If you want to use a Bolt, you could do the following (which also requires some changes to the KafkaSpout or a subclass of it).
Assign an unique IDs as an additional attribute to each tuple (maybe, there is already a unique ID available; otherwise, you could introduce a "counter-ID" or just the whole tuple, ie, all attributes, to identify each tuple).
Insert a bolt after KafkaSpout via fieldsGrouping on the ID (to ensure that a tuple that is replayed is streamed to the same bolt instance).
Within your bolt, use a HashMap<ID,Counter> that buffers all tuples and counts the number of (re-)tries. If the counter is smaller than your threshold value, forward the input tuple so it gets processed by the actual topology that follows (of course, you need to anchor the tuple appropriately). If the count is larger than your threshold, ack the tuple to break the cycle and remove its entry from the HashMap (you might also want to LOG all failed tuples).
In order to remove successfully processed tuples from the HashMap, each time a tuple is acked in KafkaSpout you need to forward the tuple ID to the bolt so that it can remove the tuple from the HashMap. Just declare a second output stream for your KafkaSpout subclass and overwrite Spout.ack(...) (of course you need to call super.ack(...) to ensure KafkaSpout gets the ack, too).
This approach might consume a lot of memory though. As an alternative to have an entry for each tuple in the HashMap you could also use a third stream (that is connected to the bolt as the other two), and forward a tuple ID if a tuple fails (ie, in Spout.fail(...)). Each time, the bolt receives a "fail" message from this third stream, the counter is increase. As long as no entry is in the HashMap (or the threshold is not reached), the bolt simply forwards the tuple for processing. This should reduce the used memory but requires some more logic to be implemented in your spout and bolt.
Both approaches have the disadvantage, that each acked tuple results in an additional message to your newly introduces bolt (thus, increasing network traffic). For the second approach, it might seem that you only need to send a "ack" message to the bolt for tuples that failed before. However, you do not know which tuples did fail and which not. If you want to get rid of this network overhead, you could introduce a second HashMap in KafkaSpout that buffers the IDs of failed messages. Thus, you can only send an "ack" message if a failed tuple was replayed successfully. Of course, this third approach makes the logic to be implemented even more complex.
Without modifying KafkaSpout to some extend, I see no solution for your problem. I personally would patch KafkaSpout or would use the third approach with a HashMap in KafkaSpout subclass and the bolt (because it consumed little memory and does not put a lot of additional load on the network compared to the first two solutions).
Basically it works like this:
If you deploy topologies they should be production grade (this is, a certain level of quality is expected, and the number of tuples low).
If a tuple fails, check if the tuple is actually valid.
If a tuple is valid (for example failed to be inserted because it's not possible to connect to an external database, or something like this) reply it.
If a tuple is miss-formed and can never be handled (for example an database id which is text and the database is expecting an integer) it should be ack, you will never be able to fix such thing or insert it into the database.
New kinds of exceptions, should be logged (as well as the tuple contents itself). You should check these logs and generate the rule to validate tuples in the future. And eventually add code to correctly process them (ETL) in the future.
Don't log everything, otherwise your log files will be huge, be very selective on what do you log. The contents of the log files should be useful and not a pile of rubbish.
Keep doing this, and eventually you will only cover all cases.
We also face the similar data where we have bad data coming in causing the bolt to fail infinitely.
In order to resolve this on runtime, we have introduced one more bolt naming it as "DebugBolt" for reference. So the spout sends the message to this bolt first and then this bolts does the required data fix for the bad messages and then emits them to the required bolt. This way one can fix the data errors on the fly.
Also, if you need to delete some messages, you can actually pass an ignoreFlag from your DebugBolt to your original Bolt and your original bolt should just send an ack to spout without processing if the ignoreFlag is True.
We simply had our bolt emit the bad tuple on an error stream and acked it. Another bolt handled the error by writing it back to a Kafka topic specifically for errors. This allows us to easily direct normal vs. error data flow through the topology.
The only case where we fail a tuple is because some required resource is offline, such as a network connection, DB, ... These are retriable errors. Anything else is directed to the error stream to be fixed or handled as is appropriate.
This all assumes of course, that you don't want to incur any data loss. If you only want to attempt a best effort and ignore after a few retries, then I would look at other options.
As per my knowledge Storm doesn't provide built-in support for this.
I have applied below-mentioned implementation:
public class AuditMessageWriter extends BaseBolt {
private static final long serialVersionUID = 1L;
Map<Object, Integer> failedTuple = new HashMap<>();
public AuditMessageWriter() {
}
/**
* {#inheritDoc}
*/
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
//any initialization if u want
}
/**
* {#inheritDoc}
*/
#Override
public void execute(Tuple input) {
try {
//Write your processing logic
collector.ack(input);
} catch (Exception e2) {
//In case of any exception save the tuple in failedTuple map with a count 1
//Before adding the tuple in failedTuple map check the count and increase it and fail the tuple
//if failure count reaches the limit (message reprocess limit) log that and remove from map and acknowledge the tuple
log(input);
ExceptionHandler.LogError(e2, "Message IO Exception");
}
}
void log(Tuple input) {
try {
//Here u can pass result to dead queue or log that
//And ack the tuple
} catch (Exception e) {
ExceptionHandler.LogError(e, "Exception while logging");
}
}
#Override
public void cleanup() {
// To declare output fields.Not required in this alert.
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// To declare output fields.Not required in this alert.
}
#Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}