Gobblin grouping workunits for Kafka source

Gobblin grouping workunits for Kafka source - apache-kafka

In https://gobblin.readthedocs.io/en/latest/case-studies/Kafka-HDFS-Ingestion/#grouping-workunits section of Gobblin documentation we can read about Single-level packing with following desc
The single-level packer uses a worst-fit-decreasing approach for assigning workunits to mappers: each workunit goes to the mapper that currently has the lightest load. This approach balances the mappers well. However, multiple partitions of the same topic are usually assigned to different mappers. This may cause two issues: (1) many small output files: if multiple partitions of a topic are assigned to different mappers, they cannot share output files. (2) task overhead: when multiple partitions of a topic are assigned to different mappers, a task is created for each partition, which may lead to a large number of tasks and large overhead.
Second overhead seems to stand in contradiction to what we can read in the other parts.
One paragraph higher we can red
For each partition, after the first and last offsets are determined, a workunit is created.
and here https://gobblin.readthedocs.io/en/latest/Gobblin-Architecture/#gobblin-job-flow in point 3:
From the set of WorkUnits given by the Source, the job creates a set of tasks. A task is a runtime counterpart of a WorkUnit, which represents a logic unit of work. Normally, a task is created per WorkUnit
So for what I understand there always is task associated with Kafka partition unless WorkUnits are grouped together (then we have one task for many WorkUnits thus many paritions)
Do I understand something wrong here or second overhead in single-level packaging make no sens?

Related

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...

You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

What exactly is StreamTask in StreamThread in kafka streams?

I am trying to understand how Kafka Stream work under the hood (to know it a little better), and came across confluent link, and it is really wonderful.
It says two terms viz: StreamThreads and StreamTasks.
I am not able to understand what exactly is StreamTasks?
Is it executed by StreamThread?
As per doc, StreamThreads can have multiple StreamTasks, so won't there be any data sharing and won't this thread run slower? How does a StreamThread "run" multiple StreamTasks?
Any explanation in simple words would be of great help.

"Tasks" are a logical abstractions of work than can be done in parallel (ie, stuff that can be processed independent from each other). Kafka Streams basically creates a task for each input topic partition, because data in different partitions can processed independent from each other (it's a simplification, but holds if you have a single input topic; for joins it's a little bit different).
A StreamThread is basically a JVM thread. Task are assigned to StreamsThread for execution. In the current implementation, a StreamThread basically loops over all tasks and processes some amount of input data for each task. In between, the StreamThread (that is using a KafkaConsumer) polls the broker for new data for all its assigned tasks.
Because tasks are independent from each other, you can run as many thread as there are tasks. For this case, each thread would execute only a single task.

Scaling Kafka: how new event processing capacity is added dynamically?

To a large extent getting throughout in a system on Kafka rests of these degrees of freedom:
(highly recommended) messages should be share nothing. If share-nothing they can be randomly assigned to different partitions within a topic and processed independently of other messages
(highly recommended) the partition count per topic should be sized. More partitions per topic equals greater possible levels of parallelism
(highly recommended) to avoid hotspots within a topic partition, the Kafka key may need to include time or some other varying data point so that a single partition does not unintentionally get the majority of the work
(helpful) the processing time per message should be small when possible
https://dzone.com/articles/20-best-practices-for-working-with-apache-kafka-at mentions other items fine tuning these principles
Now suppose that on an otherwise OK system, one will get a lot of new work. For example, a new and large client may be added mid-day or an existing client may need to onboard a new account adding zillions of new events. How do we scale horizontally adding new capacity for this work?
If the messages are truly share-nothing throughout the entire system --- I have a data pipeline of services where A gets a message, processes it, publishes a new message to another service B, and so on --- adding new capacity to the system could be easy as sending a message on an separate administration topic telling the consumer task(s) to spin up new threads. Then so long as the number of partitions in the topic(s) is not a bottleneck, one would have indeed add new processing capacity.
This approach is certainly doable but is still un-optimal in these respects:
Work on different clientIds is definitely share-nothing. Merely adding new threads takes work faster, but any new work would interleave behind and within the existing client work. Had a new topic been available with a new pub/sub process pair(s), the new work could be done in parallel if the cluster has spare capacity on the new topic(s)
In general, share-nothing work may not be always possible at every step in a data pipeline. If ordering was ever required, the addition of new subscriber threads could get messages out of order for a given topic, partition. This happens when there are M paritions in a topic but >M subscriber threads. I have one such order sensitive case. It's worth noting then that ordering effectively means at most 1 subscriber thread per partition so sizing paritions may be even more important.
Tasks may not be allowed to add topics at runtime by the sysadmin
Even if adding topics at runtime is possible, system orchestration is required to tell various produces that clientID no longer is associated with the old topic T, but rather T'. WIP on T should be flushed first before using T'
How does the Cassandra community deal with adding capacity at runtime or is this day-dreaming? Adding new capacity via in this way seems to roughly center on:
Dynamic, elastic horizontal capacity seems to broadly center on these principles:
have spare capacity on your cluster
have extra unused topics for greater parallelism; create them at runtime or pre-create but not use if sys-admins don't allow dynamically creation
equip the system so that events for a given clientID can be intercepted before they enter the pipeline and deferred to a special queue, know when existing events on the clientID have flushed through the system, then update config(s) sending the held/deferred events and any new events on new clients to the new topic
Telling consumers to spin up more listeners
Dynamically adding more partitions? (Doubt that's possible or practical)

Kafka Streams: Understanding groupByKey and windowedBy

I have the following code.
My goal is to group messages by a given key and a 10 second window. I would like to count the total amount accumulated for a particular key in the particular window.
I read that I need to have caching enabled and also have a cache size declared. I am also forwarding the wall clock to enforce the windowing to kick in and group the elements in two separate groups. You can see what my expectations are for the given code in the two assertions.
Unfortunately this code fails them and it does so in two ways:
it sends a result of the reduction operation each time it is executed as opposed to utilizing the caching on the store and sending a single total value
windows are not respected as can be seen by the output
Can you please explain to me how am I misunderstanding the mechanics of Kafka Streams in this case?

Spring Batch - gridSize

I want some clear picture in this.
I have 2000 records but I limit 1000 records in the master for partitioning using rownum with gridSize=250 and partition across 5 slaves running in 10 machines.
I assume 1000/250= 4 steps will be created.
Whether data info sent to 4 slaves leaving 1 slave idle? If number
of steps is more than the number of slave java process, I assume the
data would be eventually distributed across all slaves.
Once all steps completed, would the slave java process memory is
freed (all objects are freed from memory as the step exists)?
If all steps completed for 1000/250=4 steps, to process the
remaining 1000 records, how can I start my new job instance without
scheduler triggers the job.

Since, you have not shown your Partitioner code, I would try to answer only on assumptions.
You don't have to assume about number of steps ( I assume 1000/250= 4 steps will be created ), it would be number of entries you create in java.util.Map<java.lang.String,ExecutionContext> that you return from your partition method of Partitioner Interface.
partition method takes gridSize as argument and its up to you to make use of this parameter or not so if you decide to do partitioning based on some other parameter ( instead of evenly distributing count ) then you can do that. Eventually, number of partitions would be number of entries in returned map and values stored in ExecutionContext can be used for fetching data in readers and so on.
Next, you can choose about number of steps to be started in parallel by setting appropriate TaskExecutor and concurrencyLimit values i.e. you might create 100 steps in partition but want to start only 4 steps in parallel and that can very well be achieved by configuration settings on top of partitioner.
Answer#1: As already pointed, data distribution has to be coded by you in your reader using ExecutionContext information you created in partitioner. It doesn't happen automatically.
Answer#2: Not sure what you exactly mean but yes, everything gets freed after completion and information is saved in meta data.
Answer#3: As already pointed out, all steps would be created in one go for all the data. Which steps run for which data and how many run in parallel can be controlled by readers and configuration.
Hope it helps !!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse