I'm a beginner kafka and flink enthusiast.
I noticed something troubling. When i increase my parallelism of a kafka job to anything more than 1, i get no windows to execute their processes. I wish to use parallelism to increase analysis speed.
Look at the image examples from Apache Flink Web Dashboard which visualizes the issue.
This is the exact same code and the exact same ingested data-set, the difference is only parallelism. In the first example the ingested data flows through the window functions, but when parallelism is increased the data just piles up in the first window function which never executes. It stays like this forever and never produces any error.
The source used in the code is KafkaSource, FlinkKafkaConsumer seems to work fine with the same setup but is deprecated so i wish not to use it.
Thanks for any ideas!
The issue (is almost certainly) that the Kafka topic being consumed has fewer partitions than the configured parallelism. The new KafkaSource handles this situation differently than FlinkKafkaConsumer.
An event-time window waits for the arrival of a watermark indicating that the stream is now complete up through the end-time of the window. When your KafkaSource operator has 10 instances, some of which aren't receiving any data, those idle instances are holding back the watermark. Basically, Flink doesn't know that those instances aren't expected to ever produce data -- instead it's waiting for them to be assigned work to do.
You can fix this by doing one of the following:
Reduce Flink's parallelism to be less than or equal to the number of Kafka partitions.
Configure your WatermarkStrategy to use withIdleness(duration) so that the idle instances will recognize that they aren't doing anything, and (temporarily) remove themselves from being involved with watermarking. (And if those instances are ever assigned splits/partitions to consume, they'll resume doing watermarking.)
Related
I would like to write a simple Flink application that reads from a Kafka queue and processes the message and stores the output to an external system, with at least once semantics and without using checkpoints. I would like to avoid checkpoints because if the Kafka offsets are checkpointed, then all intermediate state will have to be checkpointed as well. In other words, I want the application to be as stateless as possible.
The way I envision at least once to work is the following:
a source reads from kafka
processing happens
the output is stored to the external system
the message is acknowledged to kafka
Note that:
If 2. or 3. fail, and the app restarts, the same message will be processed again (good)
If 2. and 3. succeed, 4. fails and the app restarts, we will will have stored the result twice (acceptable)
Based on https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration, the only way to get at least once (or the stronger exactly once) guarantees is by using checkpoints.
It seems that the core of the issue is that 4. needs to communicate back to 1. to ack to Kafka, which cannot happen in standard Flink, but should be possible using stateful functions.
To put it all together, the question is:
Is it possible to achieve at least once semantics using kafka in flink without using chekpoints?
According to the documentation you already linked it says:
"Checkpointing disabled: if checkpointing is disabled, the Flink Kafka Consumer relies on the automatic periodic offset committing capability of the internally used Kafka clients. Therefore, to disable or enable offset committing, simply set the enable.auto.commit / auto.commit.interval.ms keys to appropriate values in the provided Properties configuration."
As your goal is to disable checkpointing, you could set
enable.auto.commit=true
auto.commit.interval.ms=??? // use a time high enough such that your steps 2. and 3. are covered.
The performance tuning documentation provided by Storm states for the absolute best performance scaling multiple parallel topologies can yield better performance than simply scaling workers.
I am try to benchmark this theory against scaling worker.
However, using version 1.2.1 the storm Kafka spout is not behaving as I would have expected across multiple different topologies.
Setting a common client.id and group.id for the kafka spout consumer across all topologies for a single topic, each topology still subscribes to all available partitions and duplicate tuples, with errors being thrown as already committed tuples are recommitted.
I am surprised by this behaviour as I assumed that the consumer API would support this fairly simple use case.
I would be really grateful if somebody would explain
what's the implementation logic of this behaviour with the kafka spout?
any way around this problem?
The default behavior for the spout is to assign all partitions for a topic to workers in the topology, using the KafkaConsumer.assign API. This is the behavior you are seeing. With this behavior, you shouldn't be sharing group ids between topologies.
If you want finer control over which partitions are assigned to which workers or topologies, you can implement the TopicFilter interface, and pass it to your KafkaSpoutConfig. This should let you do what you want.
Regarding running multiple topologies being faster, I'm assuming you're referring to this section from the docs: In multiworker mode, messages often cross worker process boundaries. For performance sensitive cases, if it is possible to configure a topology to run as many single-worker instances [...] it may yield significantly better throughput and latency. The objective here is to avoid sending messages between workers, and instead keep each partition's processing internal in one worker. If you want to avoid running many topologies, you could look at customizing the Storm scheduler to make it allocate e.g. one full copy of your pipeline in each worker. That way, if you use localOrShuffleGrouping, there will always be a local bolt to send to, so you don't have to go over the network to another worker.
While working to adapt Java's KafkaIOIT to work with a large dataset I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read.
To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam repo (here).
The expected result would be that first the records are generated in a deterministic way, next they are written to Kafka - this concludes the write pipeline.
As for reading and correctness checking - first, the data is read from the topic and after being decoded into String representations, a hashcode of the whole PCollection is calculated (For details, check KafkaIOIT.java).
During the testing I ran into several problems:
When the predermined number of records is read from the Kafka topic, the hash is different each time.
Sometimes not all the records are read and the Dataflow task waits for the input indefinitely, occasionally throwing exceptions.
I believe there are two possible causes of this behavior:
either there is something wrong with the Kafka cluster configuration
or KafkaIO behaves erratically on high data volumes, duplicating and/or dropping records.
I found a Stack answer that I believe might explain the first behavior:
link - if messages are delivered more than once, it's obvious that the hash of the whole collection would change.
In this case, I don't really know how to configure KafkaIO.Write in Beam to produce exactly once.
This leaves the issue of messages being dropped unsolved. Can you help?
As mentioned in the comments, a practical appraoch would be to start small and see if this is a problem of scaling up.
E.g. starting with 10 messages, and multiplying the number till you see something strange.
Furthermore, one thing that stands out is that you send data to a topic, and check the hash after reading from the topic. However, you do not mention partitions, is it possible that you are in fact seeing different results because there are multiple partitions?
Kafka guarantees order within a partition.
I have a simple dataflow pipeline (job id 2018-05-15_06_17_40-8591349846083543299) with 1 min worker and 7 max workers that does the following:
Consume from 4 Kafka topics using KafkaIO. Each topic is represented differently and is a separate PCollection
Perform transformation on each PCollection to output a standard representation PCollection.
Merge the 4 PCollection using Flatten.pCollections
Window into hourly with the following trigger:
Repeatedly
.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(40000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))
)
)
.orFinally(AfterWatermark.pastEndOfWindow())
Write these events to GCS using AvroIO windowed writes with 14 shards.
When the pipeline is launched initially everything is fine, but after several hours later, the System Lag increases dramatically in the AvroIO:GroupIntoShards step.
Upon further investigation one of the topics is lagging behind many hours (this topic has the greatest events per second when compared to the other 3). Looking at the logs I see Closing idle reader for S12-000000000000000a which is understandable. However, the topic's consumer group offsets for the 36 partitions is in a state where for some partitions the offset is very low, but some are very high. The log-end-offset is more or less evenly distributed and the records we are producing are around the same size.
Questions:
If the System Lag is high in a certain step, does that prevent the Kafka consumers from consuming?
Any possible reason for the uneven distribution in Kafka offsets?
The PCollection's that is merged have different traffic patterns, some low and some high. Would adding the AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5) trigger effectively start writing to GCS for each (window, shard) after 5 minutes when an event is first seen in a window?
Updating the pipeline using the same code / configuration brings it back into a normal state where the consumed rate is much higher (due to the lag before the restart) than the produced rate.
Addressing 3 questions raised (I left a comment about the specific job):
No, system lag does not prevent Kafka from consuming.
In general if there is lots of work for downstream stages ready to be processed, that can delay upstream work from starting. But that is not KafkaIO specific.
Does not seem to be the case here. In general, assuming there is no skew among Kafka partitions themselves, heavy skew in Beam processing can cause readers assigned to workers that are doing more work than others.
I think yes. I think firstElementInPane() applies to element from any of the sources.
Works fine when source topic partition count = 1. If I bump up the partitions to any value > 1, I see the below error. Applicable to both Low level as well as the DSL API. Any pointers ? What could be missing ?
org.apache.kafka.streams.errors.StreamsException: stream-thread [StreamThread-1] Failed to rebalance
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:410)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:242)
Caused by: org.apache.kafka.streams.errors.StreamsException: task [0_1] Store in-memory-avg-store's change log (cpu-streamz-in-memory-avg-store-changelog) does not contain partition 1
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.register(ProcessorStateManager.java:185)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.register(ProcessorContextImpl.java:123)
at org.apache.kafka.streams.state.internals.InMemoryKeyValueStoreSupplier$MemoryStore.init(InMemoryKeyValueStoreSupplier.java:102)
at org.apache.kafka.streams.state.internals.InMemoryKeyValueLoggedStore.init(InMemoryKeyValueLoggedStore.java:56)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.init(MeteredKeyValueStore.java:85)
at org.apache.kafka.streams.processor.internals.AbstractTask.initializeStateStores(AbstractTask.java:81)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:119)
at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:633)
at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:660)
at org.apache.kafka.streams.processor.internals.StreamThread.access$100(StreamThread.java:69)
at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:124)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:407)
It's an operational issue. Kafka Streams does not allow to change the number of input topic partitions during its "life time".
If you stop a running Kafka Streams application, change the number of input topic partitions, and restart your app it will break (with the error you see above). It is tricky to fix this for production use cases and it is highly recommended to not change the number of input topic partitions (cf. comment below). For POC/demos it's not difficult to fix though.
In order to fix this, you should reset your application using Kafka's application reset tool:
http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
Using the application reset tool, has the disadvantage that you wipe out your whole application state. Thus, in order to get your application into the same state as before, you need to reprocess the whole input topic from beginning. This is of course only possible, if all input data is still available and nothing got deleted by brokers that applying topic retention time/size policy.
Furthermore you should note, that adding partitions to input topics changes the topic's partitioning schema (be default hash-based partitioning by key). Because Kafka Streams assumes that input topics are correctly partitioned by key, if you use the reset tool and reprocess all data, you might get wrong result as "old" data is partitioned differently than "new" data (ie, data written after adding the new partitions). For production use cases, you would need to read all data from your original topic and write it into a new topic (with increased number of partitions) to get your data partitioned correctly (or course, this step might change the ordering of records with different keys -- what should not be an issue usually -- just wanted to mention it). Afterwards you can use the new topic as input topic for your Streams app.
This repartitioning step can also be done easily within you Streams application by using operator through("new_topic_with_more_partitions") directly after reading the original topic and before doing any actual processing.
In general however, it is recommended to over partition your topics for production use cases, such that you will never need to change the number of partitions later on. The overhead of over partitioning is rather small and saves you a lot of hassle later on. This is a general recommendation if you work with Kafka -- it's not limited to Streams use cases.
One more remark:
Some people might suggest to increase the number of partitions of Kafka Streams internal topics manually. First, this would be a hack and is not recommended for certain reasons.
It might be tricky to figure out what the right number is, as it depends on various factors (as it's a Stream's internal implementation detail).
You also face the problem of breaking the partitioning scheme, as described in the paragraph above. Thus, you application most likely ends up in an inconsistent state.
In order to avoid inconsistent application state, Streams does not delete any internal topics or changes the number of partitions of internal topics automatically, but fails with the error message you reported. This ensure, that the user is aware of all implications by doing the "cleanup" manually.
Btw: For upcoming Kafka 0.10.2 this error message got improved: https://github.com/apache/kafka/blob/0.10.2/streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopicManager.java#L100-L103