Will Kafka operator semantic change when I rename an Apache Apex application? - apache-kafka

Am I correct in assuming that the moment we rename an application, the semantics of the Kafka operator will completely change and might end up reading from the "initialOffset" by the application code ?
How is the semantics maintained for a definition of "application name" ?
Does every deploy of the application code result in a new application or it is simply using the #ApplicationAnnotation(name="") instance to define this meaning ?

You can always launch the application by using -originalAppId the operator should continue from where the original app left-off. If you are using kafka 0.9 operator and you launch the application with same name, you can set the initialOffset to "application_or_latest" or "application_or_earliest", so the operator should continue from offsets that are processed in last run. The difference is if you specify -originalAppId the offsets are restored from checkpoint while the other one store offsets in kafka itself.

You can launch an application from its previous state by using the -originalAppId parameter and provide the yarn application id from its previous run's checkpointed state, it should apply to all operator within the dag including the kafka input operator. You can also provide a new name for the application using attribute dt.attr.APPLICATION_NAME.
eg:
launch pi-demo-3.4.0-incubating-SNAPSHOT.apa -originalAppId application_1459879799578_8727 -Ddt.attr.APPLICATION_NAME="pidemo v201"

Related

Pause message consumption and resume after time interval

We are using spring-cloud-stream to work with kafka. And I need to add some interval between getting data by consumer from single topic.
batch-node is already set as true , also fetch-max-wait, fetch-min-size, max-poll-records are already tuned.
Should I do something with idleEventInterval, or it's wrong way?
You can pause/resume the container as needed (avoiding a rebalance).
See https://docs.spring.io/spring-cloud-stream/docs/3.2.1/reference/html/spring-cloud-stream.html#binding_visualization_control
Since version 3.1 we expose org.springframework.cloud.stream.binding.BindingsLifecycleController which is registered as bean and once injected could be used to control the lifecycle of individual bindings
If you only want to delay for a short time, you can set the container's idleBetweenPolls property, using a ListenerContainerCustomizer bean.

Flink StreamingEnvronment does not terminate when using Kafka as source

I am using Flink for a Streaming Application. When creating the stream from a Collection or a List, the application terminates and everything after the "env.execute" gets executed normally.
I need to use a different source for the stream. More precisely, I use Kafka as a source (env.addSource(...)). In this case, the program just block while reaching the end of the stream.
I created an appropriate Deserialization Schema for my stream, having an extra event that signals the end of the stream.
I know that the isEndOfStream() condition succeeds on that point (I have an appropriate message printed on the screen in this case).
At this point the program just stops and does nothing, and so the commands that follow the "execute" line aren't on my disposal.
I am using Flink 1.7.2 and the flink-connector-kafka_2.11, with Scala 2.11.12. I am executing using the IntelliJ environment and Maven.
While researching, I found a suggestion to throw an exception while getting at the end of the stream (using the Schema's capabilities). That does not support my goal because I also have more operators/commands within the execution of the environment that need to be executed (and do get executed correctly at this point). If i choose to disrupt the program by throwing an exception I would lose everything else.
After the execution line I use the .getNetRuntime() function to measure the running time of my operators within the stream.
I need to have StreamingEnvironment end like it does when using a List as a source. Is there a way to remove Kafka at this point for example?

Is it possible to dynamically adjust the num.stream.threads configuration of kafka stream while the program is running?

I am running multiple instances of kafka stream in a service. I want to dynamically adjust the num.stream.threads configuration to control the priority of each instance while the program is running.
I didn't find a related method on the KafkaStream class.
I wonder if there is any other way?
it's not possible to update KafkaStreams configuration at runtime when you already created it (it relates not just to property num.stream.threads, but also for others as well).
as a workaround, you could recreate a specific KafkaStreams by stopping existing one and creating and starting a new one without stopping other streams and without restarting your application. it depends on your specific use case whether it fits your needs or not.
this could be achieved by several options. one of them - update configs (like num.stream.threads) in database per specific kafka stream flow, and from each instance of your application fetch data from database (e.g. every 10 minutes by cron expression), and if any updates found - stop existing and start a new KafkaStream that has desired updated configs. if you have a single instance of application, it could be achieved much easier via REST.
Update since kafka-streams 2.8.0
since kafka-streams 2.8.0, you have the ability to add and remove stream threads at runtime, without recreating stream (API to Start and Shut Down Stream Threads)
kafkaStreams.addStreamThread();
kafkaStreams.removeStreamThread();
That is currently not possible.
If you want to change the number of threads, you need to stop the program with KafkaStreams#close(), create new KafkaStreams instance with updated configuration, and start the new instance with KafkaStreams#start().

Kafka - Topology change on redundant apps

Let's say I have two applications with the same applicationId "foo-processor" and the following setup:
streamsBuilder.table(fooTopic)
.groupBy(...)
.reduce(...)
Assuming I now have some cases I don't want to handle and add a filter like this:
streamsBuilder.table(fooTopic)
.filter(...)
.groupBy(...)
.reduce(...)
During deployment, not all instances of the app is shut down and restarted at the same time. Therefore, instance #1 of foo-processor is restarted and instance #2 is still using the previous topology. What happens is that instance #1 will have this error:
java.lang.IllegalArgumentException: Assigned partition foo-processor-KTABLE-REDUCE-STATE-STORE-0000000006-repartition-2 for non-subscribed topic regex pattern; subscription pattern is foo-processor-KTABLE-REDUCE-STATE-STORE-0000000007-repartition|<topic>
I assume this is the expected behaviour because the repartition topic might not contain the same events because of the different topology. That being said, I am wondering how should I handle change in topology.
Does that mean that the application is different so the applicationId should also change? If not, how should I handle topology changes if many instances of the same app are running?
Thanks!
If you want to change the topology, you need to use a new application.id -- running both in parallel with the same application.id is not supported.

Kafka consume message in reverse order

I use Kafka 0.10, I have a Topic logs where my IoT devices post their logs into , The key of my messages are the device-id , so all the logs of the same device are in the same partition.
I have an api /devices/{id}/tail-logs that needs to display the N last logs of one device at the moment the call was made.
Currently I have it implemented in a very unefficient way (but working), as I start from the beginning (i.e oldest logs) of the partition containing the device's log until I reach current timestamp.
A more efficient way would be if I could get the current latest offset and then consume the messages backward (I would need to filter out some message to keep only those of the device i'm looking for)
Is it possible to do it with kafka ? If not how one can solve this problematic ? (a more heavy solution I would see would be to have a kafka-connect linked to an elastic search and then to query the elasticsearch but to have 2 more components for this seems a bit overkill...)
As you are on 0.10.2, I would recommend to write a Kafka Streams application. The application will be stateful and the state will hold the last N records/logs per device-id -- if new data is written to the input topic, the Kafka Streams application will just update it's state (without the need to re-read the whole topic).
Furthermore, the application also serves you request ("api /devices/{id}/tail-logs" using Interactive Queries feature.
Thus, I would not build a stateless application that has to recompute the answer for each request, but build a stateful application that eagerly compute the result (and update the result automatically all the time) for all possible requests (ie, for all device-ids) and just returns the already computed result when a request comes in.