Kafka Streams: Program architecture for handling multiple streams - apache-kafka

I am running multiple individual kafka streams. To this end I created a Stream Manager that holds these streams. Below is the SM class in essence.
public class StreamManager {
public Map<String, BaseStream> streamMap = new HashMap<String, BaseStream>();
public Map<String, ReadOnlyKeyValueStore<Long, BaseModel>> storeMap = new HashMap<String, ReadOnlyKeyValueStore<Long, BaseModel>>();
public StreamManager(String bootstrapServer){
initialize(bootstrapServer);
}
/**
* initialize streams. Right now hard coding creation of streams here.
* #param bootstrapServer
*/
public void initialize(String bootstrapServer){
Properties s1Props = this.GetStreamingProperties(bootstrapServer, "STREAM_1");
Properties s2Props = this.GetStreamingProperties(bootstrapServer, "STREAM_1");
BaseStream s1Stream = new CompositeInfoStream(new KStreamBuilder(), compositeInfoProps);
BaseStream s2Stream = new ImcInfoStream(new KStreamBuilder(), imcInfoProps);
streamMap.put(s1Stream.storeName, s1Stream);
streamMap.put(s2Stream.storeName, s2Stream);
/**
* Start all streams
*/
public void startStreams(){
for(BaseStream stream: streamMap.values()){
stream.start();
}
}
public static void main(String[] args) throws Exception {
StreamManager mgr = new StreamManager(StreamingConfig.instance().MESSAGE_SERVER);
StreamManager.startStreamServices(mgr);
Runtime.getRuntime().addShutdownHook(new Thread() {
#Override
public void run() {
try {
} finally {
mgr.closeStreams();
}
}
});
int i = 0;
while(true) {
if (i++ == 0)
mgr.logAllStreamStates();
Thread.sleep(60000);
if (i == 60) i = 0;
}
}
}
I initialize and start the streams and then let the process run in a loop. Now what I want is to have more control over individual streams. To start and kill them if need be (for some odd reason my streams go in REBALANCE mode often and don't come back. Currently, if one of the stream goes into REBALANCING I have to kill the entire SM (all streams) and restart. What I would like to do is only restart the individual stream.
I would like to get a sense of how my architecture should be. Does kafka streams provide a mechanism to manage a cluster of streams? Can I use multiprocessing to accomplish this and if so could you guide me to some resources to do so keeping in mind that we use windows for development and linux for deployment

Related

Deleting element from Store after ExpirePolicy

Environment: I am running Apache Ignite v2.13.0 for the cache and the cache store is persisting to a Mongo DB v3.6.0. I am also utilizing Spring Boot (Java).
Question: When I have an expiration policy set, how do I remove the corresponding data from my persistent database?
What I have attempted: I have attempted to utilize CacheEntryExpiredListener but my print statement is not getting triggered. Is this the proper way to solve the problem?
Here is a sample bit of code:
#Service
public class CacheRemovalListener implements CacheEntryExpiredListener<Long, Metrics> {
#Override
public void onExpired(Iterable<CacheEntryEvent<? extends Long, ? extends Metrics>> events) throws CacheEntryListenerException {
for (CacheEntryEvent<? extends Long, ? extends Metrics> event : events) {
System.out.println("Received a " + event);
}
}
}
Use Continuous Query to get notifications about Ignite data changes.
ExecutorService mongoUpdateExecutor = Executors.newSingleThreadExecutor();
CacheEntryUpdatedListener<Integer, Integer> lsnr = new CacheEntryUpdatedListener<Integer, Integer>() {
#Override
public void onUpdated(Iterable<CacheEntryEvent<? extends Integer, ? extends Integer>> evts) {
for (CacheEntryEvent<?, ?> e : evts) {
if (e.getEventType() == EventType.EXPIRED) {
// Use separate executor to avoid blocking Ignite threads
mongoUpdateExecutor.submit(() -> removeFromMongo(e.getKey()));
}
}
}
};
var qry = new ContinuousQuery<Integer, Integer>()
.setLocalListener(lsnr)
.setIncludeExpired(true);
// Start receiving updates.
var cursor = cache.query(qry);
// Stop receiving updates.
cursor.close();
Note 1: EXPIRED events should be enabled explicitly with ContinuousQuery#setIncludeExpired.
Note 2: Query listeners should not perform any heavy/blocking operations. Offload that work to a separate thread/executor.

Joining streams Flink doesn't work with Kafka consumer

I'm trying to join two streams, one from the data collection, one consumes from Kafka.
code snippet
public static void main(String[] args) {
KafkaSource<JsonNode> kafkaSource = ...
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Kafka messages : {"name": "John"}
final DataStream<JsonNode> dataStream1 = env.fromSource(kafkaSource, waterMark(), "Kafka").rebalance()
.assignTimestampsAndWatermarks(waterMark());
final DataStream<String> dataStream2 = env.fromElements("John", "Zbe", "Abe")
.assignTimestampsAndWatermarks(waterMark());
dataStream1
.join(dataStream2)
.where(new KeySelector<JsonNode, String>() {
#Override
public String getKey(JsonNode value) throws Exception {
return value.get("name").asText();
}
})
.equalTo(new KeySelector<String, String>() {
#Override
public String getKey(String value) throws Exception {
return value;
}
})
.window(SlidingEventTimeWindows.of(Time.minutes(50) /* size */, Time.minutes(10) /* slide */))
.apply(new JoinFunction<JsonNode, String, String>() {
#Override
public String join(JsonNode first, String second) throws Exception {
return first+" "+second;
}
}).print();
env.execute();
}
watermark
private static <T> WatermarkStrategy<T> waterMark() {
return new WatermarkStrategy<T>() {
#Override
public WatermarkGenerator<T> createWatermarkGenerator(
org.apache.flink.api.common.eventtime.WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
};
}
After running snippet code, it doesn't have any merged data in the output. Am I going wrong somewhere?
Apache flink version: 1.13.2
The problem is probably related to watermarking. Since you're not using event-time-based timestamps, try changing SlidingEventTimeWindows to SlidingProcessingTimeWindows and see if it then produces results.
The underlying problem is probably a lack of data. The rebalance() on the Kafka stream guarantees that idle partitions won't stall the watermarks unless all partitions are idle. But if this is an unbounded streaming job, unless you have some data that falls after the first window, the watermark won't advance far enough to trigger the first window.
Options:
Send some data with larger timestamps
Configure the Kafka source as a bounded stream by using the .setBounded(...) option on the KakfaSource builder
Stop the job using the --drain option (docs)
The fact that dataStream2 is bounded is also a problem, but I'm not sure how much of one. At best this will prevent any windows after the first one from producing any results (since datastream joins are inner joins).

How to deploy a verticle on a Web server/Application server?

I'm just beginning to learn Vert.x and how to code Verticles. I wonder if it makes any sense to deploy a Verticle from within an Application server or Web server like Tomcat. For example:
public class HelloVerticle extends AbstractVerticle {
private final Logger logger = LoggerFactory.getLogger(HelloVerticle.class);
private long counter = 1;
#Override
public void start() {
vertx.setPeriodic(5000, id -> {
logger.info("tick");
});
vertx.createHttpServer()
.requestHandler(req -> {
logger.info("Request #{} from {}", counter++, req.remoteAddress().host());
req.response().end("Hello!");
})
.listen(9080);
logger.info("Open http://localhost:9080/");
}
public static void main(String[] args) {
Vertx vertx = Vertx.vertx();
vertx.deployVerticle(new HelloVerticle());
}
}
Obviously the main method needs to be replaced by some ContextListener of any trigger provided by the Application Server. Does it make any sense or it's not supposed to use Vert.x in this Context?
Thanks
Using Vert.x as a Verticle inside a Tomcat app doesn't make much sense from my POV, because it defeats the whole point of componentization.
On the other hand you might want to simply connect to Event Bus to send/publish/receive messages, and is fairly easy to achieve.
I did it for a Grails (SB-based) project and put the Vertx stuff inside a service like:
class VertxService {
Vertx vertx
#PostConstruct
void init() {
def options = [:]
Vertx.clusteredVertx(options){ res ->
if (res.succeeded())
vertx = res.result()
else
System.exit( -1 )
})
}
void publish( addr, msg ){ vertx.publish addr, msg }
//...
}

Invoking Kafka Interactive Queries from inside a Stream

I have a particular requirement for invoking an Interactive Query from inside a Stream . This is because I need to create a new Stream which should have data contained inside the State Store. Truncated code below:
tempModifiedDataStream.to(topic.getTransformedTopic(), Produced.with(Serdes.String(), Serdes.String()));
GlobalKTable<String, String> myMetricsTable = builder.globalTable(
topic.getTransformedTopic(),
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(
topic.getTransformedStoreName() /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(Serdes.String()) /* value serde */
);
KafkaStreams streams = new KafkaStreams(builder.build(), kStreamsConfigs());
KStream<String, String> tempAggrDataStream = tempModifiedDataStream
.flatMap((key, value) -> {
try {
List<KeyValue<String, String>> result = new ArrayList<>();
ReadOnlyKeyValueStore<String, String> keyValueStore =
streams .store(
topic.getTransformedStoreName(),
QueryableStoreTypes.keyValueStore());
In the last line, To access the State Store I need to have the KafkaStreams object and the Topology is finalized when I create the KafkaStreams object. The problem with this approach is that the 'tempAggrDataStream' is hence not part of the Topology and that part of the code does not get executed. And I cant move the KafkaStreams definition below as otherwise I can't call the Interactive Query.
I am a bit new to Kafka Streams ; so is this something silly from my side?
If you want to achieve sending all content of the topic content after each data modification, I think you should rather use Processor API.
You could create org.apache.kafka.streams.kstream.Transformer with state store.
For each processing message it will update state store and send all content to downstream.
It is not very efficient, because it will be forwarding for each processing message the whole content of the topic/state store (that can be thousands, millions of records).
If you need only latest value it is enough to set your topic cleanup.policy to compact. And from other site use KTable, which give abstraction of Table (Snapshot of stream)
Sample Transformer code for forwarding whole content of state store is as follow. The whole work is done in transform(String key, String value) method.
public class SampleTransformer
implements Transformer<String, String, KeyValue<String, String>> {
private String stateStoreName;
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
public SampleTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
stateStore.put(key, value);
stateStore.all().forEachRemaining(keyValue -> context.forward(keyValue.key, keyValue.value));
return null;
}
#Override
public void close() {
}
}
More information about Processor APi can be found:
https://docs.confluent.io/current/streams/developer-guide/processor-api.html
https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html
How to combine Processor API with Stream DSL can be found:
https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration

Creating an Esper long running process or service

I'd like to create an Esper engine long running process but I'm not sure of Esper's threading model nor the model I should implement to do this. Naively I tried the following:
public class EsperTest {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
//EPServiceProvider epService = EPServiceProviderManager.getDefaultProvider();
EPServiceProvider epService = EPServiceProviderManager.getProvider("CoreEngine");
epService.addServiceStateListener(new EPServiceStateListener() {
#Override
public void onEPServiceDestroyRequested(EPServiceProvider epsp) {
System.out.println("Service destroyed");
}
#Override
public void onEPServiceInitialized(EPServiceProvider epsp) {
System.out.println("System initialised");
}
});
epService.initialize();
}
}
But the code appears to execute to the end of the main() method and the JVM ends.
Referring to the Esper documentation, section 14.7 p456:
In the default configuration, each engine instance maintains a single timer thread (internal timer)
providing for time or schedule-based processing within the engine. The default resolution at which
the internal timer operates is 100 milliseconds. The internal timer thread can be disabled and
applications can instead send external time events to an engine instance to perform timer or
scheduled processing at the resolution required by an application.
Consequently I thought that by creating a an engine instance ("CoreEngine") at least one (timer) thread would be created and assuming this is not a daemon thread the main() method would not complete but this appears not to be the case.
Do I have to implement my own infinite loop in main() or is there a configuration which can be provided to Esper which will allow it to run 'forever.?
The timer threads is a daemon thread.
Instead of a loop use a latch like this.
final CountDownLatch shutdownLatch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
shutdownLatch.countDown();
}
});
shutdownLatch.await();