What is the default Window and Trigger for Apache Storm when using ReduceByKey without specifying a window or trigger? - triggers

I implemented a topology in Storm where i read data from Kafka:
KafkaSpoutConfig<String, String> conf = KafkaSpoutConfig.builder("kafka:9092",
"test_input").setRecordTranslator(new Translator(), new Fields("timestamp", "value")).build();
StreamBuilder builder = new StreamBuilder();
Stream<Tuple> stream = builder.newStream(new KafkaSpout<String, String>(conf));
Then i parse my data and call a custom ReduceByKey function without specifying a window or a trigger and send it back to a Kafka Topic:
PairStream<Integer, Game> gameStream = stream.mapToPair(row -> mapGame(row));
gameStream.reduceByKey(new GameReducer<Object>()).to(bolt);
I'm wondering how Storm is triggering the aggregation on the unbounded streaming Data. How is Apache Storm windowing and triggering the Reduce Function? Is it when new data arrives that this aggregate gets updated? So an ElementCount(1) trigger f.e ? I can't find any relevant Information about this with google or reading the docs.

Related

Write multiple topics in Kafka by Flink dynamically Exception handling

I am currently reading data from a single kafka topic and writing to dynamic topic based on data itself. I have implemented the following code (which is dynamically selecting topic based on accountId of data) and it working just fine :
class KeyedEnrichableEventSerializationSchema(schemaRegistryUrl: String)
extends KafkaSerializationSchema[KeyedEnrichableEvent]
with KafkaContextAware[KeyedEnrichableEvent] {
private val enrichableEventClass = classOf[EnrichableEvent]
private val enrichableEventSerialization: AvroSerializationSchema[EnrichableEvent] =
ConfluentRegistryAvroSerializationSchema.forSpecific(enrichableEventClass, enrichableEventClass.getCanonicalName, schemaRegistryUrl)
override def serialize(element: KeyedEnrichableEvent, timestamp: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] =
new ProducerRecord("trackingevents."+element.value.getEventMetadata.getAccountId, element.key, enrichableEventSerialization.serialize(element.value))
override def getTargetTopic(element: KeyedEnrichableEvent): String = "trackingevents."+element.value.getEventMetadata.getAccountId
The problem/concern is, if the topic does not exist, i am having an exception in JobManager UI about the topic not present and whole processing is halted until I create the topic. May be this is recommended behaviour but is there any alternate like maybe we can put the data in different topic and resume processing instead of stopping whole processing. Or at least is there any notification mechanism available in flink which notifies immediately that the processing is halted.
I think that the simplest solution in this case is to enable autocreation of topics in Kafka, so that problem is solved totally.
However, if for some reason it's impossible to be done in Your case, the simplest solution IMHO would be to create a ProcessFunction, which would keep a connection to Kafka using KafkaConsumer or AdminClient and periodically check if the topic that would be used for given message exists. You could then whether push the data to sideOutput or try to create topics that are missing.

Kafka Streams Processor API clear state store

I am using kafka Processor API to do some custom calculations. Because of some complex processing, DSL was not the best fit. The stream code looks like the one below.
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("storeName");
StoreBuilder<KeyValueStore<String, StoreObject>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
Serdes.String(), storeObjectSerde);
topology.addSource("SourceReadername", stringDeserializer, sourceSerde.deserializer(), "sourceTopic")
.addProcessor("processor", () -> new CustomProcessor("store"), FillReadername)
.addStateStore(storeBuilder, "processor") // define store for processor
.addSink("sinkName", "outputTopic", stringSerializer, resultSerde.serializer(),
Fill_PROCESSOR);
I need to clear some items from the state store based on an event coming in a separate topic. I am not able to find the right way to probably join with another stream using Processor API or some other way to listen to events in another topic to be able to trigger the cleanup code in the CustomProcessor class.
Is there a way we can get events in another topic in Processor API? Or probably mix DSL with Processor API to be able to join the two and send events in any of the topic to the Process method so that I can run the cleanup code when an event is received in the cleanup topic?
Thanks
You just need to add another input topic (add:Source) and add Processor that transforms messages from that topic and based on them remove staff from state store. One note, those topics should use same keys (because of partitioning).

How to send multiple (different) tuples from one KafkaSpout at once to the bolt?

I am a novice in Apache Storm.
I am trying to develop a real-time stream processing system using Apache Kafka, Storm and ESPER CEP engine.
For that, I am having one KafkaSpout that will emit streams to Bolts(which has my CEP queries) to filter the stream.
I have already created a topology and I am trying to run it on a local cluster
The problem is that the CEP query running in my bolts require batches of tuples to perform window operations on the streams. And in my topology, KafkaSpout is sending only one tuple at a time to Bolts for processing. So my CEP query is not working as expected.
I am using default KafkaSpout in Storm. Is there any way I can send multiple different tuples at once to the Bolts? Some tuning of configuration can do this or do I need to make my custom KafkaSpout for that?
Please help!!
My topology:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("KafkaSpout", new KafkaSpout<>(KafkaSpoutConfig.builder("localhost:" + 9092, "weatherdata").setProp(ConsumerConfig.GROUP_ID_CONFIG, "weather-consumer-group").build()),4);
builder.setBolt("A", new FeatureSelectionBolt(), 2).globalGrouping("KafkaSpout");
builder.setBolt("B", new TrendDetectionBolt(), 2).shuffleGrouping("A")
I am using 2 Bolts and one spout.
My esper Query running in Bolt A is
select first(e), last(e) from weatherEvent.win:length(3) as e
Here I am trying to get the first and last event from the window of length three from the event stream. But I get same first and last event because KafkaSpout is sending only one tuple at a time.
The spout can't do it, but you can use either Storm's windowing support https://storm.apache.org/releases/2.0.0-SNAPSHOT/Windowing.html, or just write an aggregation bolt and put it between the spout and the rest of the topology.
So your topology should be spout -> aggregator -> feature selection -> trend detection.
I'd recommend you try out the built-in windowing support, but if you would rather write your own aggregation, your bolt really just needs to receive some number of tuples (e.g. 3), and emit a new tuple containing all the values.
The aggregator bolt should do something like
private List<Tuple> buffered;
execute(Tuple input) {
if (buffered.size != 2) {
buffered.add(input)
return
}
Tuple first = buffered.get(0)
Tuple second = buffered.get(1)
Values aggregate = new Values(first.getValues(), second.getValues(), input.getValues())
List<Tuple> anchors = List.of(first, second, input)
collector.emit(anchors, aggregate)
collector.ack(first, second, input)
buffered.clear()
}
This way you end up with one tuple containing the contents of the 3 input tuples.

Spark - Get earliest and latest offset of Kafka without opening stream

I am currently using spark-streaming-kafka-0-10_2.11 to connect my spark application with the kafka queue. For Streams everything works fine. For a specific scenario however I just need the whole content of the kafka queue exactly once - for this I got the suggestion to better use KafkaUtils.createRDD (SparkStreaming: Read Kafka Stream and provide it as RDD for further processing)
However for spark-streaming-kafka-0-10_2.11 I cannot figure out how to get the earliest and latest offset for my Kafka topic that would be needed to create the Offset-Range I have to hand of the the createRDD method.
What is the recommended way to get those offsets without opening a stream? Any help would be greatly appreciated.
After reading several discussions I am able to get the earliest or latest offset from a specific partition with :
val consumer = new SimpleConsumer(host,port,timeout,bufferSize,"offsetfetcher");
val topicAndPartition = new TopicAndPartition(topic, initialPartition)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(OffsetRequest.EarliestTime,1)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
return offsets.head
but still , how to replicate the behaviour of "from_beginning" in a kafka_consumer.sh CLI command is something I do not know by the KafkaUtils.createRDD aproach.

Multiple Streams support in Apache Flink Job

My Question in regarding Apache Flink framework.
Is there any way to support more than one streaming source like kafka and twitter in single flink job? Is there any work around.Can we process more than one streaming sources at a time in single flink job?
I am currently working in Spark Streaming and this is the limitation there.
Is this achievable by other streaming frameworks like Apache Samza,Storm or NIFI?
Response is much awaited.
Yes, this is possible in Flink and Storm (no clue about Samza or NIFI...)
You can add as many source operators as you want and each can consume from a different source.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = ... // see Flink webpage for more details
DataStream<String> stream1 = env.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);)
DataStream<String> stream2 = env.readTextFile("/tmp/myFile.txt");
DataStream<String> allStreams = stream1.union(stream2);
For Storm using low level API, the pattern is similar. See An Apache Storm bolt receive multiple input tuples from different spout/bolt
Some solutions have already been covered, I just want to add that in a NiFi flow you can ingest many different sources, and process them either separately or together.
It is also possible to ingest a source, and have multiple teams build flows on this without needing to ingest the data multiple times.