Apache Beam: how to discard messages? - apache-beam

I have a pipeline that:
Reads messages from pubsub
Converts them to a domain object
Applies fixed window
Sends data back to a pubsub topic
I would like to process only specific messages - for example having a specific attribute and discard all other messages. How can this be done in beam?
Can I simply skip c.outputWithTimestamp(...); for the messages that should be discarded?
My code:
pipeline.apply("Read PubSub messages",
PubsubIO.
readStrings().
fromSubscription(pubsubSub))
.apply("Convert to DeviceData",
ParDo.of(new DoFn<String, KV<String, DeviceData>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
String message = c.element();
DeviceData data = new Gson().fromJson(message, DeviceData.class);
String sourceId = data.getSensorId() != null ? data.getSensorId() : data.getFormulaId();
// use timestamp from payload
Long timeInNanoSeconds = data.getTimeInNanoSeconds();
Instant timestamp = ClockUtil.fromNanos(timeInNanoSeconds);
long millis = timestamp.toEpochMilli();
c.outputWithTimestamp(KV.of(sourceId, data), new org.joda.time.Instant(millis));
}
}))
.apply("Apply fixed window", window)
.apply("Group by inputId", GroupByKey.create())
.apply("Collect created buckets", ParDo.of(new GatherBuckets(options.getWindowSize())))
.apply("Send to Pub/sub", PubsubIO.writeStrings().to(topic));

Can I simply skip c.outputWithTimestamp(...); for the messages that should be discarded?
Yes, a DoFn can emit any number of output messages per input message, including zero.

Related

Beam Pub/Sub pipeline window group by value

I have setup a basic playground beam pipeline that uses a fixed window on incoming sensor data via a pub/sub topic.
Code:
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read PubSub Messages", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))))
.apply("Write Files to GCS", new WriteOneFilePerWindow(options.getOutput(), numShards));
My incoming data, that is stored on the bucket has this layout.
{"deviceId":"e97c6cce-5341-429b-a49d-506e9af1c845","value":79206.34,"sensorId":"b1185c92-4f07-4ef0-86ee-2fe38bdaaf33","timeInNanoSeconds":1671485053899000000,"receivedTimeInNanoSeconds":16714850539867940
{"deviceId":"d2fa82a7-d70e-4057-8d4f-769c21815195","value":28.83,"sensorId":"8a308426-bfc7-417b-a278-ae9c6353f080","timeInNanoSeconds":1671485065774000000,"receivedTimeInNanoSeconds":1671485065796548000}
I need to group the window on the sensorId - means there should be a fixed window for each sensorId. In the end, I want to persist a window per sensor.
From what I understand in the docs, I should go with GroupByKey
Can someone push me in the right direction?
That is correct, you would want to use a GroupByKey here. To use a GroupByKey, you need to transform your data into KVs. E.g.
pipeline
.apply("Read PubSub Messages", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))))
.apply(ParDo
.of(new DoFn<String, KV<String, String>>() {
public void processElement(ProcessContext c) {
String json = c.element();
[parse the json to get deviceId out]
c.output(KV.of(deviceId, json));
}
}))
.apply(GroupByKey.of()) // Groups by key and window.
.apply(ParDo
.of(new DoFn<KV<String, Itarable<String>>, String>() {
public void processElement(ProcessContext c) {
KV<String, Itarable<String>> grouped = c.element();
String deviceId = grouped.key();
Iterable<String> messages = grouped.value();
[loop over your messages to compute the result]
c.output(result);
}
}))
...

How to trigger window if one of multiple Kafka topics are idle

I'm consuming multiple Kafka topics, windowing them hourly and writing them into separate parquet files for each topic. However, if one of the topics are idle, the window does not get triggered and nothing is written to the FS. For this example, I'm consuming 2 topics with a single partition. taskmanager.numberOfTaskSlots: 2 and parallelism.default: 1. What is the proper way of solving this problem in Apache Beam with Flink Runner?
pipeline
.apply(
"ReadKafka",
KafkaIO
.read[String, String]
.withBootstrapServers(bootstrapServers)
.withTopics(topics)
.withCreateTime(Duration.standardSeconds(0))
.withReadCommitted
.withKeyDeserializer(classOf[StringDeserializer])
.withValueDeserializer(classOf[StringDeserializer])
.withoutMetadata()
)
.apply("ConvertToMyEvent", MapElements.via(new KVToMyEvent()))
.apply(
"WindowHourly",
Window.into[MyEvent](FixedWindows.of(Duration.standardHours(1)))
)
.apply(
"WriteParquet",
FileIO
.writeDynamic[String, MyEvent]()
.by(new BucketByEventName())
//...
)
TimeWindow needs data. If the topic is idle, it means , there is no data to close the Window and the window is open until the data arrives. If you want to window data based on Processing time instead of actual event time , try using a simple process function
public class MyProcessFunction extends
KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>{
// The data type can be primitive like String or your custom class
private transient ValueState<Long> windowDesc;
#Override
public void open(final Configuration conf) {
final ValueStateDescriptor<Long> windowDesc = new ValueStateDescriptor("windowDesc", Long.class);
this.windowTime = this.getRuntimeContext().getState(windowDesc); // normal variable declaration does not work. Declare variables like this and use it inside the functions
}
#Override
public void processElement(InputType input, Context context, Collector<OutPutType> collector)
throws IOException {
this.windowTime.update( <window interval> ); // milliseconds are recommended
context.timerService().registerProcessingTimeTimer(this.windowTime.value());//register a timer. Timer runs for windowTime from the current time.
.
.
.
if( this.windowTime.value() != null ){
context.timerService().deleteProcessingTimeTimer(this.windowTime.value());
// delete any existing time if you want to reset timer
}
}
#Override
public void onTimer(long timestamp, KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>.OnTimerContext context,
Collector<OutputType> collector) throws IOException {
//This method is executed when the timer reached
collector.collect( < whatever you want to stream out> );// this data will be available in the pipeline
}
}
```

How to process a KStream in a batch of max size or fallback to a time window?

I would like to create a Kafka stream-based application that processes a topic and takes messages in batches of size X (i.e. 50) but if the stream has low flow, to give me whatever the stream has within Y seconds (i.e. 5).
So, instead of processing messages one by one, I process a List[Record] where the size of the list is 50 (or maybe less).
This is to make some I/O bound processing more efficient.
I know that this can be implemented with the classic Kafka API but was looking for a stream-based implementation that can also handle offset committing natively, taking errors/failures into account.
I couldn't find anything related int he docs or by searching around and was wondering if anyone has a solution to this problem.
#Matthias J. Sax answer is nice, I just want to add an example for this, I think it might be useful for someone.
let's say we want to combine incoming values into the following type:
public class MultipleValues { private List<String> values; }
To collect messages into batches with max size, we need to create transformer:
public class MultipleValuesTransformer implements Transformer<String, String, KeyValue<String, MultipleValues>> {
private ProcessorContext processorContext;
private String stateStoreName;
private KeyValueStore<String, MultipleValues> keyValueStore;
private Cancellable scheduledPunctuator;
public MultipleValuesTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
scheduledPunctuator = processorContext.schedule(Duration.ofSeconds(30), PunctuationType.WALL_CLOCK_TIME, this::doPunctuate);
}
#Override
public KeyValue<String, MultipleValues> transform(String key, String value) {
MultipleValues itemValueFromStore = keyValueStore.get(key);
if (isNull(itemValueFromStore)) {
itemValueFromStore = MultipleValues.builder().values(Collections.singletonList(value)).build();
} else {
List<String> values = new ArrayList<>(itemValueFromStore.getValues());
values.add(value);
itemValueFromStore = itemValueFromStore.toBuilder()
.values(values)
.build();
}
if (itemValueFromStore.getValues().size() >= 50) {
processorContext.forward(key, itemValueFromStore);
keyValueStore.put(key, null);
} else {
keyValueStore.put(key, itemValueFromStore);
}
return null;
}
private void doPunctuate(long timestamp) {
KeyValueIterator<String, MultipleValues> valuesIterator = keyValueStore.all();
while (valuesIterator.hasNext()) {
KeyValue<String, MultipleValues> keyValue = valuesIterator.next();
if (nonNull(keyValue.value)) {
processorContext.forward(keyValue.key, keyValue.value);
keyValueStore.put(keyValue.key, null);
}
}
}
#Override
public void close() {
scheduledPunctuator.cancel();
}
}
and we need to create key-value store, add it to StreamsBuilder, and build KStream flow using transform method
Properties props = new Properties();
...
Serde<MultipleValues> multipleValuesSerge = Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(MultipleValues.class));
StreamsBuilder builder = new StreamsBuilder();
String storeName = "multipleValuesStore";
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore(storeName);
StoreBuilder<KeyValueStore<String, MultipleValues>> storeBuilder =
Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), multipleValuesSerge);
builder.addStateStore(storeBuilder);
builder.stream("source", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new MultipleValuesTransformer(storeName), storeName)
.print(Printed.<String, MultipleValues>toSysOut().withLabel("transformedMultipleValues"));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.start();
with such approach we used the incoming key for which we did aggregation. if you need to collect messages not by key, but by some message's fields, you need the following flow to trigger rebalancing on KStream (by using intermediate topic):
.selectKey(..)
.through(intermediateTopicName)
.transform( ..)
The simplest way might be, to use a stateful transform() operation. Each time you receive a record, you put it into the store. When you have received 50 records, you do your processing, emit output, and delete the records from the store.
To enforce processing if you don't read the limit in a certain amount of time, you can register a wall-clock punctuation.
It seems that there is no need to use Processors or Transformers and transform() to batch events by count. Regular groupBy() and reduce()/aggregate() should do the trick:
KeyValueSerde keyValueSerde = new KeyValueSerde(); // simple custom Serde
final AtomicLong batchCount = new AtomicLong(0L);
myKStream
.groupBy((k,v) -> KeyValue.pair(k, batchCount.getAndIncrement() / batchSize),
Grouped.keySerde(keyValueSerde))
.reduce(this::windowReducer) // <-- how you want to aggregate values in batch
.toStream()
.filter((k,v) -> /* pass through full batches only */)
.selectKey((k,v) -> k.key)
...
You'd also need to add straightforward Serde for the standard KeyValue<String, Long>.
This option is obviously only helpful when you don't need a "punctuator" to emit incomplete batches on timeout. It also doesn't guarantee the order of elements in the batch in case of distributed processing.
You can also concatenate count to the key string to form the new key (instead of using KeyValue). That would simplify example even further (to using Serdes.String()).

Kafka listener, get all messages

Good day collegues.
I have Kafka project using Spring Kafka what listen a definite topic.
I need one time in a day listen all messages, put them into a collection and find specific message there.
I couldn't understand how to read all messages in one #KafkaListener method.
My class is:
#Component
public class KafkaIntervalListener {
public CountDownLatch intervalLatch = new CountDownLatch(1);
private final SCDFRunnerService scdfRunnerService;
public KafkaIntervalListener(SCDFRunnerService scdfRunnerService) {
this.scdfRunnerService = scdfRunnerService;
}
#KafkaListener(topics = "${kafka.interval-topic}", containerFactory = "intervalEventKafkaListenerContainerFactory")
public void intervalListener(IntervalEvent event) throws UnsupportedEncodingException, JSONException {
System.out.println("Recieved interval message: " + event);
IntervalType type = event.getType();
Instant instant = event.getInterval();
List<IntervalEvent> events = new ArrayList<>();
events.add(event);
events.size();
this.intervalLatch.countDown();
}
}
My events collection always has size = 1;
I tried to use different loops, but then, my collection become filed 530 000 000 times the same message.
UPDATE:
I have found a way to do it with factory.setBatchListener(true); But i need to find launch it with #Scheduled(cron = "${kafka.cron}", zone = "Europe/Moscow"). Right now this method is always is listening. Now iam trying something like this:
#Scheduled(cron = "${kafka.cron}", zone = "Europe/Moscow")
public void run() throws Exception {
kafkaIntervalListener.intervalLatch.await();
}
It doesn't work, in debug mode my breakpoint never works on this site.
The listener container is, by design, message-driven.
For fetching messages on-demand, it's better to use the Kafka Consumer API directly and fetch messages using the poll() method.

KTable Reduce function does not honor windowing

Requirement :- We need to consolidate all the messages having same orderid and perform subsequent operation for the consolidated Message.
Explanation :- Below snippet of code tries to capture all order messages received from a particular tenant and tries to consolidate to a single order message after waiting for a specific period of time
It does the following stuff
Repartition message based on OrderId. So each order message will be having tenantId and groupId as its key
Perform a groupby key operation followed by windowed operation for 2 minutes
Reduce operation is performed once windowing is completed.
Ktable is converted again to stream back and then its output is send to another kafka topic
Expected Output :- If there are 5 messages having same order id being sent with in window period. It was expected that the final kafka topic should have only one message and it would be the last reduce operation message.
Actual Output :- All the 5 messages are seen indicating windowing is not happening before invoking reduce operation. All the messages seen in kafka have proper reduce operation being done as each and every message is received.
Queries :- In kafka stream library version 0.11.0.0 reduce function used to accept timewindow as its argument. I see that this is deprecated in kafka stream version 1.0.0. Windowing which is done in the below piece of code, is it correct ? Is windowing supported in newer version of kafka stream library 1.0.0 ? If so, then is there something can be improved in below snippet of code ?
String orderMsgTopic = "sampleordertopic";
JsonSerializer<OrderMsg> orderMsgJSONSerialiser = new JsonSerializer<>();
JsonDeserializer<OrderMsg> orderMsgJSONDeSerialiser = new JsonDeserializer<>(OrderMsg.class);
Serde<OrderMsg> orderMsgSerde = Serdes.serdeFrom(orderMsgJSONSerialiser,orderMsgJSONDeSerialiser);
KStream<String, OrderMsg> orderMsgStream = this.builder.stream(orderMsgTopic, Consumed.with(Serdes.ByteArray(), orderMsgSerde))
.map(new KeyValueMapper<byte[], OrderMsg, KeyValue<? extends String, ? extends OrderMsg>>() {
#Override
public KeyValue<? extends String, ? extends OrderMsg> apply(byte[] byteArr, OrderMsg value) {
TenantIdMessageTypeDeserializer deserializer = new TenantIdMessageTypeDeserializer();
TenantIdMessageType tenantIdMessageType = deserializer.deserialize(orderMsgTopic, byteArr);
String newTenantOrderKey = null;
if ((tenantIdMessageType != null) && (tenantIdMessageType.getMessageType() == 1)) {
Long tenantId = tenantIdMessageType.getTenantId();
newTenantOrderKey = tenantId.toString() + value.getOrderKey();
} else {
newTenantOrderKey = value.getOrderKey();
}
return new KeyValue<String, OrderMsg>(newTenantOrderKey, value);
}
});
final KTable<Windowed<String>, OrderMsg> orderGrouping = orderMsgStream.groupByKey(Serialized.with(Serdes.String(), orderMsgSerde))
.windowedBy(TimeWindows.of(windowTime).advanceBy(windowTime))
.reduce(new OrderMsgReducer());
orderGrouping.toStream().map(new KeyValueMapper<Windowed<String>, OrderMsg, KeyValue<String, OrderMsg>>() {
#Override
public KeyValue<String, OrderMsg> apply(Windowed<String> key, OrderMsg value) {
return new KeyValue<String, OrderMsg>(key.key(), value);
}
}).to("newone11", Produced.with(Serdes.String(), orderMsgSerde));
I realised that I had set StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG to 0 and also set the default commit interval of 1000ms. Changing this value helps me to some extent get the windowing working