Beam Pub/Sub pipeline window group by value - apache-beam

I have setup a basic playground beam pipeline that uses a fixed window on incoming sensor data via a pub/sub topic.
Code:
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read PubSub Messages", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))))
.apply("Write Files to GCS", new WriteOneFilePerWindow(options.getOutput(), numShards));
My incoming data, that is stored on the bucket has this layout.
{"deviceId":"e97c6cce-5341-429b-a49d-506e9af1c845","value":79206.34,"sensorId":"b1185c92-4f07-4ef0-86ee-2fe38bdaaf33","timeInNanoSeconds":1671485053899000000,"receivedTimeInNanoSeconds":16714850539867940
{"deviceId":"d2fa82a7-d70e-4057-8d4f-769c21815195","value":28.83,"sensorId":"8a308426-bfc7-417b-a278-ae9c6353f080","timeInNanoSeconds":1671485065774000000,"receivedTimeInNanoSeconds":1671485065796548000}
I need to group the window on the sensorId - means there should be a fixed window for each sensorId. In the end, I want to persist a window per sensor.
From what I understand in the docs, I should go with GroupByKey
Can someone push me in the right direction?

That is correct, you would want to use a GroupByKey here. To use a GroupByKey, you need to transform your data into KVs. E.g.
pipeline
.apply("Read PubSub Messages", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))))
.apply(ParDo
.of(new DoFn<String, KV<String, String>>() {
public void processElement(ProcessContext c) {
String json = c.element();
[parse the json to get deviceId out]
c.output(KV.of(deviceId, json));
}
}))
.apply(GroupByKey.of()) // Groups by key and window.
.apply(ParDo
.of(new DoFn<KV<String, Itarable<String>>, String>() {
public void processElement(ProcessContext c) {
KV<String, Itarable<String>> grouped = c.element();
String deviceId = grouped.key();
Iterable<String> messages = grouped.value();
[loop over your messages to compute the result]
c.output(result);
}
}))
...

Related

Apache Beam: how to discard messages?

I have a pipeline that:
Reads messages from pubsub
Converts them to a domain object
Applies fixed window
Sends data back to a pubsub topic
I would like to process only specific messages - for example having a specific attribute and discard all other messages. How can this be done in beam?
Can I simply skip c.outputWithTimestamp(...); for the messages that should be discarded?
My code:
pipeline.apply("Read PubSub messages",
PubsubIO.
readStrings().
fromSubscription(pubsubSub))
.apply("Convert to DeviceData",
ParDo.of(new DoFn<String, KV<String, DeviceData>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
String message = c.element();
DeviceData data = new Gson().fromJson(message, DeviceData.class);
String sourceId = data.getSensorId() != null ? data.getSensorId() : data.getFormulaId();
// use timestamp from payload
Long timeInNanoSeconds = data.getTimeInNanoSeconds();
Instant timestamp = ClockUtil.fromNanos(timeInNanoSeconds);
long millis = timestamp.toEpochMilli();
c.outputWithTimestamp(KV.of(sourceId, data), new org.joda.time.Instant(millis));
}
}))
.apply("Apply fixed window", window)
.apply("Group by inputId", GroupByKey.create())
.apply("Collect created buckets", ParDo.of(new GatherBuckets(options.getWindowSize())))
.apply("Send to Pub/sub", PubsubIO.writeStrings().to(topic));
Can I simply skip c.outputWithTimestamp(...); for the messages that should be discarded?
Yes, a DoFn can emit any number of output messages per input message, including zero.

How to stop sending to kafka topic when control goes to catch block Functional kafka spring

could you please advise , how can I stop sending to my 3rd kafka topic, when the control reaches the catch block, currently the message is sent to both error topic as well as the topic to which it should send in case of normal processing. A snippet of code is as below:
#Component
public class Abc {
private final StreamBridge streamBridge;
public Abc (StreamBridge streamBridge)
this.streamBridge = streamBridge;
#Bean
public Function<KStream<String, KafkaClass>, KStream<String,KafkaClass>> hiProcess() {
return input -> input.map((key,value) -> {
try{
KafkaClass stream = processFunction();
}
catch(Exception e) {
Message<KakfaClass> mess = MessageBuilder.withPayload(value).build();
streamBridge.send("errProcess-out-0". mess);
}
return new KeyValue<>(key, stream);
})
}
}
This can be implemented using the following pattern:
KafkaClass stream;
return input -> input
.branch((k, v) -> {
try {
stream = processFunction();
return true;
}
catch (Exception e) {
Message<KakfaClass> mess = MessageBuilder.withPayload(value).build();
streamBridge.send("errProcess-out-0". mess);
return false;
}
},
(k, v) -> true)[0]
.map((k, v) -> new KeyValue<>(k, stream));
Here, we are using the branching feature (API) of KStream to split your input into two paths - normal flow and the one causing the errors. This is accomplished by providing two filters to the branch method call. The first filter is the normal flow in which you call the processFunction method and get a response back. If we don't get an exception, the filter returns true, and the result of the branch operation is captured in the first element of the output array [0] which is processed downstream in the map operation in which it sends the final result to the outbound topic.
On the other hand, if it throws an exception, it sends whatever is necessary to the error topic using StreamBridge and the filter returns false. Since the downstream map operation is only performed on the first element of the array from branching [0], nothing will be sent outbound. When the first filter returns false, it goes to the second filter which always returns true. This is a no-op filter where the results are completely ignored.
One downside of this particular implementation is that you need to store the response from processFunction in an instance field and then mutate on each incoming KStream record so that you can access its value in the final map method where you send the output. However, for this particular use case, this may not be an issue.

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks
What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

How to process a KStream in a batch of max size or fallback to a time window?

I would like to create a Kafka stream-based application that processes a topic and takes messages in batches of size X (i.e. 50) but if the stream has low flow, to give me whatever the stream has within Y seconds (i.e. 5).
So, instead of processing messages one by one, I process a List[Record] where the size of the list is 50 (or maybe less).
This is to make some I/O bound processing more efficient.
I know that this can be implemented with the classic Kafka API but was looking for a stream-based implementation that can also handle offset committing natively, taking errors/failures into account.
I couldn't find anything related int he docs or by searching around and was wondering if anyone has a solution to this problem.
#Matthias J. Sax answer is nice, I just want to add an example for this, I think it might be useful for someone.
let's say we want to combine incoming values into the following type:
public class MultipleValues { private List<String> values; }
To collect messages into batches with max size, we need to create transformer:
public class MultipleValuesTransformer implements Transformer<String, String, KeyValue<String, MultipleValues>> {
private ProcessorContext processorContext;
private String stateStoreName;
private KeyValueStore<String, MultipleValues> keyValueStore;
private Cancellable scheduledPunctuator;
public MultipleValuesTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
scheduledPunctuator = processorContext.schedule(Duration.ofSeconds(30), PunctuationType.WALL_CLOCK_TIME, this::doPunctuate);
}
#Override
public KeyValue<String, MultipleValues> transform(String key, String value) {
MultipleValues itemValueFromStore = keyValueStore.get(key);
if (isNull(itemValueFromStore)) {
itemValueFromStore = MultipleValues.builder().values(Collections.singletonList(value)).build();
} else {
List<String> values = new ArrayList<>(itemValueFromStore.getValues());
values.add(value);
itemValueFromStore = itemValueFromStore.toBuilder()
.values(values)
.build();
}
if (itemValueFromStore.getValues().size() >= 50) {
processorContext.forward(key, itemValueFromStore);
keyValueStore.put(key, null);
} else {
keyValueStore.put(key, itemValueFromStore);
}
return null;
}
private void doPunctuate(long timestamp) {
KeyValueIterator<String, MultipleValues> valuesIterator = keyValueStore.all();
while (valuesIterator.hasNext()) {
KeyValue<String, MultipleValues> keyValue = valuesIterator.next();
if (nonNull(keyValue.value)) {
processorContext.forward(keyValue.key, keyValue.value);
keyValueStore.put(keyValue.key, null);
}
}
}
#Override
public void close() {
scheduledPunctuator.cancel();
}
}
and we need to create key-value store, add it to StreamsBuilder, and build KStream flow using transform method
Properties props = new Properties();
...
Serde<MultipleValues> multipleValuesSerge = Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(MultipleValues.class));
StreamsBuilder builder = new StreamsBuilder();
String storeName = "multipleValuesStore";
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore(storeName);
StoreBuilder<KeyValueStore<String, MultipleValues>> storeBuilder =
Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), multipleValuesSerge);
builder.addStateStore(storeBuilder);
builder.stream("source", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new MultipleValuesTransformer(storeName), storeName)
.print(Printed.<String, MultipleValues>toSysOut().withLabel("transformedMultipleValues"));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.start();
with such approach we used the incoming key for which we did aggregation. if you need to collect messages not by key, but by some message's fields, you need the following flow to trigger rebalancing on KStream (by using intermediate topic):
.selectKey(..)
.through(intermediateTopicName)
.transform( ..)
The simplest way might be, to use a stateful transform() operation. Each time you receive a record, you put it into the store. When you have received 50 records, you do your processing, emit output, and delete the records from the store.
To enforce processing if you don't read the limit in a certain amount of time, you can register a wall-clock punctuation.
It seems that there is no need to use Processors or Transformers and transform() to batch events by count. Regular groupBy() and reduce()/aggregate() should do the trick:
KeyValueSerde keyValueSerde = new KeyValueSerde(); // simple custom Serde
final AtomicLong batchCount = new AtomicLong(0L);
myKStream
.groupBy((k,v) -> KeyValue.pair(k, batchCount.getAndIncrement() / batchSize),
Grouped.keySerde(keyValueSerde))
.reduce(this::windowReducer) // <-- how you want to aggregate values in batch
.toStream()
.filter((k,v) -> /* pass through full batches only */)
.selectKey((k,v) -> k.key)
...
You'd also need to add straightforward Serde for the standard KeyValue<String, Long>.
This option is obviously only helpful when you don't need a "punctuator" to emit incomplete batches on timeout. It also doesn't guarantee the order of elements in the batch in case of distributed processing.
You can also concatenate count to the key string to form the new key (instead of using KeyValue). That would simplify example even further (to using Serdes.String()).

KTable Reduce function does not honor windowing

Requirement :- We need to consolidate all the messages having same orderid and perform subsequent operation for the consolidated Message.
Explanation :- Below snippet of code tries to capture all order messages received from a particular tenant and tries to consolidate to a single order message after waiting for a specific period of time
It does the following stuff
Repartition message based on OrderId. So each order message will be having tenantId and groupId as its key
Perform a groupby key operation followed by windowed operation for 2 minutes
Reduce operation is performed once windowing is completed.
Ktable is converted again to stream back and then its output is send to another kafka topic
Expected Output :- If there are 5 messages having same order id being sent with in window period. It was expected that the final kafka topic should have only one message and it would be the last reduce operation message.
Actual Output :- All the 5 messages are seen indicating windowing is not happening before invoking reduce operation. All the messages seen in kafka have proper reduce operation being done as each and every message is received.
Queries :- In kafka stream library version 0.11.0.0 reduce function used to accept timewindow as its argument. I see that this is deprecated in kafka stream version 1.0.0. Windowing which is done in the below piece of code, is it correct ? Is windowing supported in newer version of kafka stream library 1.0.0 ? If so, then is there something can be improved in below snippet of code ?
String orderMsgTopic = "sampleordertopic";
JsonSerializer<OrderMsg> orderMsgJSONSerialiser = new JsonSerializer<>();
JsonDeserializer<OrderMsg> orderMsgJSONDeSerialiser = new JsonDeserializer<>(OrderMsg.class);
Serde<OrderMsg> orderMsgSerde = Serdes.serdeFrom(orderMsgJSONSerialiser,orderMsgJSONDeSerialiser);
KStream<String, OrderMsg> orderMsgStream = this.builder.stream(orderMsgTopic, Consumed.with(Serdes.ByteArray(), orderMsgSerde))
.map(new KeyValueMapper<byte[], OrderMsg, KeyValue<? extends String, ? extends OrderMsg>>() {
#Override
public KeyValue<? extends String, ? extends OrderMsg> apply(byte[] byteArr, OrderMsg value) {
TenantIdMessageTypeDeserializer deserializer = new TenantIdMessageTypeDeserializer();
TenantIdMessageType tenantIdMessageType = deserializer.deserialize(orderMsgTopic, byteArr);
String newTenantOrderKey = null;
if ((tenantIdMessageType != null) && (tenantIdMessageType.getMessageType() == 1)) {
Long tenantId = tenantIdMessageType.getTenantId();
newTenantOrderKey = tenantId.toString() + value.getOrderKey();
} else {
newTenantOrderKey = value.getOrderKey();
}
return new KeyValue<String, OrderMsg>(newTenantOrderKey, value);
}
});
final KTable<Windowed<String>, OrderMsg> orderGrouping = orderMsgStream.groupByKey(Serialized.with(Serdes.String(), orderMsgSerde))
.windowedBy(TimeWindows.of(windowTime).advanceBy(windowTime))
.reduce(new OrderMsgReducer());
orderGrouping.toStream().map(new KeyValueMapper<Windowed<String>, OrderMsg, KeyValue<String, OrderMsg>>() {
#Override
public KeyValue<String, OrderMsg> apply(Windowed<String> key, OrderMsg value) {
return new KeyValue<String, OrderMsg>(key.key(), value);
}
}).to("newone11", Produced.with(Serdes.String(), orderMsgSerde));
I realised that I had set StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG to 0 and also set the default commit interval of 1000ms. Changing this value helps me to some extent get the windowing working