I'm trying to write a ParDo, which will use both Timer and Side Input, but it crashes when I try to run it with beam-runners-direct-java with IllegalArgumentException on a line https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/QuiescenceDriver.java#L167, because there are actually two inputs to ParDo (main PCollection and side input), while only one is expected.
Is there some way to workaround this? Is this a bug in Beam?
Here's the code snippet that reproduces that behaviour:
public class TestCrashesForTimerAndSideInput {
#Rule
public final transient TestPipeline p = TestPipeline.create();
#RequiredArgsConstructor
private static class DoFnWithTimer extends DoFn<KV<String, String>, String> {
private final PCollectionView<Map<String, String>> sideInput;
#TimerId("t")
private final TimerSpec tSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#ProcessElement
public void processElement(ProcessContext c, #TimerId("t") Timer t) {
KV<String, String> element = c.element();
c.output(element.getKey() + c.sideInput(sideInput).get(element));
t.offset(Duration.standardSeconds(1)).setRelative();
}
#OnTimer("t")
public void onTimerFire(OnTimerContext x) {
x.output("Timer fired");
}
}
#Test
public void testCrashesForTimerAndSideInput() {
ImmutableMap<String, String> sideData = ImmutableMap.<String, String>builder().
put("x", "X").
put("y", "Y").
build();
PCollectionView<Map<String, String>> sideInput =
p.apply(Create.of(sideData)).apply(View.asMap());
TestStream<String> testStream = TestStream.create(StringUtf8Coder.of()).
addElements("x").
advanceProcessingTime(Duration.standardSeconds(1)).
addElements("y").
advanceProcessingTime(Duration.standardSeconds(1)).
advanceWatermarkToInfinity();
PCollection<String> result = p.
apply(testStream).
apply(MapElements.into(kvs(strings(), strings())).via(v -> KV.of(v, v))).
apply(ParDo.of(new DoFnWithTimer(sideInput)).withSideInputs(sideInput));
PAssert.that(result).containsInAnyOrder("X", "Y", "Timer fired");
p.run();
}
}
and the exception:
java.lang.IllegalArgumentException: expected one element but was: <ParDo(DoFnWithTimer)/ParMultiDo(DoFnWithTimer)/To KeyedWorkItem/ParMultiDo(ToKeyedWorkItem).output [PCollection], View.AsMap/View.VoidKeyToMultimapMaterialization/ParDo(VoidKeyToMultimapMaterialization)/ParMultiDo(VoidKeyToMultimapMaterialization).output [PCollection]>
at org.apache.beam.repackaged.beam_runners_direct_java.com.google.common.collect.Iterators.getOnlyElement(Iterators.java:322)
at org.apache.beam.repackaged.beam_runners_direct_java.com.google.common.collect.Iterables.getOnlyElement(Iterables.java:294)
at org.apache.beam.runners.direct.QuiescenceDriver.fireTimers(QuiescenceDriver.java:167)
at org.apache.beam.runners.direct.QuiescenceDriver.drive(QuiescenceDriver.java:110)
at org.apache.beam.runners.direct.ExecutorServiceParallelExecutor$2.run(ExecutorServiceParallelExecutor.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Related
I'm making a Flink CEP application that reads data through Kafka. When I try to catch the patterns, the sink operation does not occur when there is no data after it. For example, I expect A-> B-> C as a pattern. And from the kafka comes A, B, C data. However, in order for the sink operation I added to the patternProcess function to work, the data coming from the kafka must be like A, B, C, X. How do I fix this problem please help.
READ KAFKA
DataStream<String> dataStream = env.addSource(KAFKA).assignTimestampsAndWatermarks(WatermarkStrategy
.forBoundedOutOfOrderness(Duration.ofSeconds(0)));
dataStream.print("DS:"); //to see every incoming data
PATTERN
Pattern<Event, ?> pattern = Pattern.<Event>begin("start").where(
new SimpleCondition<Event>() {
#Override
public boolean filter(Event event) {
return event.actionId.equals("2.24");
}
}
).next("middle").where(
new SimpleCondition<Event>() {
#Override
public boolean filter(Event event) {
return event.actionId.equals("2.24");
}
}
).within(Time.seconds(5));
CEP And Sink
PatternStream<Event> patternStream = CEP.pattern(eventStringKeyedStream, pattern);
patternStream.process(new PatternProcessFunction<Event, Event>() {
#Override
public void processMatch(Map<String, List<Event>> map, Context context, Collector<Event> collector) throws Exception {
collector.collect(map.get("start").get(0));
}
}).print();//or sink function
My Program RESULT
DS::2> {"ActionID":"2.24"}
DS::2> {"ActionID":"2.24"}
DS::2> {"ActionID":"2.25"}
4> {ActionID='2.24'}
I was expecting
DS::2> {"ActionID":"2.24"}
DS::2> {"ActionID":"2.24"}
4> {ActionID='2.24'}
So why does it produce results when one more data comes after the conditions are met, not when the conditions are met for the pattern?
Please help me.
EDIT
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.functions.PatternProcessFunction;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import java.time.Duration;
import java.util.List;
import java.util.Map;
public class EventTimePattern {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> input = env.socketTextStream("localhost",9999)
.map(new MapFunction<String, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> map (String value) throws Exception {
String[] fields = value.split(",");
if (fields.length == 2) {
return new Tuple2<String, Long>(
fields[0] ,
Long.parseLong(fields[1]));
}
return null;
}
})
/* env.fromElements(
Tuple2.of("A", 5L),
Tuple2.of("A", 10L)
)*/
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofMillis(0))
.withTimestampAssigner((event, timestamp) -> event.f1))
.map(event -> event.f0);
Pattern<String, ?> pattern =
Pattern.<String>begin("start")
.where(
new SimpleCondition<String>() {
#Override
public boolean filter(String value) throws Exception {
return value.equals("A");
}
})
.next("end")
.where(
new SimpleCondition<String>() {
#Override
public boolean filter(String value) throws Exception {
return value.equals("A");
}
})
.within(Time.seconds(5));
input.print("I");
DataStream<String> result =
CEP.pattern(input, pattern)
.process(new PatternProcessFunction<String, String>() {
#Override
public void processMatch(
Map<String, List<String>> map,
Context context,
Collector<String> out) throws Exception {
StringBuilder builder = new StringBuilder();
builder.append(map.get("start").get(0))
.append(",")
.append(map.get("end").get(0));
out.collect(builder.toString());
}
});
result.print();
env.execute();
}
}
I failed to reproduce your problem. Here's a similar example that works fine (I used Flink 1.12.2):
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.functions.PatternProcessFunction;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
public class EventTimePattern {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> input =
env.fromElements(
Tuple2.of("A", 5L),
Tuple2.of("A", 10L)
)
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofMillis(0))
.withTimestampAssigner((event, timestamp) -> event.f1))
.map(event -> event.f0);
Pattern<String, ?> pattern =
Pattern.<String>begin("start")
.where(
new SimpleCondition<String>() {
#Override
public boolean filter(String value) throws Exception {
return value.equals("A");
}
})
.next("end")
.where(
new SimpleCondition<String>() {
#Override
public boolean filter(String value) throws Exception {
return value.equals("A");
}
})
.within(Time.seconds(5));
DataStream<String> result =
CEP.pattern(input, pattern)
.process(new PatternProcessFunction<String, String>() {
#Override
public void processMatch(
Map<String, List<String>> map,
Context context,
Collector<String> out) throws Exception {
StringBuilder builder = new StringBuilder();
builder.append(map.get("start").get(0))
.append(",")
.append(map.get("end").get(0));
out.collect(builder.toString());
}
});
result.print();
env.execute();
}
}
Please share a simple, complete, reproducible example that illustrates the problem you're having.
I am executing a simple word count program where I used one Kafka topic (producer) as an input source then I apply a pardo to it for calculating the word count. Now I need help to write the words to different topics on the basis of their frequency. Let say all the word with even frequency will go to topic 1 and rest will go to topic 2.
can anyone help me with the example?
This can be done using Kafka.io writeRecord method that takes Producer<key,value> and then using new Produce<>("topic_name","key","value") -
below is the code -:
static class ExtractWordsFn extends DoFn<String, String> {
private final Counter emptyLines = Metrics.counter(ExtractWordsFn.class, "emptyLines");
private final Distribution lineLenDist =
Metrics.distribution(ExtractWordsFn.class, "lineLenDistro");
#ProcessElement
public void processElement(#Element String element, OutputReceiver<String> receiver) {
lineLenDist.update(element.length());
if (element.trim().isEmpty()) {
emptyLines.inc();
}
String[] words = element.split(ExampleUtils.TOKENIZER_PATTERN, -1);
for (String word : words) {
if (!word.isEmpty()) {
receiver.output(word);
}
}
}
}
public static class FormatAsTextFn extends SimpleFunction<KV<String, Long>, ProducerRecord<String,String>> {
#Override
public ProducerRecord<String, String> apply(KV<String, Long> input) {
if(input.getValue()%2==0)
return new ProducerRecord("test",input.getKey(),input.getKey()+" "+input.getValue().toString());
else
return new ProducerRecord("copy",input.getKey(),input.getKey()+" "+input.getValue().toString());
}
}
public static class CountWords
extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
#Override
public PCollection<KV<String, Long>> expand(PCollection<String> lines) {
PCollection<String> words = lines.apply(ParDo.of(new ExtractWordsFn()));
PCollection<KV<String, Long>> wordCounts = words.apply(Count.perElement());
return wordCounts;
}
}
p.apply("ReadLines", KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("copy")// use withTopics(List<String>) to read from multiple topics.
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("group.id", "my_beam_app_1"))
.updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true"))
.withLogAppendTime()
.withReadCommitted()
.commitOffsetsInFinalize()
.withProcessingTime()
.withoutMetadata()
)
.apply(Values.create())
.apply(Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(new CountWords())
.apply(MapElements.via(new FormatAsTextFn())) //PCollection<ProducerRecord<string,string>>
.setCoder(ProducerRecordCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()))
.apply("WriteCounts", (KafkaIO.<String, String>writeRecords()
.withBootstrapServers("localhost:9092")
//.withTopic("test")
.withKeySerializer(StringSerializer.class)
.withValueSerializer(StringSerializer.class)
))
I have a simple pipeline that reads from Kafka by KafkaIO reader and transforms next into pipeline. In the end, it writes down to GCP in avro format. So when I run the pipeline in DataFlow it works perfectly but when the runner is DirectRunner it reads all data from topics and throws the exception.
java.lang.IllegalArgumentException: Forbidden IOException when reading from InputStream
at org.apache.beam.sdk.util.CoderUtils.decodeFromSafeStream(CoderUtils.java:118)
at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:98)
at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:92)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at org.apache.beam.runners.direct.CloningBundleFactory$CloningBundle.add(CloningBundleFactory.java:84)
at org.apache.beam.runners.direct.GroupAlsoByWindowEvaluatorFactory$OutputWindowedValueToBundle.outputWindowedValue(GroupAlsoByWindowEvaluatorFactory.java:251)
at org.apache.beam.runners.direct.GroupAlsoByWindowEvaluatorFactory$OutputWindowedValueToBundle.outputWindowedValue(GroupAlsoByWindowEvaluatorFactory.java:237)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1057)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:438)
at org.apache.beam.repackaged.direct_java.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:125)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1060)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:768)
at org.apache.beam.runners.direct.GroupAlsoByWindowEvaluatorFactory$GroupAlsoByWindowEvaluator.processElement(GroupAlsoByWindowEvaluatorFactory.java:185)
at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:160)
at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:124)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
at org.apache.beam.sdk.util.VarInt.decodeLong(VarInt.java:73)
at org.apache.beam.sdk.coders.IterableLikeCoder.decode(IterableLikeCoder.java:136)
at org.apache.beam.sdk.coders.IterableLikeCoder.decode(IterableLikeCoder.java:60)
at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82)
at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36)
at org.apache.beam.sdk.util.CoderUtils.decodeFromSafeStream(CoderUtils.java:115)
... 19 more
I use custom serializator and deserializator for reading avro and getting paylod.
Kafka Reader
private PTransform<PBegin, PCollection<KV<String, AvroGenericRecord>>> createKafkaRead(Map<String, Object> configUpdates) {
return KafkaIO.<String, AvroGenericRecord>read()
.withBootstrapServers(bootstrapServers)
.withConsumerConfigUpdates(configUpdates)
.withTopics(inputTopics)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(BeamKafkaAvroGenericDeserializer.class, AvroGenericCoder.of(serDeConfig()))
.withMaxNumRecords(maxNumRecords)
.commitOffsetsInFinalize()
.withoutMetadata();
}
AvroGenericCoder
public class AvroGenericCoder extends CustomCoder<AvroGenericRecord> {
private final Map<String, Object> config;
private transient BeamKafkaAvroGenericDeserializer deserializer;
private transient BeamKafkaAvroGenericSerializer serializer;
public static AvroGenericCoder of(Map<String, Object> config) {
return new AvroGenericCoder(config);
}
protected AvroGenericCoder(Map<String, Object> config) {
this.config = config;
}
private BeamKafkaAvroGenericDeserializer getDeserializer() {
if (deserializer == null) {
BeamKafkaAvroGenericDeserializer d = new BeamKafkaAvroGenericDeserializer();
d.configure(config, false);
deserializer = d;
}
return deserializer;
}
private BeamKafkaAvroGenericSerializer getSerializer() {
if (serializer == null) {
serializer = new BeamKafkaAvroGenericSerializer();
}
return serializer;
}
#Override
public void encode(AvroGenericRecord record, OutputStream outStream) {
getSerializer().serialize(record, outStream);
}
#Override
public AvroGenericRecord decode(InputStream inStream) {
try {
return getDeserializer().deserialize(null, IOUtils.toByteArray(inStream));
} catch (IOException e) {
throw new RuntimeException("Error translating into bytes ", e);
}
}
#Override
public void verifyDeterministic() {
}
#Override
public Object structuralValue(AvroGenericRecord value) {
return super.structuralValue(value);
}
#Override
public int hashCode() {
return HashCodeBuilder.reflectionHashCode(this);
}
#Override
public boolean equals(Object obj) {
return EqualsBuilder.reflectionEquals(this, obj);
}
}
This is main pipeline
PCollection<AvroGenericRecord> records = p.apply(readKafkaTr)
.apply(Window.<AvroGenericRecord>into(FixedWindows.of(Duration.standardMinutes(options.getWindowInMinutes())))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(options.getWindowInMinutes())))
.withLateFirings(AfterPane.elementCountAtLeast(options.getElementsCountToWaitAfterWatermark())))
.withAllowedLateness(Duration.standardDays(options.getAfterWatermarkInDays()))
.discardingFiredPanes()
);
records.apply(Filter.by((ProcessFunction<AvroGenericRecord, Boolean>) Objects::nonNull))
.apply(new WriteAvroFilesTr(options.getBasePath(), options.getNumberOfShards()));
Yes, I think #RyanSkraba is right - DirectRunner does many things that not all other runners do (because initial goal of DirectRunner was to be used for testing, so it performs many additional checks comparing to other runners).
Btw, why would not use Beam AvroCoder in this case? Simple example how to use it with KafkaIO:
https://github.com/aromanenko-dev/beam-issues/blob/master/kafka-io/src/main/java/KafkaAvro.java
I have 2 Kafka streams, I want to merge by some key and on top of the merged stream I want to perform the stateful operation so that I can sum up counts from both streams
this what I tried but dint work ..
PCollection<String> stream1 = .. read from kafka
PCollection<String> stream2 = .. read from kafka
PCollection<String,Long> wonrdCount1 = stream1.apply(...)
PCollection<String,Long> wonrdCount2 = stream2.apply(...)
PCollection<String,Long> merged = merge wordcount1 and wordcount2 using CoGroupByKey
Pcolection<String,Long> finalStream = mergred.apply(...)
for finalstream apply state
public class KafkaWordCount implements Serializable {
private String kafkaBrokers =null;
private String topic =null;
public KafkaWordCount(String brokers, String topic){
this.kafkaBrokers =brokers;
this.topic =topic;
}
public PCollection<KV<String,Long>> build(Pipeline p){
final String myState="HELLO";
PCollection<KV<String,Long>> res =
p.apply(KafkaIO.<Long, String>read()
.withBootstrapServers(this.kafkaBrokers )
.withTopic(this.topic)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class))
.apply(ParDo.of(new DoFn<KafkaRecord<Long, String>, String>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
KafkaRecord<Long, String> record = processContext.element();
processContext.output(record.getKV().getValue());
}
}))
.apply("ExtractWords",
ParDo.of(new DoFn<String, KV<String, Long>>() {
#ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^\\p{L}]+")) {
if (!word.isEmpty()) {
c.output(KV.of(word,1L));
}
}
}
}));
return res;
}
}
public class DataPipe {
public static void main(String[] args) {
final String stateId = "myMapState";
final String myState = "myState";
PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class).setRunner(FlinkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<KV<String,Long>> stream1 =
new KafkaWordCount("localhost:9092","idm")
.build(p)
.apply(
Window
.<KV<String,Long>>into(
FixedWindows.of(Duration.millis(3600000)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
PCollection<KV<String,Long>> stream2 =
new KafkaWordCount("localhost:9092","assist")
.build(p)
.apply(
Window
.<KV<String,Long>>into(
FixedWindows.of(Duration.millis(3600000)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
final TupleTag<Long> web = new TupleTag<Long>();
final TupleTag<Long> assist = new TupleTag<Long>();
PCollection<KV<String, CoGbkResult>> joinedStream =
KeyedPCollectionTuple.of(web, stream1)
.and(assist, stream2)
.apply(CoGroupByKey.<String>create());
PCollection<KV<String,Long>> finalCountStream =
joinedStream
.apply(ParDo.of(
new DoFn<KV<String, CoGbkResult>, KV<String,Long>>() {
#StateId(stateId)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext processContext,
#StateId(stateId) MapState<String, Long> state) {
KV<String,CoGbkResult> element = processContext.element();
Iterable<Long> count1 = element.getValue().getAll(web);
Iterable<Long> count2 = element.getValue().getAll(assist);
Long sumAmount =
StreamSupport
.stream(
Iterables
.concat(count1, count2)
.spliterator(),
false)
.collect(Collectors.summingLong(n -> n));
System.out.println(element.getKey()+"::"+sumAmount);
// processContext.output(element.getKey()+"::"+sumAmount);
Long currCount = state.get(element.getKey()).read()==null? 0L:state.get(element.getKey()).read();
Long newCount = currCount+sumAmount;
state.put(element.getKey(),sumAmount);
processContext.output(KV.of(element.getKey(),sumAmount));
}
}));
finalCountStream
.apply(ParDo.of(new DoFn<KV<String,Long>, KV<String,Long>>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
processContext.output(processContext.element());
}
}))
.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext c,
#StateId(myState) MapState<String, Long> state){
KV<String,Long> e = c.element();
System.out.println("Thread ID :"
+ Thread.currentThread().getId());
Long currCount =
state.get(e.getKey()).read()==null
? 0L
: state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
}))
.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values());
/* finalCountStream.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values()
);*/
//finalCountStream.apply(TextIO.write().to("wordcounts"));
p.run().waitUntilFinish();
}
}
This Beam pipeline reads text from two kafka streams , split it into words and merge both streams based on word and finally emits word count from both stream to another kafka topic
For reference: if I understand correctly the problem, you can simplify the first part of your pipeline by using KafkaIO.withTopics(List<String>) and read from two (or more) topics in one step. So, there is no need to join data from different topics after.
I am facing the following error when I use sideinputs.
With the following model code:
PCollectionView<Map<String, String>> view1= information
.apply(View.<String, String>asMap());
PCollection<KV<String, Position>> FileData;
FileData.apply("populate",
ParDo.of(new DoFn<KV<String, Position>, KV<String, Position>>() {
#ProcessElement
public void processElement(ProcessContext c) {
}.withSideInputs(view1));
The error occurs when withSideInputs method is called. The withsideinput is not accepting KV type value as input . Could you please tell what I am missing.
Error Message:
java.lang.ClassCastException: org.apache.beam.sdk.values.KV cannot be cast to java.lang.Iterable
at org.apache.beam.runners.core.SideInputHandler.addSideInputValue(SideInputHandler.java:142)
at org.apache.beam.runners.apex.translation.operators.ApexParDoOperator$2.process(ApexParDoOperator.java:225)
at org.apache.beam.runners.apex.translation.operators.ApexParDoOperator$2.process(ApexParDoOperator.java:207)
at com.datatorrent.api.DefaultInputPort.put(DefaultInputPort.java:79)
at com.datatorrent.stram.engine.AbstractReservoir$SpscArrayBlockingQueueReservoir.sweep(AbstractReservoir.java:413)
at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:269)
at com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1428)
Sample Code to reproduce the issue:
public void testMapAsEntrySetSideInput() {
final PCollectionView<Map<String, Integer>> view =
pipeline.apply("CreateSideInput", Create.of(KV.of("a", 1), KV.of("b", 3)))
.apply(View.<String, Integer>asMap());
PCollection<KV<String, Integer>> output =
pipeline.apply("CreateMainInput", Create.of(2 /* size */))
.apply(
"OutputSideInputs",
ParDo.of(new DoFn<Integer, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) {
assertEquals((int) c.element(), c.sideInput(view).size());
assertEquals((int) c.element(), c.sideInput(view).entrySet().size());
for (Entry<String, Integer> entry : c.sideInput(view).entrySet()) {
c.output(KV.of(entry.getKey(), entry.getValue()));
}
}
}).withSideInputs(view));
PAssert.that(output).containsInAnyOrder(
KV.of("a", 1), KV.of("b", 3));
pipeline.run();
}
This was a bug in the Apex runner and it has been resolved for Beam release 2.3.0.