Apache beam IOException in decoder - apache-beam

I have a simple pipeline that reads from Kafka by KafkaIO reader and transforms next into pipeline. In the end, it writes down to GCP in avro format. So when I run the pipeline in DataFlow it works perfectly but when the runner is DirectRunner it reads all data from topics and throws the exception.
java.lang.IllegalArgumentException: Forbidden IOException when reading from InputStream
at org.apache.beam.sdk.util.CoderUtils.decodeFromSafeStream(CoderUtils.java:118)
at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:98)
at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:92)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at org.apache.beam.runners.direct.CloningBundleFactory$CloningBundle.add(CloningBundleFactory.java:84)
at org.apache.beam.runners.direct.GroupAlsoByWindowEvaluatorFactory$OutputWindowedValueToBundle.outputWindowedValue(GroupAlsoByWindowEvaluatorFactory.java:251)
at org.apache.beam.runners.direct.GroupAlsoByWindowEvaluatorFactory$OutputWindowedValueToBundle.outputWindowedValue(GroupAlsoByWindowEvaluatorFactory.java:237)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1057)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:438)
at org.apache.beam.repackaged.direct_java.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:125)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1060)
at org.apache.beam.repackaged.direct_java.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:768)
at org.apache.beam.runners.direct.GroupAlsoByWindowEvaluatorFactory$GroupAlsoByWindowEvaluator.processElement(GroupAlsoByWindowEvaluatorFactory.java:185)
at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:160)
at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:124)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
at org.apache.beam.sdk.util.VarInt.decodeLong(VarInt.java:73)
at org.apache.beam.sdk.coders.IterableLikeCoder.decode(IterableLikeCoder.java:136)
at org.apache.beam.sdk.coders.IterableLikeCoder.decode(IterableLikeCoder.java:60)
at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82)
at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36)
at org.apache.beam.sdk.util.CoderUtils.decodeFromSafeStream(CoderUtils.java:115)
... 19 more
I use custom serializator and deserializator for reading avro and getting paylod.
Kafka Reader
private PTransform<PBegin, PCollection<KV<String, AvroGenericRecord>>> createKafkaRead(Map<String, Object> configUpdates) {
return KafkaIO.<String, AvroGenericRecord>read()
.withBootstrapServers(bootstrapServers)
.withConsumerConfigUpdates(configUpdates)
.withTopics(inputTopics)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(BeamKafkaAvroGenericDeserializer.class, AvroGenericCoder.of(serDeConfig()))
.withMaxNumRecords(maxNumRecords)
.commitOffsetsInFinalize()
.withoutMetadata();
}
AvroGenericCoder
public class AvroGenericCoder extends CustomCoder<AvroGenericRecord> {
private final Map<String, Object> config;
private transient BeamKafkaAvroGenericDeserializer deserializer;
private transient BeamKafkaAvroGenericSerializer serializer;
public static AvroGenericCoder of(Map<String, Object> config) {
return new AvroGenericCoder(config);
}
protected AvroGenericCoder(Map<String, Object> config) {
this.config = config;
}
private BeamKafkaAvroGenericDeserializer getDeserializer() {
if (deserializer == null) {
BeamKafkaAvroGenericDeserializer d = new BeamKafkaAvroGenericDeserializer();
d.configure(config, false);
deserializer = d;
}
return deserializer;
}
private BeamKafkaAvroGenericSerializer getSerializer() {
if (serializer == null) {
serializer = new BeamKafkaAvroGenericSerializer();
}
return serializer;
}
#Override
public void encode(AvroGenericRecord record, OutputStream outStream) {
getSerializer().serialize(record, outStream);
}
#Override
public AvroGenericRecord decode(InputStream inStream) {
try {
return getDeserializer().deserialize(null, IOUtils.toByteArray(inStream));
} catch (IOException e) {
throw new RuntimeException("Error translating into bytes ", e);
}
}
#Override
public void verifyDeterministic() {
}
#Override
public Object structuralValue(AvroGenericRecord value) {
return super.structuralValue(value);
}
#Override
public int hashCode() {
return HashCodeBuilder.reflectionHashCode(this);
}
#Override
public boolean equals(Object obj) {
return EqualsBuilder.reflectionEquals(this, obj);
}
}
This is main pipeline
PCollection<AvroGenericRecord> records = p.apply(readKafkaTr)
.apply(Window.<AvroGenericRecord>into(FixedWindows.of(Duration.standardMinutes(options.getWindowInMinutes())))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(options.getWindowInMinutes())))
.withLateFirings(AfterPane.elementCountAtLeast(options.getElementsCountToWaitAfterWatermark())))
.withAllowedLateness(Duration.standardDays(options.getAfterWatermarkInDays()))
.discardingFiredPanes()
);
records.apply(Filter.by((ProcessFunction<AvroGenericRecord, Boolean>) Objects::nonNull))
.apply(new WriteAvroFilesTr(options.getBasePath(), options.getNumberOfShards()));

Yes, I think #RyanSkraba is right - DirectRunner does many things that not all other runners do (because initial goal of DirectRunner was to be used for testing, so it performs many additional checks comparing to other runners).
Btw, why would not use Beam AvroCoder in this case? Simple example how to use it with KafkaIO:
https://github.com/aromanenko-dev/beam-issues/blob/master/kafka-io/src/main/java/KafkaAvro.java

Related

Mixing Kafka Streams DSL with Processor API to get offset

I am trying to find a way to log the offset when an exception occurs.
Here is what I am trying to achieve:
void createTopology(StreamsBuilder builder) {
builder.stream(topic, Consumed.with(Serdes.String(), new JsonSerde()))
.filter(...)
.mapValues(value -> {
Map<String, Object> output;
try {
output = decode(value.get("data"));
} catch (DecodingException e) {
LOGGER.error(e.getMessage());
// TODO: LOG OFFSET FOR FAILED DECODE HERE
return new ArrayList<>();
}
...
return output;
})
.filter((k, v) -> !(v instanceof List && ((List<?>) v).isEmpty()))
.to(sink_topic);
}
I found this: https://docs.confluent.io/platform/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-transformations-stateful
and it is in my understanding that I need to use the Processor API but still haven't found a solution for my issue.
A ValueTransfomer can also access the offset via the ProcessorContext passed via init, and I believe it's much easier.
Here is the solution, as suggested by IUSR: https://stackoverflow.com/a/73465691/14945779 (thank you):
static class InjectOffsetTransformer implements ValueTransformer<JsonObject, JsonObject> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public JsonObject transform(JsonObject value) {
value.addProperty("offset", context.offset());
return value;
}
#Override
public void close() {
}
}
void createTopology(StreamsBuilder builder) {
builder.stream(topic, Consumed.with(Serdes.String(), new JsonSerde()))
.filter(...)
.transformValues(InjectOffsetTransformer::new)
.mapValues(value -> {
Map<String, Object> output;
try {
output = decode(value.get("data"));
} catch (DecodingException e) {
LOGGER.warn(String.format("Error reading from topic %s. Last read offset %s:", topic, lastReadOffset), e);
return new ArrayList<>();
}
lastReadOffset = value.get("offset").getAsLong();
return output;
})
.filter((k, v) -> !(v instanceof List && ((List<?>) v).isEmpty()))
.to(sink_topic);
}

Usage of timer and side input on ParDo in Apache Beam

I'm trying to write a ParDo, which will use both Timer and Side Input, but it crashes when I try to run it with beam-runners-direct-java with IllegalArgumentException on a line https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/QuiescenceDriver.java#L167, because there are actually two inputs to ParDo (main PCollection and side input), while only one is expected.
Is there some way to workaround this? Is this a bug in Beam?
Here's the code snippet that reproduces that behaviour:
public class TestCrashesForTimerAndSideInput {
#Rule
public final transient TestPipeline p = TestPipeline.create();
#RequiredArgsConstructor
private static class DoFnWithTimer extends DoFn<KV<String, String>, String> {
private final PCollectionView<Map<String, String>> sideInput;
#TimerId("t")
private final TimerSpec tSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#ProcessElement
public void processElement(ProcessContext c, #TimerId("t") Timer t) {
KV<String, String> element = c.element();
c.output(element.getKey() + c.sideInput(sideInput).get(element));
t.offset(Duration.standardSeconds(1)).setRelative();
}
#OnTimer("t")
public void onTimerFire(OnTimerContext x) {
x.output("Timer fired");
}
}
#Test
public void testCrashesForTimerAndSideInput() {
ImmutableMap<String, String> sideData = ImmutableMap.<String, String>builder().
put("x", "X").
put("y", "Y").
build();
PCollectionView<Map<String, String>> sideInput =
p.apply(Create.of(sideData)).apply(View.asMap());
TestStream<String> testStream = TestStream.create(StringUtf8Coder.of()).
addElements("x").
advanceProcessingTime(Duration.standardSeconds(1)).
addElements("y").
advanceProcessingTime(Duration.standardSeconds(1)).
advanceWatermarkToInfinity();
PCollection<String> result = p.
apply(testStream).
apply(MapElements.into(kvs(strings(), strings())).via(v -> KV.of(v, v))).
apply(ParDo.of(new DoFnWithTimer(sideInput)).withSideInputs(sideInput));
PAssert.that(result).containsInAnyOrder("X", "Y", "Timer fired");
p.run();
}
}
and the exception:
java.lang.IllegalArgumentException: expected one element but was: <ParDo(DoFnWithTimer)/ParMultiDo(DoFnWithTimer)/To KeyedWorkItem/ParMultiDo(ToKeyedWorkItem).output [PCollection], View.AsMap/View.VoidKeyToMultimapMaterialization/ParDo(VoidKeyToMultimapMaterialization)/ParMultiDo(VoidKeyToMultimapMaterialization).output [PCollection]>
at org.apache.beam.repackaged.beam_runners_direct_java.com.google.common.collect.Iterators.getOnlyElement(Iterators.java:322)
at org.apache.beam.repackaged.beam_runners_direct_java.com.google.common.collect.Iterables.getOnlyElement(Iterables.java:294)
at org.apache.beam.runners.direct.QuiescenceDriver.fireTimers(QuiescenceDriver.java:167)
at org.apache.beam.runners.direct.QuiescenceDriver.drive(QuiescenceDriver.java:110)
at org.apache.beam.runners.direct.ExecutorServiceParallelExecutor$2.run(ExecutorServiceParallelExecutor.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Apache Flink - how to send and consume POJOs using AWS Kinesis

I want to consume POJOs arriving from Kinesis with Flink.
Is there any standard for how to correctly send and deserialize the messages?
Thanks
I resolved it with:
DataStream<SamplePojo> kinesis = see.addSource(new FlinkKinesisConsumer<>(
"my-stream",
new POJODeserializationSchema(),
kinesisConsumerConfig));
and
public class POJODeserializationSchema extends AbstractDeserializationSchema<SamplePojo> {
private ObjectMapper mapper;
#Override
public SamplePojo deserialize(byte[] message) throws IOException {
if (mapper == null) {
mapper = new ObjectMapper();
}
SamplePojo retVal = mapper.readValue(message, SamplePojo.class);
return retVal;
}
#Override
public boolean isEndOfStream(SamplePojo nextElement) {
return false;
}
}

Flink Kafka Consumer throws Null Pointer Exception when using DataStream key by

I am using this example Flink CEP where I am separating out the data as I have created one application which is Sending application to Kafka & another application reading from Kafka... I generated the producer for class TemperatureWarning i.e. in Kafka,I was sending data related to TemperatureWarning Following is my code which is consuming data from Kafka...
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.enableCheckpointing(5000);
Properties properties=new Properties();
properties.setProperty("bootstrap.servers", "PUBLICDNS:9092");
properties.setProperty("zookeeper.connect", "PUBLICDNS:2181");
properties.setProperty("group.id", "test");
DataStream<TemperatureWarning> dstream=env.addSource(new FlinkKafkaConsumer09<TemperatureWarning>("MonitoringEvent", new MonitoringEventSchema(), properties));
Pattern<TemperatureWarning, ?> alertPattern = Pattern.<TemperatureWarning>begin("first")
.next("second")
.within(Time.seconds(20));
PatternStream<TemperatureWarning> alertPatternStream = CEP.pattern(
dstream.keyBy("rackID"),
alertPattern);
DataStream<TemperatureAlert> alerts = alertPatternStream.flatSelect(
(Map<String, TemperatureWarning> pattern, Collector<TemperatureAlert> out) -> {
TemperatureWarning first = pattern.get("first");
TemperatureWarning second = pattern.get("second");
if (first.getAverageTemperature() < second.getAverageTemperature()) {
out.collect(new TemperatureAlert(second.getRackID(),second.getAverageTemperature(),second.getTimeStamp()));
}
});
dstream.print();
alerts.print();
env.execute("Flink Kafka Consumer");
But when I execute this application,it throws following Exception:
Exception in thread "main" java.lang.NullPointerException
at org.apache.flink.api.common.operators.Keys$ExpressionKeys.<init>(Keys.java:329)
at org.apache.flink.streaming.api.datastream.DataStream.keyBy(DataStream.java:274)
at com.yash.consumer.KafkaFlinkConsumer.main(KafkaFlinkConsumer.java:49)
Following is my class TemperatureWarning :
public class TemperatureWarning {
private int rackID;
private double averageTemperature;
private long timeStamp;
public TemperatureWarning(int rackID, double averageTemperature,long timeStamp) {
this.rackID = rackID;
this.averageTemperature = averageTemperature;
this.timeStamp=timeStamp;
}
public TemperatureWarning() {
this(-1, -1,-1);
}
public int getRackID() {
return rackID;
}
public void setRackID(int rackID) {
this.rackID = rackID;
}
public double getAverageTemperature() {
return averageTemperature;
}
public void setAverageTemperature(double averageTemperature) {
this.averageTemperature = averageTemperature;
}
public long getTimeStamp() {
return timeStamp;
}
public void setTimeStamp(long timeStamp) {
this.timeStamp = timeStamp;
}
#Override
public boolean equals(Object obj) {
if (obj instanceof TemperatureWarning) {
TemperatureWarning other = (TemperatureWarning) obj;
return rackID == other.rackID && averageTemperature == other.averageTemperature;
} else {
return false;
}
}
#Override
public int hashCode() {
return 41 * rackID + Double.hashCode(averageTemperature);
}
#Override
public String toString() {
//return "TemperatureWarning(" + getRackID() + ", " + averageTemperature + ")";
return "TemperatureWarning(" + getRackID() +","+averageTemperature + ") "+ "," + getTimeStamp();
}
}
Following is my class MonitoringEventSchema :
public class MonitoringEventSchema implements DeserializationSchema<TemperatureWarning>,SerializationSchema<TemperatureWarning>
{
#Override
public TypeInformation<TemperatureWarning> getProducedType() {
// TODO Auto-generated method stub
return null;
}
#Override
public byte[] serialize(TemperatureWarning element) {
// TODO Auto-generated method stub
return element.toString().getBytes();
}
#Override
public TemperatureWarning deserialize(byte[] message) throws IOException {
// TODO Auto-generated method stub
if(message!=null)
{
String str=new String(message,"UTF-8");
String []val=str.split(",");
TemperatureWarning warning=new TemperatureWarning(Integer.parseInt(val[0]),Double.parseDouble(val[1]),Long.parseLong(val[2]));
return warning;
}
return null;
}
#Override
public boolean isEndOfStream(TemperatureWarning nextElement) {
// TODO Auto-generated method stub
return false;
}
}
Now what is required to do keyBy operation as I have mentioned the key which is required for stream to partition ?? What needs to be done here to solve this error ??
The problem is in this function:
#Override
public TypeInformation<TemperatureWarning> getProducedType() {
// TODO Auto-generated method stub
return null;
}
you cannot return null here.

CAS consumer not working as expected

I have a CAS consumer AE which is expected to iterates over CAS objects in a pipeline, serialize them and add the serialized CASs to an xml file.
public class DataWriter extends JCasConsumer_ImplBase {
private File outputDirectory;
public static final String PARAM_OUTPUT_DIRECTORY = "outputDir";
#ConfigurationParameter(name=PARAM_OUTPUT_DIRECTORY, defaultValue=".")
private String outputDir;
CasToInlineXml cas2xml;
public void initialize(UimaContext context) throws ResourceInitializationException {
super.initialize(context);
ConfigurationParameterInitializer.initialize(this, context);
outputDirectory = new File(outputDir);
if (!outputDirectory.exists()) {
outputDirectory.mkdirs();
}
}
#Override
public void process(JCas jCas) throws AnalysisEngineProcessException {
String file = fileCollectionReader.fileName;
File outFile = new File(outputDirectory, file + ".xmi");
FileOutputStream out = null;
try {
out = new FileOutputStream(outFile);
String xmlAnnotations = cas2xml.generateXML(jCas.getCas());
out.write(xmlAnnotations.getBytes("UTF-8"));
/* XmiCasSerializer ser = new XmiCasSerializer(jCas.getCas().getTypeSystem());
XMLSerializer xmlSer = new XMLSerializer(out, false);
ser.serialize(jCas.getCas(), xmlSer.getContentHandler());*/
if (out != null) {
out.close();
}
}
catch (IOException e) {
throw new AnalysisEngineProcessException(e);
}
catch (CASException e) {
throw new AnalysisEngineProcessException(e);
}
}
I am using it inside a pipeline after all my annotators, but it couldn't read CAS objects (I am getting NullPointerException at jCas.getCas()). It looks like I don't seem to understand the proper usage of CAS consumer. I appreciate any suggestions.