Reading data from a file and create pojo using DeserializationSchema - apache-kafka

I am Junit testing my flink code where I am reading message from Kafka and using class X extends DeserializationSchema to convert into a POJO . For junit testing I am reading from a file that would return a DataStream <String> but now I want to convert into DataStream<Y> by using the same class X to deserialize the message . Can someone help me how can this be done?
Class X extends DeserializationSchema <Y> {
ObjectMapper om = ......
#Override
public void open(InitializationContext context){
//some flink metric intialization
}
#Override
public Y deserialize(byte[] inputData) {
// some custom logic that I really interested in testing
}
}
KafkaSourceBuilder<Y> sourceBuilder = KafkaSource.<Y>builder()
.setTopics(inputTopic)
.setGroupId(kafkaInputGroupId)
.setProperties(kafkaInputParameters)
.setStartingOffsets(getOffset(jobConfig, inputTopic))
.setValueOnlyDeserializer(new X());
I tried the below
DataStream<String> text =
env.readTextFile("src/test/resources/mobile_integration_test.txt");
text.map(new MapFunction<String, Y>() {
#Override
public Y map(String value) throws Exception {
return new X().deserialize(value.getBytes());
}
}).print();
but this gives me NPE beause of the metric being initialized in the open method (explained above).
Any better ways to do this?

Related

How to add a header in Spring Cloud Streams

There aren't any documentation examples of adding or manipulating a header in the spring cloud streams documentation, only accessing the headers.
There are examples online that show usage of the ProcessorContext. However, using this results in inconsistent header application to messages.
This is the current implementation:
public class EventHeaderTransformer implements Transformer<String, RequestEvent, KeyValue<String, RequestEvent>>
{
private static final String EVENT_HEADER_NAME = "event";
ProcessorContext context;
public EventHeaderTransformer() { }
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public KeyValue<String, RequestEvent> transform(String key, RequestEvent value) {
context.headers().add(EVENT_HEADER_NAME, value.getEventName().getBytes());
return new KeyValue<>(key, value);
}
#Override
public void close() {
// nothing here
}
}
public Function<KStream<String, Request>, KStream<String, RequestEvent>> streamRequests() {
return input -> input
.transform(() -> unrelatedTransformer)
.filter(unrelatedFilter)
// The transformer in question
.transform(() -> eventHeaderTransformer);
// Debug output after this transformer show inconsistencies
}
streamRequests-in-0:
destination: queue.unmanaged.requests
group: streamRequests
consumer:
partitioned: true
concurrency: 3
streamRequests-out-0:
destination: queue.core.requests
For example, the code above results in the following message layout across 9 messages:
(p = partition)
([N] = offset)
p0[0] = message without header
p1[0] = message without header
p2[0] = message with header
p0[0] = message without header
p1[0] = message without header
p2[0] = message without header
p0[0] = message without header
p1[0] = message without header
p2[0] = message with header
Printing out debug messages shows unexpected results, where sometimes a header won't list as added, or headers may be empty, etc.
How does one within Spring Cloud Streams simply add or manipulate a header in a message passing through a transformer.
.transform(() -> eventHeaderTransformer);
Transformers are stateful; you must return a new instance each time; certainly with concurrency; with the newer org.apache.kafka.streams.processor.api.ContextualProcessor (which replaces Transformer in 3.3), this is enforced, regardless of concurrency.

Apache Flink typed KafkaSource

I implemented a connection to a kafka stream as described here. Now I attempt to write the data into a postgres database using a Jdbc sink.
Now the source with Kafka seems to have no type. So when writing statements for SQL it all looks like type Nothing.
How can I use fromSource that I have actually a typed source for Kafka?
What I so far tried is the following:
object Main {
def main(args: Array[String]) {
val builder = KafkaSource.builder
builder.setBootstrapServers("localhost:29092")
builder.setProperty("partition.discovery.interval.ms", "10000")
builder.setTopics("created")
builder.setBounded(OffsetsInitializer.latest)
builder.setStartingOffsets(OffsetsInitializer.earliest)
builder.setDeserializer(KafkaRecordDeserializationSchema.of(new CreatedEventSchema))
val source = builder.build()
val env = StreamExecutionEnvironment.getExecutionEnvironment
val streamSource = env
.fromSource(source, WatermarkStrategy.noWatermarks, "Kafka Source")
streamSource.addSink(JdbcSink.sink(
"INSERT INTO conversations (timestamp, active_conversations, total_conversations) VALUES (?,?,?)",
(statement, event) => {
statement.setTime(1, event.date)
statement.setInt(1, event.a)
statement.setInt(3, event.b)
},JdbcExecutionOptions.builder()
.withBatchSize(1000)
.withBatchIntervalMs(200)
.withMaxRetries(5)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:postgresql://localhost:5432/reporting")
.withDriverName("org.postgresql.Driver")
.withUsername("postgres")
.withPassword("veryverysecret:-)")
.build()
))
env.execute()
}
}
Which does not compile because event is of type Nothing. But I think it must not be that because with CreatedEventSchema Flink should be able to deserialise.
Maybe it also important to note that actually I just want to process the values of the kafka messages.
In Java you might do something like this:
KafkaSource<Event> source =
KafkaSource.<Event>builder()
.setBootstrapServers("localhost:9092")
.setTopics(TOPIC)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new EventDeserializationSchema())
.build();
with a value deserializer along these lines:
public class EventDeserializationSchema extends AbstractDeserializationSchema<Event> {
private static final long serialVersionUID = 1L;
private transient ObjectMapper objectMapper;
#Override
public void open(InitializationContext context) {
objectMapper = JsonMapper.builder().build().registerModule(new JavaTimeModule());
}
#Override
public Event deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, Event.class);
}
#Override
public TypeInformation<Event> getProducedType() {
return TypeInformation.of(Event.class);
}
}
Sorry I don't have a Scala example handy, but hopefully this will point you in the right direction.

Joining streams Flink doesn't work with Kafka consumer

I'm trying to join two streams, one from the data collection, one consumes from Kafka.
code snippet
public static void main(String[] args) {
KafkaSource<JsonNode> kafkaSource = ...
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Kafka messages : {"name": "John"}
final DataStream<JsonNode> dataStream1 = env.fromSource(kafkaSource, waterMark(), "Kafka").rebalance()
.assignTimestampsAndWatermarks(waterMark());
final DataStream<String> dataStream2 = env.fromElements("John", "Zbe", "Abe")
.assignTimestampsAndWatermarks(waterMark());
dataStream1
.join(dataStream2)
.where(new KeySelector<JsonNode, String>() {
#Override
public String getKey(JsonNode value) throws Exception {
return value.get("name").asText();
}
})
.equalTo(new KeySelector<String, String>() {
#Override
public String getKey(String value) throws Exception {
return value;
}
})
.window(SlidingEventTimeWindows.of(Time.minutes(50) /* size */, Time.minutes(10) /* slide */))
.apply(new JoinFunction<JsonNode, String, String>() {
#Override
public String join(JsonNode first, String second) throws Exception {
return first+" "+second;
}
}).print();
env.execute();
}
watermark
private static <T> WatermarkStrategy<T> waterMark() {
return new WatermarkStrategy<T>() {
#Override
public WatermarkGenerator<T> createWatermarkGenerator(
org.apache.flink.api.common.eventtime.WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
};
}
After running snippet code, it doesn't have any merged data in the output. Am I going wrong somewhere?
Apache flink version: 1.13.2
The problem is probably related to watermarking. Since you're not using event-time-based timestamps, try changing SlidingEventTimeWindows to SlidingProcessingTimeWindows and see if it then produces results.
The underlying problem is probably a lack of data. The rebalance() on the Kafka stream guarantees that idle partitions won't stall the watermarks unless all partitions are idle. But if this is an unbounded streaming job, unless you have some data that falls after the first window, the watermark won't advance far enough to trigger the first window.
Options:
Send some data with larger timestamps
Configure the Kafka source as a bounded stream by using the .setBounded(...) option on the KakfaSource builder
Stop the job using the --drain option (docs)
The fact that dataStream2 is bounded is also a problem, but I'm not sure how much of one. At best this will prevent any windows after the first one from producing any results (since datastream joins are inner joins).

Purpose and behaviour of init() in Vertx class

I have the following verticle for testing purposes:
public class UserVerticle extends AbstractVerticle {
private static final Logger log = LoggerFactory.getLogger(UserVerticle.class);
#Override
public void start(Future<Void> sf) {
log.info("start()");
JsonObject cnf = config();
log.info("start.config={}", cnf.toString());
sf.complete();
}
#Override
public void stop(Future<Void> sf) {
log.info("stop()");
sf.complete();
}
private void onMessage(Message<JsonObject> message) { ... }
log.info("onMessage(message={})", message);
}
}
Is is deployed from the main verticle with
vertx.deployVerticle("org.buguigny.cluster.UserVerticle",
new DeploymentOptions()
.setInstances(1)
.setConfig(new JsonObject()
.put(some_key, some_data)
),
ar -> {
if(ar.succeeded()) {
log.info("UserVerticle(uname={}, addr={}) deployed", uname, addr);
// continue when OK
}
else {
log.error("Could not deploy UserVerticle(uname={}). Cause: {}", uname, ar.cause());
// continue when KO
}
});
This code works fine.
I had a look at the Verticle documentation and discovered an init() callback method I didn't see before. As the documentation doesn't say much about what it really does, I defined it to see where in the life cycle of a verticle it gets called.
#Override
public void init(Vertx vertx, Context context) {
log.info("init()");
JsonObject cnf = context.config();
log.info("init.config={}", cnf.toString());
}
However, when init() is defined I get a java.lang.NullPointerException on the line where I call JsonObject cnf = config(); in start():
java.lang.NullPointerException: null
at io.vertx.core.AbstractVerticle.config(AbstractVerticle.java:85)
at org.buguigny.cluster.UserVerticle.start(UserVerticle.java:30)
at io.vertx.core.impl.DeploymentManager.lambda$doDeploy$8(DeploymentManager.java:494)
at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:320)
at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
My questions are:
Q1 : any clue why NullPointerException is thrown?
Q2 : what is the purpose of init()? Is it internal to Vertx or can it be be implemented by client code to, for example, define some fields in the verticle objects passed in deployment config ?
The init method is for internal usage and documented as such in the Javadoc. Here's the source code:
/**
* Initialise the verticle.<p>
* This is called by Vert.x when the verticle instance is deployed. Don't call it yourself.
* #param vertx the deploying Vert.x instance
* #param context the context of the verticle
*/
#Override
public void init(Vertx vertx, Context context) {
this.vertx = vertx;
this.context = context;
}
If init is documented in any user documentation it's a mistake, please report it.

unable to add counter in Flink 1.3.2

I am trying to add a counter in Flink as mentioned here, but the issue is that counter.inc() is returning void instead of Integer. Code for my Metric is given as below
private static class myMetric extends RichMapFunction<String,Integer> {
private Counter counter ;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.getRuntimeContext().
getMetricGroup().
counter("countit");
}
#Override
public Integer map(String s) throws Exception {
return this.counter.inc();
}
}
It should work better if you assign a value to your counter:
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("countit");
You may find the documentation helpful.