Flink outputs to Kafka, proper way to use KeyedSerializationSchema - apache-kafka

I have a simple Flink wordcount, which reads from Kafka topic, and outputs its result to another Kafka topic,
DataStream<String> input = env.addSource(new FlinkKafkaConsumer<>(inputTopic, new SimpleStringSchema(), props));
DataStream<Tuple2<String, Long>> counts = ......;
counts.addSink(new FlinkKafkaProducer<>(outputTopic, new WordCountSerializer(), props));
//counts.print();
env.execute("foobar");
The problem is, I see nothing in the output topic via kafka-console-consumer.sh command line.
To address the issue, I try to print the result, and it works fine, I can see the correct workcount result in the log file.
So the guess is that something wrong in WordCountSerializer, which is like,
class WordCountSerializer implements KeyedSerializationSchema<Tuple2<String, Long>>, java.io.Serializable {
public byte[] serializeKey(Tuple2<String, Long> element) {
return new StringSerializer().serialize(null, element.getField(0));
}
public byte[] serializeValue(Tuple2<String, Long> element) {
return new LongSerializer().serialize(null, element.getField(1));
}
public String getTargetTopic(Tuple2<String, Long> element) {
return null;
}
}
After changing the serializeValue to
public byte[] serializeValue(Tuple2<String, Long> element) {
return new StringSerializer().serialize(null, element.getField(1).toString());
}
I can see the count been output to Kafka (the word part of Tuples is still missing), like
1
3
...
My questions are,
I've seen several examples on the internet using the WordCountSerializer mentioned aboved, but it doesn't work for me, am I doing anything wrong here?
After changing the serializeValue method as above, it partially works, but what I actually want is something below, what is the proper way to achieve this?
foo,1
bar,3
...

Related

How should I define Flink's Schema to read Protocol Buffer data from Pulsar

I am using Pulsar-Flink to read data from Pulsar in Flink. I am having difficulty when the data's format is Protocol Buffer.
In the GitHub top page, Pulsar-Flink is using SimpleStringSchema. However, seemingly it does not comply with Protocol Buffer officially. Does anyone know how to deal with the data format? How should I define the schema?
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
Properties props = new Properties();
props.setProperty("topic", "test-source-topic")
FlinkPulsarSource<String> source = new FlinkPulsarSource<>(serviceUrl, adminUrl, new SimpleStringSchema(), props);
DataStream<String> stream = see.addSource(source);
// chain operations on dataStream of String and sink the output
// end method chaining
see.execute();
FYI, I am writing Scala code, so if your explanation is for Scala(not for Java), it is really helpful. Surely, any kind of advice is welcome!! Including Java.
You should implement your own DeserializationSchema. Let's assume that you have a protobuf message Address and have generated the respective Java class. Then the schema should look like the following:
public class ProtoDeserializer implements DeserializationSchema<Address> {
#Override
public TypeInformation<Address> getProducedType() {
return TypeInformation.of(Address.class);
}
#Override
public Address deserialize(byte[] message) throws IOException {
return Address.parseFrom(message);
}
#Override
public boolean isEndOfStream(Address nextElement) {
return false;
}
}

Flink Data Stream Conversion and Expose to REST end point

I have spring boot application and integrating with Apache Flink. I wanted to read data from Kafka system, and expose them to REST end point.
The below is my simple data,
#GetMapping("/details/{personName}")
public String getPersonDetails() throws Exception {
StreamExecutionEnvironment env = LocalStreamEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "group_id");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<>("test-topic-1",
new SimpleStringSchema(), properties);
consumer.setStartFromEarliest();
DataStream<String> stream = env.addSource(consumer);
stream.map(new MapFunction<String, String>() {
private static final long serialVersionUID = 1L;
#Override
public String map(String value) throws Exception {
logger.info(value);
return value;
}
}).print();
env.execute();
return "hello world";
}
My problems is,
My Kafka returns String value as below,
"id":"1","PersonName":"John","address":"Bristol","weight":"34", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"2","PersonName":"Mann","address":"Bristol","weight":"88", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"3","PersonName":"Chris","address":"Leeds","weight":"12", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"4","PersonName":"John","address":"Bristol","weight":"44", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"5","PersonName":"John","address":"NewPort","weight":"26", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"6","PersonName":"Mann","address":"Bristol","weight":"89", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
How can i return by converting into JSON by applying filters. For example if my input from REST call is "John" i want to group them and sum the weight values and return as JSON (only Name, and Weight).
Second problem,
I can't stop execute environment. Is there any alternatives? I checked Flink document, i didn't get any for my situation.
Third problem,
I wanted to keep in environment is eager loading, tried to using static block but it takes more time also.
NFRS:
I have massive data in Kafka, so wanted to scale and fast processing.
It sounds like you might need to spend more time reviewing the Flink documentation. But in a nutshell...
Add a MapFunction that parses the string into JSON, extracts the name and weight, and outputs that as a Tuple2<String, Integer> or some custom Java class.
Do a groupBy(name field), followed by a ProcessFunction that sums the weight and saves it in state.
Use QueryableState to expose the state (the summed weights) to code that's running as part of your program's main() method.
In your main method, implement a REST handler that uses the QueryableStateClient to get the weight for a given name.

Stream output of one flux to another

I am getting the data from messageHandler(kafka) and processing it via servicehandler to give me some result
serviceHandler.process(message).getMessage().toString()
Now I need to output this stream of results so that whenever a new record comes through messagehandler, it get processed and output is pushed to front-end(angular)
#Autowired
MessageHandler messageHandler;
#Autowired
ServiceHandler serviceHandler;
public void runHandler() {
Flux<Message> messages = messageHandler.flux();
messages.subscribeOn(Schedulers.parallel())
.doOnNext(message -> serviceHandler.process(message).getMessage().toString())
.subscribe();
}
public Flux pushresult(){ ???? }
Does anyone know of a way to what I need?
Take a look at .map() and .flatMap(). In your case it looks like map() but if the next service you are calling also returns a Mono/Flux you want .flatMap() or you'll end up with something like:
Flux<Flux<String>>
instead of
Flux<String>
.doOnNext is meant for side effects like logging.

Kafka Streams dynamic routing (ProducerInterceptor might be a solution?)

I'm working with Apache Kafka and I've been experimenting with the Kafka Streams functionality.
What I'm trying to achieve is very simple, at least in words and it can be achieved easily with the regular plain Consumer/Producer approach:
Read a from a dynamic list of topics
Do some processing on the message
Push the message to another topic which name is computed based on the message content
Initially I thought I could create a custom Sink or inject some kind of endpoint resolver in order to programmatically define the topic name for each single message, although ultimately couldn't find any way to do that.
So I dug into the code and found the ProducerInterceptor class that is (quoting from the JavaDoc):
A plugin interface that allows you to intercept (and possibly mutate)
the records received by the producer before they are published to the
Kafka cluster.
And it's onSend method:
This is called from KafkaProducer.send(ProducerRecord) and
KafkaProducer.send(ProducerRecord, Callback) methods, before key and
value get serialized and partition is assigned (if partition is not
specified in ProducerRecord).
It seemed like the perfect solution for me as I can effectively return a new ProducerRecord with the topic name I want.
Although apparently there's a bug (I've opened an issue on their JIRA: KAFKA-4691) and that method is called when the key and value have already been serialized.
Bummer as I don't think doing an additional deserialization at this point is acceptable.
My question to you more experienced and knowledgeable users would be your input and ideas and any kind of suggestions on how would be an efficient and elegant way of achieving it.
Thanks in advance for your help/comments/suggestions/ideas.
Below are some code snippets of what I've tried:
public static void main(String[] args) throws Exception {
StreamsConfig streamingConfig = new StreamsConfig(getProperties());
StringDeserializer stringDeserializer = new StringDeserializer();
StringSerializer stringSerializer = new StringSerializer();
MyObjectSerializer myObjectSerializer = new MyObjectSerializer();
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, myObjectSerializer, Pattern.compile("input-.*"));
.addProcessor("PROCESS", MyCustomProcessor::new, "SOURCE");
System.out.println("Starting PurchaseProcessor Example");
KafkaStreams streaming = new KafkaStreams(topologyBuilder, streamingConfig);
streaming.start();
System.out.println("Now started PurchaseProcessor Example");
}
private static Properties getProperties() {
Properties props = new Properties();
.....
.....
props.put(StreamsConfig.producerPrefix(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG), "com.test.kafka.streams.OutputTopicRouterInterceptor");
return props;
}
OutputTopicRouterInterceptor onSend implementation:
#Override
public ProducerRecord<String, MyObject> onSend(ProducerRecord<String, MyObject> record) {
MyObject obj = record.value();
String topic = computeTopicName(obj);
ProducerRecord<String, MyObject> newRecord = new ProducerRecord<String, MyObject>(topic, record.partition(), record.timestamp(), record.key(), obj);
return newRecord;
}

Kafka Streams: Appropriate way to find min value in a stream

I'm using Kafka Streams version 0.10.0.1, and trying to find the min value in a stream.
The incoming messages come from a topic called kafka-streams-topic and have a key and the value is a JSON payload that looks like this:
{"value":2334}
This is a simple payload but I want to find the min value of this JSON.
The outgoing message is just a number:
2334
and the key is also part of the message.
So if the incoming topic got:
key=1, value={"value":1000}
outgoing topic, named min-topic, would get
key=1,value=1000
another message comes through:
key=1, value={"value":100}
because this is the same key I would like to now produce a message with key=1 value=100 since this is now smaller than the first message
Now lets say we got:
key=2 value=99
A new message would be produced where:
key=2 and value=99 but the key=1 and associated value shouldn't change.
Additionally if we got the message:
key=1 value=2000
No message should be produced since this message is larger than the current value of 100
This works but I'm wondering if this adheres to the intent of the API:
public class MinProcessor implements Processor<String,String> {
private ProcessorContext context;
private KeyValueStore<String, Long> kvStore;
private Gson gson = new Gson();
#Override
public void init(ProcessorContext context) {
this.context = context;
this.context.schedule(1000);
kvStore = (KeyValueStore) context.getStateStore("Counts");
}
#Override
public void process(String key, String value) {
Long incomingPotentialMin = ((Double)gson.fromJson(value, Map.class).get("value")).longValue();
Long minForKey = kvStore.get(key);
System.out.printf("key: %s incomingPotentialMin: %s minForKey: %s \n", key, incomingPotentialMin, minForKey);
if (minForKey == null || incomingPotentialMin < minForKey) {
kvStore.put(key, incomingPotentialMin);
context.forward(key, incomingPotentialMin.toString());
context.commit();
}
}
#Override
public void punctuate(long timestamp) {}
#Override
public void close() {
kvStore.close();
}
}
Here is the code that actually runs the processor:
public class MinLauncher {
public static void main(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
StateStoreSupplier countStore = Stores.create("Counts")
.withKeys(Serdes.String())
.withValues(Serdes.Long())
.persistent()
.build();
builder.addSource("source", "kafka-streams-topic")
.addProcessor("process", () -> new MinProcessor(), "source")
.addStateStore(countStore, "process")
.addSink("sink", "min-topic", "process");
KafkaStreams streams = new KafkaStreams(builder, KafkaStreamsProperties.properties("kafka-streams-min-poc"));
streams.cleanUp();
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
Not sure what your exact input data and result is (maybe you can update you question with this information: what are your input records? what is your output? What "EXTRA messages [] are produced [] that [you] don't expect"?).
However, a few general clarifications (can refine this answer later on if required).
You do your computation based in keys, so you should expect a result for each key (not sure if you have multiple different keys in your input).
You emit data in punctuate() which is called periodically (base in the internally tracked stream-time -- i.e., based on the timestamp values extracted from your input records via TimestampExtractor). Hence, you will write the current min value of each key written to the topic when punctuate() gets called and therefore, you can have multiple updates per key that are all appended to your result topic. (Topics are append only and if you write two messages with the same key, you see both -- there is no overwrite.)