apache flink with Kafka: InvalidTypesException - apache-kafka

I have following code:
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, MyCustomClassDeserializer.class.getName());
FlinkKafkaConsumer<MyCustomClass> kafkaConsumer = new FlinkKafkaConsumer(
"test-kafka-topic",
new SimpleStringSchema(),
properties);
final StreamExecutionEnvironment streamEnv = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<MyCustomClass> kafkaInputStream = streamEnv.addSource(kafkaConsumer);
DataStream<String> stringStream = kafkaInputStream
.map(new MapFunction<MyCustomClass,String>() {
#Override
public String map(MyCustomClass message) {
logger.info("--- Received message : " + message.toString());
return message.toString();
}
});
streamEnv.execute("Published messages");
MyCustomClassDeserializer is implemented to convert byte array to java object.
When I run this program locally, I get error:
Caused by: org.apache.flink.api.common.functions.InvalidTypesException: Input mismatch: Basic type expected.
And I get this for code line:
.map(new MapFunction<MyCustomClass,String>() {
Not sure why I get this?

So, You have a deserializer that returns POJO, yet You are telling Flink that it should deserialize record from byte[] to String by using SimpleStringSchema.
See the problem now? :)
I don't think You should use the custom Kafka deserializers in FlinkKafkaConsumer in general. What You should aim for instead is to instead create a custom class that extends DeserializationSchema from Flink. It should be much better in terms of type safety and testability.

Related

How should I define Flink's Schema to read Protocol Buffer data from Pulsar

I am using Pulsar-Flink to read data from Pulsar in Flink. I am having difficulty when the data's format is Protocol Buffer.
In the GitHub top page, Pulsar-Flink is using SimpleStringSchema. However, seemingly it does not comply with Protocol Buffer officially. Does anyone know how to deal with the data format? How should I define the schema?
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
Properties props = new Properties();
props.setProperty("topic", "test-source-topic")
FlinkPulsarSource<String> source = new FlinkPulsarSource<>(serviceUrl, adminUrl, new SimpleStringSchema(), props);
DataStream<String> stream = see.addSource(source);
// chain operations on dataStream of String and sink the output
// end method chaining
see.execute();
FYI, I am writing Scala code, so if your explanation is for Scala(not for Java), it is really helpful. Surely, any kind of advice is welcome!! Including Java.
You should implement your own DeserializationSchema. Let's assume that you have a protobuf message Address and have generated the respective Java class. Then the schema should look like the following:
public class ProtoDeserializer implements DeserializationSchema<Address> {
#Override
public TypeInformation<Address> getProducedType() {
return TypeInformation.of(Address.class);
}
#Override
public Address deserialize(byte[] message) throws IOException {
return Address.parseFrom(message);
}
#Override
public boolean isEndOfStream(Address nextElement) {
return false;
}
}

Flink Data Stream Conversion and Expose to REST end point

I have spring boot application and integrating with Apache Flink. I wanted to read data from Kafka system, and expose them to REST end point.
The below is my simple data,
#GetMapping("/details/{personName}")
public String getPersonDetails() throws Exception {
StreamExecutionEnvironment env = LocalStreamEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "group_id");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<>("test-topic-1",
new SimpleStringSchema(), properties);
consumer.setStartFromEarliest();
DataStream<String> stream = env.addSource(consumer);
stream.map(new MapFunction<String, String>() {
private static final long serialVersionUID = 1L;
#Override
public String map(String value) throws Exception {
logger.info(value);
return value;
}
}).print();
env.execute();
return "hello world";
}
My problems is,
My Kafka returns String value as below,
"id":"1","PersonName":"John","address":"Bristol","weight":"34", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"2","PersonName":"Mann","address":"Bristol","weight":"88", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"3","PersonName":"Chris","address":"Leeds","weight":"12", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"4","PersonName":"John","address":"Bristol","weight":"44", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"5","PersonName":"John","address":"NewPort","weight":"26", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
{"id":"6","PersonName":"Mann","address":"Bristol","weight":"89", "country":"UK","timeStamp":"2020-08-08T10:25:42"}
How can i return by converting into JSON by applying filters. For example if my input from REST call is "John" i want to group them and sum the weight values and return as JSON (only Name, and Weight).
Second problem,
I can't stop execute environment. Is there any alternatives? I checked Flink document, i didn't get any for my situation.
Third problem,
I wanted to keep in environment is eager loading, tried to using static block but it takes more time also.
NFRS:
I have massive data in Kafka, so wanted to scale and fast processing.
It sounds like you might need to spend more time reviewing the Flink documentation. But in a nutshell...
Add a MapFunction that parses the string into JSON, extracts the name and weight, and outputs that as a Tuple2<String, Integer> or some custom Java class.
Do a groupBy(name field), followed by a ProcessFunction that sums the weight and saves it in state.
Use QueryableState to expose the state (the summed weights) to code that's running as part of your program's main() method.
In your main method, implement a REST handler that uses the QueryableStateClient to get the weight for a given name.

Kafka Streams (Scala): Invalid topology: StateStore is not added yet

I have a topology where I have a stream A.
From that stream A, I create a WindowedStore S.
A --> [S]
Then I want to make the objects in A transformed depending on data on S, and also these transformed objects to arrive to the WindowStore logic(via transformValues).
For that, I create a Transformer for that, creating a Stream A', and making the windowing aware of it (i.e. now, S will be made from A', not from A).
A -> A' --> [S]
^__read__|
But I cannot do that, because when I create the Topology, an exception is thrown:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore storeName is not added yet.
Is there a way to work this around? Is this a limitation?
Code example:
// A
val sessionElementsStream: KStream[K, SessionElement] = ...
// A'
val sessionElementsTransformed : KStream[K, SessionElementTransformed] = {
// Here we use the sessionStoreName - but it is not added yet to the Topology
sessionElementsStream.
transformValues(sessionElementTransformerSupplier, sessionStoreName)
}
val sessionElementsWindowedStream: SessionWindowedKStream[K, SessionElementTransformed] = {
sessionElementsTransformed.
groupByKey(sessionElementTransformedGroupedBy).
windowedBy(sessionWindows)
}
val sessionStore : KTable[Windowed[K], List[WindowedSession]] =
sessionElementsWindowedStream.aggregate(
initializer = List.empty[WindowedSession])(
aggregator = anAggregator, merger = aMerger)(materialized = getMaterializedMUPKSessionStore(sessionStoreName))
The original problem, is that depending on previous sessions' values, I would like to change sessions after it. But if I do this in a transformer after the sessioning, these transformed sessions can be changed and sent downstream - but they won't reflect their new state in S - so further requests to the store will have the old values.
Kafka Streams 2.1, Scala 2.12.4.
Co-partitioned topics.
UPDATE
There is a way to do this within the DSL, using an extra topic:
Sent A' to this topic
Create builder.stream from this topic and build store from it.
Define Store before you define the transformation (so the transformation step can use the Store, because it is already defined before).
However, it sounds cumbersome to have to use an extra topic here. Is there no other, simpler way to solve it?
But I cannot do that, because when I create the Topology, an exception is thrown:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore storeName is not added yet.
It looks like you simply forgot to literally "add" the state store to your processing topology, and then attach ("make available") the state store to your Transformer.
Here's a code snippet that demonstrates this (sorry, in Java).
Adding the state store to your topology:
final StreamsBuilder builder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<String, Long> myStateStore =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("my-state-store-name"),
Serdes.String(),
Serdes.Long())
.withCachingEnabled();
builder.addStateStore(myStateStore);
Attaching the state store to your Transformer:
final KStream<String, Double> stream = builder.stream("your-input-topic", Consumed.with(Serdes.String(), Serdes.Double()));
final KStream<String, Long> transformedStream =
stream.transform(new YourTransformer(myStateStore.name()), myStateStore.name());
And of course your Transformer must integrate the state store, with code like the following (this Transformer reads <String, Double> and writes String, Long>).
class MyTransformer implements TransformerSupplier<String, Double, KeyValue<String, Long>> {
private final String myStateStoreName;
MyTransformer(final String myStateStoreName) {
this.myStateStoreName = myStateStoreName;
}
#Override
public Transformer<String, Double, KeyValue<String, Long>> get() {
return new Transformer<String, Double, KeyValue<String, Long>>() {
private KeyValueStore<String, Long> myStateStore;
private ProcessorContext context;
#Override
public void init(final ProcessorContext context) {
myStateStore = (KeyValueStore<String, Long>) context.getStateStore(myStateStoreName);
}
// ...
}
}

Flink outputs to Kafka, proper way to use KeyedSerializationSchema

I have a simple Flink wordcount, which reads from Kafka topic, and outputs its result to another Kafka topic,
DataStream<String> input = env.addSource(new FlinkKafkaConsumer<>(inputTopic, new SimpleStringSchema(), props));
DataStream<Tuple2<String, Long>> counts = ......;
counts.addSink(new FlinkKafkaProducer<>(outputTopic, new WordCountSerializer(), props));
//counts.print();
env.execute("foobar");
The problem is, I see nothing in the output topic via kafka-console-consumer.sh command line.
To address the issue, I try to print the result, and it works fine, I can see the correct workcount result in the log file.
So the guess is that something wrong in WordCountSerializer, which is like,
class WordCountSerializer implements KeyedSerializationSchema<Tuple2<String, Long>>, java.io.Serializable {
public byte[] serializeKey(Tuple2<String, Long> element) {
return new StringSerializer().serialize(null, element.getField(0));
}
public byte[] serializeValue(Tuple2<String, Long> element) {
return new LongSerializer().serialize(null, element.getField(1));
}
public String getTargetTopic(Tuple2<String, Long> element) {
return null;
}
}
After changing the serializeValue to
public byte[] serializeValue(Tuple2<String, Long> element) {
return new StringSerializer().serialize(null, element.getField(1).toString());
}
I can see the count been output to Kafka (the word part of Tuples is still missing), like
1
3
...
My questions are,
I've seen several examples on the internet using the WordCountSerializer mentioned aboved, but it doesn't work for me, am I doing anything wrong here?
After changing the serializeValue method as above, it partially works, but what I actually want is something below, what is the proper way to achieve this?
foo,1
bar,3
...

Kafka Streams dynamic routing (ProducerInterceptor might be a solution?)

I'm working with Apache Kafka and I've been experimenting with the Kafka Streams functionality.
What I'm trying to achieve is very simple, at least in words and it can be achieved easily with the regular plain Consumer/Producer approach:
Read a from a dynamic list of topics
Do some processing on the message
Push the message to another topic which name is computed based on the message content
Initially I thought I could create a custom Sink or inject some kind of endpoint resolver in order to programmatically define the topic name for each single message, although ultimately couldn't find any way to do that.
So I dug into the code and found the ProducerInterceptor class that is (quoting from the JavaDoc):
A plugin interface that allows you to intercept (and possibly mutate)
the records received by the producer before they are published to the
Kafka cluster.
And it's onSend method:
This is called from KafkaProducer.send(ProducerRecord) and
KafkaProducer.send(ProducerRecord, Callback) methods, before key and
value get serialized and partition is assigned (if partition is not
specified in ProducerRecord).
It seemed like the perfect solution for me as I can effectively return a new ProducerRecord with the topic name I want.
Although apparently there's a bug (I've opened an issue on their JIRA: KAFKA-4691) and that method is called when the key and value have already been serialized.
Bummer as I don't think doing an additional deserialization at this point is acceptable.
My question to you more experienced and knowledgeable users would be your input and ideas and any kind of suggestions on how would be an efficient and elegant way of achieving it.
Thanks in advance for your help/comments/suggestions/ideas.
Below are some code snippets of what I've tried:
public static void main(String[] args) throws Exception {
StreamsConfig streamingConfig = new StreamsConfig(getProperties());
StringDeserializer stringDeserializer = new StringDeserializer();
StringSerializer stringSerializer = new StringSerializer();
MyObjectSerializer myObjectSerializer = new MyObjectSerializer();
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, myObjectSerializer, Pattern.compile("input-.*"));
.addProcessor("PROCESS", MyCustomProcessor::new, "SOURCE");
System.out.println("Starting PurchaseProcessor Example");
KafkaStreams streaming = new KafkaStreams(topologyBuilder, streamingConfig);
streaming.start();
System.out.println("Now started PurchaseProcessor Example");
}
private static Properties getProperties() {
Properties props = new Properties();
.....
.....
props.put(StreamsConfig.producerPrefix(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG), "com.test.kafka.streams.OutputTopicRouterInterceptor");
return props;
}
OutputTopicRouterInterceptor onSend implementation:
#Override
public ProducerRecord<String, MyObject> onSend(ProducerRecord<String, MyObject> record) {
MyObject obj = record.value();
String topic = computeTopicName(obj);
ProducerRecord<String, MyObject> newRecord = new ProducerRecord<String, MyObject>(topic, record.partition(), record.timestamp(), record.key(), obj);
return newRecord;
}