Flink getting past bad messages in Kafka: "poison message" - apache-kafka

First time I'm trying to get this to work so bear with me. I'm trying to
learn checkpointing with Kafka and handling "bad" messages, restarting
without losing state.
Use Case:
Use checkpointing.
Read a stream of integers from Kafka, keep a running sum.
If a "bad" Kafka message read, restart app, skip the "bad" message, keep
state. My stream would look something look like this:
set1,5
set1,7
set1,foobar
set1,6
I want my app to keep a running sum of the integers it has seen, and restart
if it crashes without losing state, so app behavior/running sum would be:
5,
12,
app crashes and restarts, reads checkpoint
18
etc.
However, I'm finding when my app restarts, it keeps reading the bad "foobar"
message and doesnt get past it. Source code below. The mapper bombs when I
try to parse "foobar" as an Integer.
How can I modify app to get past "poison" message?
env.enableCheckpointing(1000L);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500L);
env.getCheckpointConfig().setCheckpointTimeout(10000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.setStateBackend(new
FsStateBackend("hdfs://mymachine:9000/flink/checkpoints"));
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", BROKERS);
properties.setProperty("zookeeper.connect", ZOOKEEPER_HOST);
properties.setProperty("group.id", "consumerGroup1");
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>(topicName,
new SimpleStringSchema(), properties);
DataStream<String> messageStream = env.addSource(kafkaConsumer);
DataStream<Tuple2<String,Integer>> sums = messageStream
.map(new NumberMapper())
.keyBy(0)
.sum(1);
sums.print();
private static class NumberMapper implements
MapFunction<String,Tuple2<String,Integer>> {
public Tuple2<String,Integer> map(String input) throws Exception {
return parseData(input);
}
private Tuple2<String,Integer> parseData(String record) {
String[] tokens = record.toLowerCase().split(",");
// Get Key
String key = tokens[0];
// Get Integer Value
String integerValue = tokens[1];
System.out.println("Trying to Parse=" + integerValue);
Integer value = Integer.parseInt(integerValue);
// Build Tuple
return new Tuple2<String,Integer>(key, value);
}
}

You could change the NumberMapper into a FlatMap and filter out invalid elements:
private static class NumberMapper implements FlatMapFunction<String, Tuple2<String, Integer>> {
public void flatMap(String input, Collector<Tuple2<String, Integer>> collector) throws Exception {
Optional<Tuple2<String, Integer>> optionalResult = parseData(input);
optionalResult.ifPresent(collector::collect);
}
private Optional<Tuple2<String, Integer>> parseData(String record) {
String[] tokens = record.toLowerCase().split(",");
// Get Key
String key = tokens[0];
// Get Integer Value
String integerValue = tokens[1];
try {
Integer value = Integer.parseInt(integerValue);
// Build Tuple
return Optional.of(Tuple2.of(key, value));
} catch (NumberFormatException e) {
return Optional.empty();
}
}
}

Related

How to make Serdes work with multi-step kafka streams

I am new to Kafka and I'm building a starter project using the Twitter API as a data source. I have create a Producer which can query the Twitter API and sends the data to my kafka topic with string serializer for both key and value. My Kafka Stream Application reads this data and does a word count, but also grouping by the date of the tweet. This part is done through a KTable called wordCounts to make use of its upsert functionality. The structure of this KTable is:
Key: {word: exampleWord, date: exampleDate}, Value: numberOfOccurences
I then attempt to restructure the data in the KTable stream to a flat structure so I can later send it to a database. You can see this in the wordCountsStructured KStream object. This restructures the data to look like the structure below. The value is initially a JsonObject but i convert it to a string to match the Serdes which i set.
Key: null, Value: {word: exampleWord, date: exampleDate, Counts: numberOfOccurences}
However, when I try to send this to my second kafka topic, I get the error below.
A serializer (key:
org.apache.kafka.common.serialization.StringSerializer / value:
org.apache.kafka.common.serialization.StringSerializer) is not
compatible to the actual key or value type (key type:
com.google.gson.JsonObject / value type: com.google.gson.JsonObject).
Change the default Serdes in StreamConfig or provide correct Serdes
via method parameters.
I'm confused by this since the KStream I am sending to the topic is of type <String, String>. Does anyone know how I might fix this?
public class TwitterWordCounter {
private final JsonParser jsonParser = new JsonParser();
public Topology createTopology(){
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("test-topic2");
KTable<JsonObject, Long> wordCounts = textLines
//parse each tweet as a tweet object
.mapValues(tweetString -> new Gson().fromJson(jsonParser.parse(tweetString).getAsJsonObject(), Tweet.class))
//map each tweet object to a list of json objects, each of which containing a word from the tweet and the date of the tweet
.flatMapValues(TwitterWordCounter::tweetWordDateMapper)
//update the key so it matches the word-date combination so we can do a groupBy and count instances
.selectKey((key, wordDate) -> wordDate)
.groupByKey()
.count(Materialized.as("Counts"));
/*
In order to structure the data so that it can be ingested into SQL, the value of each item in the stream must be straightforward: property, value
so we have to:
1. take the columns which include the dimensional data and put this into the value of the stream.
2. lable the count with 'count' as the column name
*/
KStream<String, String> wordCountsStructured = wordCounts.toStream()
.map((key, value) -> new KeyValue<>(null, MapValuesToIncludeColumnData(key, value).toString()));
KStream<String, String> wordCountsPeek = wordCountsStructured.peek(
(key, value) -> System.out.println("key: " + key + "value:" + value)
);
wordCountsStructured.to("test-output2", Produced.with(Serdes.String(), Serdes.String()));
return builder.build();
}
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application1111");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "myIPAddress");
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
TwitterWordCounter wordCountApp = new TwitterWordCounter();
KafkaStreams streams = new KafkaStreams(wordCountApp.createTopology(), config);
streams.start();
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
//this method is used for taking a tweet and transforming it to a representation of the words in it plus the date
public static List<JsonObject> tweetWordDateMapper(Tweet tweet) {
try{
List<String> words = Arrays.asList(tweet.tweetText.split("\\W+"));
List<JsonObject> tweetsJson = new ArrayList<JsonObject>();
for(String word: words) {
JsonObject tweetJson = new JsonObject();
tweetJson.add("date", new JsonPrimitive(tweet.formattedDate().toString()));
tweetJson.add("word", new JsonPrimitive(word));
tweetsJson.add(tweetJson);
}
return tweetsJson;
}
catch (Exception e) {
System.out.println(e);
System.out.println(tweet.serialize().toString());
return new ArrayList<JsonObject>();
}
}
public JsonObject MapValuesToIncludeColumnData(JsonObject key, Long countOfWord) {
key.addProperty("count", countOfWord); //new JsonPrimitive(count));
return key;
}
Because you are performing a key changing operation before the groupBy(), it will create a repartition topic and for that topic, it will rely on the default key, value serdes, which you have set to String Serde.
You can modify the groupBy() call to groupBy(Grouped.with(StringSerde,JsonSerde) and this should help.

How to generate output files for each input in Apache Flink

I'm using Flink to process my streaming data.
The streaming is coming from some other middleware, such as Kafka, Pravega, etc.
Saying that Pravega is sending some word stream, hello world my name is....
What I need is three steps of process:
Map each word to my custom class object MyJson.
Map the object MyJson to String.
Write Strings to files: one String is written to one file.
For example, for the stream hello world my name is, I should get five files.
Here is my code:
// init Pravega connector
PravegaDeserializationSchema<String> adapter = new PravegaDeserializationSchema<>(String.class, new JavaSerializer<>());
FlinkPravegaReader<String> source = FlinkPravegaReader.<String>builder()
.withPravegaConfig(pravegaConfig)
.forStream(stream)
.withDeserializationSchema(adapter)
.build();
// map stream to MyJson
DataStream<MyJson> jsonStream = env.addSource(source).name("Pravega Stream")
.map(new MapFunction<String, MyJson>() {
#Override
public MyJson map(String s) throws Exception {
MyJson myJson = JSON.parseObject(s, MyJson.class);
return myJson;
}
});
// map MyJson to String
DataStream<String> valueInJson = jsonStream
.map(new MapFunction<MyJson, String>() {
#Override
public String map(MyJson myJson) throws Exception {
return myJson.toString();
}
});
// output
valueInJson.print();
This code will output all of results to Flink log files.
My question is how to write one word to one output file?
I think the easiest way to do this would be with a custom sink.
stream.addSink(new WordFileSink)
public static class WordFileSink implements SinkFunction<String> {
#Override
public void invoke(String value, Context context) {
// generate a unique name for the new file and open it
// write the word to the file
// close the file
}
}
Note that this implementation won't necessarily provide exactly once behavior. You might want to take care that the file naming scheme is both unique and deterministic (rather than depending on processing time), and be prepared for the case that the file may already exist.

How to obtain the previous message from a topic when processing it with KafkaStreams

I've been implementing Kafka producers/consumers and streams using the data that the covid19api holds.
I'm trying to extract the daily cases for each day from, for instance, the endpoint https://api.covid19api.com/all. However, this service - and the rest of services from this API - has all the data since the begining of the disease (confirmed, death and recovered cases) but accumulated and not the daily cases, which is, in the end, what I'm trying to achieve.
Using transformValues and StoreBuilder (as It was recommended here) didin't work for me either since the scenaries are diferent. I've implemented something different using the transformValue feature but everytime the previous value retrieved is the firts of the topic and not the actual previous:
#Override
public String transform(Long key, String value) {
String prevValue = state.get(key);
log.info("{} => {}", key, value) ;
if (prevValue != null) {
Covid19StatDto prevDto = new Gson().fromJson(prevValue, Covid19StatDto.class);
Covid19StatDto dto = new Gson().fromJson(value, Covid19StatDto.class);
log.info("Current value {} previous {} ", dto.toString(), prevDto.toString());
dto.setConfirmed(dto.getConfirmed() - prevDto.getConfirmed());
String newDto = new Gson().toJson(dto);
log.info("New value {}", newDto);
return newDto;
} else {
state.put(key, value);
}
return value;
}
¿How do I obtain the previous message from a topic when I'm processing it with a stream? Any help or suggestion will be highly appreciated.
Regards.
Is the problem not simply that you're only ever storing the first value you get for each key in the state store? If on each subsequent message you always want the previous message, then you need to always store the current message in the state store as the last step, for exmaple:
#Override
public String transform(Long key, String value) {
String prevValue = state.get(key);
log.info("{} => {}", key, value) ;
if (prevValue != null) {
Covid19StatDto prevDto = new Gson().fromJson(prevValue, Covid19StatDto.class);
Covid19StatDto dto = new Gson().fromJson(value, Covid19StatDto.class);
log.info("Current value {} previous {} ", dto.toString(), prevDto.toString());
dto.setConfirmed(dto.getConfirmed() - prevDto.getConfirmed());
String newDto = new Gson().toJson(dto);
log.info("New value {}", newDto);
return newDto;
}
// Always update the state store:
state.put(key, value);
return value;
}

How to process a KStream in a batch of max size or fallback to a time window?

I would like to create a Kafka stream-based application that processes a topic and takes messages in batches of size X (i.e. 50) but if the stream has low flow, to give me whatever the stream has within Y seconds (i.e. 5).
So, instead of processing messages one by one, I process a List[Record] where the size of the list is 50 (or maybe less).
This is to make some I/O bound processing more efficient.
I know that this can be implemented with the classic Kafka API but was looking for a stream-based implementation that can also handle offset committing natively, taking errors/failures into account.
I couldn't find anything related int he docs or by searching around and was wondering if anyone has a solution to this problem.
#Matthias J. Sax answer is nice, I just want to add an example for this, I think it might be useful for someone.
let's say we want to combine incoming values into the following type:
public class MultipleValues { private List<String> values; }
To collect messages into batches with max size, we need to create transformer:
public class MultipleValuesTransformer implements Transformer<String, String, KeyValue<String, MultipleValues>> {
private ProcessorContext processorContext;
private String stateStoreName;
private KeyValueStore<String, MultipleValues> keyValueStore;
private Cancellable scheduledPunctuator;
public MultipleValuesTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
scheduledPunctuator = processorContext.schedule(Duration.ofSeconds(30), PunctuationType.WALL_CLOCK_TIME, this::doPunctuate);
}
#Override
public KeyValue<String, MultipleValues> transform(String key, String value) {
MultipleValues itemValueFromStore = keyValueStore.get(key);
if (isNull(itemValueFromStore)) {
itemValueFromStore = MultipleValues.builder().values(Collections.singletonList(value)).build();
} else {
List<String> values = new ArrayList<>(itemValueFromStore.getValues());
values.add(value);
itemValueFromStore = itemValueFromStore.toBuilder()
.values(values)
.build();
}
if (itemValueFromStore.getValues().size() >= 50) {
processorContext.forward(key, itemValueFromStore);
keyValueStore.put(key, null);
} else {
keyValueStore.put(key, itemValueFromStore);
}
return null;
}
private void doPunctuate(long timestamp) {
KeyValueIterator<String, MultipleValues> valuesIterator = keyValueStore.all();
while (valuesIterator.hasNext()) {
KeyValue<String, MultipleValues> keyValue = valuesIterator.next();
if (nonNull(keyValue.value)) {
processorContext.forward(keyValue.key, keyValue.value);
keyValueStore.put(keyValue.key, null);
}
}
}
#Override
public void close() {
scheduledPunctuator.cancel();
}
}
and we need to create key-value store, add it to StreamsBuilder, and build KStream flow using transform method
Properties props = new Properties();
...
Serde<MultipleValues> multipleValuesSerge = Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(MultipleValues.class));
StreamsBuilder builder = new StreamsBuilder();
String storeName = "multipleValuesStore";
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore(storeName);
StoreBuilder<KeyValueStore<String, MultipleValues>> storeBuilder =
Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), multipleValuesSerge);
builder.addStateStore(storeBuilder);
builder.stream("source", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new MultipleValuesTransformer(storeName), storeName)
.print(Printed.<String, MultipleValues>toSysOut().withLabel("transformedMultipleValues"));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.start();
with such approach we used the incoming key for which we did aggregation. if you need to collect messages not by key, but by some message's fields, you need the following flow to trigger rebalancing on KStream (by using intermediate topic):
.selectKey(..)
.through(intermediateTopicName)
.transform( ..)
The simplest way might be, to use a stateful transform() operation. Each time you receive a record, you put it into the store. When you have received 50 records, you do your processing, emit output, and delete the records from the store.
To enforce processing if you don't read the limit in a certain amount of time, you can register a wall-clock punctuation.
It seems that there is no need to use Processors or Transformers and transform() to batch events by count. Regular groupBy() and reduce()/aggregate() should do the trick:
KeyValueSerde keyValueSerde = new KeyValueSerde(); // simple custom Serde
final AtomicLong batchCount = new AtomicLong(0L);
myKStream
.groupBy((k,v) -> KeyValue.pair(k, batchCount.getAndIncrement() / batchSize),
Grouped.keySerde(keyValueSerde))
.reduce(this::windowReducer) // <-- how you want to aggregate values in batch
.toStream()
.filter((k,v) -> /* pass through full batches only */)
.selectKey((k,v) -> k.key)
...
You'd also need to add straightforward Serde for the standard KeyValue<String, Long>.
This option is obviously only helpful when you don't need a "punctuator" to emit incomplete batches on timeout. It also doesn't guarantee the order of elements in the batch in case of distributed processing.
You can also concatenate count to the key string to form the new key (instead of using KeyValue). That would simplify example even further (to using Serdes.String()).

Kafka Stream giving weird output

I'm playing around with Kafka Streams trying to do basic aggregations (for the purpose of this question, just incrementing by 1 on each message). On the output topic that receives the changes done to the KTable, I get really weird output:
#B�
#C
#C�
#D
#D�
#E
#E�
#F
#F�
I recognize that the "�" means that it's printing out some kind of character that doesn't exist in the character set, but I'm not sure why. Here's my code for reference:
public class KafkaMetricsAggregator {
public static void main(final String[] args) throws Exception {
final String bootstrapServers = args.length > 0 ? args[0] : "my-kafka-ip:9092";
final Properties streamsConfig = new Properties();
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG, "metrics-aggregator");
// Where to find Kafka broker(s).
streamsConfig.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
// Specify default (de)serializers for record keys and for record values.
streamsConfig.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfig.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// Records should be flushed every 10 seconds. This is less than the default
// in order to keep this example interactive.
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000);
// For illustrative purposes we disable record caches
streamsConfig.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
// Class to extract the timestamp from the event object
streamsConfig.put(StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG, "my.package.EventTimestampExtractor");
// Set up serializers and deserializers, which we will use for overriding the default serdes
// specified above.
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(new JsonSerializer(), new JsonDeserializer());
final Serde<String> stringSerde = Serdes.String();
final Serde<Double> doubleSerde = Serdes.Double();
final KStreamBuilder builder = new KStreamBuilder();
final KTable<String, Double> aggregatedMetrics = builder.stream(jsonSerde, jsonSerde, "test2")
.groupBy(KafkaMetricsAggregator::generateKey, stringSerde, jsonSerde)
.aggregate(
() -> 0d,
(key, value, agg) -> agg + 1,
doubleSerde,
"metrics-table2");
aggregatedMetrics.to(stringSerde, doubleSerde, "metrics");
final KafkaStreams streams = new KafkaStreams(builder, streamsConfig);
// Only clean up in development
streams.cleanUp();
streams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
EDIT: Using aggregatedMetrics.print(); does print out the correct output to the console:
[KSTREAM-AGGREGATE-0000000002]: my-generated-key , (43.0<-null)
Any ideas about what's going on?
You're using Serdes.Double() for your values, that uses a binary efficient encoding [1] for the serialised values and that's what you're seeing on your topic. To get human-readable numbers on the console, you'd need to instruct the consumer to use the DoubleDeserializer too.
[1] https://github.com/apache/kafka/blob/e31c0c9bdbad432bc21b583bd3c084f05323f642/clients/src/main/java/org/apache/kafka/common/serialization/DoubleSerializer.java#L29-L44
Specify DoubleDeserializer as value deserializer at consumer's command line as shown below
--property value.deserializer=org.apache.kafka.common.serialization.DoubleDeserializer