Is it possible to deserialize Avro message(consuming message from Kafka) without giving Reader schema in ConfluentRegistryAvroDeserializationSchema - apache-kafka

I am using Kafka Connector in Apache Flink for access to streams served by Confluent Kafka.
Apart from schema registry url ConfluentRegistryAvroDeserializationSchema.forGeneric(...) expecting 'reader' schema.
Instead of providing read schema I want to use same writer's schema(lookup in registry) for reading the message too because Consumer will not have latest schema.
FlinkKafkaConsumer010<GenericRecord> myConsumer =
new FlinkKafkaConsumer010<>("topic-name", ConfluentRegistryAvroDeserializationSchema.forGeneric(<reader schema goes here>, "http://host:port"), properties);
myConsumer.setStartFromLatest();
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/connectors/kafka.html
"Using these deserialization schema record will be read with the schema that was retrieved from Schema Registry and transformed to a statically provided"
Since I do not want to keep schema definition at consumer side how do I deserialize Avro message from Kafka using writer's schema?
Appreciate your help!

I don't think it is possible to use directly ConfluentRegistryAvroDeserializationSchema.forGeneric. It is intended to be used with a reader schema and they have preconditions checking for this.
You have to implement your own. Two import things:
Set specific.avro.reader to false (other wise you'll get specific records)
The KafkaAvroDeserializer has to be lazily initialized (because it isn't serializable it self, as it holds a reference to the schema registry client)
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient;
import io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig;
import io.confluent.kafka.serializers.KafkaAvroDeserializer;
import io.confluent.kafka.serializers.KafkaAvroDeserializerConfig;
import java.util.HashMap;
import java.util.Map;
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema;
public class KafkaGenericAvroDeserializationSchema
implements KeyedDeserializationSchema<GenericRecord> {
private final String registryUrl;
private transient KafkaAvroDeserializer inner;
public KafkaGenericAvroDeserializationSchema(String registryUrl) {
this.registryUrl = registryUrl;
}
#Override
public GenericRecord deserialize(
byte[] messageKey, byte[] message, String topic, int partition, long offset) {
checkInitialized();
return (GenericRecord) inner.deserialize(topic, message);
}
#Override
public boolean isEndOfStream(GenericRecord nextElement) {
return false;
}
#Override
public TypeInformation<GenericRecord> getProducedType() {
return TypeExtractor.getForClass(GenericRecord.class);
}
private void checkInitialized() {
if (inner == null) {
Map<String, Object> props = new HashMap<>();
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, registryUrl);
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, false);
SchemaRegistryClient client =
new CachedSchemaRegistryClient(
registryUrl, AbstractKafkaAvroSerDeConfig.MAX_SCHEMAS_PER_SUBJECT_DEFAULT);
inner = new KafkaAvroDeserializer(client, props);
}
}
}
env.addSource(
new FlinkKafkaConsumer<>(
topic,
new KafkaGenericAvroDeserializationSchema(schemaReigstryUrl),
kafkaProperties));

Related

Apache Flink typed KafkaSource

I implemented a connection to a kafka stream as described here. Now I attempt to write the data into a postgres database using a Jdbc sink.
Now the source with Kafka seems to have no type. So when writing statements for SQL it all looks like type Nothing.
How can I use fromSource that I have actually a typed source for Kafka?
What I so far tried is the following:
object Main {
def main(args: Array[String]) {
val builder = KafkaSource.builder
builder.setBootstrapServers("localhost:29092")
builder.setProperty("partition.discovery.interval.ms", "10000")
builder.setTopics("created")
builder.setBounded(OffsetsInitializer.latest)
builder.setStartingOffsets(OffsetsInitializer.earliest)
builder.setDeserializer(KafkaRecordDeserializationSchema.of(new CreatedEventSchema))
val source = builder.build()
val env = StreamExecutionEnvironment.getExecutionEnvironment
val streamSource = env
.fromSource(source, WatermarkStrategy.noWatermarks, "Kafka Source")
streamSource.addSink(JdbcSink.sink(
"INSERT INTO conversations (timestamp, active_conversations, total_conversations) VALUES (?,?,?)",
(statement, event) => {
statement.setTime(1, event.date)
statement.setInt(1, event.a)
statement.setInt(3, event.b)
},JdbcExecutionOptions.builder()
.withBatchSize(1000)
.withBatchIntervalMs(200)
.withMaxRetries(5)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:postgresql://localhost:5432/reporting")
.withDriverName("org.postgresql.Driver")
.withUsername("postgres")
.withPassword("veryverysecret:-)")
.build()
))
env.execute()
}
}
Which does not compile because event is of type Nothing. But I think it must not be that because with CreatedEventSchema Flink should be able to deserialise.
Maybe it also important to note that actually I just want to process the values of the kafka messages.
In Java you might do something like this:
KafkaSource<Event> source =
KafkaSource.<Event>builder()
.setBootstrapServers("localhost:9092")
.setTopics(TOPIC)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new EventDeserializationSchema())
.build();
with a value deserializer along these lines:
public class EventDeserializationSchema extends AbstractDeserializationSchema<Event> {
private static final long serialVersionUID = 1L;
private transient ObjectMapper objectMapper;
#Override
public void open(InitializationContext context) {
objectMapper = JsonMapper.builder().build().registerModule(new JavaTimeModule());
}
#Override
public Event deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, Event.class);
}
#Override
public TypeInformation<Event> getProducedType() {
return TypeInformation.of(Event.class);
}
}
Sorry I don't have a Scala example handy, but hopefully this will point you in the right direction.

Best practice for implementing Micronaut/Kafka-Streams with more than one KStream/KTable?

There are several details about the example Micronaut/Kafka Streams application which I don't understand. Here is the example class from the documentation (original link: https://micronaut-projects.github.io/micronaut-kafka/latest/guide/#kafkaStreams).
My questions are:
Why are we returning only the source stream?
If we have multiple source KStream objects, EG to do a join, do we need to also make them Beans?
Do we also need make each source KTable a Bean?
What happens if we don't make a source KStream or KTable a Bean? We currently have at least one project that does this but with no apparent problems.
import io.micronaut.configuration.kafka.streams.ConfiguredStreamBuilder;
import io.micronaut.context.annotation.Factory;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.Grouped;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import javax.inject.Named;
import javax.inject.Singleton;
import java.util.Arrays;
import java.util.Locale;
import java.util.Properties;
#Factory
public class WordCountStream {
public static final String STREAM_WORD_COUNT = "word-count";
public static final String INPUT = "streams-plaintext-input";
public static final String OUTPUT = "streams-wordcount-output";
public static final String WORD_COUNT_STORE = "word-count-store";
#Singleton
#Named(STREAM_WORD_COUNT)
KStream<String, String> wordCountStream(ConfiguredStreamBuilder builder) {
// set default serdes
Properties props = builder.getConfiguration();
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
KStream<String, String> source = builder
.stream(INPUT);
KTable<String, Long> groupedByWord = source
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, word) -> word, Grouped.with(Serdes.String(), Serdes.String()))
//Store the result in a store for lookup later
.count(Materialized.as(WORD_COUNT_STORE));
groupedByWord
//convert to stream
.toStream()
//send to output using specific serdes
.to(OUTPUT, Produced.with(Serdes.String(), Serdes.Long()));
return source;
}
}
Edit: Here's a version of our service with multiple streams, edited to remove identifying info.
#Factory
public class TopologyCopy {
private static class DataOut {}
private static class DataInOne {}
private static class DataInTwo {}
private static class DataInThree {}
#Singleton
#Named("data")
KStream<Integer, DataOut> dataStream(ConfiguredStreamBuilder builder) {
Properties props = builder.getConfiguration();
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class);
KStream<Integer, DataInOne> dataOneStream = builder.stream("data-one",
Consumed.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInOne.class)));
KStream<Integer, DataInTwo> dataTwoStream = builder.stream("data-two",
Consumed.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInTwo.class)));
GlobalKTable<Integer, DataInThree> signalTable = builder.globalTable("data-three",
Consumed.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInThree.class)),
Materialized.as("data-three-store"));
KTable<Integer, DataInTwo> dataTwoTable = dataTwoStream
.groupByKey()
.aggregate(() -> null, (key, device, storedDevice) -> device,
Materialized.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInTwo.class)));
dataOneStream
.transformValues(() -> /* MAGIC */))
.join(dataTwoTable, (data1, data2) -> /* MAGIC */)
.selectKey((something, msg) -> /* MAGIC */)
.to("topic-out", Produced.with(Serdes.UUID(), new JsonSerde<>(OutMessage.class)));
return dataOneStream;
}
}

Apache beam get kafka data execute SQL error:Cannot call getSchema when there is no schema

I will input data of multiple tables to kafka, and beam will execute SQL after getting the data, but now there are the following errors:
Exception in thread "main"
java.lang.IllegalStateException: Cannot call getSchema when there is
no schema at
org.apache.beam.sdk.values.PCollection.getSchema(PCollection.java:328)
at
org.apache.beam.sdk.extensions.sql.impl.schema.BeamPCollectionTable.(BeamPCollectionTable.java:34)
at
org.apache.beam.sdk.extensions.sql.SqlTransform.toTableMap(SqlTransform.java:141)
at
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:102)
at
org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:82)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:539) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:473) at
org.apache.beam.sdk.values.PCollectionTuple.apply(PCollectionTuple.java:248)
at BeamSqlTest.main(BeamSqlTest.java:65)
Is there a feasible solution? Please help me!
I think you need to set schema for your input collection PCollection<Row> apply with setRowSchema() or setSchema(). The problem is that your schema is dynamic and it's defined in runtime (not sure if Beam supports this). Could you have static schema and define it before starting processing input data?
Also, since your input source is unbounded, you need to define windows to apply SqlTransform after.
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.beam.repackaged.sql.com.google.common.collect.ImmutableMap;
import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.apache.beam.sdk.io.kafka.KafkaIO;
import org.apache.beam.sdk.io.kafka.KafkaRecord;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.*;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.util.ArrayList;
import java.util.List;
class BeamSqlTest {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).as(PipelineOptions.class);
options.setRunner(DirectRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<KafkaRecord<String, String>> lines = p.apply(KafkaIO.<String, String>read()
.withBootstrapServers("192.168.8.16")
.withTopic("tmp_table.reuslt")
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withConsumerConfigUpdates(ImmutableMap.of("group.id", "beam_app"))
.withReadCommitted()
.commitOffsetsInFinalize());
PCollection<Row> apply = lines.apply(ParDo.of(new DoFn<KafkaRecord<String, String>,Row>(){
#ProcessElement
public void processElement(ProcessContext c) {
String jsonData = c.element().getKV().getValue(); //data: {id:0001#int,name:test01#string,age:29#int,score:99#int}
if(!"data_increment_heartbeat".equals(jsonData)){ //Filter out heartbeat information
JSONObject jsonObject = JSON.parseObject(jsonData);
Schema.Builder builder = Schema.builder();
//A data pipeline may have data from multiple tables so the Schema is obtained dynamically
//This assumes data from a single table
List<Object> list = new ArrayList<Object>();
for(String s : jsonObject.keySet()) {
String[] dataType = jsonObject.get(s).toString().split("#"); //data#field type
if(dataType[1].equals("int")){
builder.addInt32Field(s);
}else if(dataType[1].equals("string")){
builder.addStringField(s);
}
list.add(dataType[0]);
}
Schema schema = builder.build();
Row row = Row.withSchema(schema).addValues(list).build();
System.out.println(row);
c.output(row);
}
}
}));
PCollection<Row> result = PCollectionTuple.of(new TupleTag<>("USER_TABLE"), apply)
.apply(SqlTransform.query("SELECT COUNT(id) total_count, SUM(score) total_score FROM USER_TABLE GROUP BY id"));
result.apply( "log_result", MapElements.via( new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row input) {
System.out.println("USER_TABLE result: " + input.getValues());
return input;
}
}));`enter code here`
}
}

Invoking Kafka Interactive Queries from inside a Stream

I have a particular requirement for invoking an Interactive Query from inside a Stream . This is because I need to create a new Stream which should have data contained inside the State Store. Truncated code below:
tempModifiedDataStream.to(topic.getTransformedTopic(), Produced.with(Serdes.String(), Serdes.String()));
GlobalKTable<String, String> myMetricsTable = builder.globalTable(
topic.getTransformedTopic(),
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(
topic.getTransformedStoreName() /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(Serdes.String()) /* value serde */
);
KafkaStreams streams = new KafkaStreams(builder.build(), kStreamsConfigs());
KStream<String, String> tempAggrDataStream = tempModifiedDataStream
.flatMap((key, value) -> {
try {
List<KeyValue<String, String>> result = new ArrayList<>();
ReadOnlyKeyValueStore<String, String> keyValueStore =
streams .store(
topic.getTransformedStoreName(),
QueryableStoreTypes.keyValueStore());
In the last line, To access the State Store I need to have the KafkaStreams object and the Topology is finalized when I create the KafkaStreams object. The problem with this approach is that the 'tempAggrDataStream' is hence not part of the Topology and that part of the code does not get executed. And I cant move the KafkaStreams definition below as otherwise I can't call the Interactive Query.
I am a bit new to Kafka Streams ; so is this something silly from my side?
If you want to achieve sending all content of the topic content after each data modification, I think you should rather use Processor API.
You could create org.apache.kafka.streams.kstream.Transformer with state store.
For each processing message it will update state store and send all content to downstream.
It is not very efficient, because it will be forwarding for each processing message the whole content of the topic/state store (that can be thousands, millions of records).
If you need only latest value it is enough to set your topic cleanup.policy to compact. And from other site use KTable, which give abstraction of Table (Snapshot of stream)
Sample Transformer code for forwarding whole content of state store is as follow. The whole work is done in transform(String key, String value) method.
public class SampleTransformer
implements Transformer<String, String, KeyValue<String, String>> {
private String stateStoreName;
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
public SampleTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
stateStore.put(key, value);
stateStore.all().forEachRemaining(keyValue -> context.forward(keyValue.key, keyValue.value));
return null;
}
#Override
public void close() {
}
}
More information about Processor APi can be found:
https://docs.confluent.io/current/streams/developer-guide/processor-api.html
https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html
How to combine Processor API with Stream DSL can be found:
https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration

eclipselink + #convert(json) + postgres + list property

I'm using eclipselink 2.6 as a persistence provider of spring data jpa, that in my understanding, now allows you to serialize a subtree of an entity as json using the internal moxy serializer.
So I'm trying to mix this to migrate from embedded element collections to a serialized json using the json datatype of postgres.
I have an entity named Product, and this entity have the following mapped property:
#Convert(Convert.JSON)
private List<MetadataIndex> indexes=new ArrayList<MetadataIndex> ();
In which metadata index is a simple class with a few string properties.
I would like to convert this list of object into a json and store it into a column of json data type in postgres.
I thought that the above code should suffice, but it does not. The application crashes on boot (can't create entitymanager factory - npe somwhere inside eclipselink).
If I change the converter to #Convert(Convert.SERIALIZED) it works. It creates a field on the table Products named indexes of type bytea and store the serialized list in it.
Is this an eclipselink bug or I'm missing something?
Thank you.
well, I've used a custom eclipselink converter to convert my classes into json objects, then store them into the db using directly the postgres driver. This is the converter.
import fr.gael.dhus.database.jpa.domain.MetadataIndex;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;
import org.eclipse.persistence.mappings.DatabaseMapping;
import org.eclipse.persistence.sessions.Session;
import org.postgresql.util.PGobject;
import javax.persistence.AttributeConverter;
import javax.persistence.Converter;
import java.io.IOException;
import java.sql.SQLException;
import java.util.Collection;
import java.util.List;
/**
* Created by fmarino on 20/03/2015.
*/
#Converter
public class JsonConverter implements org.eclipse.persistence.mappings.converters.Converter {
private static ObjectMapper mapper = new ObjectMapper();
#Override
public Object convertObjectValueToDataValue(Object objectValue, Session session) {
try {
PGobject out = new PGobject();
out.setType("jsonb");
out.setValue( mapper.writerWithType( new TypeReference<Collection<MetadataIndex>>() {} )
.writeValueAsString(objectValue) );
return out;
} catch (IOException e) {
throw new IllegalArgumentException("Unable to serialize to json field ", e);
} catch (SQLException e) {
throw new IllegalArgumentException("Unable to serialize to json field ", e);
}
}
#Override
public Object convertDataValueToObjectValue(Object dataValue, Session session) {
try {
if(dataValue instanceof PGobject && ((PGobject) dataValue).getType().equals("jsonb"))
return mapper.reader( new TypeReference<Collection<MetadataIndex>>() {}).readValue(((PGobject) dataValue).getValue());
return "-";
} catch (IOException e) {
throw new IllegalArgumentException("Unable to deserialize to json field ", e);
}
}
#Override
public boolean isMutable() {
return false;
}
#Override
public void initialize(DatabaseMapping mapping, Session session) {
}
}
as you can see I use jackson for serialization, and specify the datatype as Collection. You can use the type you want here.
Inside my classes, I've mapped my field with this:
#Convert(converter = JsonConverter.class)
#Column (nullable = true, columnDefinition = "jsonb")
adding also this annotation to the class:
#Converter(converterClass = JsonConverter.class, name = "jsonConverter")
To make things works properly with jackson I've also added to my MetadataIndex class this annotation, on the class element:
#JsonTypeInfo(use = JsonTypeInfo.Id.CLASS, include = JsonTypeInfo.As.PROPERTY, property = "#class")
I personally like using directly the postgres driver to store those kind of special datatype. I didn't manage to achieve the same with hibernate.
As for the converter, I've would preferred a more general solution, but jackson forced me to state the object type I want to convert. If you find a better way to do it, let me know.
With a similar approach, I've also manage to use the hstore datatype of postgres.