Serialization with KStream groupBy operation

Serialization with KStream groupBy operation - apache-kafka

I am trying to perform a count operation on a KStream and running into some difficulty in understanding how serialization is working here. I have a stream that is pushing people information e.g. name, age. After consuming this stream, i am trying to create a KTable with a count of people's age.
Input:
{"name" : "abc","age" : "15"}
Output:
30, 10
20, 4
10, 8
35, 22
...
Properties
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "person_processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
Processor
KStream<Object, Person> people = builder.stream("people");
people.print(Printed.<Object, Person>toSysOut().withLabel("consumer-1"));
Output
[consumer-1]: null, [B#7e37bab6
Question-1
I understand that data in the topic is in bytes. I am not setting any Serdes for Key or Value to start with. Is KStream converting the input from bytes to Person and printing the address of Person here?
Question-2
When I add the below value Serdes, I get a more meaningful output. Is the byte information here getting converted to String and then to Person? Why is the value now printed correctly?
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
[consumer-1]: null, {"name" : "abc","age" : "15"}
Question-3
Now, when performing the count on the age, I get a runtime error on converting a String to Person. If groupBy is setting the age as the Key and the count as Long, why is the String to Person conversion happening?
KTable<Integer, Long> integerLongKTable = people.groupBy((key, value) -> value.getAge())
.count();
Exception in thread "person_processor-9ff96b38-4beb-4594-b2fe-ae191bf6b9ff-StreamThread-1" java.lang.ClassCastException: java.lang.String cannot be cast to com.example.kafkastreams.KafkaStreamsApplication$Person
at org.apache.kafka.streams.kstream.internals.KStreamImpl$1.apply(KStreamImpl.java:152)
at org.apache.kafka.streams.kstream.internals.KStreamImpl$1.apply(KStreamImpl.java:149)
Edit-1
After reading through the response from #Matthias J. Sax I created a PersonSerde using the Serializer and DeSerializer from this locatio, I get this SerializationException...
https://github.com/apache/kafka/tree/1.0/streams/examples/src/main/java/org/apache/kafka/streams/examples/pageview
static class Person {
String name;
String age;
public Person(String name, String age) {
this.name = name;
this.age = age;
}
void setName(String name) {
this.name = name;
}
String getName() {
return name;
}
void setAge(String age) {
this.age = age;
}
String getAge() {
return age;
}
#Override
public String toString() {
return "Person {name:" + this.getName() + ",age:" + this.getAge() + "}";
}
}
public class PersonSerde implements Serde {
#Override
public void configure(Map map, boolean b) {
}
#Override
public void close() {
}
#Override
public Serializer serializer() {
Map<String, Object> serdeProps = new HashMap<>();
final Serializer<Person> personSerializer = new JsonPOJOSerializer<>();
serdeProps.put("JsonPOJOClass", Person.class);
personSerializer.configure(serdeProps, false);
return personSerializer;
}
#Override
public Deserializer deserializer() {
Map<String, Object> serdeProps = new HashMap<>();
final Deserializer<Person> personDeserializer = new JsonPOJODeserializer<>();
serdeProps.put("JsonPOJOClass", Person.class);
personDeserializer.configure(serdeProps, false);
return personDeserializer;
}
}
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, personSerde.getClass());
KTable<String, Long> count = people.selectKey((key, value) -> value.getAge()).groupByKey(Serialized.with(Serdes.String(), personSerde))
.count();
Error
Caused by: org.apache.kafka.common.errors.SerializationException: Error serializing JSON message
Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException: No serializer found for class com.example.kafkastreams.KafkaStreamsApplication$Person and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS)
at com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:77)
at com.fasterxml.jackson.databind.SerializerProvider.reportBadDefinition(SerializerProvider.java:1191)
at com.fasterxml.jackson.databind.DatabindContext.reportBadDefinition(DatabindContext.java:313)
Edit 5
So it appears that when I mapValues to a String, then count works correctly. But when I use it on a custom Object, it fails
KStream<String, Person> people = builder.stream("person-topic", Consumed.with(Serdes.String(), personSerde));
people.print(Printed.<String, Person>toSysOut().withLabel("person-source"));
KStream<String, Person> agePersonKStream = people.selectKey((key, value) -> value.getAge());
agePersonKStream.print(Printed.<String, Person>toSysOut().withLabel("age-person"));
KStream<String, String> stringStringKStream = agePersonKStream.mapValues((person -> person.name));
stringStringKStream.print(Printed.<String, String>toSysOut().withLabel("age-name"));
KTable<String, Long> stringLongKTable = stringStringKStream.groupByKey(Serialized.with(Serdes.String(), Serdes.String())).count();
stringLongKTable.toStream().print(Printed.<String, Long>toSysOut().withLabel("age-count"));
Without the 3 step to mapValues to name, step 4 fails.

Question-1 I understand that data in the topic is in bytes. I am not setting any Serdes for Key or Value to start with. Is KStream converting the input from bytes to Person and printing the address of Person here?
If you don't specify any Serde in StreamsConfig or in builder.stream(..., Consumers.with(/*serdes*/)) the bytes won't be converted into a Person object but the object will be of type byte[]. Thus, print() will call byte[].toString() that results in the cryptic output ([B#7e37bab6) you see.
Question-2 When I add the below value Serdes, I get a more meaningful output. Is the byte information here getting converted to String and then to Person? Why is the value now printed correctly?
As you specify Serde.String() in StreamsConfig the bytes are converted to String type. It seems, that StringSerde is able to deserialize the bytes in a meaningful way -- but this seems to be a coincidence that it works at all. It seems that your data is actually serialized in JSON, what would explain why StringSerde() can convert the bytes into a String.
Question-3 Now, when performing the count on the age, I get a runtime error on converting a String to Person. If groupBy is setting the age as the Key and the count as Long, why is the String to Person conversion happening?
That is expected. Because the bytes are converted into a String object (as you specified Serdes.String()), the cast cannot be performed.
Final remarks:
You don't get a class cast exception if you only use print(), because for this case, no cast operation is performed. Java only inserts a cast operation if required.
For groupBy() you use value.getAge() and thus Java inserts a cast here (it knows that the expected type is Person, because it's specified via KStream<Object, Person> people = .... For print() only toString() is called that is define on Object and thus no cast is required.
Generics in Java a type hints for the compiler and replaced with Object (or casted if required during compile time). Thus, for print() a Object variable can point to an byte[] without problem and toString() is called successfully. For groupBy() case the compiler cast Object to Person to be able to call getAge() -- however, this fails, because the actually type is String.
To get your code working, you need to create a PersonSerde extend Serde<Person> class and specify it as value serde.

Related

Invoking Kafka Interactive Queries from inside a Stream

I have a particular requirement for invoking an Interactive Query from inside a Stream . This is because I need to create a new Stream which should have data contained inside the State Store. Truncated code below:
tempModifiedDataStream.to(topic.getTransformedTopic(), Produced.with(Serdes.String(), Serdes.String()));
GlobalKTable<String, String> myMetricsTable = builder.globalTable(
topic.getTransformedTopic(),
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(
topic.getTransformedStoreName() /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(Serdes.String()) /* value serde */
);
KafkaStreams streams = new KafkaStreams(builder.build(), kStreamsConfigs());
KStream<String, String> tempAggrDataStream = tempModifiedDataStream
.flatMap((key, value) -> {
try {
List<KeyValue<String, String>> result = new ArrayList<>();
ReadOnlyKeyValueStore<String, String> keyValueStore =
streams .store(
topic.getTransformedStoreName(),
QueryableStoreTypes.keyValueStore());
In the last line, To access the State Store I need to have the KafkaStreams object and the Topology is finalized when I create the KafkaStreams object. The problem with this approach is that the 'tempAggrDataStream' is hence not part of the Topology and that part of the code does not get executed. And I cant move the KafkaStreams definition below as otherwise I can't call the Interactive Query.
I am a bit new to Kafka Streams ; so is this something silly from my side?

If you want to achieve sending all content of the topic content after each data modification, I think you should rather use Processor API.
You could create org.apache.kafka.streams.kstream.Transformer with state store.
For each processing message it will update state store and send all content to downstream.
It is not very efficient, because it will be forwarding for each processing message the whole content of the topic/state store (that can be thousands, millions of records).
If you need only latest value it is enough to set your topic cleanup.policy to compact. And from other site use KTable, which give abstraction of Table (Snapshot of stream)
Sample Transformer code for forwarding whole content of state store is as follow. The whole work is done in transform(String key, String value) method.
public class SampleTransformer
implements Transformer<String, String, KeyValue<String, String>> {
private String stateStoreName;
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
public SampleTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
stateStore.put(key, value);
stateStore.all().forEachRemaining(keyValue -> context.forward(keyValue.key, keyValue.value));
return null;
}
#Override
public void close() {
}
}
More information about Processor APi can be found:
https://docs.confluent.io/current/streams/developer-guide/processor-api.html
https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html
How to combine Processor API with Stream DSL can be found:
https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration

CacheBuilder using guava cache for query resultant

To reduce the DB hits to read the data from DB using the query, I am planning to keep resultant in the cache. To do this I am using guava caching.
studentController.java
public Map<String, Object> getSomeMethodName(Number departmentId, String departmentType){
ArrayList<Student> studentList = studentManager.getStudentListByDepartmentType(departmentId, departmentType);
----------
----------
}
StudentHibernateDao.java(criteria query )
#Override
public ArrayList<Student> getStudentListByDepartmentType(Number departmentId, String departmentType) {
Criteria criteria =sessionFactory.getCurrentSession().createCriteria(Student.class);
criteria.add(Restrictions.eq("departmentId", departmentId));
criteria.add(Restrictions.eq("departmentType", departmentType));
ArrayList<Student> studentList = (ArrayList)criteria.list();
return studentList;
}
To cache the criteria query resultant i started off with building CacheBuilder, like below.
private static LoadingCache<Number departmentId, String departmentType, ArrayList<Student>> studentListCache = CacheBuilder
.newBuilder().expireAfterAccess(1, TimeUnit.MINUTES)
.maximumSize(1000)
.build(new CacheLoader<Number departmentId, String departmentType, ArrayList<Student>>() {
public ArrayList<Student> load(String key) throws Exception {
return getStudentListByDepartmentType(departmentId, departmentType);
}
});
Here I dont know where to put CacheBuilder function and how to pass multiple key parameters(i.e departmentId and departmentType) to CacheLoader and call it.
Is this the correct way of caching using guava. Am I missing anything?

Guava's cache only accepts two type parameters, a key and a value type. If you want your key to be a compound key then you need to build a new compound type to encapsulate it. Effectively it would need to look like this (I apologize for my syntax, I don't use Java that often):
// Compound Key type
class CompoundDepartmentId {
public CompoundDepartmentId(Long departmentId, String departmentType) {
this.departmentId = departmentId;
this.departmentType = departmentType;
}
}
private static LoadingCache<CompoundDepartmentId, ArrayList<Student>> studentListCache =
CacheBuilder
.newBuilder().expireAfterAccess(1, TimeUnit.MINUTES)
.maximumSize(1000)
.build(new CacheLoader<CompoundDepartmentId, ArrayList<Student>>() {
public ArrayList<Student> load(CompoundDepartmentId key) throws Exception {
return getStudentListByDepartmentType(key.departmentId, key.departmentType);
}
});

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/...
When reading only a single topic I used TextIO in the last step of my pipeline:
TextIO.write()
.to(
new DateNamedFiles(
String.format("gs://bucket/data%s/", suffix), currentMillisString))
.withWindowedWrites()
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1));
This is a similar question, which code I tried to adapt.
FileIO.<EventType, Event>writeDynamic()
.by(
new SerializableFunction<Event, EventType>() {
#Override
public EventType apply(Event input) {
return EventType.TRANSFER; // should return real type here, just a dummy
}
})
.via(
Contextful.fn(
new SerializableFunction<Event, String>() {
#Override
public String apply(Event input) {
return "Dummy"; // should return the Event converted to a String
}
}),
TextIO.sink())
.to(DynamicFileDestinations.constant(new DateNamedFiles("gs://bucket/tmp%s/%s/",
currentMillisString),
new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
return null; // Not sure what this should exactly, but it needs to
// include the EventType into the path
}
}))
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1))
The official JavaDoc contains example code which seem to have outdated method signatures. (The .via method seems to have switched the order of the arguments). I' furthermore stumbled across the example in FileIO which confused me - shouldn't TransactionType and Transaction in this line change places?

After a night of sleep and a fresh start I figured out the solution, I used the functional Java 8 style as it makes the code shorter (and more readable):
.apply(
FileIO.<String, Event>writeDynamic()
.by((SerializableFunction<Event, String>) input -> input.getTopic())
.via(
Contextful.fn(
(SerializableFunction<Event, String>) input -> input.getPayload()),
TextIO.sink())
.to(String.format("gs://bucket/data%s/", suffix)
.withNaming(type -> FileNaming.getNaming(type, "", currentMillisString))
.withDestinationCoder(StringUtf8Coder.of())
.withTempDirectory(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString))
.withNumShards(1));
Explanation:
Event is a Java POJO containing the payload of the Kafka message and the topic it belongs to, it is parsed in a ParDo after the KafkaIO step
suffix is a either dev or empty and set by environment variables
currentMillisStringcontains the timestamp when the whole pipeline
was launched so that new files don't overwrite old files on GCS when
a pipeline gets restarted
FileNaming implements a custom naming and receives the type of the event (the topic) in it's constructor, it uses a custom formatter to write to daily partitioned "sub-folders" on GCS:
class FileNaming implements FileIO.Write.FileNaming {
static FileNaming getNaming(String topic, String suffix, String currentMillisString) {
return new FileNaming(topic, suffix, currentMillisString);
}
private static final DateTimeFormatter FORMATTER = DateTimeFormat
.forPattern("yyyy-MM-dd").withZone(DateTimeZone.forTimeZone(TimeZone.getTimeZone("Europe/Zurich")));
private final String topic;
private final String suffix;
private final String currentMillisString;
private String filenamePrefixForWindow(IntervalWindow window) {
return String.format(
"%s/%s/%s_", topic, FORMATTER.print(window.start()), currentMillisString);
}
private FileNaming(String topic, String suffix, String currentMillisString) {
this.topic = topic;
this.suffix = suffix;
this.currentMillisString = currentMillisString;
}
#Override
public String getFilename(
BoundedWindow window,
PaneInfo pane,
int numShards,
int shardIndex,
Compression compression) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filenamePrefix = filenamePrefixForWindow(intervalWindow);
String filename =
String.format(
"pane-%d-%s-%05d-of-%05d%s",
pane.getIndex(),
pane.getTiming().toString().toLowerCase(),
shardIndex,
numShards,
suffix);
String fullName = filenamePrefix + filename;
return fullName;
}
}

How to map input KStream <String,String> into <String, CarClass> using KeyValueMapper?

Receiving the Json data which i simply want to map on CarClass and want to create new stream but map method doesn't allow me it to map on custom datatype
The method map(KeyValueMapper>) in the type KStream is not applicable for the arguments (new KeyValueMapper>(){})?

From http://docs.confluent.io/current/streams/developer-guide.html#stateless-transformations:
The example changes the value type from byte[] to Integer. For String to CarClass is would be the same.
KStream<byte[], String> stream = ...;
// Java 8+ example, using lambda expressions
// Note how we change the key and the key type (similar to `selectKey`)
// as well as the value and the value type.
KStream<String, Integer> transformed = stream.map(
(key, value) -> KeyValue.pair(value.toLowerCase(), value.length()));
// Java 7 example
KStream<String, Integer> transformed = stream.map(
new KeyValueMapper<byte[], String, KeyValue<String, Integer>>() {
#Override
public KeyValue<String, Integer> apply(byte[] key, String value) {
return new KeyValue<>(value.toLowerCase(), value.length());
}
});
However, if you want to only modify the value, I would recommend to use mapValues() instead of map().

Gson: Custom deserialization if certain field is present

I have a class that looks as follows
class Person {
Long id;
String firstName;
int age;
}
and my input either looks like this:
{ "id": null, "firstName": "John", "age": 10 }
or like this:
{ "id": 123 }
The first variant represents a "new" (non-persisted) person and the second refers to a person by its database id.
If id is non-null, I would like to load the object from database during deserialization, otherwise fallback on regular parsing and deserialize it as a new object.
What I've tried: I currently have a JsonDeserializer for database-deserialization, but as I understand it, there is no way to "fall back" on regular parsing. According to this answer I should use a TypeAdapterFactory and the getDelegateAdapter. My problem with this approach is that I'm given a JsonReader (and not for instance a JsonElement) so I can't determine if the input contains a valid id without consuming input.
Any suggestions on how to solve this?

I don't know if I understand your question correctly, but if you already have a JsonDeserializer, you should have a method like this one in there:
public Person deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context) { ... }
In this method you have the object context of type JsonDeserializationContext, which allows you to invoke default deserialization on a specified object.
So, you could do something like inside your custom deserializer:
//If id is null...
Person person = context.deserialize(json, Person.class);
See JsonDeserializationContext documentation.

I think I managed to figure this out with the help of the answer over here.
Here is a working type adapter factory:
new TypeAdapterFactory() {
#Override
public <T> TypeAdapter<T> create(Gson gson, TypeToken<T> type) {
final TypeAdapter<T> delegate = gson.getDelegateAdapter(this, type);
final TypeAdapter<JsonElement> elementAdapter =
gson.getAdapter(JsonElement.class);
// Are we asked to parse a person?
if (!type.getType().equals(Person.class))
return null;
return new TypeAdapter<T>() {
#Override
public T read(JsonReader reader) throws IOException {
JsonElement tree = elementAdapter.read(reader);
JsonElement id = tree.getAsJsonObject().get("id");
if (id == null)
return delegate.fromJsonTree(tree);
return (T) findObj(id.getAsLong());
}
#Override
public void write(JsonWriter writer, T obj) throws IOException {
delegate.write(writer, obj);
}
};
}
}
I haven't fully tested it yet and I'll get back and revise it if needed. (Posting it now to open up for feed back on the approach.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Serialization with KStream groupBy operation - apache-kafka

Related

Invoking Kafka Interactive Queries from inside a Stream

CacheBuilder using guava cache for query resultant

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

How to map input KStream <String,String> into <String, CarClass> using KeyValueMapper?

Gson: Custom deserialization if certain field is present

Categories

Resources