How to map input KStream <String,String> into <String, CarClass> using KeyValueMapper? - apache-kafka

Receiving the Json data which i simply want to map on CarClass and want to create new stream but map method doesn't allow me it to map on custom datatype
The method map(KeyValueMapper>) in the type KStream is not applicable for the arguments (new KeyValueMapper>(){})?

From http://docs.confluent.io/current/streams/developer-guide.html#stateless-transformations:
The example changes the value type from byte[] to Integer. For String to CarClass is would be the same.
KStream<byte[], String> stream = ...;
// Java 8+ example, using lambda expressions
// Note how we change the key and the key type (similar to `selectKey`)
// as well as the value and the value type.
KStream<String, Integer> transformed = stream.map(
(key, value) -> KeyValue.pair(value.toLowerCase(), value.length()));
// Java 7 example
KStream<String, Integer> transformed = stream.map(
new KeyValueMapper<byte[], String, KeyValue<String, Integer>>() {
#Override
public KeyValue<String, Integer> apply(byte[] key, String value) {
return new KeyValue<>(value.toLowerCase(), value.length());
}
});
However, if you want to only modify the value, I would recommend to use mapValues() instead of map().

Related

Spring batch ItemReader locale, import a double with comma

I want to import the following file with Spring Batch
key;value
A;9,5
I model it with the bean
class CsvModel
{
String key
Double value
}
The shown code here is Groovy but the language is irrelevant for the problem.
#Bean
#StepScope
FlatFileItemReader<CsvModel> reader2()
{
// set the locale for the tokenizer, but this doesn't solve the problem
def locale = Locale.getDefault()
def fieldSetFactory = new DefaultFieldSetFactory()
fieldSetFactory.setNumberFormat(NumberFormat.getInstance(locale))
def tokenizer = new DelimitedLineTokenizer(';')
tokenizer.setNames([ 'key', 'value' ].toArray() as String[])
// and assign the fieldSetFactory to the tokenizer
tokenizer.setFieldSetFactory(fieldSetFactory)
def fieldMapper = new BeanWrapperFieldSetMapper<CsvModel>()
fieldMapper.setTargetType(CsvModel.class)
def lineMapper = new DefaultLineMapper<CsvModel>()
lineMapper.setLineTokenizer(tokenizer)
lineMapper.setFieldSetMapper(fieldMapper)
def reader = new FlatFileItemReader<CsvModel>()
reader.setResource(new FileSystemResource('output/export.csv'))
reader.setLinesToSkip(1)
reader.setLineMapper(lineMapper)
return reader
}
Setting up a reader is well known, what was new for me was the first code block, setting up a numberFormat / locale / fieldSetFactory and assign it to the tokenizer. However this doesn't work, I still receive the exception
Field error in object 'target' on field 'value': rejected value [5,0]; codes [typeMismatch.target.value,typeMismatch.value,typeMismatch.float,typeMismatch]; arguments [org.springframework.context.support.DefaultMessageSourceResolvable: codes [target.value,value]; arguments []; default message [value]]; default message [Failed to convert property value of type 'java.lang.String' to required type 'float' for property 'value'; nested exception is java.lang.NumberFormatException: For input string: "9,5"]
at org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper.mapFieldSet(BeanWrapperFieldSetMapper.java:200) ~[spring-batch-infrastructure-4.1.2.RELEASE.jar:4.1.2.RELEASE]
at org.springframework.batch.item.file.mapping.DefaultLineMapper.mapLine(DefaultLineMapper.java:43) ~[spring-batch-infrastructure-4.1.2.RELEASE.jar:4.1.2.RELEASE]
at org.springframework.batch.item.file.FlatFileItemReader.doRead(FlatFileItemReader.java:180) ~[spring-batch-infrastructure-4.1.2.RELEASE.jar:4.1.2.RELEASE]
So the question is: how do I import floats in the locale de_AT (we write our decimals with a comma like this: 3,141592)? I could avoid this problem with a FieldSetMapper but I want to understand what's going on here and want to avoid the unnecessary mapper class.
And even the FieldSetMapper solution doesn't obey locales out of the box, I have to read a string and convert it myself in a double:
class PnwExportFieldSetMapper implements FieldSetMapper<CsvModel>
{
private nf = NumberFormat.getInstance(Locale.getDefault())
#Override
CsvModel mapFieldSet(FieldSet fieldSet) throws BindException
{
def model = new CsvModel()
model.key = fieldSet.readString(0)
model.value = nf.parse(fieldSet.readString(1)).doubleValue()
return model
}
}
The class DefaultFieldSet has a function setNumberFormat, but when and where do I call this function?
This unfortunately seems to be a bug. I have the same Problem and debugged into the code.
The BeanWrapperFieldSetMapper is not using the methods of DefaultFieldSetFactory, that would do the right conversion, but instead just uses FieldSet.getProperties and does the conversion by itself.
So, I see the following options: Provide the BeanWrapperFieldSetMapper either with PropertyEditors or a ConversionService, or use a different mapper.
Here is a sketch of a conversion Service:
private static class CS implements ConversionService {
#Override
public boolean canConvert(Class<?> sourceType, Class<?> targetType) {
return sourceType == String.class && targetType == double.class;
}
#Override
public boolean canConvert(TypeDescriptor sourceType, TypeDescriptor targetType) {
return sourceType.equals(TypeDescriptor.valueOf(String.class)) &&
targetType.equals(TypeDescriptor.valueOf(double.class)) ;
}
#Override
public <T> T convert(Object source, Class<T> targetType) {
return (T)Double.valueOf(source.toString().replace(',', '.'));
}
#Override
public Object convert(Object source, TypeDescriptor sourceType, TypeDescriptor targetType) {
return Double.valueOf(source.toString().replace(',', '.'));
}
}
and use it:
final BeanWrapperFieldSetMapper<IBISRecord> mapper = new BeanWrapperFieldSetMapper<>();
mapper.setTargetType(YourClass.class);
mapper.setConversionService(new CS());
...
new FlatFileItemReaderBuilder<IBISRecord>()
.name("YourReader")
.delimited()
.delimiter(";")
.includedFields(fields)
.names(names)
.fieldSetMapper(mapper)
.saveState(false)
.resource(resource)
.build();

Invoking Kafka Interactive Queries from inside a Stream

I have a particular requirement for invoking an Interactive Query from inside a Stream . This is because I need to create a new Stream which should have data contained inside the State Store. Truncated code below:
tempModifiedDataStream.to(topic.getTransformedTopic(), Produced.with(Serdes.String(), Serdes.String()));
GlobalKTable<String, String> myMetricsTable = builder.globalTable(
topic.getTransformedTopic(),
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(
topic.getTransformedStoreName() /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(Serdes.String()) /* value serde */
);
KafkaStreams streams = new KafkaStreams(builder.build(), kStreamsConfigs());
KStream<String, String> tempAggrDataStream = tempModifiedDataStream
.flatMap((key, value) -> {
try {
List<KeyValue<String, String>> result = new ArrayList<>();
ReadOnlyKeyValueStore<String, String> keyValueStore =
streams .store(
topic.getTransformedStoreName(),
QueryableStoreTypes.keyValueStore());
In the last line, To access the State Store I need to have the KafkaStreams object and the Topology is finalized when I create the KafkaStreams object. The problem with this approach is that the 'tempAggrDataStream' is hence not part of the Topology and that part of the code does not get executed. And I cant move the KafkaStreams definition below as otherwise I can't call the Interactive Query.
I am a bit new to Kafka Streams ; so is this something silly from my side?
If you want to achieve sending all content of the topic content after each data modification, I think you should rather use Processor API.
You could create org.apache.kafka.streams.kstream.Transformer with state store.
For each processing message it will update state store and send all content to downstream.
It is not very efficient, because it will be forwarding for each processing message the whole content of the topic/state store (that can be thousands, millions of records).
If you need only latest value it is enough to set your topic cleanup.policy to compact. And from other site use KTable, which give abstraction of Table (Snapshot of stream)
Sample Transformer code for forwarding whole content of state store is as follow. The whole work is done in transform(String key, String value) method.
public class SampleTransformer
implements Transformer<String, String, KeyValue<String, String>> {
private String stateStoreName;
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
public SampleTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
stateStore.put(key, value);
stateStore.all().forEachRemaining(keyValue -> context.forward(keyValue.key, keyValue.value));
return null;
}
#Override
public void close() {
}
}
More information about Processor APi can be found:
https://docs.confluent.io/current/streams/developer-guide/processor-api.html
https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html
How to combine Processor API with Stream DSL can be found:
https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration

Serialization with KStream groupBy operation

I am trying to perform a count operation on a KStream and running into some difficulty in understanding how serialization is working here. I have a stream that is pushing people information e.g. name, age. After consuming this stream, i am trying to create a KTable with a count of people's age.
Input:
{"name" : "abc","age" : "15"}
Output:
30, 10
20, 4
10, 8
35, 22
...
Properties
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "person_processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
Processor
KStream<Object, Person> people = builder.stream("people");
people.print(Printed.<Object, Person>toSysOut().withLabel("consumer-1"));
Output
[consumer-1]: null, [B#7e37bab6
Question-1
I understand that data in the topic is in bytes. I am not setting any Serdes for Key or Value to start with. Is KStream converting the input from bytes to Person and printing the address of Person here?
Question-2
When I add the below value Serdes, I get a more meaningful output. Is the byte information here getting converted to String and then to Person? Why is the value now printed correctly?
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
[consumer-1]: null, {"name" : "abc","age" : "15"}
Question-3
Now, when performing the count on the age, I get a runtime error on converting a String to Person. If groupBy is setting the age as the Key and the count as Long, why is the String to Person conversion happening?
KTable<Integer, Long> integerLongKTable = people.groupBy((key, value) -> value.getAge())
.count();
Exception in thread "person_processor-9ff96b38-4beb-4594-b2fe-ae191bf6b9ff-StreamThread-1" java.lang.ClassCastException: java.lang.String cannot be cast to com.example.kafkastreams.KafkaStreamsApplication$Person
at org.apache.kafka.streams.kstream.internals.KStreamImpl$1.apply(KStreamImpl.java:152)
at org.apache.kafka.streams.kstream.internals.KStreamImpl$1.apply(KStreamImpl.java:149)
Edit-1
After reading through the response from #Matthias J. Sax I created a PersonSerde using the Serializer and DeSerializer from this locatio, I get this SerializationException...
https://github.com/apache/kafka/tree/1.0/streams/examples/src/main/java/org/apache/kafka/streams/examples/pageview
static class Person {
String name;
String age;
public Person(String name, String age) {
this.name = name;
this.age = age;
}
void setName(String name) {
this.name = name;
}
String getName() {
return name;
}
void setAge(String age) {
this.age = age;
}
String getAge() {
return age;
}
#Override
public String toString() {
return "Person {name:" + this.getName() + ",age:" + this.getAge() + "}";
}
}
public class PersonSerde implements Serde {
#Override
public void configure(Map map, boolean b) {
}
#Override
public void close() {
}
#Override
public Serializer serializer() {
Map<String, Object> serdeProps = new HashMap<>();
final Serializer<Person> personSerializer = new JsonPOJOSerializer<>();
serdeProps.put("JsonPOJOClass", Person.class);
personSerializer.configure(serdeProps, false);
return personSerializer;
}
#Override
public Deserializer deserializer() {
Map<String, Object> serdeProps = new HashMap<>();
final Deserializer<Person> personDeserializer = new JsonPOJODeserializer<>();
serdeProps.put("JsonPOJOClass", Person.class);
personDeserializer.configure(serdeProps, false);
return personDeserializer;
}
}
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, personSerde.getClass());
KTable<String, Long> count = people.selectKey((key, value) -> value.getAge()).groupByKey(Serialized.with(Serdes.String(), personSerde))
.count();
Error
Caused by: org.apache.kafka.common.errors.SerializationException: Error serializing JSON message
Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException: No serializer found for class com.example.kafkastreams.KafkaStreamsApplication$Person and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS)
at com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:77)
at com.fasterxml.jackson.databind.SerializerProvider.reportBadDefinition(SerializerProvider.java:1191)
at com.fasterxml.jackson.databind.DatabindContext.reportBadDefinition(DatabindContext.java:313)
Edit 5
So it appears that when I mapValues to a String, then count works correctly. But when I use it on a custom Object, it fails
KStream<String, Person> people = builder.stream("person-topic", Consumed.with(Serdes.String(), personSerde));
people.print(Printed.<String, Person>toSysOut().withLabel("person-source"));
KStream<String, Person> agePersonKStream = people.selectKey((key, value) -> value.getAge());
agePersonKStream.print(Printed.<String, Person>toSysOut().withLabel("age-person"));
KStream<String, String> stringStringKStream = agePersonKStream.mapValues((person -> person.name));
stringStringKStream.print(Printed.<String, String>toSysOut().withLabel("age-name"));
KTable<String, Long> stringLongKTable = stringStringKStream.groupByKey(Serialized.with(Serdes.String(), Serdes.String())).count();
stringLongKTable.toStream().print(Printed.<String, Long>toSysOut().withLabel("age-count"));
Without the 3 step to mapValues to name, step 4 fails.
Question-1 I understand that data in the topic is in bytes. I am not setting any Serdes for Key or Value to start with. Is KStream converting the input from bytes to Person and printing the address of Person here?
If you don't specify any Serde in StreamsConfig or in builder.stream(..., Consumers.with(/*serdes*/)) the bytes won't be converted into a Person object but the object will be of type byte[]. Thus, print() will call byte[].toString() that results in the cryptic output ([B#7e37bab6) you see.
Question-2 When I add the below value Serdes, I get a more meaningful output. Is the byte information here getting converted to String and then to Person? Why is the value now printed correctly?
As you specify Serde.String() in StreamsConfig the bytes are converted to String type. It seems, that StringSerde is able to deserialize the bytes in a meaningful way -- but this seems to be a coincidence that it works at all. It seems that your data is actually serialized in JSON, what would explain why StringSerde() can convert the bytes into a String.
Question-3 Now, when performing the count on the age, I get a runtime error on converting a String to Person. If groupBy is setting the age as the Key and the count as Long, why is the String to Person conversion happening?
That is expected. Because the bytes are converted into a String object (as you specified Serdes.String()), the cast cannot be performed.
Final remarks:
You don't get a class cast exception if you only use print(), because for this case, no cast operation is performed. Java only inserts a cast operation if required.
For groupBy() you use value.getAge() and thus Java inserts a cast here (it knows that the expected type is Person, because it's specified via KStream<Object, Person> people = .... For print() only toString() is called that is define on Object and thus no cast is required.
Generics in Java a type hints for the compiler and replaced with Object (or casted if required during compile time). Thus, for print() a Object variable can point to an byte[] without problem and toString() is called successfully. For groupBy() case the compiler cast Object to Person to be able to call getAge() -- however, this fails, because the actually type is String.
To get your code working, you need to create a PersonSerde extend Serde<Person> class and specify it as value serde.

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/...
When reading only a single topic I used TextIO in the last step of my pipeline:
TextIO.write()
.to(
new DateNamedFiles(
String.format("gs://bucket/data%s/", suffix), currentMillisString))
.withWindowedWrites()
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1));
This is a similar question, which code I tried to adapt.
FileIO.<EventType, Event>writeDynamic()
.by(
new SerializableFunction<Event, EventType>() {
#Override
public EventType apply(Event input) {
return EventType.TRANSFER; // should return real type here, just a dummy
}
})
.via(
Contextful.fn(
new SerializableFunction<Event, String>() {
#Override
public String apply(Event input) {
return "Dummy"; // should return the Event converted to a String
}
}),
TextIO.sink())
.to(DynamicFileDestinations.constant(new DateNamedFiles("gs://bucket/tmp%s/%s/",
currentMillisString),
new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
return null; // Not sure what this should exactly, but it needs to
// include the EventType into the path
}
}))
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1))
The official JavaDoc contains example code which seem to have outdated method signatures. (The .via method seems to have switched the order of the arguments). I' furthermore stumbled across the example in FileIO which confused me - shouldn't TransactionType and Transaction in this line change places?
After a night of sleep and a fresh start I figured out the solution, I used the functional Java 8 style as it makes the code shorter (and more readable):
.apply(
FileIO.<String, Event>writeDynamic()
.by((SerializableFunction<Event, String>) input -> input.getTopic())
.via(
Contextful.fn(
(SerializableFunction<Event, String>) input -> input.getPayload()),
TextIO.sink())
.to(String.format("gs://bucket/data%s/", suffix)
.withNaming(type -> FileNaming.getNaming(type, "", currentMillisString))
.withDestinationCoder(StringUtf8Coder.of())
.withTempDirectory(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString))
.withNumShards(1));
Explanation:
Event is a Java POJO containing the payload of the Kafka message and the topic it belongs to, it is parsed in a ParDo after the KafkaIO step
suffix is a either dev or empty and set by environment variables
currentMillisStringcontains the timestamp when the whole pipeline
was launched so that new files don't overwrite old files on GCS when
a pipeline gets restarted
FileNaming implements a custom naming and receives the type of the event (the topic) in it's constructor, it uses a custom formatter to write to daily partitioned "sub-folders" on GCS:
class FileNaming implements FileIO.Write.FileNaming {
static FileNaming getNaming(String topic, String suffix, String currentMillisString) {
return new FileNaming(topic, suffix, currentMillisString);
}
private static final DateTimeFormatter FORMATTER = DateTimeFormat
.forPattern("yyyy-MM-dd").withZone(DateTimeZone.forTimeZone(TimeZone.getTimeZone("Europe/Zurich")));
private final String topic;
private final String suffix;
private final String currentMillisString;
private String filenamePrefixForWindow(IntervalWindow window) {
return String.format(
"%s/%s/%s_", topic, FORMATTER.print(window.start()), currentMillisString);
}
private FileNaming(String topic, String suffix, String currentMillisString) {
this.topic = topic;
this.suffix = suffix;
this.currentMillisString = currentMillisString;
}
#Override
public String getFilename(
BoundedWindow window,
PaneInfo pane,
int numShards,
int shardIndex,
Compression compression) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filenamePrefix = filenamePrefixForWindow(intervalWindow);
String filename =
String.format(
"pane-%d-%s-%05d-of-%05d%s",
pane.getIndex(),
pane.getTiming().toString().toLowerCase(),
shardIndex,
numShards,
suffix);
String fullName = filenamePrefix + filename;
return fullName;
}
}

custom item writer to write to database using list of list which contains hashmap

i am a new to spring batch and my requirement is to read a dynamic excel sheet and insert into database.I am able to read excel and pass it to writer but only last record in excel sheet gets inserted into database. Here is my code for item writer
#Bean
public ItemWriter<List<LinkedHashMap<String, String>>> tempwrite() {
JdbcBatchItemWriter<List<LinkedHashMap<String, String>>> databaseItemWriter = new JdbcBatchItemWriter<>();
databaseItemWriter.setDataSource(dataSource);
databaseItemWriter.setSql("insert into table values(next value for seq_table,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
ItemPreparedStatementSetter<List<LinkedHashMap<String, String>>> valueSetter =
new databasePsSetter();
databaseItemWriter.setItemPreparedStatementSetter(valueSetter);
return databaseItemWriter;
}
and below is my prepared statement setter class
public class databasePsSetter implements ItemPreparedStatementSetter<List<LinkedHashMap<String, String>>> {
#Override
public void setValues(List<LinkedHashMap<String, String>> item, PreparedStatement ps) throws SQLException {
int columnNumber=1;
for(LinkedHashMap<String, String> row:item){
columnNumber=1;
for (Map.Entry<String, String> entry : row.entrySet()) {
ps.setString(columnNumber, entry.getValue());
columnNumber++;
}
}
}
}
I have seen many examples but all of them is using a dto class but i am not sure whether this is the correct way of implementation for list of list which contains hashmap