how can I achieve fields grouping between KafkaSpout and the bolts - apache-kafka

I had been sending messages to Kafka topic, these messages are in JSON in the topic and I am using KafkaSpout for fetching the messages from Kafka and sending it to the bolt using shuffle grouping. Now I want to implement fields grouping between KafkaSpout and bolt. Please can anyone help me on this. How can I achieve fields grouping between KafkaSpout and the bolts.

You need to implement the backtype.storm.spout.scheme interface, basically it looks something like this:
public class FooScheme implements Scheme {
public Values deserialize(final byte[] _line) {
try{
Values values = new Values();
JSONObject msg = (JSONObject) JSONValue.parseWithException(new String(_line));
values.add((String) msg.get("a"));
values.add((String) msg.get("b"))
values.add(msg)
}
catch(ParseException e) {
//handle the exception
return null;
}
}
public Fields getOutputFields() {
return new Fields("a", "b", "json");}
}
and you use it with your spout like this:
SpoutConfig spoutConfig = new SpoutConfig( ... your config here ...);
spoutConfig.scheme = new SchemeAsMultiScheme(new FooScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
topology.setSpout("kafka-spout", 1).setNumTasks(1);
and now you can are ready to use the fields grouping by "a" or "b" or both.
FooBolt bolt = new FooBolt();
topology.setBolt("foo-bolt", new FooBolt(), 1).setNumtasks(1)
.fieldsGrouping("kafka-spout", new Fields("a","b"));
Enjoy

Related

Reading header from kafka using apache storm

We have a use-case where we need to read message headers from kafka using apache storm and pass it to downstream bolt. In storm documentation , there is no mention of how to pass kafka headers to storm topology. is anyone has figured out how to do this?
One way is to use setRecordTranslator while building spout.
Example:
return KafkaSpoutConfig.builder(
"<bootstrap-servers>", "topic-to-consume")
.setRecordTranslator((r) -> new Values(r.topic(), r.partition(), r.offset(), r.key(), r.value(), r.headers()), new Fields("topic", "partition", "offset", "key", "value", "headers"))
.build();
Now, in your Bolt's execute you would be able to access all the fields above including the headers, something like this:
public class MyNewBolt extends BaseRichBolt {
#Override
public void execute(Tuple tuple) {
RecordHeaders messageHeaders = (RecordHeaders) tuple.getValueByField("headers");
Iterator headerIterater = messageHeaders.iterator();
while (headerIterater.hasNext()) {
Header header = (Header) headerIterater.next();
String headerKey = header.key();
byte[] headerBytes = header.value();
String headerValue = new String(headerBytes);
System.out.println("Header Key: " + headerKey);
System.out.println("Header Value: " + headerValue);
}
}
}

How to make Serdes work with multi-step kafka streams

I am new to Kafka and I'm building a starter project using the Twitter API as a data source. I have create a Producer which can query the Twitter API and sends the data to my kafka topic with string serializer for both key and value. My Kafka Stream Application reads this data and does a word count, but also grouping by the date of the tweet. This part is done through a KTable called wordCounts to make use of its upsert functionality. The structure of this KTable is:
Key: {word: exampleWord, date: exampleDate}, Value: numberOfOccurences
I then attempt to restructure the data in the KTable stream to a flat structure so I can later send it to a database. You can see this in the wordCountsStructured KStream object. This restructures the data to look like the structure below. The value is initially a JsonObject but i convert it to a string to match the Serdes which i set.
Key: null, Value: {word: exampleWord, date: exampleDate, Counts: numberOfOccurences}
However, when I try to send this to my second kafka topic, I get the error below.
A serializer (key:
org.apache.kafka.common.serialization.StringSerializer / value:
org.apache.kafka.common.serialization.StringSerializer) is not
compatible to the actual key or value type (key type:
com.google.gson.JsonObject / value type: com.google.gson.JsonObject).
Change the default Serdes in StreamConfig or provide correct Serdes
via method parameters.
I'm confused by this since the KStream I am sending to the topic is of type <String, String>. Does anyone know how I might fix this?
public class TwitterWordCounter {
private final JsonParser jsonParser = new JsonParser();
public Topology createTopology(){
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("test-topic2");
KTable<JsonObject, Long> wordCounts = textLines
//parse each tweet as a tweet object
.mapValues(tweetString -> new Gson().fromJson(jsonParser.parse(tweetString).getAsJsonObject(), Tweet.class))
//map each tweet object to a list of json objects, each of which containing a word from the tweet and the date of the tweet
.flatMapValues(TwitterWordCounter::tweetWordDateMapper)
//update the key so it matches the word-date combination so we can do a groupBy and count instances
.selectKey((key, wordDate) -> wordDate)
.groupByKey()
.count(Materialized.as("Counts"));
/*
In order to structure the data so that it can be ingested into SQL, the value of each item in the stream must be straightforward: property, value
so we have to:
1. take the columns which include the dimensional data and put this into the value of the stream.
2. lable the count with 'count' as the column name
*/
KStream<String, String> wordCountsStructured = wordCounts.toStream()
.map((key, value) -> new KeyValue<>(null, MapValuesToIncludeColumnData(key, value).toString()));
KStream<String, String> wordCountsPeek = wordCountsStructured.peek(
(key, value) -> System.out.println("key: " + key + "value:" + value)
);
wordCountsStructured.to("test-output2", Produced.with(Serdes.String(), Serdes.String()));
return builder.build();
}
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application1111");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "myIPAddress");
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
TwitterWordCounter wordCountApp = new TwitterWordCounter();
KafkaStreams streams = new KafkaStreams(wordCountApp.createTopology(), config);
streams.start();
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
//this method is used for taking a tweet and transforming it to a representation of the words in it plus the date
public static List<JsonObject> tweetWordDateMapper(Tweet tweet) {
try{
List<String> words = Arrays.asList(tweet.tweetText.split("\\W+"));
List<JsonObject> tweetsJson = new ArrayList<JsonObject>();
for(String word: words) {
JsonObject tweetJson = new JsonObject();
tweetJson.add("date", new JsonPrimitive(tweet.formattedDate().toString()));
tweetJson.add("word", new JsonPrimitive(word));
tweetsJson.add(tweetJson);
}
return tweetsJson;
}
catch (Exception e) {
System.out.println(e);
System.out.println(tweet.serialize().toString());
return new ArrayList<JsonObject>();
}
}
public JsonObject MapValuesToIncludeColumnData(JsonObject key, Long countOfWord) {
key.addProperty("count", countOfWord); //new JsonPrimitive(count));
return key;
}
Because you are performing a key changing operation before the groupBy(), it will create a repartition topic and for that topic, it will rely on the default key, value serdes, which you have set to String Serde.
You can modify the groupBy() call to groupBy(Grouped.with(StringSerde,JsonSerde) and this should help.

How to process a KStream in a batch of max size or fallback to a time window?

I would like to create a Kafka stream-based application that processes a topic and takes messages in batches of size X (i.e. 50) but if the stream has low flow, to give me whatever the stream has within Y seconds (i.e. 5).
So, instead of processing messages one by one, I process a List[Record] where the size of the list is 50 (or maybe less).
This is to make some I/O bound processing more efficient.
I know that this can be implemented with the classic Kafka API but was looking for a stream-based implementation that can also handle offset committing natively, taking errors/failures into account.
I couldn't find anything related int he docs or by searching around and was wondering if anyone has a solution to this problem.
#Matthias J. Sax answer is nice, I just want to add an example for this, I think it might be useful for someone.
let's say we want to combine incoming values into the following type:
public class MultipleValues { private List<String> values; }
To collect messages into batches with max size, we need to create transformer:
public class MultipleValuesTransformer implements Transformer<String, String, KeyValue<String, MultipleValues>> {
private ProcessorContext processorContext;
private String stateStoreName;
private KeyValueStore<String, MultipleValues> keyValueStore;
private Cancellable scheduledPunctuator;
public MultipleValuesTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
scheduledPunctuator = processorContext.schedule(Duration.ofSeconds(30), PunctuationType.WALL_CLOCK_TIME, this::doPunctuate);
}
#Override
public KeyValue<String, MultipleValues> transform(String key, String value) {
MultipleValues itemValueFromStore = keyValueStore.get(key);
if (isNull(itemValueFromStore)) {
itemValueFromStore = MultipleValues.builder().values(Collections.singletonList(value)).build();
} else {
List<String> values = new ArrayList<>(itemValueFromStore.getValues());
values.add(value);
itemValueFromStore = itemValueFromStore.toBuilder()
.values(values)
.build();
}
if (itemValueFromStore.getValues().size() >= 50) {
processorContext.forward(key, itemValueFromStore);
keyValueStore.put(key, null);
} else {
keyValueStore.put(key, itemValueFromStore);
}
return null;
}
private void doPunctuate(long timestamp) {
KeyValueIterator<String, MultipleValues> valuesIterator = keyValueStore.all();
while (valuesIterator.hasNext()) {
KeyValue<String, MultipleValues> keyValue = valuesIterator.next();
if (nonNull(keyValue.value)) {
processorContext.forward(keyValue.key, keyValue.value);
keyValueStore.put(keyValue.key, null);
}
}
}
#Override
public void close() {
scheduledPunctuator.cancel();
}
}
and we need to create key-value store, add it to StreamsBuilder, and build KStream flow using transform method
Properties props = new Properties();
...
Serde<MultipleValues> multipleValuesSerge = Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(MultipleValues.class));
StreamsBuilder builder = new StreamsBuilder();
String storeName = "multipleValuesStore";
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore(storeName);
StoreBuilder<KeyValueStore<String, MultipleValues>> storeBuilder =
Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), multipleValuesSerge);
builder.addStateStore(storeBuilder);
builder.stream("source", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new MultipleValuesTransformer(storeName), storeName)
.print(Printed.<String, MultipleValues>toSysOut().withLabel("transformedMultipleValues"));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.start();
with such approach we used the incoming key for which we did aggregation. if you need to collect messages not by key, but by some message's fields, you need the following flow to trigger rebalancing on KStream (by using intermediate topic):
.selectKey(..)
.through(intermediateTopicName)
.transform( ..)
The simplest way might be, to use a stateful transform() operation. Each time you receive a record, you put it into the store. When you have received 50 records, you do your processing, emit output, and delete the records from the store.
To enforce processing if you don't read the limit in a certain amount of time, you can register a wall-clock punctuation.
It seems that there is no need to use Processors or Transformers and transform() to batch events by count. Regular groupBy() and reduce()/aggregate() should do the trick:
KeyValueSerde keyValueSerde = new KeyValueSerde(); // simple custom Serde
final AtomicLong batchCount = new AtomicLong(0L);
myKStream
.groupBy((k,v) -> KeyValue.pair(k, batchCount.getAndIncrement() / batchSize),
Grouped.keySerde(keyValueSerde))
.reduce(this::windowReducer) // <-- how you want to aggregate values in batch
.toStream()
.filter((k,v) -> /* pass through full batches only */)
.selectKey((k,v) -> k.key)
...
You'd also need to add straightforward Serde for the standard KeyValue<String, Long>.
This option is obviously only helpful when you don't need a "punctuator" to emit incomplete batches on timeout. It also doesn't guarantee the order of elements in the batch in case of distributed processing.
You can also concatenate count to the key string to form the new key (instead of using KeyValue). That would simplify example even further (to using Serdes.String()).

Creating kafka stream API for JSON's

i am trying to write a kafka stream code for converting JSON array to JSON elements...since i am new to kafka stream can any one help me out writing the code.. like what should be there in kstream and ktable..
and my stream of input ll be in the following format
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
]
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
]
and my output must be in the form
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
can anyone help me out in writing the code??
If you want to use Kafka Streams, you can use a flatMap(). Something like
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder();
builer.stream("topic").flatMap(...).to("output-topic");
Check out the examples and docs for more details:
https://docs.confluent.io/current/streams/developer-guide/index.html
https://github.com/confluentinc/kafka-streams-examples
in Python...
from kafka import KafkaConsumer
consumer = KafkaConsumer('topicName')
for message in consumer:
print(message)
specify bootstrap_servers parameter in KafkaConsumer.
For Java look cloudkarafka, really good:
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("msg = %s\n", record.value());
}
}

KafkaSpout not receiving messages

I'm trying to integrate Kafka (Kafka_2.10 version 0.8.2.1) with Storm (version 0.9.3) in Cloudera environment, and have written some code for producers/consumers. I'm able to run the producer code separately with Kafka and see that it is working with my consumer code (on console). I then wrote some code using KafkaSpout and HDFSBolt to write data into HDFS. With this code, I am able to create a topology (and see it in the UI), but the the KafkaSpout is not receiving any messages from the producer.
My code snippet is shown below:
public class LoadingData {
public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException {
String kafkaTopic = "test";
SpoutConfig spoutConfig = new SpoutConfig(new ZkHosts("localhost:2181"),
kafkaTopic, "/kafkastorm", "KafkaSpout");
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("KafkaSpout", new KafkaSpout(spoutConfig),4);
RecordFormat format = new DelimitedRecordFormat().withFieldDelimiter(",");
SyncPolicy syncPolicy = new CountSyncPolicy(10);
FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(5.0f, Units.MB);
FileNameFormat fileNameFormat = new DefaultFileNameFormat().withPath("/stormstuff");
builder.setBolt("stormbolt", new HdfsBolt()
.withFsUrl("hdfs://localhost:8020")
.withSyncPolicy(syncPolicy)
.withRecordFormat(format)
.withRotationPolicy(rotationPolicy)
.withFileNameFormat(fileNameFormat),1
).shuffleGrouping("KafkaSpout");
String topologyName = "EmployeeTopology";
Config config = new Config();
config.setNumWorkers(1);
StormSubmitter.submitTopology(topologyName, config, builder.createTopology());
}
}
Any ideas/suggestions on what I might be doing wrong? I really appreciate your help! Please let me know if you need any more details.