Creating kafka stream API for JSON's - apache-kafka

i am trying to write a kafka stream code for converting JSON array to JSON elements...since i am new to kafka stream can any one help me out writing the code.. like what should be there in kstream and ktable..
and my stream of input ll be in the following format
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
]
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
]
and my output must be in the form
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
can anyone help me out in writing the code??

If you want to use Kafka Streams, you can use a flatMap(). Something like
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder();
builer.stream("topic").flatMap(...).to("output-topic");
Check out the examples and docs for more details:
https://docs.confluent.io/current/streams/developer-guide/index.html
https://github.com/confluentinc/kafka-streams-examples

in Python...
from kafka import KafkaConsumer
consumer = KafkaConsumer('topicName')
for message in consumer:
print(message)
specify bootstrap_servers parameter in KafkaConsumer.
For Java look cloudkarafka, really good:
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("msg = %s\n", record.value());
}
}

Related

java KafkaConsumer never get results

I'm new to kafka, I have the following sample code :
KafkaConsumer<String,String> kc = new KafkaConsumer<String, String>(props);
while(true) {
List<String> topicNames = Arrays.asList(topics.split(","));
if (!kc.assignment().isEmpty()) {
kc.unsubscribe();
}
kc.subscribe(topicNames);
ConsumerRecords<String, String> recv = kc.poll(1000L);
if (!recv.isEmpty()) {
System.out.println("NOT EMPTY");
}
}
The recv is always empty but if I try to increment the pool timeout the records are returned, also if I cut off the unsubscribe part.
I've taken this piece of code from an integration proprietary software and I cannot modify it.
So my question is: Is this only a timing problem or there is more?
There is a lot that happens when a consumer (re)subscribes to a topic.
Very roughly and as far as I remember the consumer will:
request cluster information
request consumer group metadata
make a JOIN_GROUP request
be assigned certain partitions
The underlying mechanisms are even more complicated if there are more consumers within the same group. That's because the partitions should be reassigned between all the consumers within the group.
That is why:
1000 millis might not be enough for all this and you didn't poll anything in time
you polled something when you increased the timeout because Kafka managed to perform all of these bootstrapping operations
you polled something when you removed the unsubscription to the topics because most likely your consumer was already subscribed
So there is a timing issue. And I think that there is something more - un/subscribing to a topic within an infinite loop makes no sense to me (see the other answer).
You should subscribe to your topics only once at the beginning. Like this:
final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
final ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}

How to aggregate data hourly?

Whenever a user favorites some content on our site we collect the events and what we were planning to do is to hourly commit the aggregated favorites of a content and update the total favorite count in the DB.
We were evaluating Kafka Streams. Followed the word count example. Our topology is simple, produce to a topic A and read and commit aggregated data to another topic B. Then consume events from Topic B every hour and commit in the DB.
#Bean(name = KafkaStreamsDefaultConfiguration.DEFAULT_STREAMS_CONFIG_BEAN_NAME)
public StreamsConfig kStreamsConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "favorite-streams");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, WallclockTimestampExtractor.class.getName());
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerAddress);
return new StreamsConfig(props);
}
#Bean
public KStream<String, String> kStream(StreamsBuilder kStreamBuilder) {
StreamsBuilder builder = streamBuilder();
KStream<String, String> source = builder.stream(topic);
source.flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+")))
.groupBy((key, value) -> value)
.count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>> as("counts-store")).toStream()
.to(topic + "-grouped", Produced.with(Serdes.String(), Serdes.Long()));
Topology topology = builder.build();
KafkaStreams streams = new KafkaStreams(topology, kStreamsConfigs());
streams.start();
return source;
}
#Bean
public StreamsBuilder streamBuilder() {
return new StreamsBuilder();
}
However when I consume this Topic B it gives me aggregated data from the beginning. My question is that can we have some provision wherein I can consume the previous hours grouped data and then commit to DB and then Kakfa forgets about the previous hours data and gives new data each hour rather than cumulative sum. Is the design topology correct or can we do something better?
If you want to get one aggregation result per hour, you can use a windowed aggregation with a window size of 1 hour.
stream.groupBy(...)
.windowedBy(TimeWindow.of(1 *3600 * 1000))
.count(...)
Check the docs for more details: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#windowing
The output type is Windowed<String> for the key (not String). You need to provide a custom Window<String> Serde, or convert the key type. Consult SessionWindowsExample.

KafkaSpout multithread or not

kafka 0.8.x doc shows how to multithread in kafka consumer:
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(a_numThreads);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.execute(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
But KafkaSpout in storm seems to not multithread.
Maybe use multi task instead of multithread in KafkaSpout :
builder.setSpout(SqlCollectorTopologyDef.KAFKA_SPOUT_NAME, new KafkaSpout(spoutConfig), nThread);
Which one is better? Thanks
Since you mentioned Kafka 0.8.x, I am assuming the KafkaSpout you use is from storm-kafka other than storm-kafka-client.
The first code snippet is high-level consumer's API which could employ multiple threads to consume multiple partitions.
As for the kafka spout, it's probably the same, but Storm is using the low-level consumer, namely SimpleConsumer. However, there will be one SimpleConsumer instance created for each spout executor(task).

Using Kafka through Observable(RxJava)

I have a producer (using Kafka), and more than one consumer. So I publish a message in a topic and then my consumers receive and process the message.
I need to receive a response in the producer from at least one consumer (better if it be the first). I'm trying to use RxJava to do it (observables).
Is it possible to do in that way? Anyone have an example?
Here is how I am using rxjava '2.2.6' without any additional dependencies to process Kafka events:
import io.reactivex.Observable;
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;
...
// Load consumer props
Properties props = new Properties();
props.load(KafkaUtils.class.getClassLoader().getResourceAsStream("kafka-client.properties"));
// Create a consumer
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
// Subscribe to topics
consumer.subscribe(Arrays.asList(props.getProperty("kafkaTopics").split("\\s*,\\s*")));
// Create an Observable for topic events
Observable<ConsumerRecords<String, String>> observable = Observable.fromCallable(() -> {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSecond(10);
return records;
});
// Process Observable events
observable.subscribe(records -> {
if ((records != null) && (!records.isEmpty())) {
for (ConsumerRecord<String, String> record : records) {
System.out.println(record.offset() + ": " + record.value());
}
}
});
You can use it as follows:
val consumer = new RxConsumer("zookeeper:2181", "consumer-group")
consumer.getRecordStream("cool-topic-(x|y|z)")
.map(deserialize)
.take(42 seconds)
.foreach(println)
consumer.shutdown()
For more information see:
https://github.com/cjdev/kafka-rx
Would be better that you'll share your solution first of all...
Since Spring Cloud Stream is a, mh, stream solution, not request/reply, there is no an example to share with you.
You can consider to make your consumers as producers as well. And in the original producer have a consumer to read from the replying topic. Finally you will have to correlate the reply data with the request one.
The RxJava or any other implementation details aren't related.

Can single Kafka producer produce messages to multiple topics and how?

I am just exploring Kafka, currently i am using One producer and One topic to produce messages and it is consumed by one Consumer. very simple.
I was reading the Kafka page, the new Producer API is thread-safe and sharing single instance will improve the performance.
Does it mean i can use single Producer to publish messages to multiple topics?
Never tried it myself, but I guess you can. Since the code for producer and sending the record is (from here https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html):
Producer<String, String> producer = new KafkaProducer<>(props);
for(int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
So, I guess, if you just write different topics in the ProducerRecord, than it should be possible.
Also, here http://kafka.apache.org/081/documentation.html#producerapi it explicitly says that you can use a method send(List<KeyedMessage<K,V>> messages) to write into multiple topics.
If I understand you correctly, you are more looking using the same producer instance to send the same/multiple messages on multiple topics.
Not sure about java, but here you can do in C#(.NET) using the Kafka .NET Client DependentProducerBuilder
using (var producer = new ProducerBuilder<string, string>(config).Build())
using (var producer2 = new DependentProducerBuilder<Null, int>(producer.Handle).Build())
{
producer.ProduceAsync("first-topic", new Message<string, string> { Key = "my-key-value", Value = "my-value" });
producer2.ProduceAsync("second-topic", new Message<Null, int> { Value = 42 });
producer2.ProduceAsync("first-topic", new Message<Null, int> { Value = 107 });
producer.Flush(TimeSpan.FromSeconds(10));
}