In my client application, I create several consumers, but they can't concurrently process the queue. Always, only a single consumer processes the queue messages. I don't know why.
Hashtable<String, String> env = new Hashtable<String, String>();
env.put(Context.INITIAL_CONTEXT_FACTORY, "org.jnp.interfaces.NamingContextFactory");
env.put(Context.URL_PKG_PREFIXES, "org.jboss.naming:org.jnp.interfaces");
env.put(Context.PROVIDER_URL, "192.168.1.111:1099");
InitialContext ctx = new InitialContext(env);
ConnectionFactory cf = (ConnectionFactory) ctx.lookup("/ConnectionFactory");
Queue downQueue = (Queue) ctx.lookup("queue/DownQueue");
Session consumerSession = conn.createSession(false, Session.CLIENT_ACKNOWLEDGE);
MessageConsumer consumer;
for (int i = 0; i < 2; i++) {
consumer = consumerSession.createConsumer(downQueue, "proxyId=0");
consumer.setMessageListener(listener);
}
How do I process the queue with multiple concurrent consumers?
Think of it as 1 to 1 between threads and sessions. (Connections are thread safe, everything "below" is not). So in short, create multiple threads, have each thread create a session etc. And each thread will consume.
By looking at your code the consumer variable is getting reassigned different consumer objects in the for loop, which may cause the references to the earlier consumer objects lost and garbage collected. Only one consumer object will remain alive-the one which was created last in the for loop- whose reference is maintained by the consumer variable and it will consume all the coming messages.
Related
Currently I have 3 kafka brokers with 150 partitions.
I also have 3 consumers that each one is assigned to a group of partitions.
Each consumer has its own local state store with rocksdb. This in-memory key-value store is called during grpc calls. During rebalancing (if a consumer disappears) then the data is written in the local store of the other consumers.
If the consumers are running for around 2 weeks it seems that the services are running out of memory.
Is there a solution to the local storage growing too much? Can we remove data of partitions that are not needed anymore? Or is there a way to remove the stored data after the consumer is restored?
you can use the cleanUp(); method while starting or shut down Kafka Stream to cleanup state storage.
cleanUp()
Do a clean up of the local StateStore by deleting all data with regard
to the application ID. May only be called either before this
KafkaStreams instance is started in with calling start() method or
after the instance is closed by calling close() method.
KafkaStreams app = new KafkaStreams(builder.build(), props);
// Delete the application's local state.
// Note: In real application you'd call `cleanUp()` only under
// certain conditions. See tip on `cleanUp()` below.
app.cleanUp();
app.start();
Note: To avoid the corresponding recovery overhead, you should not call
cleanUp() by default but only if you really need to. Otherwise, you wipe out local state and trigger an expensive state restoration. You
won't lose data and the program will still be correct, but you may
slow down startup significantly (depending on the size of your state)
In case you are looking to delete from state store during your life cycle of Kafka Stream, you can very well remove from state store after all its just collection of map store in rocks B
Assume you are using Kafka Stream Processor
KeyValueStore<String, String> dsStore=(KeyValueStore<String, String>) context.getStateStore("localstorename");
KeyValueIterator<String, String> iter = this.dsStore.all();
while (iter.hasNext()) {
KeyValue<String, String> entry = iter.next();
dsStore.delete(entry.key);
}
I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
Previously I have been using 0.8 API. When you pass topics list to it, it returns a map of streams (one entry per topic). This allows me to spawn a separate thread and assign each topic's stream to it. Having too much data in each topic, spawning a separate thread helps multi tasking.
//0.8 code sample
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap =
consumer.createMessageStreams(topicCountMap);
I want to upgrade to 0.10. I checked KafkaStreams and KafkaConsumer classes. KafkaConsumer object takes config properties and provide the subscribe method that takes topics List and its return type is void. I cannot find a way where I can get a handle to each topic.
KafkaConsumer consumer = new KafkaConsumer(props);
consumer.subscribe(topicsList);
conusmer.poll(long ms)
KafkaStreams on the other hand seems to have the same problem.
KStreamBuilder builder = new KStreamBuilder();
String [] topics = new String[] {"topic1", "topic2"};
KStream<byte[], byte[]> source = builder.stream(stringSerde, stringSerde, topics);
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
There is source.foreach() method available but it is a stream of all topics. Anyone, any ideas ?
First, using a multi threaded consumer is tricky, thus the pattern you employed in 0.8 is hopefully well designed :)
Best practice is to use a single threaded consumer and thus, there is "no need" to separate different topics if a single consumer subscribes to a list of topics at once. Nevertheless, while consuming the record, the record object provides information about from which topic it originates from (it carries this metadata). Thus, you could theoretically dispatch a record according to its topics to a different thread for the actual processing (even if this is not recommended!).
Kafka scales out via partitions, thus, if a single-threaded consumer is not able to handle the load, you should start multiple consumers (as a consumer group) to scale out your consumer processing capacity.
A more general question: if you want to process data per topic, why not using multiple consumers each subscribing to a single topic each?
Last but not least, in Apache Kafka 0.10+ the Kafka Streams API is a newly introduced stream processing library -- though it must not be confused with 0.8 KafkaStream class (hint, there is no "s"). Both are completely unrelated to each other.
I am using Kafka 0.9.0.1.
The first time I start up my application it takes 20-30 seconds to retrieve the "latest" message from the topic
I've used different Kafka brokers (with different configs) yet I still see this behaviour. There is usually no slowness for subsequent messages.
Is this expected behaviour? you can clearly see this below by running this sample application and changing the broker/topic name to your own settings
public class KafkaProducerConsumerTest {
public static final String KAFKA_BROKERS = "...";
public static final String TOPIC = "...";
public static void main(String[] args) throws ExecutionException, InterruptedException {
new KafkaProducerConsumerTest().run();
}
public void run() throws ExecutionException, InterruptedException {
Properties consumerProperties = new Properties();
consumerProperties.setProperty("bootstrap.servers", KAFKA_BROKERS);
consumerProperties.setProperty("group.id", "Test");
consumerProperties.setProperty("auto.offset.reset", "latest");
consumerProperties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumerProperties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
MyKafkaConsumer kafkaConsumer = new MyKafkaConsumer(consumerProperties, TOPIC);
Executors.newFixedThreadPool(1).submit(() -> kafkaConsumer.consume());
Properties producerProperties = new Properties();
producerProperties.setProperty("bootstrap.servers", KAFKA_BROKERS);
producerProperties.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
producerProperties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
MyKafkaProducer kafkaProducer = new MyKafkaProducer(producerProperties, TOPIC);
kafkaProducer.publish("Test Message");
}
}
class MyKafkaConsumer {
private final Logger logger = LoggerFactory.getLogger(MyKafkaConsumer.class);
private KafkaConsumer<String, Object> kafkaConsumer;
public MyKafkaConsumer(Properties properties, String topic) {
kafkaConsumer = new KafkaConsumer<String, Object>(properties);
kafkaConsumer.subscribe(Lists.newArrayList(topic));
}
public void consume() {
while (true) {
logger.info("Started listening...");
ConsumerRecords<String, Object> consumerRecords = kafkaConsumer.poll(Long.MAX_VALUE);
logger.info("Received records {}", consumerRecords.iterator().next().value());
}
}
}
class MyKafkaProducer {
private KafkaProducer<String, Object> kafkaProducer;
private String topic;
public MyKafkaProducer(Properties properties, String topic) {
this.kafkaProducer = new KafkaProducer<String, Object>(properties);
this.topic = topic;
}
public void publish(Object object) throws ExecutionException, InterruptedException {
ProducerRecord<String, Object> producerRecord = new ProducerRecord<>(topic, "key", object);
Future<RecordMetadata> response = kafkaProducer.send(producerRecord);
response.get();
}
}
The first message should take longer than the rest because when you start a new consumer in the consumer group specified by the statement consumerProperties.setProperty("group.id", "Test");, Kakfka will balance the partitions such that each partition is consumed by atmost one consumer and will distribute the partitions for the topic across multiple consumer processes.
Also, with Kafka 0.9, there is a seperate __consumer_offsets topic which Kafka uses to manage the offsets for each consumer in a consumer group. It is likely that when you start the consumer for the first time, it looks at this topic to fetch the latest offset (there might have been a consumer consuming from this topic earlier which would have got killed, therefore it is necessary to fetch from the correct offset).
These 2 factors will cause a higher latency in the consumption of first set of messages. I can't comment on the exact latency of 20-30 seconds, but I guess this should be the default behaviour.
PS: The exact number might also depend upon other secondary factors like whether you are running the broker & the consumers on the same machine (where there would be no network latency) or on different ones where they would be communicating using TCP.
According to this link:
Try setting group_id=None in your consumer, or call consumer.close()
before ending script, or use assign() not subscribe (). Otherwise you are
rejoining an existing group that has known but unresponsive members. The
group coordinator will wait until those members checkin/leave/timeout.
Since the consumers no longer exist (it's your prior script runs) they have
to timeout.
And consumer.poll() blocks during group rebalance.
So it is correct behavior if you join group with unresponsively members (maybe you terminate the application ungracefully).
Please confirm you call "consumer.close()" before exiting your application.
Just tried your code with minimal logging additions now many times. Here is a typical log output:
2016-07-24 15:12:51,417 Start polling...|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,604 producer has send message|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,619 producer got response, exiting|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,679 Received records [Test Message]|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,679 Start polling...|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:54,680 returning on empty poll result|INFO|KafkaProducerConsumerTest
The sequence of events is as expected and in a timely manner. The consumer starts polling, the producer sends the message and receives a result, the consumer receives the message and all this with 300ms. Then the consumer starts polling again and is thrown out 3 seconds later as I change the poll timeout respectively.
I am using Kafka 0.9.0.1 for broker and client libraries. The connection is on localhost and it is a test environment with no load at all.
For completeness, here is the log form the server that was triggered by the exchange above.
[2016-07-24 15:12:51,599] INFO [GroupCoordinator 0]: Preparing to restabilize group Test with old generation 0 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:12:51,599] INFO [GroupCoordinator 0]: Stabilized group Test generation 1 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:12:51,617] INFO [GroupCoordinator 0]: Assignment received from leader for group Test for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:13:24,635] INFO [GroupCoordinator 0]: Preparing to restabilize group Test with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:13:24,637] INFO [GroupCoordinator 0]: Group Test generation 1 is dead and removed (kafka.coordinator.GroupCoordinator)
You may want to compare with your server logs for the same exchange.
I am very new to Kafka. I am creating two topics and publishing on these two topics from two Producers. I have one consumer which consumes the messages from both the topics. This is because I want to process according to the priority.
I am getting a stream from both the topics but as soon as I start iterating on ConsumerItreator of any stream, it blocks there. As it's written in documentation, it will be blocked till it gets a new message.
Is any one aware of how to read from two topics and two streams from a single Kafka Consumer?
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(KafkaConstants.HIGH_TEST_TOPIC, new Integer(1));
topicCountMap.put(KafkaConstants.LOW_TEST_TOPIC, new Integer(1));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumerConnector.createMessageStreams(topicCountMap);
KafkaStream<byte[], byte[]> highPriorityStream = consumerMap.get(KafkaConstants.HIGH_TEST_TOPIC).get(0);
ConsumerIterator<byte[], byte[]> highPrioerityIterator = highPriorityStream.iterator();
while (highPriorityStream.nonEmpty() && highPrioerityIterator.hasNext())
{
byte[] bytes = highPrioerityIterator.next().message();
Object obj = null;
CLoudDataObject thunderDataObject = null;
try
{
obj = SerializationUtils.deserialize(bytes);
if (obj instanceof CLoudDataObject)
{
thunderDataObject = (CLoudDataObject) obj;
System.out.println(thunderDataObject);
// TODO Got the Thunder object here, now write code to send it to Thunder service.
}
}
catch (Exception e)
{
}
}
If you don't want to process lower priority messages before high priority ones, how about setting consumer.timeout.ms property and catch ConsumerTimeoutException to detect that the flows for high priority reach the last message available? By default it's set -1 to block until a new message arrives. (http://kafka.apache.org/07/configuration.html)
The below explains a way to process multiple flows concurrently with different priorities.
Kafka requires multi-thread programming. In your case, the streams of the two topics need to be processed by threads for the flows. Because each thread will run independently to process messages, one blocking flow (thread) won't affect other flows.
Java's ThreadPool implementation can help the job in creating multi-thread application. You can find example implementation here:
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
Regarding the priority of execution, you can call Thread.currentThread.setPriority method to have the proper priorities of threads based on their serving Kafka topic.