I have a kafka producer initialized as follows:
var config = new Dictionary<string, object> { { "bootstrap.servers", BOOTSTRAP_SERVERS } };
Producer _producer = new Producer(config);
After the initialization, when I am about to produce a message on a particular topic I want to add compression to the producer (Only for the topic). I need to add the compression config { "compression.codec", "gzip" }.
How can I do this?
Related
I use some code for determination difference between read offsets and max values. It needed for inner diagnostics, for example if difference below "base line" value - so, read speed is good.
Provided code sample works good, but I should to know partition count for topic. And, later get max offset value for each partition in a topic.
How to get topic metadata without subscribing to a topic?
Something like: topic "TestTopic" have 4 partitions.
public async Task MaxOffsetValues()
{
// just for tests
// ToDo: use config from settings
while (true)
{
var topicName = "testTopic";
var config = new ConsumerConfig
{
BootstrapServers = "192.168.1.1:9092,192.168.1.2:9092,192.168.1.3:9092",
GroupId = Guid.NewGuid().ToString(),
ClientId = Dns.GetHostName(),
EnableAutoCommit = false,
};
using (var consumer = new ConsumerBuilder<Ignore, string>(config).Build())
{
var offsetBorders = consumer.QueryWatermarkOffsets(new TopicPartition(topicName, 0), TimeSpan.FromSeconds(10));
_log.Debug($"[Diagnostic] Topic: ({topicName}), Partition: ({0}) Minimal offset: ({offsetBorders.Low}) Maximum offset: ({offsetBorders.High})");
}
await Task.Delay(TimeSpan.FromSeconds(60));
}
}
If you are looking for programmatic & non-programmatic way to get metadata then there are 3 ways you can get information about a topic without subscribing it.
Admin client (https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/admin/AdminClient.html)
If you have access to confluent-control-center then it does show metadata about a topic.
You can also use Kafka-topics.sh CLI tool
For my goal was enought (code part reduced for clarity):
// ...
var brokerServers = _diagnosticSettings.Value.BrokerHosts;
var brokerDelayInSeconds = TimeSpan.FromSeconds(_diagnosticSettings.Value.BrokerDelayInSeconds);
var adminClientConfig = new AdminClientConfig
{
BootstrapServers = brokerServers,
ClientId = Dns.GetHostName(),
};
adminClient = new AdminClientBuilder(adminClientConfig).Build();
Metadata topicMetadata = null;
topicMetadata = adminClient.GetMetadata(topicExternalName, brokerDelayInSeconds);
partitionCount = topicMetadata.Topics[0].Partitions.Count;
// ...
So I have a Spark application that needs to read two streams from two kafka clusters (Kafka A and B) using structured streaming, and do some joins and filtering on the two streams. So is it possible to have a Spark job that reads stream from A, and also run a thread (Called consumer) on each worker that reads Kafka B and put data into a map. So later when we are filtering we can do something like stream.filter(row => consumer.idNotInMap(row.id))?
I have some questions regarding this approach:
If this approach works, will it cause any problems when the application is run on a cluster?
Will all consumer instance on each worker receive the same data in cluster mode? Or can we even let each consumer only listen on the Kafka partition for that worker node (which is probably controlled by Spark)?
How will the consumer instance gets serialized and passed to workers?
Currently they are initialized on Driver node but are there some ways to initialize it once for each worker node?
I feel like in my case I should use stream joining instead. I've already tried that and it didn't work, that's why I am taking this approach. It didn't work because stream from Kafka A is append only and stream B needs to have a state that can be updated, which makes it update only. Then joining streams of append and update mode is not supported in Spark.
Here are some pseudo-code:
// SparkJob.scala
val consumer = Consumer()
val getMetadata = udf(id => consumer.get(id))
val enrichedDataSet = stream.withColumn("metadata", getMetadata(stream("id"))
// Consumer.java
class Consumer implements Serializable {
private final ConcurrentHashMap<Integer, String> metadata;
public MetadataConsumer() {
metadata = new ConcurrentHashMap<>();
// read stream
listen();
}
// process kafka data inside this loop
private void listen() {
Thread t = new Thread(() -> {
KafkaConsumer consumer = ...;
while (consumer.hasNext()) {
var message = consumer.next();
// update metadata or put in new metadata
metadata.put(message.id, message);
}
});
t.start();
}
public String get(Integer key) {
return metadata.get(key);
}
}
i am trying to write a kafka stream code for converting JSON array to JSON elements...since i am new to kafka stream can any one help me out writing the code.. like what should be there in kstream and ktable..
and my stream of input ll be in the following format
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
]
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
]
and my output must be in the form
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
can anyone help me out in writing the code??
If you want to use Kafka Streams, you can use a flatMap(). Something like
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder();
builer.stream("topic").flatMap(...).to("output-topic");
Check out the examples and docs for more details:
https://docs.confluent.io/current/streams/developer-guide/index.html
https://github.com/confluentinc/kafka-streams-examples
in Python...
from kafka import KafkaConsumer
consumer = KafkaConsumer('topicName')
for message in consumer:
print(message)
specify bootstrap_servers parameter in KafkaConsumer.
For Java look cloudkarafka, really good:
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("msg = %s\n", record.value());
}
}
I am using confluent-3.0.1 platform and building a Kafka-Elasticsearch connector. For this I am extending SinkConnector and SinkTask (Kafka-connect APIs) to get data from Kafka.
As part of this code i am extending taskConfigs method of SinkConnector to return "max.poll.records" to fetch only 100 records at a time. But its not working and I am getting all records at same time and I am failing to commit offsets within the stipulated time. Please can any one help me to configure "max.poll.records"
public List<Map<String, String>> taskConfigs(int maxTasks) {
ArrayList<Map<String, String>> configs = new ArrayList<Map<String, String>>();
for (int i = 0; i < maxTasks; i++) {
Map<String, String> config = new HashMap<String, String>();
config.put(ConfigurationConstants.CLUSTER_NAME, clusterName);
config.put(ConfigurationConstants.HOSTS, hosts);
config.put(ConfigurationConstants.BULK_SIZE, bulkSize);
config.put(ConfigurationConstants.IDS, elasticSearchIds);
config.put(ConfigurationConstants.TOPICS_SATELLITE_DATA, topics);
config.put(ConfigurationConstants.PUBLISH_TOPIC, topicTopublish);
config.put(ConfigurationConstants.TYPES, elasticSearchTypes);
config.put("max.poll.records", "100");
configs.add(config);
}
return configs;
}
You can't override most Kafka consumer configs like max.poll.records in the connector configuration. You can do so in the Connect worker configuration though, with a consumer. prefix.
It was solved. I added below configuration in connect-avro-standalone.properties
group.id=mygroup
consumer.max.poll.records=1000
and ran below command for running my connector.
sh ./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties ./etc/kafka-connect-elasticsearch/connect-elasticsearch-sink.properties
I am just exploring Kafka, currently i am using One producer and One topic to produce messages and it is consumed by one Consumer. very simple.
I was reading the Kafka page, the new Producer API is thread-safe and sharing single instance will improve the performance.
Does it mean i can use single Producer to publish messages to multiple topics?
Never tried it myself, but I guess you can. Since the code for producer and sending the record is (from here https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html):
Producer<String, String> producer = new KafkaProducer<>(props);
for(int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
So, I guess, if you just write different topics in the ProducerRecord, than it should be possible.
Also, here http://kafka.apache.org/081/documentation.html#producerapi it explicitly says that you can use a method send(List<KeyedMessage<K,V>> messages) to write into multiple topics.
If I understand you correctly, you are more looking using the same producer instance to send the same/multiple messages on multiple topics.
Not sure about java, but here you can do in C#(.NET) using the Kafka .NET Client DependentProducerBuilder
using (var producer = new ProducerBuilder<string, string>(config).Build())
using (var producer2 = new DependentProducerBuilder<Null, int>(producer.Handle).Build())
{
producer.ProduceAsync("first-topic", new Message<string, string> { Key = "my-key-value", Value = "my-value" });
producer2.ProduceAsync("second-topic", new Message<Null, int> { Value = 42 });
producer2.ProduceAsync("first-topic", new Message<Null, int> { Value = 107 });
producer.Flush(TimeSpan.FromSeconds(10));
}