Creation of a topic with specific retension time using the Confluent.kafka .net api - apache-kafka

I'm developping a Kafka producer on .NET6 using Confluent.Kafka (rev 1.9.0) and I'm looking for a way to create a topic specificying a "retention.ms" value different from the one configured in my broker configuration.
Here is the code I'm using:
TopicSpecification newTopic = new TopicSpecification
{
Name = TOPIC_NAME,
NumPartitions = 2,
ReplicationFactor = 1,
};
Dictionary<string, string> config = new Dictionary<string, string>();
config.Add("retention.ms", "3000");
newTopic.Configs = config;
try
{
await adminClient.CreateTopicsAsync(new List<TopicSpecification> { newTopic });
}
catch(CreateTopicsException e)
{
Console.WriteLine("An excpetion occured while creating the topic " + e.Results[0].Topic + ": " + e.Results[0].Error.Reason);
}
When I use that code, the topic is created but the retention.ms value I would like to use is not taken in account.
Could some one tell me what I'm doing wrong?
Thanks in advance for your help!

Related

Reading header from kafka using apache storm

We have a use-case where we need to read message headers from kafka using apache storm and pass it to downstream bolt. In storm documentation , there is no mention of how to pass kafka headers to storm topology. is anyone has figured out how to do this?
One way is to use setRecordTranslator while building spout.
Example:
return KafkaSpoutConfig.builder(
"<bootstrap-servers>", "topic-to-consume")
.setRecordTranslator((r) -> new Values(r.topic(), r.partition(), r.offset(), r.key(), r.value(), r.headers()), new Fields("topic", "partition", "offset", "key", "value", "headers"))
.build();
Now, in your Bolt's execute you would be able to access all the fields above including the headers, something like this:
public class MyNewBolt extends BaseRichBolt {
#Override
public void execute(Tuple tuple) {
RecordHeaders messageHeaders = (RecordHeaders) tuple.getValueByField("headers");
Iterator headerIterater = messageHeaders.iterator();
while (headerIterater.hasNext()) {
Header header = (Header) headerIterater.next();
String headerKey = header.key();
byte[] headerBytes = header.value();
String headerValue = new String(headerBytes);
System.out.println("Header Key: " + headerKey);
System.out.println("Header Value: " + headerValue);
}
}
}

Confluent Kafka how to get specific topic metadata without subscribe

I use some code for determination difference between read offsets and max values. It needed for inner diagnostics, for example if difference below "base line" value - so, read speed is good.
Provided code sample works good, but I should to know partition count for topic. And, later get max offset value for each partition in a topic.
How to get topic metadata without subscribing to a topic?
Something like: topic "TestTopic" have 4 partitions.
public async Task MaxOffsetValues()
{
// just for tests
// ToDo: use config from settings
while (true)
{
var topicName = "testTopic";
var config = new ConsumerConfig
{
BootstrapServers = "192.168.1.1:9092,192.168.1.2:9092,192.168.1.3:9092",
GroupId = Guid.NewGuid().ToString(),
ClientId = Dns.GetHostName(),
EnableAutoCommit = false,
};
using (var consumer = new ConsumerBuilder<Ignore, string>(config).Build())
{
var offsetBorders = consumer.QueryWatermarkOffsets(new TopicPartition(topicName, 0), TimeSpan.FromSeconds(10));
_log.Debug($"[Diagnostic] Topic: ({topicName}), Partition: ({0}) Minimal offset: ({offsetBorders.Low}) Maximum offset: ({offsetBorders.High})");
}
await Task.Delay(TimeSpan.FromSeconds(60));
}
}
If you are looking for programmatic & non-programmatic way to get metadata then there are 3 ways you can get information about a topic without subscribing it.
Admin client (https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/admin/AdminClient.html)
If you have access to confluent-control-center then it does show metadata about a topic.
You can also use Kafka-topics.sh CLI tool
For my goal was enought (code part reduced for clarity):
// ...
var brokerServers = _diagnosticSettings.Value.BrokerHosts;
var brokerDelayInSeconds = TimeSpan.FromSeconds(_diagnosticSettings.Value.BrokerDelayInSeconds);
var adminClientConfig = new AdminClientConfig
{
BootstrapServers = brokerServers,
ClientId = Dns.GetHostName(),
};
adminClient = new AdminClientBuilder(adminClientConfig).Build();
Metadata topicMetadata = null;
topicMetadata = adminClient.GetMetadata(topicExternalName, brokerDelayInSeconds);
partitionCount = topicMetadata.Topics[0].Partitions.Count;
// ...

Is there a retention policy for custom state store (RocksDb) with Kafka streams?

I am setting up a new Kafka streams application, and want to use custom state store using RocksDb. This is working fine for putting data in state store and getting a queryable state store from it and iterating over the data, However, after about 72 hours I observe data to be missing from the store. Is there a default retention time on data for state store in Kafka streams or in RocksDb?
I are using custom state store using RocksDb so that we can utilize the column family feature, that we can't use with the embedded RocksDb implementation with KStreams. I have implementated custom store using KeyValueStore interface. And have my own StoreSupplier, StoreBuilder, StoreType and StoreWrapper as well.
A changelog topic is created for the application but no data is going to it yet (haven't looked into that problem yet).
Putting data into this custom state store and getting queryable state store from it is working fine. However, I am seeing that data is missing after about 72 hours from the store. I checked by getting the size of the state store directory as well as by exporting the data into files and checking the number of entries.
Using SNAPPY compression and UNIVERSAL compaction
Simple topology:
final StreamsBuilder builder = new StreamsBuilder();
String storeName = "store-name"
List<String> cfNames = new ArrayList<>();
// Hybrid custom store
final StoreBuilder customStore = new RocksDBColumnFamilyStoreBuilder(storeName, cfNames);
builder.addStateStore(customStore);
KStream<String, String> inputstream = builder.stream(
inputTopicName,
Consumed.with(Serdes.String(), Serdes.String()
));
inputstream
.transform(() -> new CurrentTransformer(storeName), storeName);
Topology tp = builder.build();
Snippet from custom store implementation:
RocksDBColumnFamilyStore(final String name, final String parentDir, List<String> columnFamilyNames) {
.....
......
final BlockBasedTableConfig tableConfig = new BlockBasedTableConfig()
.setBlockCache(cache)
.setBlockSize(BLOCK_SIZE)
.setCacheIndexAndFilterBlocks(true)
.setPinL0FilterAndIndexBlocksInCache(true)
.setFilterPolicy(filter)
.setCacheIndexAndFilterBlocksWithHighPriority(true)
.setPinTopLevelIndexAndFilter(true)
;
cfOptions = new ColumnFamilyOptions()
.setCompressionType(CompressionType.SNAPPY_COMPRESSION)
.setCompactionStyle(CompactionStyle.UNIVERSAL)
.setMaxWriteBufferNumber(MAX_WRITE_BUFFERS)
.setOptimizeFiltersForHits(true)
.setLevelCompactionDynamicLevelBytes(true)
.setTableFormatConfig(tableConfig);
columnFamilyDescriptors.add(new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, cfOptions));
columnFamilyNames.stream().forEach((cfName) -> columnFamilyDescriptors.add(new ColumnFamilyDescriptor(cfName.getBytes(), cfOptions)));
}
#SuppressWarnings("unchecked")
public void openDB(final ProcessorContext context) {
Options opts = new Options()
.prepareForBulkLoad();
options = new DBOptions(opts)
.setCreateIfMissing(true)
.setErrorIfExists(false)
.setInfoLogLevel(InfoLogLevel.INFO_LEVEL)
.setMaxOpenFiles(-1)
.setWriteBufferManager(writeBufferManager)
.setIncreaseParallelism(Math.max(Runtime.getRuntime().availableProcessors(), 2))
.setCreateMissingColumnFamilies(true);
fOptions = new FlushOptions();
fOptions.setWaitForFlush(true);
dbDir = new File(new File(context.stateDir(), parentDir), name);
try {
Files.createDirectories(dbDir.getParentFile().toPath());
db = RocksDB.open(options, dbDir.getAbsolutePath(), columnFamilyDescriptors, columnFamilyHandles);
columnFamilyHandles.stream().forEach((handle) -> {
try {
columnFamilyMap.put(new String(handle.getName()), handle);
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
});
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
open = true;
}
The expectation is that the state store (RocksDb) will retain the data indefinitely until manually deleted or until the storage disk goes down. I am not aware that Kafka streams has introduced having TTl with state stores yet.

Kafka KStream to GlobalKTable join does not work with same key used

I have a very frustrating problem trying to join a KStream, that is populated by a java driver program using KafkaProducer, to a GlobalKTable that is populated from a Topic that, in turn, is populated using the JDBCConnector pulling data from a MySQL Table. No matter what I try to do the join between the KStream and the GlobalKTable, which both are keyed on the same value, will not work. What I mean is that the ValueJoiner is never called. I'll try and explain by showing the relevant config and code below. I appreciate any help.
I am using the latest version of the confluent platform.
The topic that the GlobalKTable is populated from is pulled from a single MySQL table:
Column Name/Type:
pk/bigint(20)
org_name/varchar(255)
orgId/varchar(10)
The JDBCConnector configuration for this is:
name=my-demo
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/reporting?user=root&password=XXX
table.whitelist=organisation
mode=incrementing
incrementing.column.name=pk
topic.prefix=my-
transforms=keyaddition
transforms.keyaddition.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.keyaddition.fields=orgId
I am running the JDBC connector using the command line:
connect-standalone /home/jim/platform/confluent/etc/schema-registry/connect-avro-standalone.properties /home/jim/prg/kafka/config/my.mysql.properties
This gives me a topic called my-organisation, that is keyed on orgId ..... so far so good!
(note, the namespace does not seem to be set by JDBCConnector but I don't think this is an issue but I don't know for sure)
Now, the code. Here is how I initialise and create the GlobalKTable (relevant code shown):
final Map<String, String> serdeConfig =
Collections.singletonMap(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
schemaRegistryUrl);
final StreamsBuilder builder = new StreamsBuilder();
final SpecificAvroSerde<Organisation> orgSerde = new SpecificAvroSerde<>();
orgSerde.configure(serdeConfig, false);
// Create the GlobalKTable from the topic that was populated using the connect-standalone command line
final GlobalKTable<String, Organisation>
orgs =
builder.globalTable(ORG_TOPIC, Materialized.<String, Organisation, KeyValueStore<Bytes, byte[]>>as(ORG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(orgSerde));
The avro schema, from where the Organisaton class is generated is defined as:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"Organisation",
"fields":[
{"name": "pk", "type":"long"},
{"name": "org_name", "type":"string"},
{"name": "orgId", "type":"string"}
]
}
Note: as described above the orgId is set as the key on the topic using the single message transform (SMT) operation.
So, that is the GlobalKTable setup.
Now for the KStream setup (the right hand side of the join). This has the same key (orgId) as the globalKTable. I use a simple driver program for this:
(The use case is that this topic will contain events associated with each organisation)
public class UploadGenerator {
public static void main(String[] args){
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("schema.registry.url", "http://localhost:8081");
KafkaProducer producer = new KafkaProducer(props);
// This schema is also used in the consumer application or more specifically a class generated from it.
String mySchema = "{\"namespace\": \"io.confluent.examples.streams.avro\"," +
"\"type\":\"record\"," +
"\"name\":\"DocumentUpload\"," +
"\"fields\":[{\"name\":\"orgId\",\"type\":\"string\"}," +
"{\"name\":\"date\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(mySchema);
// Just using three fictional organisations with the following orgIds/keys
String[] ORG_ARRAY = {"002", "003", "004"};
long count = 0;
String key = ""; // key is the realm
while(true) {
count++;
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
}
GenericRecord avroRecord = new GenericData.Record(schema);
int orgId = ThreadLocalRandom.current().nextInt(0, 2 + 1);
avroRecord.put("orgId",ORG_ARRAY[orgId]);
avroRecord.put("date",new Date().getTime());
key = ORG_ARRAY[orgId];
ProducerRecord<Object, Object> record = new ProducerRecord<>("topic_uploads", key, avroRecord);
try {
producer.send(record);
producer.flush();
} catch(SerializationException e) {
System.out.println("Exccccception was generated! + " + e.getMessage());
} catch(Exception el) {
System.out.println("Exception: " + el.getMessage());
}
}
}
}
So, this generates a new event representing an upload for an organisation represented by the orgId but also specifically set in the key variable used in the ProducerRecord.
Here is the code that sets up the KStream for these events:
final SpecificAvroSerde<DocumentUpload> uploadSerde = new SpecificAvroSerde<>();
uploadSerde.configure(serdeConfig, false);
// Get the stream of uploads
final KStream<String, DocumentUpload> uploadStream = builder.stream(UPLOADS_TOPIC, Consumed.with(Serdes.String(), uploadSerde));
// Debug output to see the contents of the stream
uploadStream.foreach((k, v) -> System.out.println("uploadStream: Key: " + k + ", Value: " + v));
// Note, I tried to re-key the stream with the orgId field (even though it was set as the key in the driver but same problem)
final KStream<String, DocumentUpload> keyedUploadStream = uploadStream.selectKey((key, value) -> value.getOrgId());
keyedUploadStream.foreach((k, v) -> System.out.println("keyedUploadStream: Key: " + k + ", Value: " + v));
// Java 7 form used as it was easier to put in debug statements
// OrgPK is just a helper class defined in the same class
KStream<String, OrgPk> joined = keyedUploadStream.leftJoin(orgs,
new KeyValueMapper<String, DocumentUpload, String>() { /* derive a (potentially) new key by which to lookup against the table */
#Override
public String apply(String key, DocumentUpload value) {
System.out.println("1. The key passed in is: " + key);
System.out.println("2. The upload realm passed in is: " + value.getOrgId());
return value.getOrgId();
}
},
// THIS IS NEVER CALLED WITH A join() AND WHEN CALLED WITH A leftJoin() HAS A NULL ORGANISATION
new ValueJoiner<DocumentUpload, Organisation, OrgPk>() {
#Override
public OrgPk apply(DocumentUpload leftValue, Organisation rightValue) {
System.out.println("3. Value joiner has been called...");
if( null == rightValue ) {
// THIS IS ALWAYS CALLED, SO THERE IS NEVER A "MATCH"
System.out.println(" 3.1. Orgnisation is NULL");
return new OrgPk(leftValue.getRealm(), 1L);
}
System.out.println(" 3.1. Org is OK");
// Never reaches here - this is the issue i.e. there is never a match
return new OrgPk(leftValue.getOrgId(), rightValue.getPk());
}
});
So, the above join (or leftJoin) never matches, even though the two keys are the same! This is the main issue.
Finally, the avro schema for the DocumentUpload is:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"DocumentUpload",
"fields":[
{"name": "orgId", "type":"string"},
{"name":"date", "type":"long", "logicalType":"timestamp-millis"}
]
}
So, in summary:
I have a KStream on a topic with a String key of OrgId
I have a GlobalKTable on a topic with a String key of OrgId also.
The join never works, even though the keys are in the GlobalKTable (at least they are in the topic underlying the GlobalKTable)
Can someone help me? I am pulling my hair out trying to figure this out.
I was able to solve this issue on Windows/Intellij by providing a state dir config
StreamsConfig.STATE_DIR_CONFIG

How to read data using key in Kafka Consumer API?

I'm constructing messages using below code...
Producer<String, String> producer = new kafka.javaapi.producer.Producer<String, String>(producerConfig);
KeyedMessage<String, String> keyedMsg = new KeyedMessage<String, String>(topic, "device-420", "{message:'hello world'}");
producer.send(keyedMsg);
And Consuming using following code block...
//Key = topic name, Value = No. of threads for topic
Map<String, Integer> topicCount = new HashMap<String, Integer>();
topicCount.put(topic, 1);
//ConsumerConnector creates the message stream for each topic
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = consumerConnector.createMessageStreams(topicCount);
// Get Kafka stream for topic
List<KafkaStream<byte[], byte[]>> kStreamList = consumerStreams.get(topic);
// Iterate stream using ConsumerIterator
for (final KafkaStream<byte[], byte[]> kStreams : kStreamList) {
ConsumerIterator<byte[], byte[]> consumerIte = kStreams.iterator();
while (consumerIte.hasNext()) {
MessageAndMetadata<byte[], byte[]> msg = consumerIte.next();
System.out.println(topic.toUpperCase() + ">"
+ " Partition:" + msg.partition()
+ " | Key:"+ new String(msg.key())
+ " | Offset:" + msg.offset()
+ " | Message:"+ new String(msg.message()));
}
}
Everything is working fine because I'm reading data topic wise. So I want to know that Is there any way to to consume data using message key i.e. device-420 in this example?
Short answer: no.
The smallest granularity in Kafka is a partition. You can write a client that reads only from a single partition. However, a partition can contain multiple keys and you need to consume all the keys contained in this partition.