Kafka Topic Partition - apache-kafka

Kafka Topic Partition offset position always start from 0 or random value and How to ensure the consumer record is the first record in the partition ? Is there any way to find out ? If any please let me know. Thanks.

Yes and no.
When you start a new topic, the offset start at zero. Depending on the Kafka version you are using, the offsets are
logical – and incremented message by message (since 0.8.0: https://issues.apache.org/jira/browse/KAFKA-506) – or
physical – ie, the offset is increased by the number of bytes for each message.
Furthermore, old log entries are cleared by configurable conditions:
retention time: eg, keep message of the last week only
retention size: eg, use at max 10GB of storage; delete old messages that cannot be stored any more
log-compaction (since 0.8.1): you only preserve the latest value for each key (see https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction)
Thus, the first offset might not be zero if old messages got deleted. Furthermore, if you turn on log-compaction, some offsets might be missing.
In any case, you can always seek to any offset safely, as Kafka can figure out if the offset is valid or not. For an invalid offset, is automatically advances to the next valid offset. Thus, if you seek to offset zero, you will always get the oldest message that is stored.

Yes, Kafka offset starts from 0 and ends with byte length of the complete record and then next record picks the offset from there onward.
As Kafka is distributed so we can not assure that Consumer will get the data in ordered way.

Related

What values do Kafka offsets have and what do they mean?

I understand that the offset is used to determine which messages a consumer wants. But is the offset a hash? Is it a timestamp? Is it simply an integer, where 3 could mean the last 3 messages?
An offset is "a sequential id number [..] that uniquely identifies each record within the partition" (source: Kafka documentation).
It starts at 0, which is the first record ever published in a given partition. It increases monotonically with each record added to the partition.

Kafka topics beyond retention period

What happens to topics that are beyond their retention period? The messages will get wiped out but will the topic still exist and if so, will it write to offset 0 if there is only one partition on a topic?
Each offset within a partition is always assigned to a single message, and it won't be reassigned. From Log Compaction Basics documentation:
Note that the messages in the tail of the log retain the original offset assigned when they were first written—that never changes. Note also that all offsets remain valid positions in the log, even if the message with that offset has been compacted away ...
The brokers will hold no data for those topics, but the offsets will be set at their "high water mark" until new messages are produced.
The topic metadata will still exist, and the offsets always increase, never reset.

when the amount of messages reach the maxsize of retention.bytes ,the kafka will delete messages,the offset will be reset to zero?

I am new to kafka, when we use kafka,we can set the retention.bytes. say we set to 1GB, if the amount of message reach 1GB,kafka will delete messages.I want to ask that the offset will be reset to zero?
second, the consumer set auto.offset.reset to largest, after kafka delete the messages, what offset will the consumer start?
For your question #1, with honoring both size-based and time-based policies, log might be rolled over to a new empty log segment. New log segment file's starting offset will be the offset of the next message that will be appended to the log.
For your question #2, it depends. If the offset tracked by consumer is out of range due to the message deletion, then it will be reset to the largest offset.

Max number of messages that can be stored in a Kafka topic partition?

I have a retention policy set for 48 hours. So old logs are eventually flushed. But topic's offset number keeps growing. When does this number get reset? What happens when the max offset number is reached? Also, new segments are rolled with base offset as filename at the time of creating new segment.What will be the filenames of .log and .index files when this limit is reached?
The following is the current base offset for log segment :
The offset is never reset because the max offset value is so big (int64) that you won't ever reach it.

Message created time in Kafka

When I consume message in Kafka, how can I know when this message is produced/created?
There is no created method in MessageAndMetadata.
And How to set offset in Kafka when neccessary? I prefer setting by code or command line.
Thanks.
There is currently no per-message timestamping in Apache Kafka 0.7 or 0.8, but it would be handy.
What exists is a time-based offset-lookup in the Offset API:
Time - Used to ask for all messages before a certain time (ms). There are two special values. Specify -1 to receive the latest offset (i.e. the offset of the next coming message) and -2 to receive the earliest available offset. Note that because offsets are pulled in descending order, asking for the earliest offset will always return you a single element.
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-OffsetAPI
It relies on the filesystem ctime of the topic+partition segment files on the broker so it is a bit crude (depending on file rotation intervals, etc) and in the case of replication may be outright incorrect (timestamp of replication rather than when the original segment was created).
The solution is to embed the message creation time in the message itself.