how to access latest offset of topic in confluent kafka rest proxy to calculate lag - apache-kafka

In confluent kafka rest proxy we can get the last committed offset of particular consumer group but how can we get the latest offset of topic to calculate the lag.

You can use Kafka REST Proxy to fetch the latest offset committed for a particular partition. According to the Confluent Docs,
GET /consumers/(string: group_name)/instances/(string: instance)/offsets
Get the last committed offsets for the given partitions (whether the
commit happened by this process or another).
Note that this request must be made to the specific REST proxy
instance holding the consumer instance.
Parameters:
group_name (string) -- The name of the consumer group
instance (string) -- The ID of the consumer instance Request JSON
Array of Objects:
partitions -- A list of partitions to find the last committed offsets for
partitions[i].topic (string) -- Name of the topic
partitions[i].partition (int) -- Partition ID
Response JSON Array of Objects:
offsets -- A list of committed offsets
offsets[i].topic (string) -- Name of the topic for which an offset was committed
offsets[i].partition (int) -- Partition ID for which an offset was committed
offsets[i].offset (int) -- Committed offset
offsets[i].metadata (string) -- Metadata for the committed offset
Status Codes:
404 Not Found --
Error code 40402 -- Partition not found
Error code 40403 -- Consumer instance not found
Example Request:
GET /consumers/testgroup/instances/my_consumer/offsets HTTP/1.1
Host: proxy-instance.kafkaproxy.example.com
Accept: application/vnd.kafka.v2+json, application/vnd.kafka+json, application/json
{
"partitions": [
{
"topic": "test",
"partition": 0
},
{
"topic": "test",
"partition": 1
}
]
}
Example Response:
HTTP/1.1 200 OK
Content-Type: application/vnd.kafka.v2+json
{"offsets":
[
{
"topic": "test",
"partition": 0,
"offset": 21,
"metadata":""
},
{
"topic": "test",
"partition": 1,
"offset": 31,
"metadata":""
}
]
}

Looks like there is an early access feature for this: https://docs.confluent.io/platform/current/kafka-rest/api.html#get--clusters-cluster_id-consumer-groups-consumer_group_id-lags

Related

what does the `port` mean in kafka zookeeper path `/brokers/ids/$id`

I got two kafka listeners with config
listeners=PUBLIC_SASL://0.0.0.0:5011,PUBLIC_PLAIN://0.0.0.0:5010
advertised.listeners=PUBLIC_SASL://192.168.181.2:5011,PUBLIC_PLAIN://192.168.181.2:5010
listener.security.protocol.map=PUBLIC_SASL:SASL_PLAINTEXT,PUBLIC_PLAIN:PLAINTEXT
inter.broker.listener.name=PUBLIC_SASL
5010 is plaintext, 5011 is sasl_plaintext.
After startup, I found this information in zookeeper(/brokers/ids/$id):
{
"listener_security_protocol_map": {
"PUBLIC_SASL": "SASL_PLAINTEXT",
"PUBLIC_PLAIN": "PLAINTEXT"
},
"endpoints": [
"PUBLIC_SASL://192.168.181.2:5011",
"PUBLIC_PLAIN://192.168.181.2:5010"
],
"jmx_port": -1,
"features": { },
"host": "192.168.181.2",
"timestamp": "1658485899402",
"port": 5010,
"version": 5
}
What does the port filed mean? Why the port is 5010? Could I change it to 5011?
What you're seeing are advertised.port and advertised.host Kafka settings, which may be parsed from the advertised.listener list for backward compatibility, but both of these are deprecated, however, and the Kafka protocol now uses the protocol map and corresponding endpoints list, instead.

kafka connect sink to mongo only last result with delay

i have aggregation query pageView group by country, results push to out topic.
And sink to mongodb by kafka connector
{
"connector.class": "MongoDbAtlasSink",
"name": "confluent-mongodb-sink",
"input.data.format" : "JSON",
"connection.host": "ip",
"topics": "viewPageCountByUsers",
"max.num.retries": "3",
"retries.defer.timeout": "5000",
"max.batch.size": "0",
"database": "test",
"collection": "ViewPagesCountByUsers",
"tasks.max": "1"
}
The problem is that this data is very frequent and very load mongodb. How i can set kafkaconnection that send only last value by key as batch, example with 5 sec delay ?
Example: It's pointless to update the database 5 times
{countryID:7, viewCount: 111}
{countryID:7, viewCount: 112}
{countryID:7, viewCount: 113}
{countryID:7, viewCount: 114}
{countryID:7, viewCount: 115}
If there was an opportunity send only last result by key with 5 sec delay i can update 1 time.
// collect batch 5 sec and flush:
{countryID:7, viewCount: 115}
{countryID:8, viewCount: 573}
How do it?
Sink connectors just take whatever is in the topic, generally without batching.
You'd need to use a stream-processor such as Kafka Streams / KSQLdb to run a windowed-aggregation, then output to a new topic, which you'd read from the sink connector.

Why I receive a lot of duplicates with debezium?

I'm testing Debezium platform in a local deployment with docker-compose. Here's my test case:
run postgres, kafka, zookeeper and 3 replicas of debezium/connect:1.3
configure connector in one of the replica with the following configs:
{
"name": "database-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"plugin.name": "wal2json",
"slot.name": "database",
"database.hostname": "debezium_postgis_1",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "database",
"database.server.name": "database",
"heartbeat.interval.ms": 5000,
"table.whitelist": "public.outbox",
"transforms.outbox.table.field.event.id": "event_uuid",
"transforms.outbox.table.field.event.key": "event_name",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.payload.id": "event_uuid",
"transforms.outbox.route.topic.replacement": "${routedByValue}",
"transforms.outbox.route.by.field": "topic",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"max.batch.size": 1,
"offset.commit.policy": "io.debezium.engine.spi.OffsetCommitPolicy.AlwaysCommitOffsetPolicy",
"binary.handling.mode": "bytes"
}
}
run a script that executes 2000 insert in outbox table by calling this method from another class
#Transactional
public void write(String eventName, String topic, byte[] payload) {
Outbox newRecord = new Outbox(eventName, topic, payload);
repository.save(newRecord);
repository.delete(newRecord);
}
After some seconds (when I see the first messages on Kafka), I kill the replica who's handling the stream. Let's say it delivered successfully 200 messages on the right topic.
I get from the topic where debezium stores offsets the last offset message:
{
"transaction_id": null,
"lsn_proc": 24360992,
"lsn": 24495808,
"txId": 560,
"ts_usec": 1595337502556806
}
then I open a db shell and run the following
SELECT slot_name, restart_lsn - pg_lsn('0/0') as restart_lsn, confirmed_flush_lsn - pg_lsn('0/0') as confirmed_flush_lsn FROM pg_replication_slots; and postgres reply:
[
{
"slot_name": "database",
"restart_lsn": 24360856,
"confirmed_flush_lsn": 24360992
}
]
After 5 minutes I killed the replica, Kafka rebalances connectors and it deploy a new running task on one of the living replicas.
The new connector starts handling the stream, but it seems that it starts from the beginning because after it finish I found 2200 messages on Kafka.
With that configuration (max.batch.size: 1 and AlwaysCommitPolicy) I expect to see max 2001 messages.
Where am I wrong ?
I found the problem in my configuration:
"offset.commit.policy": "io.debezium.engine.spi.OffsetCommitPolicy.AlwaysCommitOffsetPolicy" works only with the Embedded API.
Moreover the debezium/connect:1.3 docker image has a default value for OFFSET_FLUSH_INTERVAL_MS of 1 minute. So if I stop the container within its first 1 minute, no offsets will be stored on kafka

How can I acqurie the JSON data from Kafka using SparkStreaming

I use kafka monitor the alteration of LocalFile and SparkStreaming to analyse . But I can't extrct the data from the kafka because the format of data is JSON .
When I tap the command bin/kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --topic kafka-streaming --from-beginning,
THE FORMAT OF DATA IS:
{
"schema": {
"type": "string",
"optional": false
},
"payload": "{\"like_count\": 594, \"view_count\": 49613, \"user_name\": \" w\", \"play_url\": \"http://upic/2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-0-gjcfcmzmef-954d5652f100c12e\", \"description\": \"ţ ų ඣ 9 9 9 9\", \"cover\": \"http://uhead/AB/2016/03/09/18/BMjAxNjAzMDkxODI1MzNfMjA3ODc2NTY2XzJfaGQ5OQ==.jpg\", \"video_id\": 5235997527237673952, \"comment_count\": 39, \"download_url\": \"http://2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-1-zdpjkouqke-5862405191e4c1e4\", \"user_id\": 207876566, \"video_create_time\": \"2019-04-08 12:48:21\", \"user_sex\": \"F\"}"
}
The version of spark is 2.3.0 and the kafka version is 1.1.0. The version of spark-streaming-kafka is 0-10_2.11-2.3.0.
The JSON data in the column of PAYOAD is I want to deal with and analyse. How can I change the codes to acquire the JSON data
Use org.apache.kafka.common.serialization.StringDeserializer and org.apache.kafka.common.serialization.StringSerializer for consuming and sending data to kafka topic respectively.
This way you will get a String on consumption which can very easily be converted to JSON Object using JSONParser

MQTT Kafka Source connector : funny byte characters

I am following https://github.com/kaiwaehner/kafka-connect-iot-mqtt-connector-example for connecting Mosquitto and Kafka with MQTT source connector. I am getting the data sent by the Mosquitto Publisher into the Mosquitto Subscriber and the Kafka Consumer. But the key and value field in my ConsumerRecord object of kafka-consumer is having some prepended byte characters.
Below are the code snippets and the outputs I'm getting.
mqttPublisher.py
while v3 < 3:
data3 = {
"time": str(datetime.datetime.now().time()),
"val": v3
}
client.publish("sensor/dist", json.dumps(data3), qos=2)
v3 += 1
time.sleep(2)
mqttSubscriber.py
def on_message_print(client, userdata, message):
print(message.topic,message.payload)
subscribe.callback(on_message_print, "sensor/#", hostname="localhost")
kafkaConsumer.py
consumer = KafkaConsumer('mqtt.',
bootstrap_servers=['localhost:9092'])
for message in consumer:
print(message)
Output:mqttSubscriber.py
sensor/dist b'{"time": "12:44:30.817462", "val": 0}'
sensor/dist b'{"time": "12:44:32.820040", "val": 1}'
sensor/dist b'{"time": "12:44:34.822657", "val": 2}'
Output : kafkaConsumer.py
ConsumerRecord(topic='mqtt.', partition=0, offset=225, timestamp=1545117270870, timestamp_type=0, key=b'\x00\x00\x00\x00\x01\x16sensor/dist', value=b'\x00\x00\x00\x00\x02J{"time": "12:44:30.817462", "val": 0}', headers=[('mqtt.message.id', b'0'), ('mqtt.qos', b'0'), ('mqtt.retained', b'false'), ('mqtt.duplicate', b'false')], checksum=None, serialized_key_size=17, serialized_value_size=43, serialized_header_size=62)
ConsumerRecord(topic='mqtt.', partition=0, offset=226, timestamp=1545117272821, timestamp_type=0, key=b'\x00\x00\x00\x00\x01\x16sensor/dist', value=b'\x00\x00\x00\x00\x02J{"time": "12:44:32.820040", "val": 1}', headers=[('mqtt.message.id', b'0'), ('mqtt.qos', b'0'), ('mqtt.retained', b'false'), ('mqtt.duplicate', b'false')], checksum=None, serialized_key_size=17, serialized_value_size=43, serialized_header_size=62)
ConsumerRecord(topic='mqtt.', partition=0, offset=227, timestamp=1545117274824, timestamp_type=0, key=b'\x00\x00\x00\x00\x01\x16sensor/dist', value=b'\x00\x00\x00\x00\x02J{"time": "12:44:34.822657", "val": 2}', headers=[('mqtt.message.id', b'0'), ('mqtt.qos', b'0'), ('mqtt.retained', b'false'), ('mqtt.duplicate', b'false')], checksum=None, serialized_key_size=17, serialized_value_size=43, serialized_header_size=62)
What is causing the above prepending of extra bytes in the Kafka Consumer?
Thanks in advance.
As part of the demo, you're starting a Schema Registry
Start Kafka Connect and dependencies (Kafka, Zookeeper, Schema Registry):
confluent start connect
If you look at the first 5 bytes, you'll see they start with 0, then four more bytes representing an integer.
See the Schema Registry Wire Format and try doing a curl localhost:8081/subjects to see if it lists your topic name for mqtt-key and mqtt-value.
If you didn't want Avro, you would need to configure and edit your Kafka Connect property file to use different Converters, and not use confluent start other than getting Kafka and Zookeeper running
Or if you want Python to deserialize the Avro, you can refer to the confluent-kafka-python repo on Github