VictoriaMetrics / Grafana not showing data from a period in the middle of range but shows up data when zoomed in - grafana

We have been attempting to backfill 6 months of data into VictoriaMetrics. VictoriaMetrics is set up in a cluster with 2 vmstorage nodes and 1 vminsert and 1 vmselect instance. Everything went smoothly until data stopped showing up grafana after a specific date. Our import script was still running without an error. Checking the logs for vminsert and 2 vmstorage nodes there's no error. Then data started appearing at the end of the range.
So, a big chunk of the data is missing on Grafana. But sometimes zooming in reveals some of the data that doesn't show up on a higher time range. After double-checking the vmstorage logs seems like partitions are not the same on each node. and one of the nodes is missing the latest partition.
Logs from Storage Node 1
2022-12-02T00:21:22.521Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_06" with smallPartsPath="/storage/data/small/2022_06", bigPartsPath="/storage/data/big/2022_06"
2022-12-02T00:21:24.259Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_06" has been created
2022-12-02T13:56:23.722Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_07" with smallPartsPath="/storage/data/small/2022_07", bigPartsPath="/storage/data/big/2022_07"
2022-12-02T13:56:26.533Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_07" has been created
2022-12-03T01:24:45.721Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_08" with smallPartsPath="/storage/data/small/2022_08", bigPartsPath="/storage/data/big/2022_08"
2022-12-03T01:24:46.900Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_08" has been created
2022-12-03T17:57:02.525Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_09" with smallPartsPath="/storage/data/small/2022_09", bigPartsPath="/storage/data/big/2022_09"
2022-12-03T17:57:03.713Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_09" has been created
2022-12-04T07:08:51.722Z info VictoriaMetrics/lib/storage/partition.go:1305 merged 18251 rows across 18251 blocks in 37.328 seconds at 488 rows/sec to "/storage/data/small/2022_09/18251_18251_20220928180000.000_20220928180000.000_172D5A35BF48B011"; sizeBytes: 250837
2022-12-04T08:41:28.530Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_10" with smallPartsPath="/storage/data/small/2022_10", bigPartsPath="/storage/data/big/2022_10"
2022-12-04T08:41:31.022Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_10" has been created
2022-12-04T21:22:29.569Z info VictoriaMetrics/lib/mergeset/table.go:1027 merged 24870725 items across 28659 blocks in 37.205 seconds at 668480 items/sec to "/storage/indexdb/172CAC7059FB5EB3/24860954_28638_172CAC71E7BD8BD0"; sizeBytes: 331467849
2022-12-04T22:37:55.719Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_11" with smallPartsPath="/storage/data/small/2022_11", bigPartsPath="/storage/data/big/2022_11"
2022-12-04T22:37:56.406Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_11" has been created
2022-12-05T00:39:45.097Z info VictoriaMetrics/lib/storage/partition.go:1305 merged 199154 rows across 199154 blocks in 30.426 seconds at 6545 rows/sec to "/storage/data/small/2022_11/199154_113352_20221104180000.000_20221106180000.000_172DB81E235772F4"; sizeBytes: 1909359
2022-12-05T03:29:28.254Z info VictoriaMetrics/lib/storage/partition.go:1305 merged 211814 rows across 211814 blocks in 30.355 seconds at 6977 rows/sec to "/storage/data/small/2022_11/211814_114004_20221108180000.000_20221110180000.000_172DB81E235773BF"; sizeBytes: 1891824
2022-12-05T16:34:53.329Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_12" with smallPartsPath="/storage/data/small/2022_12", bigPartsPath="/storage/data/big/2022_12"
2022-12-05T16:34:54.324Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_12" has been created
Logs from Storage Node 2
2022-12-02T00:21:22.523Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_06" with smallPartsPath="/storage/data/small/2022_06", bigPartsPath="/storage/data/big/2022_06"
2022-12-02T00:21:23.985Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_06" has been created
2022-12-02T13:56:23.727Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_07" with smallPartsPath="/storage/data/small/2022_07", bigPartsPath="/storage/data/big/2022_07"
2022-12-02T13:56:26.533Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_07" has been created
2022-12-03T01:24:45.724Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_08" with smallPartsPath="/storage/data/small/2022_08", bigPartsPath="/storage/data/big/2022_08"
2022-12-03T01:24:46.900Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_08" has been created
2022-12-03T17:57:02.517Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_09" with smallPartsPath="/storage/data/small/2022_09", bigPartsPath="/storage/data/big/2022_09"
2022-12-03T17:57:03.713Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_09" has been created
2022-12-04T07:08:23.316Z info VictoriaMetrics/lib/storage/partition.go:1305 merged 23345 rows across 23345 blocks in 33.611 seconds at 694 rows/sec to "/storage/data/small/2022_09/23345_23345_20220928180000.000_20220928180000.000_172D5A35BF445C06"; sizeBytes: 338817
2022-12-04T08:41:28.524Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_10" with smallPartsPath="/storage/data/small/2022_10", bigPartsPath="/storage/data/big/2022_10"
2022-12-04T08:41:31.022Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_10" has been created
2022-12-04T22:37:55.725Z info VictoriaMetrics/lib/storage/partition.go:200 creating a partition "2022_11" with smallPartsPath="/storage/data/small/2022_11", bigPartsPath="/storage/data/big/2022_11"
2022-12-04T22:37:56.406Z info VictoriaMetrics/lib/storage/partition.go:216 partition "2022_11" has been created
What is the issue here?
Why does the data sometimes show up when zoomed in?
Is there a data mismatch between the nodes that sometimes show up based on which node got routed to?

It is likely you need to reset response cache at every vmselect node after the backfilling process is complete. See these docs for details.

Related

KSQL | Consumer lag | confluent cloud |

I am using kafka confluent cloud as a message queue in the eco-system. There are 2 topics, A and B.
Messages in B arrives a little later after messages of A is being published. ( in a delay of 30 secs )
I am joining these 2 topics using ksql, ksql server is deployed in in-premises and is connected to confluent cloud. In the KSQL i am joining these 2 topics as streams based on the common identifier, say requestId and create a new stream C. C is the joined stream.
At a times, C steam shows it has generated a lag it has not processed messages of A & B.
This lag is visible in the confluent cloud UI. When i login to ksql server i could see following error and after restart of ksql server everything works fine. This happens intermittently in 2 - 3 days.
Here is my configuration in the ksql server which is deployed in in-premises.
# A comma separated list of the Confluent Cloud broker endpoints
bootstrap.servers=${bootstrap_servers}
ksql.internal.topic.replicas=3
ksql.streams.replication.factor=3
ksql.logging.processing.topic.replication.factor=3
listeners=http://0.0.0.0:8088
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="${bootstrap_auth_key}" password="${bootstrap_secret_key}";
# Schema Registry specific settings
ksql.schema.registry.basic.auth.credentials.source=USER_INFO
ksql.schema.registry.basic.auth.user.info=${schema_registry_auth_key}:${schema_registry_secret_key}
ksql.schema.registry.url=${schema_registry_url}
#Additinoal settings
ksql.streams.producer.delivery.timeout.ms=2147483647
ksql.streams.producer.max.block.ms=9223372036854775807
ksql.query.pull.enable.standby.reads=false
#ksql.streams.num.standby.replicas=3 // TODO if we need HA 1+1
#num.standby.replicas=3
# Automatically create the processing log topic if it does not already exist:
ksql.logging.processing.topic.auto.create=true
# Automatically create a stream within KSQL for the processing log:
ksql.logging.processing.stream.auto.create=true
compression.type=snappy
ksql.streams.state.dir=${base_storage_directory}/kafka-streams
Error message in the ksql server logs.
[2020-11-25 14:08:49,785] INFO stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-2] State transition from RUNNING to PARTITIONS_ASSIGNED (org.apache.kafka.streams.processor.internals.StreamThread:220)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] WARN stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3] Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group. (org.apache.kafka.streams.processor.internals.StreamThread:572)
org.apache.kafka.streams.errors.TaskMigratedException: Consumer committing offsets failed, indicating the corresponding thread is no longer part of the group; it means all tasks belonging to this thread should be migrated.
at org.apache.kafka.streams.processor.internals.TaskManager.commitOffsetsOrTransaction(TaskManager.java:1009)
at org.apache.kafka.streams.processor.internals.TaskManager.commit(TaskManager.java:962)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:851)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:714)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510)
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1251)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1158)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1132)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1107)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:602)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:412)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
Edit :
During this exception. i have verified the ksql server has enough RAM and CPU

Apache Flink & Kafka FETCH_SESSION_ID_NOT_FOUND info logs

Our Flink application has a Kafka datasource.
Application is run with 32 parallelism.
When I look at the logs, I see a lot of statements about FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:47,753 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-81, groupId=sampleGroup]
Node 26 was unable to process the fetch request with (sessionId=439766827, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:48,230 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-78, groupId=sampleGroup]
Node 28 was unable to process the fetch request with (sessionId=281654250, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
What do these log statements mean?
What are the possible negative effects?
Not: I have no experience with Apache Kafka
Thanks..
This can happen for a few reasons but the most common one is the FetchSession cache being full on the brokers.
By default, brokers cache up to 1000 FetchSessions (configured via max.incremental.fetch.session.cache.slots). When this fills up, brokers cam evict cache entries. If your client cache entry is gone, it will received the FETCH_SESSION_ID_NOT_FOUND error.
This error is not fatal and consumers should send a new full FetchRequest automatically and keep working.
You can check the size of the FetchSession cache using the kafka.server:type=FetchSessionCache,name=NumIncrementalFetchSessions metric.

Kafka Producer cannot validate record wihout PK and return InvalidRecordException

I have error with my kafka producer. I use Debezium Kafka connectors V1.1.0 Final and Kafka 2.4.1 . For tables with pk, all tables are flushed clearly, but unfortunately for tables with no pk on it, it give me this error:
[2020-04-14 10:00:00,096] INFO Exporting data from table 'public.table_0' (io.debezium.relational.RelationalSnapshotChangeEventSource:280)
[2020-04-14 10:00:00,097] INFO For table 'public.table_0' using select statement: 'SELECT * FROM "public"."table_0"' (io.debezium.relational.RelationalSnapshotChangeEventSource:287)
[2020-04-14 10:00:00,519] INFO Finished exporting 296 records for table 'public.table_0'; total duration '00:00:00.421' (io.debezium.relational.RelationalSnapshotChangeEventSource:330)
[2020-04-14 10:00:00,522] INFO Snapshot - Final stage (io.debezium.pipeline.source.AbstractSnapshotChangeEventSource:79)
[2020-04-14 10:00:00,523] INFO Snapshot ended with SnapshotResult [status=COMPLETED, offset=PostgresOffsetContext [sourceInfo=source_info[server='postgres'db='xxx, lsn=38/C74913C0, txId=4511542, timestamp=2020-04-14T02:00:00.517Z, snapshot=FALSE, schema=public, table=table_0], partition={server=postgres}, lastSnapshotRecord=true]] (io.debezium.pipeline.ChangeEventSourceCoordinator:90)
[2020-04-14 10:00:00,524] INFO Connected metrics set to 'true' (io.debezium.pipeline.metrics.StreamingChangeEventSourceMetrics:59)
[2020-04-14 10:00:00,526] INFO Starting streaming (io.debezium.pipeline.ChangeEventSourceCoordinator:100)
[2020-04-14 10:00:00,550] ERROR WorkerSourceTask{id=pg_dev_pinjammodal-0} failed to send record to table_0: (org.apache.kafka.connect.runtime.WorkerSourceTask:347)
org.apache.kafka.common.InvalidRecordException: This record has failed the validation on broker and hence be rejected.
I have check the tables and it seem valid record. I set my producer producer.ack=1 in my config. Is this config trigger the invalidity in here?
The problem was creating Kafka topics with log compaction for non-PK tables, which need keys. The messages don't have keys, because the tables don't have PKs. This results in the brokers not being able to validate the Kafka messages.
The solution is to not set log compaction to the topics and/or not pre-creating those topics. Another option would be to add PKs to the tables.

Kafka Streams with state stores - Reprocessing of messages on app restart

We have the following topology with two transformers, and each transformer uses persistent state store:
kStreamBuilder.stream(inboundTopicName)
.transform(() -> new FirstTransformer(FIRST_STATE_STORE), FIRST_STATE_STORE)
.map((key, value) -> ...)
.transform(() -> new SecondTransformer(SECOND_STATE_STORE), SECOND_STATE_STORE)
.to(outboundTopicName);
and Kafka settings has auto.offset.reset: latest. After app was launched, I see two internal compacted topics were creates (and it's expected): appId_inbound_firstStateStore-changelog and appId_inbound_secondStateStore-changelog
Our app was down for two days, and after we started app again, messages were reprocessed from the beginning for specific partition (but we have multiple partitions).
I know that committed offsets are stored during ~ 1 day for kafka brokers prior to version 2, so our offsets should be cleaned up by retention. But why messages were reprocessed from beginning if we use auto.offset.reset: latest? Maybe it's somehow relate to stateful operations or changelog internal topics.
I see the following logs (most of them are duplicated multiple times):
StoreChangelogReader Restoring task 0_55's state store firstStateStore from beginning of the changelog
Fetcher [Consumer clientId=xxx-restore-consumer, groupId=] Resetting offset for partition xxx-55 to offset 0
ConsumerCoordinator Setting newly assigned partitions
ConsumerCoordinator Revoking previously assigned partitions
StreamsPartitionAssignor Assigned tasks to clients
AbstractCoordinator Successfully joined group with generation
StreamThread partition revocation took xxx ms
Unsubscribed all topics or patterns and assigned partitions
AbstractCoordinator (Re-)joining group
Attempt to heartbeat failed since group is rebalancing
AbstractCoordinator Group coordinator xxx:9092 (id: xxx rack: null) is unavailable or invalid, will attempt rediscovery
FetchSessionHandler - [Consumer clientId=xxx-restore-consumer, groupId=] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: org.apache.kafka.common.errors.DisconnectException
Kafka broker version 0.11.0.2; Kafka Streams version 2.1.0

Kafka Consumer left consumer group

I've faced some problem using Kafka. Any help is much appreciated!
I have zookeeper and kafka cluster 3 nodes each in docker swarm. Kafka broker configuration you can see below.
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_NUM_PARTITIONS: 8
KAFKA_REPLICA_SOCKET_TIMEOUT_MS: 30000
KAFKA_REQUEST_TIMEOUT_MS: 30000
KAFKA_COMPRESSION_TYPE: "gzip"
KAFKA_JVM_PERFORMANCE_OPTS: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80"
KAFKA_HEAP_OPTS: "-Xmx768m -Xms768m -XX:MetaspaceSize=96m"
My case:
20x Producers producing messages to kafka topic constantly
1x Consumer reads and log messages
Kill kafka node (docker container stop) so now cluster has 2 nodes of Kafka broker (3rd will start and join cluster automatically)
And Consumer not consuming messages anymore because it left consumer group due to rebalancing
Does exist any mechanism to tell consumer to join group after rebalancing?
Logs:
INFO 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] Attempt to heartbeat failed since group is rebalancing
WARN 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
#Rostyslav Whenever we make a call by consumer to read a message, it does 2 major calls.
Poll
Commit
Poll is basically fetching records from kafka topic and commit tells kafka to save it as a read message , so that it's not read again. While polling few parameters play major role:
max_poll_records
max_poll_interval_ms
FYI: variable names are per python api.
Hence, whenever we try to read message from Consumer, it makes a poll call every max_poll_interval_ms and the same call is made only after the records fetched (as defined in max_poll_records) are processed. So, whenever, max_poll_records are not processed in max_poll_inetrval_ms, we get the error.
In order to overcome this issue, we need to alter one of the two variable. Altering max_poll_interval_ms ccan be hectic as sometime it may take longer time to process certain records, sometime lesser records. I always advice to play with max_poll_records as a fix to the issue. This works for me.