Druid kafka Ingestion Process Data Drop - druid

My Druid Server is running in single-server and I'm Ingesting 30M records in One Datasource from Kafka. There is 16GB RAM and 100GB Swap Memory allocated.Java Memory Heap size is 15.62GB.
Druid also contains another datasource with 2.6M records and that supervisor is suspended.
Now there is 4.5M records stored successfully but When Records increases to 6M there is problem stats " Unable to reconnect to Zookeeper service, Session expired event received " and then Data Dropped to 4.5M and reinstating, and again same process repeated with records goes to up at some point like 6.2M and then same error occurred and data dropped to 4.5M . then After 4-5 Hours Druid Service restarts and record in datasource starts from 4.5M.
Segment granularity is set to HOUR.
Following is the statistics of Memory usage in system
total used free shared buff/cache available
Mem: 15G 15G 165M 36K 160M 83M
Swap: 99G 45G 54G
What Should I have to do? Is this a problem of Memory?

Related

High CPU utilization KSQL db

We are running KSQL db server on kubernetes cluster.
Node config:
AWS EKS fargate
No of nodes: 1
CPU: 2 vCPU (Request), 4 vCPU (Limit)
RAM: 4 GB (Request), 8 GB (Limit)
Java heap: 3 GB (Default)
Data size:
We have ~11 source topic with 1 partition, some one them having 10k record few has more than 100k records. ~7 sink topic but to create those 7 sink topic have ~60 ksql table, ~38 ksql streams & ~64 persistent queries because of joins and aggregation. So heavy computation.
KSQLdb version: 0.23.1 and we are using confluent official KSQL docker image
The problem:
When running our KSQL script we are seeing spike in CPU to 350-360% and memory 20-30%. And when that happen kubernetes restarting the server instance. Which is resulting ksql-migration to fail.
Error:
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection
refused:
<deployment-name>.<namespace>.svc.cluster.local/172.20.73.150:8088
Error: io.vertx.core.VertxException: Connection was closed
We have 30 migration files, and each file has multiple table and stream creation.
And its always failing on v27.
What we have tried so far:
Running it alone. And in that case it pass with no error.
Increase initial CPU to 4 vCPU but no change in CPU utilization
Had 2 nodes with 2 partition in kafka, but that also had same issue with addition few data columns having no data.
So something is not right in our configuration or resource allocation.
What's the standard of deployment for KSQL in kubernetes? maybe its not meant for kubernetes.

When does Kafka throw BufferExhaustedException?

org.apache.kafka.clients.producer.BufferExhaustedException: Failed to allocate memory within the configured max blocking time 5 ms.
This says that exception is thrown when producer is unable to allocate memory to the record within the configured max blocking time 5 ms.
This is what it says when I was trying to add Kafka s3-sink connectors. There are 11 topics in two kafka brokers and there were consumers present already consuming from these topics. I was spinning out a 2 node Kafka connect cluster with 11 connectors trying to consume from these topics. But there was a huge spike in errors when I started these s3-sink connectors. Once I stopped these connectors, the errors dropped and seemed to be fine. Then I started the consumers again with less number of tasks and this time the errors spiked up when there was a sudden surge in the traffic and back to normal when the traffic was back to normal. There was a max retry of 5 and it messages failed to write even after 5 attempts.
From whatever I had read, it might be due to producer batch size or producer rate being higher than the consumer rate. And I guess each consumer will be occupying upto 64 mb when there is bursty traffic. Could that be the reason? Should I try and increase the blocking time?
Producer Config:
lingerTime: 0
maxBlockTime: 5
bufferMemory: 1024000
batchSize: 102400
ack: "1"
maxRequestSize: 102400
retries: 1
maxInFlightRequestsPerConn: 1000
It was actually due to the increase in the IOPS of the EC2 instances that Kafka brokers couldn't handle. Increasing the number of bytes fetched per poll and decreasing the frequency of polls fixed it.

Druid Too much memory consumption

I have deployed Druid Single server using command (./bin/start-micro-quickstart)
My server specification is 8vCPU and 32GB RAM (EC2 t2.2Xlarge) and also added 100GB SWAP
I'm trying to ingesting 27M records from kafka to Druid.
Now at this point I have 4M records shown in Druid Datasource with total size is 506 MB and segment are 5400 (Average Segment size is 92.99 KB)
And my Memory usage is
total used free shared buff/cache available
Mem: 31G 30G 248M 24K 207M 96M
Swap: 99G 78G 21G
My DataSource size is 506 MB So why RAM Consumption is 108 GB ?
And are those all segment is in memory?
Whice druid service uses CPU and Which Druid Service uses Memory?
How many peon task you are running ? Since you are using kafka to ingest, I am assuming you are using superviser spec. If you have too many topics, and a supervisor spect for each topic, it will take memory. Check direct memory requirements. https://druid.apache.org/docs/latest/configuration/index.html

Monitoring Mirror Maker 2

I am trying to add a template to my monitoring system to monitor Mirror Maker 2.0
From the documentation i know these metrics are supplied in jmx
# MBean: kafka.connect.mirror:type=MirrorSourceConnector,target=.
([-.w]+),topic=([-.w]+),partition=([0-9]+)
record-count # number of records replicated source -> target
record-age-ms # age of records when they are replicated
record-age-ms-min
record-age-ms-max
record-age-ms-avg
replication-latecny-ms # time it takes records to propagate source->target
replication-latency-ms-min
replication-latency-ms-max
replication-latency-ms-avg
byte-rate # average number of bytes/sec in replicated records
If i wanted to monitor the lag of the replication between clusters, is it supposed to be inferred from record-age-ms? (ie if that age ms continues to grow then the delay continues to grow?)
Thanks

Kafka Producer and Broker Throughput Limitations

I have configured a two node six partition Kafka cluster with a replication factor of 2 on AWS. Each Kafka node runs on a m4.2xlarge EC2 instance backed by an EBS.
I understand that rate of data flow from Kafka producer to Kafka broker is limited by the network bandwidth of producer.
Say network bandwidth between Kafka producer and broker is 1Gbps ( approx. 125 MB/s) and bandwidth between Kafka broker and storage ( between EC2 instance and EBS volume ) is 1 Gbps.
I used the org.apache.kafka.tools.ProducerPerformance tool for profiling the performance.
I observed that a single producer can write at around 90 MB/s to the broker when a message size is 100 bytes.( hence network is not saturated)
I also observed that disk write rate to EBS volume is around 120 MB/s.
Is this 90 MB/s due to some network bottleneck or is it a limitation of Kafka ? (forgetting batch size and compression etc. for simplicity )
Could this be due to the bandwidth limitation between broker and ebs volume?
I also observed that when two producers ( from two separate machines ) produce data, throughput of one producer dropped to around 60 MB/s.
What could be the reason for this? Why doesn't that value reach 90 MB/s ? Could this be due to the network bottleneck between broker and ebs volume?
What confuses me is that in both cases (single producer and two producers ) disk write rate to ebs stays around 120 MB/s ( closer to its upper limit ).
Thank you
I ran into the same issue as per my understanding, in first case one producer is sending data to two brokers (there is nothing else in the network) so you got 90 MB/s and each broker at 45MB/s (approx), but in the second case two producers are sending data to two brokers so from the producer perspective it is able to send data at 60 MB/s but from the broker perspective it is receiving data at 60MB/s. so you are actually able to push more data through kafka.
There are a couple things to consider:
There are separate disk and network limits that apply to both the instance and the volume.
You have to account for replication. If you have RF=2, the amount of write traffic taken by a single node is 2*(PRODUCER_TRAFFIC)/(PARTITION_COUNT) assuming even distribution of writes across partitions.