OutOfMemory Error on DruidDB index_kafka_histogram tasks - druid

I am new to the DruidDB setup. I am trying to ingest data in DruidDB. Initially, it was working fine but after some time I am getting the following error.
Sample Config:
"metricsSpec": [
"type": "longMin",
"name": "min",
"fieldName": "min",
"expression": null
"type": "longMax",
"name": "max",
"fieldName": "max",
"expression": null
"type": "longSum",
"name": "count",
"fieldName": "count",
"expression": null
"type": "longSum",
"name": "sum",
"fieldName": "sum",
"expression": null
"type": "quantilesDoublesSketch",
"name": "quantilesDoubleSketch",
"fieldName": "sketch",
"k": 128
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "MINUTE",
"rollup": true,
"intervals": null
"tuningConfig": {
"type": "kafka",
"maxRowsInMemory": 10000,
"maxBytesInMemory": 200000,
"maxRowsPerSegment": 5000000,
"intermediatePersistPeriod": "PT10M",
"basePersistDirectory": "/tmp/druid-realtime-persist15059426147899962275",
"maxPendingPersists": 0,
"indexSpec": {
"bitmap": {
"type": "roaring",
"compressRunOnSerialization": true
"dimensionCompression": "lz4",
"metricCompression": "lz4",
"longEncoding": "longs"
"indexSpecForIntermediatePersists": {
"bitmap": {
"type": "roaring",
"compressRunOnSerialization": true
"dimensionCompression": "lz4",
"metricCompression": "lz4",
"longEncoding": "longs"
"buildV9Directly": true,
"reportParseExceptions": false,
"handoffConditionTimeout": 0,
"resetOffsetAutomatically": false,
"chatRetries": 8,
"httpTimeout": "PT10S",
"shutdownTimeout": "PT80S",
"offsetFetchPeriod": "PT30S",
"intermediateHandoffPeriod": "P2147483647D",
"logParseExceptions": true,
"maxParseExceptions": 2147483647,
"maxSavedParseExceptions": 0,
"skipSequenceNumberAvailabilityCheck": false,
"repartitionTransitionDuration": "PT120S"
09:55:49.134 [task-runner-0-priority-0] ERROR org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Uncaught Throwable while running task[AbstractTask{id='index_kafka_histogram_25c6328c09f15d7_nofamgdk', groupId='index_kafka_histogram', taskResource=TaskResource{availabilityGroup='index_kafka_histogram_25c6328c09f15d7', requiredCapacity=1}, dataSource='histogram', context={checkpoints={"0":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":0,"19":0,"20":0,"21":0,"22":0,"23":0,"24":0,"25":0,"26":0,"27":0,"28":0,"29":0,"30":0,"31":0,"32":0,"33":0,"34":0,"35":0,"36":0,"37":0,"38":0,"39":0,"40":0,"41":0,"42":0,"43":0,"44":0,"45":0,"46":0,"47":0,"48":0,"49":0}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}}]
java.lang.OutOfMemoryError: Java heap space
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at org.apache.druid.indexing.worker.executor.ExecutorLifecycle.join(ExecutorLifecycle.java:215)
at org.apache.druid.cli.CliPeon.run(CliPeon.java:295)
at org.apache.druid.cli.Main.main(Main.java:113)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.apache.druid.indexing.worker.executor.ExecutorLifecycle.join(ExecutorLifecycle.java:212)
... 2 more
Caused by: java.lang.OutOfMemoryError: Java heap space
I have tried hard reset option for the tasks and answers mentioned in the following.
Link: druid indexing task fails with OutOfMemory Exception

A few things you should check.
Your ingestion is configured for maxRowsInMemory of 10000 and maxBytesInMemory of 200000. These seem small. They will cause each task to write to disk often. All writes to disk must be merged when it comes time to publish segments. Depending on your message size and throughput there could be a large number of small files to be merged. The merge operation may be running into trouble because of this. You can control the maxColumnsToMerge in your ingestion task that will keep this under control.
From one of the Druid committers, “the way the setting works is, let's say you have 30 columns per segment, and for each segment you have 110 intermediate persists. during segment generation, we merge all of those intermediate persists. that's 110 * 30 = 3,300 total to merge. each column has memory requirements of about 5KB on heap and 100KB direct (off heap). so that'll require about 16.5MB on heap and 500MB off heap. if you want to limit it, you can set maxColumnsToMerge = 1500 and it will use only about half that.“
Additionally you should verify that you are not overcommitting memory, you should have enough memory for worker.capacity * ( heap + direct memory ) on the middle manager in addition to what the Middle manager process itself uses. Also assuming there is nothing else running on the node.


Query uses correct index still takes long execution time

I am working on optimizing the mongo query. One of the queries is taking too long to execute on the index. Sharing the code snippet below:
Here is the command copied from Atlas:
"command": {
"getMore": 5992505034453534,
"collection": "data",
"$db": "prod",
"$clusterTime": {
"clusterTime": {
"$timestamp": {
"t": 1670439680,
"i": 1071
"originatingCommand": {
"find": "data",
"filter": {
"accountId": "QQQAAQAQAQAQA",
"custId": "62a7b11fy883bhedge73",
"state": {
"$in": [
"startTime": {
"$lte": {
"$date": "2022-12-07T17:39:28.573Z"
"maxTimeMS": 300000,
"planSummary": [
"accountId": 1,
"custId": 1,
"state": 1,
"startTime": 1
"cursorid": 5992505034144062000,
"keysExamined": 2520,
"docsExamined": 2519,
"cursorExhausted": 1,
"numYields": 130,
"nreturned": 2519,
"reslen": 4898837,
I have the below Index on Mongo:
Index Name: accountId_custId_state_startTime
accountId:1 custId:1 state:1 startTime:1
Atlas Stats:
Index Size: 776.5MB
Usage: 73.58/min
I do not understand why the execution time is too high. Why it's taking 1672ms to query?
From an indexing perspective, the operation is perfectly efficient:
"keysExamined": 2520,
"docsExamined": 2519,
"nreturned": 2519,
It only scanned the relevant portion of the index, pulling only documents that were sent back to the client as part of the result set. There is nothing that can be improved from an indexing perspective here. Therefore any observed slowness is likely being caused by "something else".
In general it shouldn't take the database 1.6 seconds to process 2,519 documents (~5MB). But without knowing more about your environment we can't really say anything more specific. Is there a meaningful amount of concurrent workload that may be competing for resources here? Is the cluster itself undersized for the workload? It is notable that the ratio of yields to documents returned seems higher than usual, which could be an indicator of problems like these.
I would recommend looking at the overall health of the cluster and at the other operations that are running at the same time. My impression is that running this operation in isolation would probably result in it executing faster, further suggesting that the problem (and therefore the resolution as well) is somewhere other than the index used by this operation.

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?
I am having trouble with Service Fabric trying to place too many services onto a single node too fast.
To give an example of cluster size, there are 2-4 worker node types, there are 3-6 worker nodes per node type, each node type may run 200 guest executable applications, and each application will have at least 2 replicas. The nodes are more than capable of running the services while running, it is just startup time where CPU is too high.
The problem seems to be the thresholds or defaults for placement and load balancing rules set in the cluster config. As examples of what I have tried: I have turned on InBuildThrottlingEnabled and set InBuildThrottlingGlobalMaxValue to 100, I have set the Global Movement Throttle settings to be various percentages of the total application count.
At this point there are two distinct scenarios I am trying to solve for. In both cases, the nodes go to 100% for an amount of time such that service fabric declares the node as down.
1st: Starting an entire cluster from all nodes being off without overwhelming nodes.
2nd: A single node being overwhelmed by too many services starting after a host comes back online
Here are my current parameters on the cluster:
"Name": "PlacementAndLoadBalancing",
"Parameters": [
"Name": "UseMoveCostReports",
"Value": "true"
"Name": "PLBRefreshGap",
"Value": "1"
"Name": "MinPlacementInterval",
"Value": "30.0"
"Name": "MinLoadBalancingInterval",
"Value": "30.0"
"Name": "MinConstraintCheckInterval",
"Value": "30.0"
"Name": "GlobalMovementThrottleThresholdForPlacement",
"Value": "25"
"Name": "GlobalMovementThrottleThresholdForBalancing",
"Value": "25"
"Name": "GlobalMovementThrottleThreshold",
"Value": "25"
"Name": "GlobalMovementThrottleCountingInterval",
"Value": "450"
"Name": "InBuildThrottlingEnabled",
"Value": "false"
"Name": "InBuildThrottlingGlobalMaxValue",
"Value": "100"
Based on discussion in answer below, wanted to leave a graph-image: if a node goes down, the act of shuffling services on to the remaining nodes will cause a second node to go down, as noted here. Green node goes down, then purple goes down due to too many resources being shuffled onto it.
From SF's perspective, 1 & 2 are the same problem. Also as a note, SF doesn't evict a node just because CPU consumption is high. So: "The nodes go to 100% for an amount of time such that service fabric declares the node as down." needs some more explanation. The machines might be failing for other reasons, or I guess could be so loaded that the kernel level failure detectors can't ping other machines, but that isn't very common.
For config changes: I would remove all of these to go with the defaults
"Name": "PLBRefreshGap",
"Value": "1"
"Name": "MinPlacementInterval",
"Value": "30.0"
"Name": "MinLoadBalancingInterval",
"Value": "30.0"
"Name": "MinConstraintCheckInterval",
"Value": "30.0"
For the inbuild throttle to work, this needs to flip to true:
"Name": "InBuildThrottlingEnabled",
"Value": "false"
Also, since these are likely constraint violations and placement (not proactive rebalancing) we need to explicitly instruct SF to throttle those operations as well. There is config for this in SF, although it is not documented or publicly supported at this time, you can see it in the settings. By default only balancing is throttled, but you should be able to turn on throttling for all phases and set appropriate limits via something like the below.
These first two settings are also within PlacementAndLoadBalancing, like the ones above.
"Name": "ThrottlePlacementPhase",
"Value": "true"
"Name": "ThrottleConstraintCheckPhase",
"Value": "true"
These next settings to set the limits are in their own sections, and are a map of the different node type names to the limit you want to throttle for that node type.
"name": "MaximumInBuildReplicasPerNodeConstraintCheckThrottle",
"parameters": [
"name": "YourNodeTypeNameHere",
"value": "100"
"name": "YourOtherNodeTypeNameHere",
"value": "100"
"name": "MaximumInBuildReplicasPerNodePlacementThrottle",
"parameters": [
"name": "YourNodeTypeNameHere",
"value": "100"
"name": "YourOtherNodeTypeNameHere",
"value": "100"
"name": "MaximumInBuildReplicasPerNodeBalancingThrottle",
"parameters": [
"name": "YourNodeTypeNameHere",
"value": "100"
"name": "YourOtherNodeTypeNameHere",
"value": "100"
"name": "MaximumInBuildReplicasPerNode",
"parameters": [
"name": "YourNodeTypeNameHere",
"value": "100"
"name": "YourOtherNodeTypeNameHere",
"value": "100"
I would make these changes and then try again. Additional information like what is actually causing the nodes to be down (confirmed via events and SF health info) would help identify the source of the problem. It would probably also be good to verify that starting 100 instances of the apps on the node actually works and whether that's an appropriate threshold.

Druid GroupBy query gives different response when changing the order by fields

I have a question regarding an Apache Druid incubating query.
I have a simple group by to select the number of calls per operator. See here my query:
"queryType": "groupBy",
"dataSource": "ivr-calls",
"intervals": [
"dimensions": [
"type": "lookup",
"dimension": "operator_id",
"outputName": "value",
"name": "ivr_operator",
"replaceMissingValueWith": "Unknown"
"type": "default",
"dimension": "operator_id",
"outputType": "long",
"outputName": "id"
"granularity": "all",
"aggregations": [
"type": "longSum",
"name": "calls",
"fieldName": "calls"
"limitSpec": {
"type": "default",
"limit": 999999,
"columns": [
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
In this query I order the result by the "value" dimension, I receive 218 results.
I noticed that some of the records are duplicate. (I see some operators two times in my resultset). This is strange because in my experience all dimensions which you select are also used for grouping by. So, they should be unique.
If I add an order by to the "id" dimension, I receive 183 results (which is expected):
"columns": [
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
"dimension": "id",
"direction": "ascending",
"dimensionOrder": "numeric"
The documentation tells me nothing about this strange behavior (https://druid.apache.org/docs/latest/querying/limitspec.html).
My previous experience with druid is that the order by is just "ordering".
I am running druid version 0.15.0-incubating-iap9.
Can anybody tell me why there is a difference in the result set based on the column sorting?
I resolved this problem for now by specifying all columns in my order by.
Issue seems to be related to a bug in druid. See: https://github.com/apache/incubator-druid/issues/9000

How to measure per user bandwidth usage on google cloud storage?

We want to charge users based on the amount of traffic their data has. Actually the amount of downstream bandwidth their data is consuming.
I have exported google cloud storage access_logs. From the logs, I can count the number of times a file is accessed. (filesize * count will be the bandwidth usage)
But the problem is that this doesn't work well with cached content. My calculated value is much more than the actual usage.
I went with this method because our traffic will be new and won't use the cache, which means that the difference won't matter. But in reality, it seems like it is a real problem.
This is a common use case and I think there should be a better way to solve this problem with google cloud storage.
"insertId": "-tohip8e1vmvw",
"logName": "projects/bucket/logs/cloudaudit.googleapis.com%2Fdata_access",
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"authenticationInfo": {
"principalEmail": "firebase-storage#system.gserviceaccount.com"
"authorizationInfo": [
"granted": true,
"permission": "storage.objects.get",
"resource": "projects/_/bucket/bucket.appspot.com/objects/users/2y7aPImLYeTsCt6X0dwNMlW9K5h1/somefile",
"resourceAttributes": {}
"granted": true,
"permission": "storage.objects.getIamPolicy",
"resource": "projects/_/bucket/bucket.appspot.com/objects/users/2y7aPImLYeTsCt6X0dwNMlW9K5h1/somefile",
"resourceAttributes": {}
"methodName": "storage.objects.get",
"requestMetadata": {
"destinationAttributes": {},
"requestAttributes": {
"auth": {},
"time": "2019-07-02T11:58:36.068Z"
"resourceLocation": {
"currentLocations": [
"resourceName": "projects/_/bucket/bucket.appspot.com/objects/users/2y7aPImLYeTsCt6X0dwNMlW9K5h1/somefile",
"serviceName": "storage.googleapis.com",
"status": {}
"receiveTimestamp": "2019-07-02T11:58:36.412798307Z",
"resource": {
"labels": {
"bucket_name": "bucket.appspot.com",
"location": "eu",
"project_id": "project-id"
"type": "gcs_bucket"
"severity": "INFO",
"timestamp": "2019-07-02T11:58:36.062Z"
An entry of the log.
We are using a single bucket for now. Can also use multiple if it helps.
One possibility is to have a separate bucket for each user and get the bucket's bandwidth usage through timeseries api.
The endpoint for this purpose is:
And following are the parameters to achieve bytes sent for one hour (we can specify time range above 60s) whose sum will be the total bytes sent from the bucket.
"dataSets": [
"timeSeriesFilter": {
"filter": "metric.type=\"storage.googleapis.com/network/sent_bytes_count\" resource.type=\"gcs_bucket\" resource.label.\"project_id\"=\"<<<< project id here >>>>\" resource.label.\"bucket_name\"=\"<<<< bucket name here >>>>\"",
"perSeriesAligner": "ALIGN_SUM",
"crossSeriesReducer": "REDUCE_SUM",
"secondaryCrossSeriesReducer": "REDUCE_SUM",
"minAlignmentPeriod": "3600s",
"groupByFields": [
"unitOverride": "By"
"targetAxis": "Y1",
"plotType": "LINE",
"legendTemplate": "${resource.labels.bucket_name}"
"options": {
"mode": "COLOR"
"constantLines": [],
"timeshiftDuration": "0s",
"y1Axis": {
"label": "y1Axis",
"scale": "LINEAR"

What is the DNS Record logged by kubedns?

I'm using Google Container Engine and I'm noticing entries like the following in my logs
"insertId": "1qfzyonf2z1q0m",
"internalId": {
"projectNumber": "1009253435077"
"labels": {
"compute.googleapis.com/resource_id": "710923338689591312",
"compute.googleapis.com/resource_name": "fluentd-cloud-logging-gke-gas2016-4fe456307445d52d-worker-pool-",
"compute.googleapis.com/resource_type": "instance",
"container.googleapis.com/cluster_name": "gas2016-4fe456307445d52d",
"container.googleapis.com/container_name": "kubedns",
"container.googleapis.com/instance_id": "710923338689591312",
"container.googleapis.com/namespace_name": "kube-system",
"container.googleapis.com/pod_name": "kube-dns-v17-e4rr2",
"container.googleapis.com/stream": "stderr"
"logName": "projects/cml-236417448818/logs/kubedns",
"resource": {
"labels": {
"cluster_name": "gas2016-4fe456307445d52d",
"container_name": "kubedns",
"instance_id": "710923338689591312",
"namespace_id": "kube-system",
"pod_id": "kube-dns-v17-e4rr2",
"zone": "us-central1-f"
"type": "container"
"severity": "ERROR",
"textPayload": "I0718 17:05:20.552572 1 dns.go:660] DNS Record:&{worker-7.default.svc.cluster.local. 6000 10 10 false 30 0 }, hash:f97f8525\n",
"timestamp": "2016-07-18T17:05:20.000Z"
Is this an actual error or is the severity incorrect? Where can I find the definition for the struct that is being printed?
The severity is incorrect. This is some tracing/debugging that shouldn't have been left in the binary, and has been removed since 1.3 was cut. It will be removed in a future release.
See also: Google container engine cluster showing large number of dns errors in logs