druid indexing task fails with OutOfMemory Exception - druid

I have created a druid cluster and submitted a indexing task. Looks like There is a reducer skew happening and indexing task stucks are reduce 99 %. it fails with below error.
2018-03-27T21:14:30,349 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 96%
2018-03-27T21:14:33,353 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 97%
2018-03-27T21:15:18,418 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 98%
2018-03-27T21:26:05,358 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 99%
2018-03-27T21:37:04,261 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 100%
2018-03-27T21:42:34,690 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1522166154803_0010_r_000001_3, Status : FAILED
Container [pid=111411,containerID=container_1522166154803_0010_01_000388] is running beyond physical memory limits. Current usage: 7.9 GB of 7.4 GB physical memory used; 10.8 GB of 36.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1522166154803_0010_01_000388 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 111411 111408 111411 111411 (bash) 1 2 115810304 696 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx6042m -Ddruid.storage.bucket=dish-Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1522166154803_0010/container_1522166154803_0010_01_000388/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1522166154803_0010/container_1522166154803_0010_01_000388 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0 org.apache.hadoop.mapred.YarnChild 10.176.225.139 35084 attempt_1522166154803_0010_r_000001_3 388 1>/var/log/hadoop-yarn/containers/application_1522166154803_0010/container_1522166154803_0010_01_000388/stdout 2>/var/log/hadoop-yarn/containers/application_1522166154803_0010/container_1522166154803_0010_01_000388/stderr
|- 111591 111411 111411 111411 (java) 323692 28249 11526840320 2058251 /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1522166154803_0010/container_1522166154803_0010_01_000388/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1522166154803_0010/container_1522166154803_0010_01_000388 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0 org.apache.hadoop.mapred.YarnChild 10.176.225.139 35084 attempt_1522166154803_0010_r_000001_3 388
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
I have check my yarn-site.xml and below is my configuration.
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>241664</value>
</property>
Below is my index configuration. The data I am trying to load is only for the day 2018-04-04.
{
"type" : "index_hadoop",
"spec" : {
"dataSchema" : {
"dataSource" : "viewership",
"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "event_date",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": ["network_group","show_name","time_of_day","viewing_type","core_latino","dma_name","legacy_unit","presence_of_kids","head_of_hhold_age","prin","sys","tenure_years","vip_w_dvr","vip_wo_dvr","network_rank","needs_based_segment","hopper","core_english","star_status","day_of_week"],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
{
"type" : "longSum",
"name" : "time_watched",
"fieldName" : "time_watched"
},
{
"type" : "cardinality",
"name" : "distinct_accounts",
"fields" : [ "account_id" ]
}
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE",
"intervals" : [ "2017-04-03/2017-04-16" ]
}
},
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "/user/hadoop/"
}
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"type": "hashed",
"targetPartitionSize": 4000000,
"assumeGrouped": true
},
"useCombiner": true,
"buildV9Directly": true,
"numBackgroundPersistThreads": 1
}
},
"hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.19"]
}

I have also been facing the same issue in my early days with Druid MR Job.
Property set in (yarn.scheduler.maximum-allocation-mb: 241664) means maximum container size that can be allocated. But here the problem is map/reducer container size that is allocated. Check for default properties in mapreduce.map.memory.mb / mapreduce.reduce.memory.mb. You should also tweak the split size to control the block size being processed by each container.
I have used the following "jobProperties" in Druid Index Job Json:
"jobProperties":{
"mapreduce.map.memory.mb" : "8192",
"mapreduce.reduce.memory.mb" : "18288",
"mapreduce.input.fileinputformat.split.minsize" : "125829120",
"mapreduce.input.fileinputformat.split.maxsize" : "268435456"
}

Either you need to increase the memory or give it virtual memory. OR better approach would be -
You can spawn multiple ingestion task each with smaller segment granularity eg, day level
"intervals" : [ "2017-04-03/2017-04-04" ]
and so on.

Related

Poor performance on bulk deleting a large collection mongodb

I have a single standalone mongo installation on a Linux machine.
The database contains a collection with 181 million documents. This collection is by far the largest collection in the database (approx 90%)
The size of the collection is currently 3.5 TB.
I'm running Mongo version 4.0.10 (Wired Tiger)
The collection have 2 indexes.
One on id
One on 2 fields and it is used when deleting documents (see those in the snippet below).
When benchmarking bulk deletion on this collection we used the following snippet
db.getCollection('Image').deleteMany(
{$and: [
{"CameraId" : 1},
{"SequenceNumber" : { $lt: 153000000 }}]})
To see the state of the deletion operation I ran a simple test of deleting 1000 documents while looking at the operation using currentOp(). It shows the following.
"command" : {
"q" : {
"$and" : [
{
"CameraId" : 1.0
},
{
"SequenceNumber" : {
"$lt" : 153040000.0
}
}
]
},
"limit" : 0
},
"planSummary" : "IXSCAN { CameraId: 1, SequenceNumber: 1 }",
"numYields" : 876,
"locks" : {
"Global" : "w",
"Database" : "w",
"Collection" : "w"
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(877),
"w" : NumberLong(877)
}
},
"Database" : {
"acquireCount" : {
"w" : NumberLong(877)
}
},
"Collection" : {
"acquireCount" : {
"w" : NumberLong(877)
}
}
}
It seems to be using the correct index but the number and type of locks worries me. As I interpret this it aquires 1 global lock for each deleted document from a single collection.
When using this approach it has taken over a week to delete 40 million documents. This cannot be expected performance.
I realise there other design exists such as bulking documents into larger chunks and store them using GridFs, but the current design is what it is and I want to make sure that what I see is expected before changing my design or restructuring the data or even considering clustering etc.
Any suggestions of how to increase performance on bulk deletions or is this expected?

Service unavailable error while using MongoDB, ElasticSearch and transporter

I am trying to use the transporter plugin to create a pipeline to sync a MongoDB database and ElasticSearch. I am using a Linux virtual machine (ubuntu) for this.
I have created a MongoDB collection my_application with the following data in it:
db.users.find().pretty();
{
"_id" : ObjectId("6008153cf979ac0f18681765"),
"firstName" : "Sammy",
"lastName" : "Shark"
}
{
"_id" : ObjectId("60081544f979ac0f18681766"),
"firstName" : "Gilly",
"lastName" : "Glowfish"
}
I configured ElasticSearch and the transporter pipeline and now exported MongoDB_URI and Elastic_URI.
I then ran my transporter pipeline.js to obtain this:
INFO[0005] metrics source records: 2 path=source ts=1611154492641006368
INFO[0005] metrics source/sink records: 2 path="source/sink" ts=1611154492641013556
I then try to view my ElasticSearch but get this error:
curl $ELASTICSEARCH_URI/_search?pretty=true
{
"error" : {
"root_cause" : [
{
"type" : "cluster_block_exception",
"reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"
}
],
"type" : "cluster_block_exception",
"reason" : "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"
},
"status" : 503
}
Here is my elasticsearch.yml:
# Use a descriptive name for the node:
node.name: node-1
path.data: /var/lib/elasticsearch
# Path to log files:
path.logs: /var/log/elasticsearch
# Set the bind address to a specific IP (IPv4 or IPv6):
network.host: 0.0.0.0
# Set a custom port for HTTP:
http.port: 9200
# Bootstrap the cluster using an initial set of master-eligible nodes:
cluster.initial_master_nodes: ["node-1", "node-2"]
Here is my elasticsearch node:
{
"name" : "node-1",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "_na_",
"version" : {
"number" : "7.7.1",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "ad56dce891c901a492bb1ee393f12dfff473a423",
"build_date" : "2020-05-28T16:30:01.040088Z",
"build_snapshot" : false,
"lucene_version" : "8.5.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
I have tried deleting indices and restarting the server but the error repeats. Would like to know the solution to this. I am using elasticsearch 7.10

Getting MongoDB error on Write - quota exceeded

I have a JAVA app which writes to a replicaset. Am using 3.0.7 version of MongoDB server. The mongo-driver for Java is 3.0.4.
It was working just fine but now am getting the following error on ALL writes:
com.mongodb.MongoWriteException: quota exceeded
at com.mongodb.MongoCollectionImpl.executeSingleWriteRequest(MongoCollectionImpl.java:487)
at com.mongodb.MongoCollectionImpl.update(MongoCollectionImpl.java:474)
at com.mongodb.MongoCollectionImpl.updateOne(MongoCollectionImpl.java:325)
Have looked at the MongoDB config documentation and I have not set any quota limits in mongod.conf. Am not using smallFiles either.
But I think am running into fileSize limits. The file sizes for my DB are as follows:
total 6240548
-rw------- 1 mongod mongod 67108864 Dec 8 16:55 2015.0
-rw------- 1 mongod mongod 134217728 Dec 8 12:15 2015.1
-rw------- 1 mongod mongod 268435456 Dec 8 12:15 2015.2
-rw------- 1 mongod mongod 536870912 Dec 8 12:15 2015.3
-rw------- 1 mongod mongod 1073741824 Dec 8 12:15 2015.4
-rw------- 1 mongod mongod 2146435072 Dec 8 16:06 2015.5
-rw------- 1 mongod mongod 2146435072 Dec 8 16:06 2015.6
-rw------- 1 mongod mongod 16777216 Dec 8 16:55 2015.ns
drwxr-xr-x 2 mongod mongod 4096 Dec 7 09:14 _tmp
The /etc/mongod.conf is as follows:
storage:
dbPath: /data/mongoDB
indexBuildRetry: true
repairPath: /data/mongoDB/repair
journal:
enabled: true
directoryPerDB: true
syncPeriodSecs: 60
engine: mmapv1
mmapv1:
preallocDataFiles: false
nsSize: 16
quota:
enforced: false
maxFilesPerDB: 8
smallFiles: false
journal:
debugFlags: 0
commitIntervalMs: 100
What can be going wrong?
PS: /etc/mongod.conf is being used.
mongod 10864 1 0 Nov16 ? 03:10:34 /usr/bin/mongod -f /etc/mongod.conf
Update- 1 .Tried the same update by changing collection name to a new collection. THat worked! but doesnt explain the issue yet. 2. Changed java driver to 3.0.3, that didnt help.
Update- 2:(12/9) Adding Collection Stats, since it is something to do with the collection itself and the java driver. Let me know if something seems awry pls.
{
"ns" : "2015.events",
"count" : 827054,
"size" : 3814018,
"avgObjSize" : 4722,
"numExtents" : 22,
"extents" : [
{
"len" : 8192,
"loc: " : {
"file" : 0,
"offset" : 20480
}
},
{
"len" : 32768,
"loc: " : {
"file" : 0,
"offset" : 2134016
}
},
{
"len" : 131072,
"loc: " : {
"file" : 0,
"offset" : 2166784
}
},
{
"len" : 524288,
"loc: " : {
"file" : 0,
"offset" : 2297856
}
},
{
"len" : 2097152,
"loc: " : {
"file" : 0,
"offset" : 2822144
}
},
{
"len" : 8388608,
"loc: " : {
"file" : 0,
"offset" : 4919296
}
},
{
"len" : 11325440,
"loc: " : {
"file" : 0,
"offset" : 14356480
}
},
{
"len" : 15290368,
"loc: " : {
"file" : 0,
"offset" : 28827648
}
},
{
"len" : 20643840,
"loc: " : {
"file" : 0,
"offset" : 44118016
}
},
{
"len" : 27869184,
"loc: " : {
"file" : 1,
"offset" : 8192
}
},
{
"len" : 37625856,
"loc: " : {
"file" : 1,
"offset" : 36265984
}
},
{
"len" : 50798592,
"loc: " : {
"file" : 1,
"offset" : 78086144
}
},
{
"len" : 68579328,
"loc: " : {
"file" : 2,
"offset" : 8396800
}
},
{
"len" : 92585984,
"loc: " : {
"file" : 2,
"offset" : 93753344
}
},
{
"len" : 124993536,
"loc: " : {
"file" : 3,
"offset" : 8192
}
},
{
"len" : 168742912,
"loc: " : {
"file" : 3,
"offset" : 125001728
}
},
{
"len" : 227803136,
"loc: " : {
"file" : 4,
"offset" : 8192
}
},
{
"len" : 307535872,
"loc: " : {
"file" : 4,
"offset" : 239136768
}
},
{
"len" : 415174656,
"loc: " : {
"file" : 4,
"offset" : 595939328
}
},
{
"len" : 560488448,
"loc: " : {
"file" : 5,
"offset" : 8192
}
},
{
"len" : 756662272,
"loc: " : {
"file" : 5,
"offset" : 607756288
}
},
{
"len" : 1021497344,
"loc: " : {
"file" : 6,
"offset" : 8192
}
}
],
"storageSize" : 3826952,
"lastExtentSize" : 997556,
"paddingFactor" : 1,
"paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
"userFlags" : 1,
"capped" : false,
"nindexes" : 8,
"indexDetails" : {
},
"totalIndexSize" : 348270,
"indexSizes" : {
"_id_" : 47123,
"remoteRequest.uri_1" : 76969,
"transactionId_1" : 59611,
"startTime_1" : 39027,
"endTime_1" : 35235,
"remoteRequest.queryParams.q_1" : 20328,
"remoteRequest.queryParams.fq_1" : 38285,
"elapsedTimeInNanos_1" : 31689
},
"ok" : 1
}
Your mongod.conf has a quota enabled for each database. Based on that mongod.conf file, you will be unable to create more than 8 database files, which limits you to a max of about 6.4 GB of storage. You mention that you are able to get around this issue by using a new collection, so I am interested in what your data directory looks like now. I would not expect you to be able to bypass this hard limit, however due to internal data structures, it may be possible to "bypass" it for a short time.
You can verify how much actual data is being stored by running the dbStats command
use 2015
db.stats(1024*1024)
This output will tell you how much data you actually have in the db, vs the amount of allocated storage. These numbers will not match, this is expected as documents include empty space for padding.
My next question would be, is there are reason you are artificially limiting the amount of storage space your mongod can allocate? Perhaps a capped collection would better suit your needs? If you could expand on your use, I can perhaps give you a better answer.

mongodb single node performance

I use MongoDB for an internal ADMIN type of application used by my team.
Mongo is installed on 1 box and no replica sets.
ADMIN application inserts 70K to 100K documents/per day and we maintain 4 months of data. DB has ~100 million documents at any given time.
When the application was deployed, it all started fine for few days. As the data kept accumulated to reach the 4 months max limit, I see severe performance issues with MongoDB.
I installed MongoDB 3.0.4 as-is on a Linux box and did not fine tune any optimization settings.
Are there any optimization settings I need to adjust?
ADMIN application has schedulers which runs every 1/2 hr to insert and purge outdated data. Given below collection with indexes defined on createdDate,env,messageId,sourceSystem, I see few queries were taking 30 min to respond.
Sample query: Count of documents with a given env,sourceSystem, but between a given range of dates. ADMIN app uses grails and the above query is created using GORM. It used to work fine in the beginning. But over the period of time, performance degraded. I tried restarting the application as well. It didn't help. I believe using the MongoDB as-is (like a Dev Mode) might be causing performance issue. Any suggestions on what to tweak in settings (perhaps cpu/mem limits etc)?
{
"_id" : ObjectId("5575e388e4b001976b5e570f"),
"createdDate" : ISODate("2015-06-07T05:00:34.040Z"),
"env" : "prod",
"messageId" : "f684b34d-a480-42a0-a7b8-69d6d18f39e5",
"payload" : "JSON or XML DATA",
"sourceSystem" : "sourceModule"
}
Update:
Indices:
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"messageId" : 1
},
"name" : "messageId_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"createdDate" : 1
},
"name" : "createdDate_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"sourceSystem" : 1
},
"name" : "sourceSystem_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"env" : 1
},
"name" : "env_1",
"ns" : "admin.Message"
}
]

MongoDB CPU pegged; profiler returns "profile line too large"

Every 6-12 hours, my MongoDB CPU is pegged (100% CPU usage).
I've enabled profiling. The last time returned this:
PRIMARY> db.system.profile.find().sort({$natural:-1});
{ "ts" : ISODate("2012-11-08T05:31:09.042Z"), "client" : "10.188.14.195", "user" : "", "err" : "profile line too large (max is 100KB)" }
Not very helpful, unfortunately.
I tried doing a db.currentOp(); while it was pegged and got this:
{
"opid" : 18256845,
"active" : true,
"lockType" : "write",
"waitingForLock" : false,
"secs_running" : 803653,
"op" : "none",
"ns" : "streamified.credentials",
"query" : {
},
"client" : "",
"desc" : "rsSync",
"threadId" : "0x7f3b865f7700",
"numYields" : 1
},
Indicating that the query had been alive for over 800,000 seconds (FAR before the CPU was pegged). This query remained even after the CPU returned to normal, as well.
What is the best way to determine exactly which query (or, at the very least, which collection) is causing the CPU to become pegged?