Not able to load data in druid cluster setup - druid

I'm getting this error while loading some data in druid cluster. I have the machine with the specified memory. Can someone suggest me with any solution? I'm doing a stream load, also I'm not able to load batch data.
These are the machines I'm using.
1.) 4 vCPUs
15 GB RAM
80 GB SSD storage
2.) 8 vCPUs
61 GB RAM
160 GB SSD storage
3.) 8 vCPUs
61 GB RAM
160 GB SSD storage
{
"dataSource" : "metrics",
"task" : "index_realtime_metrics_2019-01-15T09:00:00.000Z_0_0",
"status" : "TaskFailed"
}
2019-01-15 09:14:39,520 [Hashed wheel timer #1] INFO c.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2019-01-15T09:14:39.520Z","service":"
tranquility","host":"localhost","severity":"anomaly","description":"Loss of Druid redundancy: metrics","data":{"dataSource":"metrics","task":"index_realtime_metrics_20
19-01-15T09:00:00.000Z_0_0","status":"TaskFailed"}}]
2019-01-15 09:14:39,521 [Hashed wheel timer #1] WARN c.m.tranquility.beam.ClusteredBeam - Emitting alert: [anomaly] Beam defunct: druid:overlord/metrics
{
"eventCount" : 1,
"timestamp" : "2019-01-15T09:00:00.000Z",
"beam" : "MergingPartitioningBeam(DruidBeam(interval = 2019-01-15T09:00:00.000Z/2019-01-15T10:00:00.000Z, partition = 0, tasks = [index_realtime_metrics_2019-01-15T0
9:00:00.000Z_0_0/metrics-009-0000-0000]))"
}
com.metamx.tranquility.beam.DefunctBeamException: Tasks are all gone: index_realtime_metrics_2019-01-15T09:00:00.000Z_0_0
at com.metamx.tranquility.druid.DruidBeam$$anonfun$sendAll$2$$anonfun$6$$anonfun$apply$6.apply(DruidBeam.scala:115) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0
]
at com.metamx.tranquility.druid.DruidBeam$$anonfun$sendAll$2$$anonfun$6$$anonfun$apply$6.apply(DruidBeam.scala:115) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0
]
at scala.Option.getOrElse(Option.scala:121) ~[org.scala-lang.scala-library-2.11.7.jar:na]
at com.metamx.tranquility.druid.DruidBeam$$anonfun$sendAll$2$$anonfun$6.apply(DruidBeam.scala:112) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.metamx.tranquility.druid.DruidBeam$$anonfun$sendAll$2$$anonfun$6.apply(DruidBeam.scala:109) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.twitter.util.Future$$anonfun$map$1$$anonfun$apply$6.apply(Future.scala:950) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Try$.apply(Try.scala:13) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Future$.apply(Future.scala:97) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Future$$anonfun$map$1.apply(Future.scala:950) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Future$$anonfun$map$1.apply(Future.scala:949) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise$Transformer.liftedTree1$1(Promise.scala:112) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise$Transformer.k(Promise.scala:112) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise$Transformer.apply(Promise.scala:122) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise$Transformer.apply(Promise.scala:103) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise$$anon$1.run(Promise.scala:366) ~[com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.concurrent.LocalScheduler$Activation.run(Scheduler.scala:178) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.concurrent.LocalScheduler$Activation.submit(Scheduler.scala:136) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.concurrent.LocalScheduler.submit(Scheduler.scala:207) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.concurrent.Scheduler$.submit(Scheduler.scala:92) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise.runq(Promise.scala:350) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise.updateIfEmpty(Promise.scala:726) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise.link(Promise.scala:793) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
at com.twitter.util.Promise.become(Promise.scala:658) [com.twitter.util-core_2.11-6.30.0.jar:6.30.0]
:

can you share the indexing task logs?
according to: https://groups.google.com/forum/#!topic/druid-user/yZEnAk0iKr4 and https://groups.google.com/forum/#!topic/druid-user/ZBTU08VNp2o it might be lack of memory in the middlemanager

Related

How to obtain specific WMI metric out of a given list

I would like to monitor two specific metrics from an nVidia card (Encoder and Decoder Usage) From the nVidia manifest i've copied the following lines into a powershell:
$gpus = Get-WmiObject -namespace "root\cimv2\nv" -class gpu
foreach($object in $gpus) # obtain an instance
{
$object.invokeMethod("info",$null)
}
But that iterates over a number of metrics and gives a large list with metrics:
PS C:\Users\Administrator\Scripts> .\check.ps1
class: Gpu
class version: 2.4.0
object name: Quadro P2200
object ID: 1 (0x1)
GPU handle: 0xD800
GPU type: Quadro
GPU memory type: GDDR5X
Virtual memory size: 53488 MB
Physical memory size: 5120 MB
Available memory size: 1242 MB
Memory bus width: 160
Number of cores: 1280
Current GPU clock: 1754 MHz
Current Memory clock: 5005 MHz
Power consumed over sampling period: 29.369 Watt
Power sampling period: 1 ms
Number of power measurement samples: 1
The percentage of time where the GPU is considered busy: 21
The percentage of GPU memory utilization: 75
Video BIOS version: 86.6.77.0.5
Device Info: PCI\VEN_10DE&DEV_1C31&SUBSYS_131B1028&REV_A1
coolers: Cooler.id=1
thermal probes: ThermalProbe.id=1
ECC: Ecc.id=1
PCI-E current bus protocol generation: 3
PCI-E current width: 16 lanes
PCI-E current speed: 8000 Mbps
PCI-E maximum bus protocol generation: 3
PCI-E maximum width: 16 lanes
PCI-E maximum speed: 8000 Mbps
PCI-E downstream width: 16 lanes
VideoEngine Encoder usage: 76%
VideoEngine Decoder usage: 6%
VideoEngine Encoder sampling period: 167000 ms
VideoEngine Decoder sampling period: 167000 ms
VideoEngine Encoder sessions: 11
VideoEngine average FPS: 50
VideoEngine average latency: 1264 ms
How can I formulate the WMI command or pipe an instruction like grep so that i get the single result from the VideoEngine Decoder Usage and VideoEngine Encoder Usage?
Those Encoder/Decoder Usage metrics seems to be part of a subclass called 'videoCodec' and could be requested via: Get-WmiObject -Class Gpu -ComputerName localhost -Namespace ROOT\cimv2\NV | Select-Object *
Which results in a list from which i only pasted the bottom part of it:
productName : Quadro P2200
productType : 2
thermalProbes : {ThermalProbe.id=1}
uname : Quadro P2200
ver : System.Management.ManagementBaseObject
verVBIOS : System.Management.ManagementBaseObject
videoCodec : System.Management.ManagementBaseObject
Scope : System.Management.ManagementScope
Path : \\DHC-AMPP-NODE13\ROOT\cimv2\NV:Gpu.id=1,uname="Quadro P2200"
Options : System.Management.ObjectGetOptions
ClassPath : \\DHC-AMPP-NODE13\ROOT\cimv2\NV:Gpu
Properties : {archId, archName, coolers, coreCount...}
SystemProperties : {__GENUS, __CLASS, __SUPERCLASS, __DYNASTY...}
Qualifiers : {dynamic}
Site :
Container :
The idea is that i collect those two metrics into a monitoring system called zabbix, with the use of powershell scripting.
It's all exists in the videoCodec property so just filter the results:
$gpus = (Get-WmiObject -namespace "root\cimv2\nv" -class gpu).videoCodec |
Select percentEncoderUsage,percentDecoderUsage,encoderSamplingPeriod,
decoderSamplingPeriod,encoderSessionsCount,averageFps,averageLatency
or shorter version:
$gpus = (Get-WmiObject -namespace "root\cimv2\nv" -class gpu).videoCodec |
Select *coder*,*average*
Will results:
percentEncoderUsage : 0
percentDecoderUsage : 0
encoderSamplingPeriod : 167000
decoderSamplingPeriod : 167000
encoderSessionsCount : 0
averageFps : 0
averageLatency : 0

PostgreSQL 9.5 Replication Lag running on EC2

I have a series of PostgreSQL 9.5 servers running on r4.16xlarge instances and Amazon Linux 1 that started experiencing replication lag of several seconds starting this week. The configurations were changed but the old configs weren't saved so I'm not sure what the previous settings were. Here's the custom values:
max_connections = 1500
shared_buffers = 128GB
effective_cache_size = 132GB
maintenance_work_mem = 128MB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
#effective_io_concurrency = 10
work_mem = 128MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 64
synchronous_commit = off
The drive layout is as follows - 4 disks for the xlog drive and 10 for the regular partition, all gp2 disk type.
Personalities : [raid0]
md126 : active raid0 xvdo[3] xvdn[2] xvdm[1] xvdl[0]
419428352 blocks super 1.2 512k chunks
md127 : active raid0 xvdk[9] xvdj[8] xvdi[7] xvdh[6] xvdg[5] xvdf[4] xvde[3] xvdd[2] xvdc[1] xvdb[0]
2097146880 blocks super 1.2 512k chunks
The master server is a smaller c4.8xlarge instance with this setup:
max_connections = 1500
shared_buffers = 15GB
effective_cache_size = 45GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 16
work_mem = 26MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 36
With this drive layout:
Personalities : [raid0]
md126 : active raid0 xvdd[2] xvdc[1] xvdb[0] xvde[3]
419428352 blocks super 1.2 512k chunks
md127 : active raid0 xvdr[12] xvdg[1] xvdo[9] xvdl[6] xvdh[2] xvdf[0] xvdp[10] xvdu[15] xvdm[7] xvdj[4] xvdn[8] xvdk[5] xvdi[3] xvds[13] xvdt[14] xvdq[11]
3355435008 blocks super 1.2 512k chunks
I guess I'm looking for optimal settings for these two instance types so I can eliminate the replication lag. None of the servers are what I would call heavily loaded.
With further digging I found that the following setting fixed the replication lag:
hot_standby_feedback = on
This may cause some WAL bloating on the master but now the backlog is gone.

Why is orientdb oetl import giving me this error

I am trying to import a csv file into orientdb 3.0 I have created and tested the json file and it works with a smaller dataset. But the dataset that I want to import is around a billion rows (six columns)
Following is the user.json file I am using for import with oetl
{
"source": { "file": { "path": "d1.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "User" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/databases/magriwebdoc",
"dbType": "graph",
"classes": [
{"name": "User", "extends": "V"}
], "indexes": [
{"class":"User", "fields":["id:string"], "type":"UNIQUE" }
]
}
}
}
This is the console output from oetl command:
2019-05-22 14:31:15:484 INFO Windows OS is detected, 262144 limit of open files will be set for the disk cache. [ONative]
2019-05-22 14:31:15:647 INFO 8261029888 B/7878 MB/7 GB of physical memory were detected on machine [ONative]
2019-05-22 14:31:15:647 INFO Detected memory limit for current process is 8261029888 B/7878 MB/7 GB [ONative]
2019-05-22 14:31:15:649 INFO JVM can use maximum 455MB of heap memory [OMemoryAndLocalPaginatedEnginesInitializer]
2019-05-22 14:31:15:649 INFO Because OrientDB is running outside a container 12% of memory will be left unallocated according to the setting 'memory.leftToOS' not taking into account heap memory [OMemoryAndLocalPaginatedEnginesInitializer]
2019-05-22 14:31:15:650 INFO OrientDB auto-config DISKCACHE=6,477MB (heap=455MB os=7,878MB) [orientechnologies]
2019-05-22 14:31:15:652 INFO System is started under an effective user : `lenovo` [OEngineLocalPaginated]
2019-05-22 14:31:15:670 INFO WAL maximum segment size is set to 6,144 MB [OrientDBEmbedded]
2019-05-22 14:31:15:701 INFO BEGIN ETL PROCESSOR [OETLProcessor]
2019-05-22 14:31:15:703 INFO [file] Reading from file d1.csv with encoding UTF-8 [OETLFileSource]
2019-05-22 14:31:15:703 INFO Started execution with 1 worker threads [OETLProcessor]
2019-05-22 14:31:16:008 INFO Page size for WAL located in D:\databases\magriwebdoc is set to 4096 bytes. [OCASDiskWriteAheadLog]
2019-05-22 14:31:16:703 INFO + extracted 0 rows (0 rows/sec) - 0 rows -> loaded 0 vertices (0 vertices/sec) Total time: 1001ms [0 warnings, 0 errors] [OETLProcessor]
2019-05-22 14:31:16:770 INFO Storage 'plocal:D:\databases/magriwebdoc' is opened under OrientDB distribution : 3.0.18 - Veloce (build 747595e790a081371496f3bb9c57cec395644d82, branch 3.0.x) [OLocalPaginatedStorage]
2019-05-22 14:31:17:703 INFO + extracted 0 rows (0 rows/sec) - 0 rows -> loaded 0 vertices (0 vertices/sec) Total time: 2001ms [0 warnings, 0 errors] [OETLProcessor]
2019-05-22 14:31:17:954 SEVER ETL process has problem: [OETLProcessor]
2019-05-22 14:31:17:956 INFO END ETL PROCESSOR [OETLProcessor]
2019-05-22 14:31:17:957 INFO + extracted 0 rows (0 rows/sec) - 0 rows -> loaded 0 vertices (0 vertices/sec) Total time: 2255ms [0 warnings, 0 errors] [OETLProcessor]D:\orientserver\bin>
I know the code is right but I am assuming it's more of a memory issue!
Please advise what should I do.
Have you tried improving your memory settings, according to the size of the data that you want to process?
From the documentation, you can custom these properties:
Configuration Environmental Variables (See $ORIENTDB_OPTS_MEMORY parameter)
Performance-Tuning - Memory Settings
Maybe could help you
Your json script seems no problem, but you can try to delete your indexes part. I have encountered the same problem because of the wrong indexes, too. It may because the UNIQUE indexes constraint. You can try:
Delete the indexes part of json script.
If you need this index, make sure to clear you database before you import your dataset.

orientdb 2.1.11 java process consuming too much memory

We're running OrientDB 2.1.11 (Community Edition) along with JDK 1.8.0.74.
We're noticing memory consumption by 'orientdb' java slowly creeping up and in a few days, the database becomes un-responsive (we have to stop/start Orientdb in order to release memory).
We also noticed this kind of behavior in a few hours when we index the database.
The total size of the database is only 60 GB and not more than 200 million records!
As you can see below, it already consumes VIRT(11.44 GB) RES(8.62 GB).
We're running CentOS 7.1.x.
Even change heap from 512 to 256M and modified diskcache.bufferSize to 8GB
MAXHEAP=-Xmx256m
ORIENTDB MAXIMUM DISKCACHE IN MB, EXAMPLE, ENTER -Dstorage.diskCache.bufferSize=8192 FOR 8GB
MAXDISKCACHE="-Dstorage.diskCache.bufferSize=8192"
top output:
Tasks: 155 total, 1 running, 154 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.1 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16269052 total, 229492 free, 9510740 used, 6528820 buff/cache
KiB Swap: 8257532 total, 8155244 free, 102288 used. 6463744 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2367 nmx 20 0 11.774g 8.620g 14648 S 0.3 55.6 81:26.82 java
ps aux output:
nmx 2367 4.3 55.5 12345680 9038260 ? Sl May02 81:28 /bin/java
-server -Xmx256m -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError
-Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9
-Dprofiler.enabled=true -Dstorage.diskCache.bufferSize=8192
How do I control memory usage?
Is there a CB memory leak?
Could you set following setting for JVM -XX:+UseLargePages -XX:LargePageSizeInBytes=2m .
This should sovle your issue.
this page, solved my issue.
in nutshell:
add this configuration to your database:
static {
OGlobalConfiguration.NON_TX_RECORD_UPDATE_SYNCH.setValue(true); //Executes a synch against the file-system at every record operation. This slows down records updates but guarantee reliability on unreliable drives
OGlobalConfiguration.STORAGE_LOCK_TIMEOUT.setValue(300000);
OGlobalConfiguration.RID_BAG_EMBEDDED_TO_SBTREEBONSAI_THRESHOLD.setValue(-1);
OGlobalConfiguration.FILE_LOCK.setValue(false);
OGlobalConfiguration.SBTREEBONSAI_LINKBAG_CACHE_SIZE.setValue(5000);
OGlobalConfiguration.INDEX_IGNORE_NULL_VALUES_DEFAULT.setValue(true);
OGlobalConfiguration.MEMORY_CHUNK_SIZE.setValue(32);
long maxMemory = Runtime.getRuntime().maxMemory();
long maxMemoryGB = (maxMemory / 1024L / 1024L / 1024L);
maxMemoryGB = maxMemoryGB < 1 ? 1 : maxMemoryGB;
long cacheSizeMb = 2 * 65536 * maxMemoryGB / 1024;
long maxDirectMemoryMb = VM.maxDirectMemory() / 1024L / 1024L;
String cacheProp = System.getProperty("storage.diskCache.bufferSize");
if (cacheProp==null) {
long maxDirectMemoryOrientMb = maxDirectMemoryMb / 3L;
long cachSizeMb = cacheSizeMb > maxDirectMemoryOrientMb ? maxDirectMemoryOrientMb : cacheSizeMb;
cacheSizeMb = (long)Math.pow(2, Math.ceil( Math.log(cachSizeMb)/ Math.log((2))));
System.setProperty("storage.diskCache.bufferSize",Long.toString(cacheSizeMb));
// the command below generates a NullPointerException in Orient 2.2.15-snapshot
// OGlobalConfiguration.DISK_CACHE_SIZE.setValue(cacheSizeMb);
}
}

Write performance is decreased when collection is increased

My problem : MongoDB poor write performance on large collection
Begin: peformance is 70,000 doc/sec (empty collection)
For a while, my collection is getting larger.
End: performance is < 10,000 doc/sec (0.4 billion docs in collection)
There are my records below:
Write Performance with Time
Write Performance with Collection
And then, I make a new collection in the same database,
I write two collection at the same time.
The Large collection (ns: "rf1.case1") performance is still slow,
but New collection (ns: "rf1.case2") performance is pretty fast.
I have already read MongoDB poor write performance on large collections with 50.000.000 documents plus , but I still no idea !
Is my configuration wrong?
3 Servers are the same Hardware Specfication:
CPU: 8 core
Memory: 32GB
Disk: 2TB HDD (7200rpm)
My Scenario:
There are 3 shards(Replica Set):
Server1 : primary(sh01) + secondary(sh03) + arbiter(sh02) + configsvr + mongos
Server2 : primary(sh02) + secondary(sh01) + arbiter(sh03) + configsvr + mongos
Server3 : primary(sh03) + secondary(sh02) + arbiter(sh01) + configsvr + mongos
Mongod sample:
/usr/local/mongodb/bin/mongod --quiet --port 20001 --dbpath $s1 --logpath $s1/s1.log --replSet sh01 --shardsvr --directoryperdb --fork --storageEngine wiredTiger --wiredTigerCollectionBlockCompressor snappy --wiredTigerCacheSizeGB 8
Two Collection:
(chunksize = 64MB)
rf1.case1
shard key: { "_id" : "hashed" }
chunks:
host1 385
host2 401
host3 367
too many chunks to print, use verbose if you want to force print
rf1.case2
shard key: { "_id" : "hashed" }
chunks:
host1 11
host2 10
host3 10