Orient ETL perfomance issue with importing edges to plocal on SSD - orientdb

My goal is importing 25M edges in the graph which has about 50M vertices. Target time:
The current speed of importing is ~150 edges/sec. Speed on remote connection was about 100 edges/sec.
extracted 20,694,336 rows (171 rows/sec) - 20,694,336 rows -> loaded 20,691,830 vertices (171 vertices/sec) Total time: 35989762ms [0
warnings, 4 errors]
extracted 20,694,558 rows (156 rows/sec) - 20,694,558 rows -> loaded 20,692,053 vertices (156 vertices/sec) Total time: 35991185ms [0
warnings, 4 errors]
extracted 20,694,745 rows (147 rows/sec) - 20,694,746 rows -> loaded 20,692,240 vertices (147 vertices/sec) Total time: 35992453ms [0
warnings, 4 errors]
extracted 20,694,973 rows (163 rows/sec) - 20,694,973 rows -> loaded 20,692,467 vertices (162 vertices/sec) Total time: 35993851ms [0
warnings, 4 errors]
extracted 20,695,179 rows (145 rows/sec) - 20,695,179 rows -> loaded 20,692,673 vertices (145 vertices/sec) Total time: 35995262ms [0
warnings, 4 errors]
I tried to enable parallel in etl config, but looks like it is completely broken in Orient 2.2.12 (Inconsistency with multi-threading changes in 2.1?) and gives me nothing but 4 errors in the log above. Dumb parallel mode (running 2+ ETL processes) also impossible for plocal connection.
My config:
{
"config": {
"log": "info",
"parallel": true
},
"source": {
"input": {}
},
"extractor": {
"row": {
"multiLine": false
}
},
"transformers": [
{
"code": {
"language": "Javascript",
"code": "(new com.orientechnologies.orient.core.record.impl.ODocument()).fromJSON(input);"
}
},
{
"merge": {
"joinFieldName": "_ref",
"lookup": "Company._ref"
}
},
{
"vertex": {
"class": "Company",
"skipDuplicates": true
}
},
{
"edge": {
"joinFieldName": "with_id",
"lookup": "Person._ref",
"direction": "in",
"class": "Stakeholder",
"edgeFields": {
"_ref": "${input._ref}",
"value_of_share": "${input.value_of_share}"
},
"skipDuplicates": true,
"unresolvedLinkAction": "ERROR"
}
},
{
"field": {
"fieldNames": [
"with_id",
"with_to",
"_type",
"value_of_share"
],
"operation": "remove"
}
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:/mnt/disks/orientdb/orientdb-2.2.12/databases/df",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoDropIfExists": false,
"dbAutoCreate": false,
"standardElementConstraints": false,
"tx": false,
"wal": false,
"batchCommit": 1000,
"dbType": "graph",
"classes": [
{
"name": "Company",
"extends": "V"
},
{
"name": "Person",
"extends": "V"
},
{
"name": "Stakeholder",
"extends": "E"
}
]
}
}
}
Data sample:
{"_ref":"1072308006473","with_to":"person","with_id":"010703814320","_type":"is.stakeholder","value_of_share":10000.0} {"_ref":"1075837000095","with_to":"person","with_id":"583600656732","_type":"is.stakeholder","value_of_share":15925.0} {"_ref":"1075837000095","with_to":"person","with_id":"583600851010","_type":"is.stakeholder","value_of_share":33150.0}
Server's specs are: instance on Google Cloud, PD-SSD, 6CPU, 18GB RAM.
Btw, on the same server I managed to get ~3k/sec on importing vertices using remote connection (it is still too slow, but acceptable for my current dataset).
And the question: is it any reliable way to increase speed of importing to let's say 10k inserts per second, or at least 5k? I wouldn't like to turn off indexes, it is still millions of records, not billions.
UPDATE
After few hours the performance continue to deteriorate.
extracted 23,146,912 rows (56 rows/sec) - 23,146,912 rows -> loaded 23,144,406 vertices (56 vertices/sec) Total time: 60886967ms [0
warnings, 4 errors]
extracted 23,146,981 rows (69 rows/sec) - 23,146,981 rows -> loaded 23,144,475 vertices (69 vertices/sec) Total time: 60887967ms [0
warnings, 4 errors]
extracted 23,147,075 rows (39 rows/sec) - 23,147,075 rows -> loaded 23,144,570 vertices (39 vertices/sec) Total time: 60890356ms [0
warnings, 4 errors]

Related

OutOfMemory Error on DruidDB index_kafka_histogram tasks

I am new to the DruidDB setup. I am trying to ingest data in DruidDB. Initially, it was working fine but after some time I am getting the following error.
Sample Config:
...
"metricsSpec": [
{
"type": "longMin",
"name": "min",
"fieldName": "min",
"expression": null
},
{
"type": "longMax",
"name": "max",
"fieldName": "max",
"expression": null
},
{
"type": "longSum",
"name": "count",
"fieldName": "count",
"expression": null
},
{
"type": "longSum",
"name": "sum",
"fieldName": "sum",
"expression": null
},
{
"type": "quantilesDoublesSketch",
"name": "quantilesDoubleSketch",
"fieldName": "sketch",
"k": 128
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "MINUTE",
"rollup": true,
"intervals": null
},
...
"tuningConfig": {
"type": "kafka",
"maxRowsInMemory": 10000,
"maxBytesInMemory": 200000,
"maxRowsPerSegment": 5000000,
"intermediatePersistPeriod": "PT10M",
"basePersistDirectory": "/tmp/druid-realtime-persist15059426147899962275",
"maxPendingPersists": 0,
"indexSpec": {
"bitmap": {
"type": "roaring",
"compressRunOnSerialization": true
},
"dimensionCompression": "lz4",
"metricCompression": "lz4",
"longEncoding": "longs"
},
"indexSpecForIntermediatePersists": {
"bitmap": {
"type": "roaring",
"compressRunOnSerialization": true
},
"dimensionCompression": "lz4",
"metricCompression": "lz4",
"longEncoding": "longs"
},
"buildV9Directly": true,
"reportParseExceptions": false,
"handoffConditionTimeout": 0,
"resetOffsetAutomatically": false,
"chatRetries": 8,
"httpTimeout": "PT10S",
"shutdownTimeout": "PT80S",
"offsetFetchPeriod": "PT30S",
"intermediateHandoffPeriod": "P2147483647D",
"logParseExceptions": true,
"maxParseExceptions": 2147483647,
"maxSavedParseExceptions": 0,
"skipSequenceNumberAvailabilityCheck": false,
"repartitionTransitionDuration": "PT120S"
}
...
Error
09:55:49.134 [task-runner-0-priority-0] ERROR org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Uncaught Throwable while running task[AbstractTask{id='index_kafka_histogram_25c6328c09f15d7_nofamgdk', groupId='index_kafka_histogram', taskResource=TaskResource{availabilityGroup='index_kafka_histogram_25c6328c09f15d7', requiredCapacity=1}, dataSource='histogram', context={checkpoints={"0":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0,"10":0,"11":0,"12":0,"13":0,"14":0,"15":0,"16":0,"17":0,"18":0,"19":0,"20":0,"21":0,"22":0,"23":0,"24":0,"25":0,"26":0,"27":0,"28":0,"29":0,"30":0,"31":0,"32":0,"33":0,"34":0,"35":0,"36":0,"37":0,"38":0,"39":0,"40":0,"41":0,"42":0,"43":0,"44":0,"45":0,"46":0,"47":0,"48":0,"49":0}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}}]
java.lang.OutOfMemoryError: Java heap space
Error!
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at org.apache.druid.indexing.worker.executor.ExecutorLifecycle.join(ExecutorLifecycle.java:215)
at org.apache.druid.cli.CliPeon.run(CliPeon.java:295)
at org.apache.druid.cli.Main.main(Main.java:113)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.apache.druid.indexing.worker.executor.ExecutorLifecycle.join(ExecutorLifecycle.java:212)
... 2 more
Caused by: java.lang.OutOfMemoryError: Java heap space
I have tried hard reset option for the tasks and answers mentioned in the following.
Link: druid indexing task fails with OutOfMemory Exception
A few things you should check.
Your ingestion is configured for maxRowsInMemory of 10000 and maxBytesInMemory of 200000. These seem small. They will cause each task to write to disk often. All writes to disk must be merged when it comes time to publish segments. Depending on your message size and throughput there could be a large number of small files to be merged. The merge operation may be running into trouble because of this. You can control the maxColumnsToMerge in your ingestion task that will keep this under control.
From one of the Druid committers, “the way the setting works is, let's say you have 30 columns per segment, and for each segment you have 110 intermediate persists. during segment generation, we merge all of those intermediate persists. that's 110 * 30 = 3,300 total to merge. each column has memory requirements of about 5KB on heap and 100KB direct (off heap). so that'll require about 16.5MB on heap and 500MB off heap. if you want to limit it, you can set maxColumnsToMerge = 1500 and it will use only about half that.“
Additionally you should verify that you are not overcommitting memory, you should have enough memory for worker.capacity * ( heap + direct memory ) on the middle manager in addition to what the Middle manager process itself uses. Also assuming there is nothing else running on the node.

Pgsync synchronisation is not made with new data

I want to synchronize 3 tables from a postgresql database to a self hosted elasticsearch and to do so, I use PGSync.
To build this stack, I followed this tutorial.
When I start the docker containers everything works well (execpt some errors in pgsync but its normal, the tables don't exist yet), after that, I restore my database from a dump (each tables has 30 000, 9 000 000 and 13 000 000 lines approximately). After the dump pgsync detects the new lines in the database and sync them in elasticsearch.
My problem is that after that first synchronisation, PGSync detects new lines:
Polling db cardpricetrackerprod: 61 item(s)
Polling db cardpricetrackerprod: 61 item(s)
but the synchronisation isn't made.
Here is what my schema looks like:
[
{
"database": "mydb",
"index": "elastic-index-first-table",
"nodes": {
"table": "first_table",
"schema": "public",
"columns": [
"id",
...
]
}
},
{
"database": "mydb",
"index": "elastic-index-second-table",
"nodes": {
"table": "second_table",
"schema": "public",
"columns": [
"id",
...
]
}
},
{
"database": "mydb",
"index": "elastic-index-third-table",
"nodes": {
"table": "third_table",
"schema": "public",
"columns": [
"id",
...
]
}
}
]
Have I missed a configuration step?

Only data from node 1 visible in a 2 node OrientDB cluster

I created a 2-node OrientDB cluster by following the below steps. But while distributing it, the data present in only one of the node is accessible. Please can you help me debug this issue. The OrientDB version is 2.2.6
Steps involved :
Utilized plocal mode in ETL tool and stored part of the data in node 1 and the other part in node2. The data stored actually belongs to just one class of vertex alone. ( On checking the data from console, the data has got injested properly ).
Then executed both the nodes in distributed mode, data from only one machineis accessible.
The default-distributed-db-config.json file is specified below :
{
"autoDeploy": true,
"readQuorum": 1,
"writeQuorum": 1,
"executionMode": "undefined",
"readYourWrites": true,
"servers": {
"*": "master"
},
"clusters": {
"internal": {
},
"address": {
"servers" : [ "orientmaster" ]
},
"address_1": {
"servers" : [ "orientslave1" ]
},
"*": {
"servers": ["<NEW_NODE>"]
}
}
}
There are two clusters created for the vertex named address namely address and address_1. The data in machine orientslave1 is stored using ETL tool into cluster address_1 , similarly the data in machine orientmaster is stored into the cluster address. ( I've ensured that both of these cluster ids are different at time of creation )
However when these two machines are connected together in distributed mode, the data in cluster address_1 is only visible
The ETL json is attached below :
{
"source": { "file": { "path": "/home/ubuntu/labvolume1/DataStorage/geo1_5lacs.csv" } },
"extractor": { "csv": {"columnsOnFirstLine": false, "columns":["place:string"] } },
"transformers": [
{ "vertex": { "class": "ADDRESS", "skipDuplicates":true } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ubuntu/labvolume1/orientdb/databases/ETL_Test1",
"dbType": "graph",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"wal": false,
"tx":false,
"classes": [
{"name": "ADDRESS", "extends": "V", "clusters":1}
], "indexes": [
{"class":"ADDRESS", "fields":["place:string"], "type":"UNIQUE" }
]
}
}
}
Please let me know, if there is anything i'm doing wrongly

Utilizing OrientDB ETL to create 2 vertices and a connected edge at every line of CSV

I'm utilizing OrientDB ETL tool to import a large amount of data in GBs. The format of the CSV is such that ( I'm using orientDB 2.2 ) :
"101.186.130.130","527225725","233 djfnsdkj","0.119836317542"
"125.143.534.148","112212983","1227 sdfsdfds","0.0465215171983"
"103.149.957.752","112364761","1121 sdfsdfds","0.0938863016658"
"103.190.245.128","785804692","6138 sdfsdfsd","0.117767539364"
I'm required to create Two vertices one with the value in Column1(key being the value itself) and another Vertex having values in column 2 & 3 ( Its key concatenated with both values and both present as attributes in the second vertex type, the 4th column will be the property of the edge connecting both of these vertices.
I used the below code and it works ok with some errors, one problem is all values in each csv row is stored as properties within the IpAddress vertex, Is there any way to store only the IpAddress in it. Secondly please can you let me know the method to concatenate two values read from the csv.
{
"source": { "file": { "path": "/home/abcd/OrientDB/examples/ip_address.csv" } },
"extractor": { "csv": {"columnsOnFirstLine": false, "columns": ["ip:string", "dpcb:string", "address:string", "prob:string"] } },
"transformers": [
{ "merge": { "joinFieldName":"ip", "lookup":"IpAddress.ip" } },
{ "edge": { "class": "Located",
"joinFieldName": "address",
"lookup": "PhyLocation.loc",
"direction": "out",
"targetVertexFields": { "geo_address": "${input.address}", "dpcb_number": "${input.dpcb}"},
"edgeFields": { "confidence": "${input.prob}" },
"unresolvedLinkAction": "CREATE"
}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:/localhost/Bulk_Transfer_Test",
"dbType": "graph",
"dbUser": "root",
"dbPassword": "tiger",
"serverUser": "root",
"serverPassword": "tiger",
"classes": [
{"name": "IpAddress", "extends": "V"},
{"name": "PhyLocation", "extends": "V"},
{"name": "Located", "extends": "E"}
], "indexes": [
{"class":"IpAddress", "fields":["ip:string"], "type":"UNIQUE" },
{"class":"PhyLocation", "fields":["loc:string"], "type":"UNIQUE" }
]
}
}
}

Import from RDBMS DB2 To Document DB OrientDB

In my database CU242176 DBMS OrientDB version 2.0.7 is a table M_PERM:
PERM_DESC:string;
PERM_ID:integer not null;
PERM_NAME:string.
In my database CU242176 DBMS DB2 version 9.1 is a table M_PERM of the same structure. In this table 14 rows. With module Orientdb-ETL I did import the data. No errors, but there is no data in the table. While the table is created index on PERM_ID.
Here is my config:
{
"config":{
"log": "debug"
},
"extractor" : {
"jdbc":
{ "driver": "com.ibm.db2.jcc.DB2Driver",
"url": "jdbc:db2://ITS-C:50000/CU242176",
"userName": "metr",
"userPassword": "metr1",
"query": "select PERM_DESC,PERM_ID,PERM_NAME from METR.M_PERM"
}
},
"transformers":[
],
"loader" : {
"orientdb": {
"dbURL": "plocal:c:/Program Files/orientdb-community-2.0.7/databases/CU242176",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": false,
"standardElementConstraints": false,
"tx":true,
"wal":false,
"batchCommit":1000,
"dbType": "document",
"classes":[{"name": "M_PERM"}],
"indexes": [{"class":"M_PERM", "fields":["PERM_ID:integer"], "type":"UNIQUE" }]
}
}
}
Log executed command (oetl config_Import_M_PERM_JDBC.json):
OrientDB etl v.2.0.7 (build #BUILD#) www.orientechnologies.com
[orientdb] DEBUG Opening database 'plocal:c:/Program Files/orientdb-community-2.0.7/databases/
CU242176'...
2015-04-29 14:39:34:562 WARNING {db=CU242176} segment file 'database.ocf' was not closed corre
ctly last time [OSingleFileSegment]BEGIN ETL PROCESSOR
[orientdb] DEBUG orientdb: found 0 documents in class 'null'
END ETL PROCESSOR
extracted 29 records (0 records/sec) - 29 records -> loaded 14 documents (0 documents/sec) T otal time: 159ms [0 warnings, 0 errors]
How do I resolve this issue? For 14 lines loaded into my table.
Instead of:
"classes": [{"name": "M_PERM"}],
use:
"class": "M_PERM"
I can't see this documented anywhere but it worked for me.