Populating only vertex from CSV file - orientdb

Need help to know how should I populate my vertex class in orientdb with the csv file. The format in csv file is
name,type,status
xxxxx,ABC,3
yyyyy,ABC,1
zzzzz,123,5
--
I have a vertex and edges extended in OrientDB, where the vertex have 3 property name,type and status. I only want the vertex to get populated from csv, the edges will be created dynamically via API
I tried to create ETL file as below :
{
"source":{"file": { "path": "/tmp/ientdb-community-2.2.18/config/data.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "MyObject" } }
],
"loader": {
"orientdb": {
"dbURL": "remote:localhost/mydb",
"dbUser": "root",
"dbPassword": "root",
"dbType": "graph",
"classes": [
{"name": "MyObject", "extends": "V"},
], "indexes": [
{"class":"MyObject", "fields":["name:string"], "type":"UNIQUE" }
]
}
}
}
I find that, if I use plocal the root/root credential is not working. And the classes are not as same as when logged in with remote (after starting server)

I tried your code and it works for me, this is what I get:
the only changes that I made to your code are: credential, and dbUrl plocal instead of remote:
{
"source":{"file": { "path": "mypath/config/data.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "MyObject" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:mypath/databases/mydb",
"dbType": "graph",
"dbUser": "<user name>",
"dbPassword": "<user password>",
**BEGIN UPDATE**
"serverUser": "<server administrator user name, usually root>",
"serverPassword": "<server administrator user password that is provided at server startup>",
**END UPDATE**
"classes": [
{"name": "MyObject", "extends": "V"},
], "indexes": [
{"class":"MyObject", "fields":["name:string"], "type":"UNIQUE" }
]
}
}
}
By the way I noticed that your path is called: ientdb-community-2.2.18 is that correct?
Hope it helps.
Regards.

Related

Only data from node 1 visible in a 2 node OrientDB cluster

I created a 2-node OrientDB cluster by following the below steps. But while distributing it, the data present in only one of the node is accessible. Please can you help me debug this issue. The OrientDB version is 2.2.6
Steps involved :
Utilized plocal mode in ETL tool and stored part of the data in node 1 and the other part in node2. The data stored actually belongs to just one class of vertex alone. ( On checking the data from console, the data has got injested properly ).
Then executed both the nodes in distributed mode, data from only one machineis accessible.
The default-distributed-db-config.json file is specified below :
{
"autoDeploy": true,
"readQuorum": 1,
"writeQuorum": 1,
"executionMode": "undefined",
"readYourWrites": true,
"servers": {
"*": "master"
},
"clusters": {
"internal": {
},
"address": {
"servers" : [ "orientmaster" ]
},
"address_1": {
"servers" : [ "orientslave1" ]
},
"*": {
"servers": ["<NEW_NODE>"]
}
}
}
There are two clusters created for the vertex named address namely address and address_1. The data in machine orientslave1 is stored using ETL tool into cluster address_1 , similarly the data in machine orientmaster is stored into the cluster address. ( I've ensured that both of these cluster ids are different at time of creation )
However when these two machines are connected together in distributed mode, the data in cluster address_1 is only visible
The ETL json is attached below :
{
"source": { "file": { "path": "/home/ubuntu/labvolume1/DataStorage/geo1_5lacs.csv" } },
"extractor": { "csv": {"columnsOnFirstLine": false, "columns":["place:string"] } },
"transformers": [
{ "vertex": { "class": "ADDRESS", "skipDuplicates":true } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ubuntu/labvolume1/orientdb/databases/ETL_Test1",
"dbType": "graph",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"wal": false,
"tx":false,
"classes": [
{"name": "ADDRESS", "extends": "V", "clusters":1}
], "indexes": [
{"class":"ADDRESS", "fields":["place:string"], "type":"UNIQUE" }
]
}
}
}
Please let me know, if there is anything i'm doing wrongly

Utilizing OrientDB ETL to create 2 vertices and a connected edge at every line of CSV

I'm utilizing OrientDB ETL tool to import a large amount of data in GBs. The format of the CSV is such that ( I'm using orientDB 2.2 ) :
"101.186.130.130","527225725","233 djfnsdkj","0.119836317542"
"125.143.534.148","112212983","1227 sdfsdfds","0.0465215171983"
"103.149.957.752","112364761","1121 sdfsdfds","0.0938863016658"
"103.190.245.128","785804692","6138 sdfsdfsd","0.117767539364"
I'm required to create Two vertices one with the value in Column1(key being the value itself) and another Vertex having values in column 2 & 3 ( Its key concatenated with both values and both present as attributes in the second vertex type, the 4th column will be the property of the edge connecting both of these vertices.
I used the below code and it works ok with some errors, one problem is all values in each csv row is stored as properties within the IpAddress vertex, Is there any way to store only the IpAddress in it. Secondly please can you let me know the method to concatenate two values read from the csv.
{
"source": { "file": { "path": "/home/abcd/OrientDB/examples/ip_address.csv" } },
"extractor": { "csv": {"columnsOnFirstLine": false, "columns": ["ip:string", "dpcb:string", "address:string", "prob:string"] } },
"transformers": [
{ "merge": { "joinFieldName":"ip", "lookup":"IpAddress.ip" } },
{ "edge": { "class": "Located",
"joinFieldName": "address",
"lookup": "PhyLocation.loc",
"direction": "out",
"targetVertexFields": { "geo_address": "${input.address}", "dpcb_number": "${input.dpcb}"},
"edgeFields": { "confidence": "${input.prob}" },
"unresolvedLinkAction": "CREATE"
}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:/localhost/Bulk_Transfer_Test",
"dbType": "graph",
"dbUser": "root",
"dbPassword": "tiger",
"serverUser": "root",
"serverPassword": "tiger",
"classes": [
{"name": "IpAddress", "extends": "V"},
{"name": "PhyLocation", "extends": "V"},
{"name": "Located", "extends": "E"}
], "indexes": [
{"class":"IpAddress", "fields":["ip:string"], "type":"UNIQUE" },
{"class":"PhyLocation", "fields":["loc:string"], "type":"UNIQUE" }
]
}
}
}

etl and many to many relation

In documentation the etl from csv use one to many feature, I would like to extends it to many to many. So I made 3 configs, one for post, one for comment and one for relation. Post and Comment are ok but when I launch relation I have got this error, what I'm doing wrong ?
commentId,postId
0,10
1,10
21,10
41,20
82,20
{
"source": { "file": { "path": "/tmp/relation.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "edge":
{ "class": "HasComments", "joinFieldName": "postId", "lookup": "Post.id", "direction": "out"},
{ "class": "HasComments", "joinFieldName": "commentId", "lookup": "Comment.id", "direction": "in"}
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:/tmp/test",
"dbType": "graph",
"classes": [
{"name": "Post", "extends": "V"},
{"name": "Comment", "extends": "V"},
{"name": "HasComments", "extends": "E"}
],
"indexes": [
{"class":"Post", "fields":["id:integer"], "type":"UNIQUE" },
{"class":"Comment", "fields":["id:integer"], "type":"UNIQUE" }
]
}
}
}
OrientDB etl v.2.1.9-SNAPSHOT (build 2.1.x#r; 2016-01-07 10:51:24+0000) www.orientdb.com
BEGIN ETL PROCESSOR
[file] INFO Reading from file /tmp/relation.csv with encoding UTF-8
Error in Pipeline execution: com.orientechnologies.orient.etl.transformer.OTransformException: edge: input type 'com.orientechnologies.orient.core.record.impl.ODocument$1$1#72ade7e3' is not supported
ETL process halted: com.orientechnologies.orient.etl.OETLProcessHaltedException: Halt
Exception in thread "main" com.orientechnologies.orient.etl.OETLProcessHaltedException: Halt
at com.orientechnologies.orient.etl.OETLPipeline.execute(OETLPipeline.java:149)
at com.orientechnologies.orient.etl.OETLProcessor.executeSequentially(OETLProcessor.java:448)
at com.orientechnologies.orient.etl.OETLProcessor.execute(OETLProcessor.java:255)
at com.orientechnologies.orient.etl.OETLProcessor.main(OETLProcessor.java:109)
Caused by: com.orientechnologies.orient.etl.transformer.OTransformException: edge: input type 'com.orientechnologies.orient.core.record.impl.ODocument$1$1#72ade7e3' is not supported
at com.orientechnologies.orient.etl.transformer.OEdgeTransformer.executeTransform(OEdgeTransformer.java:107)
at com.orientechnologies.orient.etl.transformer.OAbstractTransformer.transform(OAbstractTransformer.java:37)
at com.orientechnologies.orient.etl.OETLPipeline.execute(OETLPipeline.java:115)
... 3 more
One possible solution might be, after importing the Post and Comment classes through their json file, you can use another json file and import the class Relation
{
"source": { "file": { "path": "/tmp/relation.csv" } },
"extractor": { "row": {} },
"transformers": [
{ "csv": { "separator": ","}
},
{ "vertex": { "class": "Relation" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/tmp/test",
"dbType": "graph",
"classes": [
{"name": "Post", "extends": "V"},
{"name": "Comment", "extends": "V"},
{"name": "Relation", "extends": "V"},
{"name": "HasComments", "extends": "E"}
],
"indexes": [
{"class":"Post", "fields":["id:integer"], "type":"UNIQUE" },
{"class":"Comment", "fields":["id:integer"], "type":"UNIQUE" }
]
}
}
}
You'll get these records.
Using the following javascript function
var g=orient.getGraphNoTx();
var relation = g.command("sql","select from Relation");
for(i=0;i<relation.length;i++){
var relationMM=g.command("sql","select postId , commentId from "+ relation[i].getId());
var idPost=relationMM[0].getProperty("postId");
var idComment=relationMM[0].getProperty("commentId");
var post=g.command("sql","select from Post where id = " + idPost);
var comment=g.command("sql","select from Comment where id = " + idComment);
g.command("sql","create edge HasComments from " + post[0].getId() + " to " + comment[0].getId());
}
g.command("sql","drop class Relation unsafe");
You'll get the following structure.
This will be your graph.
UPDATE
You can use this code to check if the edge already exists
var counter=g.command("sql","select count(*) from HasComments where out=" + post[0].getId() + " and in=" + comment[0].getId());
if(counter[0].getProperty("count")==0){
g.command("sql","create edge HasComments from " + post[0].getId() + " to " + comment[0].getId());
}

"No nodes configured for partition" after creating a database via ETL

I've just created a custom database using the following ETL config,
{
"source": { "file": { "path": "./mydata.csv" } },
"extractor": { "row": {} },
"transformers": [
{ "csv": {} },
{ "vertex": { "class": "MyClass" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/opt/orientdb/databases/MyData",
"dbUser": "root",
"dbPassword": "qrefhiuqwriouhwqv",
"dbType": "graph",
"classes": [
{"name": "MyClass", "extends": "V"},
]
}
}
}
Now, when I go to the web console, I can see I have 433k records of type MyClass created at database MyData.
When I try to query it with "select from MyClass", I get the error
2015-04-06 23:56:25:541 SEVERE Internal server error:
com.orientechnologies.orient.server.distributed.ODistributedException:
No nodes configured for partition 'MyClass.[]' request:
id=-1 from=node1428362873334 task=command_sql(select from MyClass) userName= [ONetworkProtocolHttpDb]
What am I doing wrong?

Chef: Trying to get build-essential to install on our node before Postgres

Here's our node configuration:
{
"run_list": [
"recipe[apt]",
"recipe[build-essential]",
[
"rackbox"
]
],
"rackbox": {
"jenkins": {
"job": "job1",
"git_repo": "https://github.com/hayesmp/railsgirls-app.git",
"command": "bundle exec rake",
"ip_address": "192.237.181.154",
"host": "subocean-southerner"
},
"ruby": {
"versions": [
"2.0.0-p247"
],
"global_version": "2.0.0-p247"
},
"apps": {
"unicorn": [
{
"appname": "app1",
"hostname": "app1"
}
]
},
"db_root_password": "iloverandompasswordsbutthiswilldo",
"databases": {
"postgresql": [
{
"database_name": "app1_production",
"username": "app1",
"password": "app1_pass"
}
]
}
}
}
I'm just not sure where to insert the build essential compiletime = true attribute for my configuration.
This is the sample code for this stack overflow post: Chef: Why are resources in an "include_recipe" step being skipped?
name "myapp"
run_list(
"recipe[build-essential]",
"recipe[myapp]"
)
default_attributes(
"build_essential" => {
"compiletime" => true
}
)
Paste this into your node configuration:
"build_essential": {
"compiletime": true
}
BTW: you should use recipe[rackbox] instead of [rackbox] in your run_list