OrientDB import from CSV, nullValue property - orientdb

I'm trying to import a fake CSV file into OrientDB Server 2.1.2.
The ETL tool looks amazing, allowing people to input many options, however it seems to me that the csv transformer (when I tried to use the CSV extractor I got a Extractor 'csv' not found error) does not interpret correctly the "nullValue" option.
I used the following JSON to try to load a simple file and, when using "NULL" as null value both in the data and in the JSON I could import the file correctly, while when using "?" I couldn't.
`
{
"source": { "file": {"path": "Z:/test.tsv"}},
"extractor": { "row": {}},
"transformers": [
{"csv": {
"separator": " ",
"nullValue": "?",
"columnsOnFirstLine": true,
"columns": [
"a:STRING",
"b:STRING",
"c:String",
"n:Integer"
],
"dateFormat": "dd.mm.yyyy"
}
},
{"vertex": {"class": "Test", "skipDuplicates": true}}
],
"loader": {
"orientdb": {
"dbURL": "plocal:C:/Users/taatoal1/tmp/orientdb/databases/test",
"dbType": "graph",
"classes": [
{"name": "Test"}
]
}
}
}
`
Here is the data:
a b c 1
a0 b0 c0 2
a1 b1 c1 ?
Am I doing something wrong?

my suggestion is to try with (just released) latest version, 2.1.4:Orient Download
In 2.1.4 we add the support for the CSV extractor which internally uses commons-csv from Apache.

Related

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Use ETL to load CSV data into OrientDB containing a SPATIAL index

I'm interested in loading some data into an OrientDB from some CSV files that contain spatial coordinates in WGS84 Lat/Long.
I'm using OrientDB 2.2.8 and have the lucene spatial module added to my $ORIENTDB_HOME/lib directory.
I'm loading my data into a database using ETL and would like to add the spatial index but I'm not sure how to do this.
Say my CSV file has the following columns:
Label (string)
Latitude (float)
Longitude (float)
I've tried this in my ETL:
"loader": {
"orientdb": {
"dbURL": "plocal:myDatabase.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ { "name": "vertex", "extends", "V" } ],
"indexes": [ { "class": "vertex", "fields":["Label:string"], "type":"UNIQUE" },
{ "class": "Label", "fields":["Latitude:float","Longitude:float"], "type":"SPATIAL" }
]
}
}
but it's not working. I get the following error message:
ETL process has problem: com.orientechnologies.orient.core.index.OIndexException: Index with type SPATIAL and algorithm null does not exist.
Has anyone looked into creating spatial indices via ETL? Most of the stuff I'm seeing on this is using either Java or via direct query.
Thanks in advance for any advice.
I was able to get it to load using the legacy spatial capabilities.
I put together a cheezy dataset that has some coordinates for a few of the Nazca line geoglyphs:
Name,Latitude,Longitude
Hummingbird,-14.692131,-75.148892
Monkey,-14.7067274,-75.1475391
Condor,-14.6983457,-75.1283374
Spider,-14.694363,-75.1235815
Spiral,-14.688309,-75.122757
Hands,-14.694459,-75.113881
Tree,-14.693897,-75.114467
Astronaut,-14.745222,-75.079755
Dog,-14.706401,-75.130788
I used a script to create my GeoGlyph class, createVertexGeoGlyph.osql:
set echo true
connect PLOCAL:./nazca.orientdb admin admin
CREATE CLASS GeoGlyph EXTENDS V CLUSTERS 1
CREATE PROPERTY GeoGlyph.Name STRING
CREATE PROPERTY GeoGlyph.Latitude FLOAT
CREATE PROPERTY GeoGlyph.Longitude FLOAT
CREATE PROPERTY GeoGlyph.Tag EMBEDDEDSET STRING
CREATE INDEX GeoGlyph.index.Location ON GeoGlyph(Latitude,Longitude) SPATIAL ENGINE LUCENE
which I load into my database using
$ console.sh createVertexGeoGlyph.osql
I do it this way because it seems to work more consistently for me. I've had some difficulties with getting the ETL engine to create defined properties when I've wanted it to off CSV imports. Sometimes it wants to cooperate and create my properties and other times has trouble.
So, the next step to get the data in is to create my .json files for the ETL process. I like to make two, one that is file-specific and another that is a common file since often I have datasets that span multiple files.
First, I have a my nazca_liens.json file:
{
"config": {
"log": "info",
"fileDirectory": "./",
"fileName": "nazca_lines.csv"
}
}
Next is the commonGeoGlyph.json file:
{
"begin": [
{ "let": { "name": "$filePath", "expression": "$fileDirectory.append($fileName )" } },
],
"config": { "log": "debug" },
"source": { "file": { "path": "$filePath" } },
"extractor":
{
"csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"separator": ",",
"columnsOnFirstLine": true,
"dateFormat": "yyyy-MM-dd"
}
},
"transformers": [
{ "vertex": { "class": "GeoGlyph" } },
{ "code": { "language":"Javascript",
"code": "print('>>> Current record: ' + record); record;" }
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:nazca.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [],
"indexes": []
}
}
}
There's more stuff in the file than is necessary, I use it as a template for a lot of stuff. In this case, I don't have to create my index in the ETL file itself because I already created it in the createVertexGeoGlyph.osql file.
To load the data I just use the oetl.sh script:
$ oetl.sh commonGeoGlyph.json nazca_lines.json
This is what's working for me... I'm sure there are better ways to do it, but this works. I'm posting this here to tie off the question. Hopefully someone will find this to be useful.

Utilizing OrientDB ETL to create 2 vertices and a connected edge at every line of CSV

I'm utilizing OrientDB ETL tool to import a large amount of data in GBs. The format of the CSV is such that ( I'm using orientDB 2.2 ) :
"101.186.130.130","527225725","233 djfnsdkj","0.119836317542"
"125.143.534.148","112212983","1227 sdfsdfds","0.0465215171983"
"103.149.957.752","112364761","1121 sdfsdfds","0.0938863016658"
"103.190.245.128","785804692","6138 sdfsdfsd","0.117767539364"
I'm required to create Two vertices one with the value in Column1(key being the value itself) and another Vertex having values in column 2 & 3 ( Its key concatenated with both values and both present as attributes in the second vertex type, the 4th column will be the property of the edge connecting both of these vertices.
I used the below code and it works ok with some errors, one problem is all values in each csv row is stored as properties within the IpAddress vertex, Is there any way to store only the IpAddress in it. Secondly please can you let me know the method to concatenate two values read from the csv.
{
"source": { "file": { "path": "/home/abcd/OrientDB/examples/ip_address.csv" } },
"extractor": { "csv": {"columnsOnFirstLine": false, "columns": ["ip:string", "dpcb:string", "address:string", "prob:string"] } },
"transformers": [
{ "merge": { "joinFieldName":"ip", "lookup":"IpAddress.ip" } },
{ "edge": { "class": "Located",
"joinFieldName": "address",
"lookup": "PhyLocation.loc",
"direction": "out",
"targetVertexFields": { "geo_address": "${input.address}", "dpcb_number": "${input.dpcb}"},
"edgeFields": { "confidence": "${input.prob}" },
"unresolvedLinkAction": "CREATE"
}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:/localhost/Bulk_Transfer_Test",
"dbType": "graph",
"dbUser": "root",
"dbPassword": "tiger",
"serverUser": "root",
"serverPassword": "tiger",
"classes": [
{"name": "IpAddress", "extends": "V"},
{"name": "PhyLocation", "extends": "V"},
{"name": "Located", "extends": "E"}
], "indexes": [
{"class":"IpAddress", "fields":["ip:string"], "type":"UNIQUE" },
{"class":"PhyLocation", "fields":["loc:string"], "type":"UNIQUE" }
]
}
}
}

OrientDB: missing "half edges"

I'm still playing with OrientDB.
Now I'm trying the schema functionalities, that look awesome :-)
I have two data files: joinA.txt and joinB.txt, which I used to populate a database with the following schema (the content of the two files is at the end of the post):
CREATE CLASS Employee EXTENDS V;
CREATE PROPERTY Employee.eid Integer;
CREATE PROPERTY Employee.name String;
CREATE PROPERTY Employee.eage Short;
CREATE INDEX Employee.eid unique_hash_index;
CREATE CLASS ExtendedProfile EXTENDS V;
CREATE CLASS XYZProfile EXTENDS ExtendedProfile;
CREATE PROPERTY XYZProfile.textual String;
-- SameAs can only connect Employees to ExtendedProfile
CREATE CLASS SameAs EXTENDS E; -- same employee across many tables
CREATE PROPERTY SameAs.out LINK ExtendedProfile;
CREATE PROPERTY SameAs.In LINK Employee;
The JSONs I gave to the ETL tool are, for JoinA:
{
"source": { "file": {"path": "the_path"}},
"extractor": {"csv": {
"separator": " ",
"columns": [
"eid:Integer",
"name:String",
"eage:Short"
]
}
},
"transformers": [
{"vertex": {"class": "Employee", "skipDuplicates": true}}
]
,"loader": {
"orientdb": {
"dbURL": "plocal:thepath",
"dbType": "graph",
"useLightweightEdges": false
}
}
}
and for JoinB:
{
"source": { "file": {"path": "thepath"}},
"extractor": {"csv": {
"separator": " ",
"columnsOnFirstLine": false,
"quote": "\"",
"columns": [
"id:String",
"textual:String"
]
}
},
"transformers": [
{"vertex": {"class": "XYZProfile", "skipDuplicates": true}},
{ "edge": { "class": "SameAs",
"direction": "out",
"joinFieldName": "id",
"lookup":"Employee.eid",
"unresolvedLinkAction":"ERROR"}},
],
"loader": {
"orientdb": {
"dbURL": "path",
"dbUser": "root",
"dbPassword": "pwd",
"dbType": "graph",
"useLightweightEdges": false}
}
}
Now, the problem is that when I run select expand(both()) from Employee I get the edges in the column out_SameAs, while when I run select expand(both()) from XYZProfile I get nothing.
This is weird since the first query told me that the #CLASS pointed by the edges is XYZProfile.
Does anybody know what's wrong with my example?
Cheers,
Alberto
JoinA:
1 A 10
2 B 14
3 C 22
JoinB:
1 i0
1 i1
2 i2
Check out your JSON file, I think there is an error on your JSON file. You forget to put [] at the beginning and ending of the JSON file.
It was actually my fault.
The line CREATE PROPERTY SameAs.In LINK Employee; was the problem: In should have been all lowercased, as pointed out here.

PapaParse Errors explanation

I'm using papaParse to parse an CSV file into JSON for further use. Upon parsin it returns
"errors": [ { "type": "FieldMismatch", "code": "TooFewFields", "message": "Too few fields: expected 21 fields but parsed 1", "row": 14 } ], "meta": { "delimiter": ";", "linebreak": "\r\n", "aborted": false, "truncated": false, "fields": [ "Output in top 10 percentiles (%)", "Overall", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014" ] } }
can somebody please explain to me what this means? I read trough the documentation on their webpage but still don't understand what is wrong
the CSV file I'm working with is this (http://www.topdeckandwreck.com/excel%20graphs/Sheet10.csv)
In your config, add
skipEmptyLines: true
Reference: http://papaparse.com/docs#config-details
Solution was posted by Lasse V Karlsen in the comments, removing the last empty line in notepad so the CSV file contains only data removes the error