How to import Edges from CSV with ETL into OrientDB graph? - orientdb

I'm trying to import edges from a CSV-file into OrientDB. The vertices are stored in a separate file and already imported via ETL into OrientDB.
So my situation is similar to OrientDB import edges only using ETL tool and OrientDB ETL loading CSV with vertices in one file and edges in another.
Update
Friend.csv
"id","client_id","first_name","last_name"
"0","0","John-0","Doe"
"1","1","John-1","Doe"
"2","2","John-2","Doe"
...
The "id" field is removed by the Friend-Importer, but the "client_id" is stored. The idea is to have a known client-side generated id for searching etc.
PeindingFriendship.csv
"friendship_id","client_id","from","to"
"0","0-1","1","0"
"2","0-15","15","0"
"3","0-16","16","0"
...
The "friendship_id" and "client_id" should be imported as attributes of the "PendingFriendship" edge. "from" is a "client_id" of a Friend. "to" is a "client_id" of another Friend.
For "client_id" exists a unique Index on both Friend and PendingFriendship.
My ETL configuration looks like this
...
"extractor": {
"csv": {
}
},
"transformers": [
{
"command": {
"command": "CREATE EDGE PendingFriendship FROM (SELECT FROM Friend WHERE client_id = '${input.from}') TO (SELECT FROM Friend WHERE client_id = '${input.to}') SET client_id = '${input.client_id}'",
"output": "edge"
}
},
{
"field": {
"fieldName": "from",
"expression": "remove"
}
},
{
"field": {
"fieldName": "to",
"operation": "remove"
}
},
{
"field": {
"fieldName": "friendship_id",
"expression": "remove"
}
},
{
"field": {
"fieldName": "client_id",
"operation": "remove"
}
},
{
"field": {
"fieldName": "#class",
"value": "PendingFriendship"
}
}
],
...
The issue with this configuration is that it creates two edge entries. One is the expected "PendingFriendship" edge. The second one is an empty "PendingFriendship" edge, with all the fields I removed as attributes with empty values.
The import fails, at the second row/document, because another empty "PendingFriendship" cannot be inserted because it violates a uniqueness constraint.
How can I avoid the creation of the unnecessary empty "PendingFriendship".
What is the best way to import edges into OrientDB? All the examples in the documentation use CSV files where vertices and edges are in one file, but this is not the case for me.
I also had a look into the Edge-Transformer, but it returns a Vertex not an Edge!
Created PendingFriendships

After some time I found a way (workaround) to import the above data into OrientDB. Instead of using the ETL Tool I wrote simple ruby scripts which call the HTTP API of OrientDB using the Batch endpoint.
Steps:
Import the Friends.
Use the response to create a mapping of client_ids to #rids.
Parse the PeindingFriendship.csv and build batch requests.
Each Friendships is created by its own command.
The mapping from 2. is used to insert the #rids into the command from 4.
Send the batch requests in junks of 1000 commands.
Example Batch-Request body:
{
"transaction" : true,
"operations" : [
{
"type" : "cmd",
"language" : "sql",
"command" : "create edge PendingFriendship from #27:178 to #27:179 set client_id='4711'"
}
]
}
This isn't the answer to the question I asked, but it solves the higher goal of importing data into OrientDB, for me. Therefore I leave it open for the community to mark this question as solved or not.

Related

ADF - Loop through a large JSON file in a dataflow

We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.

Mapping multiple csv files in Data Factory

Does anyone know if there is a way to pass a schema mapping to multiple csv without doing it manually? I have 30 csv passed through a data flow in a foreach activity, so I can't detect or set fields's type. (Because i could only for the first)
Thanks for your help! :)
A Copy Activity mapping can be parameterized and changed at runtime if explicit mapping is required. The parameter is just a json object that you'd pass in for each of the files you are processing. It looks something like this:
{
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Id"
},
"sink": {
"name": "CustomerID"
}
},
{
"source": {
"name": "Name"
},
"sink": {
"name": "LastName"
}
},
{
"source": {
"name": "LastModifiedDate"
},
"sink": {
"name": "ModifiedDate"
}
}
]
}
You can read more about it here: Schema and data type mapping in copy activity
So, you can either pre-generate these mapping and fetch them via a lookup in a previous step in the pipeline or if they need to be dynamic you an create them at runtime with code (e.g. have an Azure Function that looks up the current schema of the CSV and returns a properly formatted translator object).
Once you have the object as a parameter you can pass it to the copy activity. On the mapping properties of the copy activity you just Add Dynamic Content and select the appropriate parameter. It will look something like this:

Kafka Connect JSON Schema does not appear to support "$ref" tags

I am using Kafka Connect with JSONSchema and am in a situation where I need to convert the JSON schema manually (to "Schema") within a Kafka Connect plugin. I can successfully retrieve the JSON Schema from the Schema Registry and am successful converting with simple JSON Schemas but I am having difficulties with ones that are complex and have valid "$ref" tags referencing components within a single JSON Schema definition.
I have several questions:
The JsonConverter.java does not appear to handle "$ref". Am I correct, or does it handle it in another way elsewhere?
Does the Schema Registry handle the referencing of sub-definitions? If yes, is there code that shows how the dereferencing is handled?
Should the JSON Schema be resolved to a string without references (ie. inline the references) before submitting to the Schema Registry and thereby remove the "$ref" issue?
I am looking at the Kafka Source code module JsonConverter.java below:
https://github.com/apache/kafka/blob/trunk/connect/json/src/main/java/org/apache/kafka/connect/json/JsonConverter.java#L428
An example of the complex schema (taken from the JSON Schema site) is shown below (notice the "$ref": "#/$defs/veggie" tag the references a later sub-definition)
{
"$id": "https://example.com/arrays.schema.json",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"description": "A representation of a person, company, organization, or place",
"title": "complex-schema",
"type": "object",
"properties": {
"fruits": {
"type": "array",
"items": {
"type": "string"
}
},
"vegetables": {
"type": "array",
"items": { "$ref": "#/$defs/veggie" }
}
},
"$defs": {
"veggie": {
"type": "object",
"required": [ "veggieName", "veggieLike" ],
"properties": {
"veggieName": {
"type": "string",
"description": "The name of the vegetable."
},
"veggieLike": {
"type": "boolean",
"description": "Do I like this vegetable?"
}
}
}
}
}
Below is the actual schema returned from the Schema Registry after it the schema was successfully registered:
[
{
"subject": "complex-schema",
"version": 1,
"id": 1,
"schemaType": "JSON",
"schema": "{\"$id\":\"https://example.com/arrays.schema.json\",\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"description\":\"A representation of a person, company, organization, or place\",\"title\":\"complex-schema\",\"type\":\"object\",\"properties\":{\"fruits\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"vegetables\":{\"type\":\"array\",\"items\":{\"$ref\":\"#/$defs/veggie\"}}},\"$defs\":{\"veggie\":{\"type\":\"object\",\"required\":[\"veggieName\",\"veggieLike\"],\"properties\":{\"veggieName\":{\"type\":\"string\",\"description\":\"The name of the vegetable.\"},\"veggieLike\":{\"type\":\"boolean\",\"description\":\"Do I like this vegetable?\"}}}}}"
}
]
The actual schema is embedded in the above returned string (the contents of the "schema" field) and contains the $ref references:
{\"$id\":\"https://example.com/arrays.schema.json\",\"$schema\":\"https://json-schema.org/draft/2020-12/schema\",\"description\":\"A representation of a person, company, organization, or place\",\"title\":\"complex-schema\",\"type\":\"object\",\"properties\":{\"fruits\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"vegetables\":{\"type\":\"array\",\"items\":{\"$ref\":\"#/$defs/veggie\"}}},\"$defs\":{\"veggie\":{\"type\":\"object\",\"required\":[\"veggieName\",\"veggieLike\"],\"properties\":{\"veggieName\":{\"type\":\"string\",\"description\":\"The name of the vegetable.\"},\"veggieLike\":{\"type\":\"boolean\",\"description\":\"Do I like this vegetable?\"}}}}}
Again, the JsonConverter in the Apache Kafka source code has no notion of JSONSchema, therefore, no, $ref doesn't work and it also doesn't integrate with the Registry.
You seem to be looking for the io.confluent.connect.json.JsonSchemaConverter class + logic

Use ETL to load CSV data into OrientDB containing a SPATIAL index

I'm interested in loading some data into an OrientDB from some CSV files that contain spatial coordinates in WGS84 Lat/Long.
I'm using OrientDB 2.2.8 and have the lucene spatial module added to my $ORIENTDB_HOME/lib directory.
I'm loading my data into a database using ETL and would like to add the spatial index but I'm not sure how to do this.
Say my CSV file has the following columns:
Label (string)
Latitude (float)
Longitude (float)
I've tried this in my ETL:
"loader": {
"orientdb": {
"dbURL": "plocal:myDatabase.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ { "name": "vertex", "extends", "V" } ],
"indexes": [ { "class": "vertex", "fields":["Label:string"], "type":"UNIQUE" },
{ "class": "Label", "fields":["Latitude:float","Longitude:float"], "type":"SPATIAL" }
]
}
}
but it's not working. I get the following error message:
ETL process has problem: com.orientechnologies.orient.core.index.OIndexException: Index with type SPATIAL and algorithm null does not exist.
Has anyone looked into creating spatial indices via ETL? Most of the stuff I'm seeing on this is using either Java or via direct query.
Thanks in advance for any advice.
I was able to get it to load using the legacy spatial capabilities.
I put together a cheezy dataset that has some coordinates for a few of the Nazca line geoglyphs:
Name,Latitude,Longitude
Hummingbird,-14.692131,-75.148892
Monkey,-14.7067274,-75.1475391
Condor,-14.6983457,-75.1283374
Spider,-14.694363,-75.1235815
Spiral,-14.688309,-75.122757
Hands,-14.694459,-75.113881
Tree,-14.693897,-75.114467
Astronaut,-14.745222,-75.079755
Dog,-14.706401,-75.130788
I used a script to create my GeoGlyph class, createVertexGeoGlyph.osql:
set echo true
connect PLOCAL:./nazca.orientdb admin admin
CREATE CLASS GeoGlyph EXTENDS V CLUSTERS 1
CREATE PROPERTY GeoGlyph.Name STRING
CREATE PROPERTY GeoGlyph.Latitude FLOAT
CREATE PROPERTY GeoGlyph.Longitude FLOAT
CREATE PROPERTY GeoGlyph.Tag EMBEDDEDSET STRING
CREATE INDEX GeoGlyph.index.Location ON GeoGlyph(Latitude,Longitude) SPATIAL ENGINE LUCENE
which I load into my database using
$ console.sh createVertexGeoGlyph.osql
I do it this way because it seems to work more consistently for me. I've had some difficulties with getting the ETL engine to create defined properties when I've wanted it to off CSV imports. Sometimes it wants to cooperate and create my properties and other times has trouble.
So, the next step to get the data in is to create my .json files for the ETL process. I like to make two, one that is file-specific and another that is a common file since often I have datasets that span multiple files.
First, I have a my nazca_liens.json file:
{
"config": {
"log": "info",
"fileDirectory": "./",
"fileName": "nazca_lines.csv"
}
}
Next is the commonGeoGlyph.json file:
{
"begin": [
{ "let": { "name": "$filePath", "expression": "$fileDirectory.append($fileName )" } },
],
"config": { "log": "debug" },
"source": { "file": { "path": "$filePath" } },
"extractor":
{
"csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"separator": ",",
"columnsOnFirstLine": true,
"dateFormat": "yyyy-MM-dd"
}
},
"transformers": [
{ "vertex": { "class": "GeoGlyph" } },
{ "code": { "language":"Javascript",
"code": "print('>>> Current record: ' + record); record;" }
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:nazca.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [],
"indexes": []
}
}
}
There's more stuff in the file than is necessary, I use it as a template for a lot of stuff. In this case, I don't have to create my index in the ETL file itself because I already created it in the createVertexGeoGlyph.osql file.
To load the data I just use the oetl.sh script:
$ oetl.sh commonGeoGlyph.json nazca_lines.json
This is what's working for me... I'm sure there are better ways to do it, but this works. I'm posting this here to tie off the question. Hopefully someone will find this to be useful.

Orientdb ETL update Vertex or just add edge

I have 3 CSV files to load in an OrientDB graph.
People
Product
Purchases
People.csv is like
person_id;name
1;francesco
2;luca
Product.csv is like
product_id;product_name
101;apple
102;banana
Purchases.csv is like
person_id;product_id;avg_price
1;101;$1.10
2;101;$1.08
1;102;$5.34
I load first all the people and the products with 2 different ETL jobs.
Each job loads vertices.
How can I load periodically just the edges using OrientdbETL, as people buy new products?
All the Transformers and particularly EDGE output OrientVertex, that can only be INSERTed by the LOADER step.
(The EDGE Transformer adds EDGE properties to the Vertex, but the actual action is an INSERT of the Vertex). Is there a way to update a Vertex using the ETL?
Rgds,
Francesco
An ETL json with these transformers should import the "Purchase" edges from purchases.csv and update the avg_price of each purchased product.
"transformers": [
{ "merge": { "joinFieldName": "product_id", "lookup": "Product.id" } },
{ "vertex": {"class": "Product", "skipDuplicates": true} },
{ "edge": { "class": "Purchase",
"joinFieldName": "person_id",
"lookup": "Person.id",
"direction": "in"
}
},
{ "field": { "fieldNames": ["person_id", "product_id"], "operation": "remove" } }
]
class and attribute names ("Product.id", "Person", etc) may be different based on your DB schema.