OrientDB ETL Edge transformer 2 joinFieldName(s) - orientdb

with one joinFieldName and lookup the Edge transformer works perfect. However, now two keys is required, i.e. compound index in the lookup. How can two joinFieldNames be specified?
This is the scripted(post processing) version:
Create edge Expands from (select from MC where sample=1 and mkey=6) to (select from Event where sample=1 and mcl=6).
This works, but is not suitable for production.
Can anyone help?

you can simply add 2 joinFieldName(s) like
{ "edge": { "class": "Conn",
"joinFieldName": "b1",
"lookup": "A.a1",
"joinFieldName": "b2",
"lookup": "A.a2",
"direction": "out"
}}
see below my test data:
json1.json
{
"source": { "file": { "path": "/home/ivan/Scrivania/cose/etl/stak39517796/data1.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "A" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ivan/OrientDB/db_installati/enterprise/orientdb-enterprise-2.2.10/databases/stack39517796",
"dbType": "graph",
"dbAutoCreate": true,
"classes": [
{"name": "A", "extends": "V"},
{"name": "B", "extends": "V"},
{"name": "Conn", "extends": "E"}
]
}
}
}
json2.json
{
"source": { "file": { "path": "/home/ivan/Scrivania/cose/etl/stak39517796/data2.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "B" } },
{ "edge": { "class": "Conn",
"joinFieldName": "b1",
"lookup": "A.a1",
"joinFieldName": "b2",
"lookup": "A.a2",
"direction": "out"
}}
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ivan/OrientDB/db_installati/enterprise/orientdb-enterprise-2.2.10/databases/stack39517796",
"dbType": "graph",
"dbAutoCreate": true,
"classes": [
{"name": "A", "extends": "V"},
{"name": "B", "extends": "V"},
{"name": "Conn", "extends": "E"}
]
}
}
}
data1.csv
a1,a2
1,1
1,2
2,3
data2.csv
b1,b2
1,1
2,3
1,2
execution order:
json1
json2
and here is the final result:
orientdb {db=stack39517796}> select from v
+----+-----+------+----+----+-------+----+----+--------+
|# |#RID |#CLASS|a1 |a2 |in_Conn|b2 |b1 |out_Conn|
+----+-----+------+----+----+-------+----+----+--------+
|0 |#17:0|A |1 |1 |[#25:0]| | | |
|1 |#18:0|A |1 |2 |[#27:0]| | | |
|2 |#19:0|A |2 |3 |[#26:0]| | | |
|3 |#21:0|B | | | |1 |1 |[#25:0] |
|4 |#22:0|B | | | |3 |2 |[#26:0] |
|5 |#23:0|B | | | |2 |1 |[#27:0] |
+----+-----+------+----+----+-------+----+----+--------+

Related

Druid Using multiple dimensions for a Dimension Extraction Function

Is it possible to use multiple dimensions for a dimension extraction function?
Something like:
{
"type": "extraction",
"dimension": ["dimension_1", "dimension_2"],
"outputName": "new_dimension",
"outputType": "STRING",
"extractionFn": {
"type": "javascript",
"function": "function(x, y){ // do sth with both x and y to return the result }"
}
}
I do not think this is possible. However, you can create something like that by first "merge" the 2 different dimensions using a virtualColumn, and then use an extraction function. You can then split the values again.
Example query (using https://github.com/level23/druid-client)
$client = new DruidClient([
"router_url" => "https://your.druid"
]);
// Build a groupBy query.
$builder = $client->query("hits")
->interval("now - 1 hour/now")
->select("os_name")
->select("browser")
->virtualColumn("concat(os_name, ';', browser)", "combined")
->sum("hits")
->select("combined", "coolBrowser", function (ExtractionBuilder $extractionBuilder) {
$extractionBuilder->javascript("function(t) { parts = t.split(';'); return parts[0] + ' with cool ' + parts[1] ; }");
})
->where("os_name", "!=", "")
->where("browser", "!=", "")
->orderBy("hits", "desc")
;
// Execute the query.
$response = $builder->groupBy();
Example result:
+--------+--------------------------------------------------+--------------------------+---------------------------+
| hits | coolBrowser | browser | os_name |
+--------+--------------------------------------------------+--------------------------+---------------------------+
| 418145 | Android with cool Chrome Mobile | Chrome Mobile | Android |
| 62937 | Windows 10 with cool Edge | Edge | Windows 10 |
| 27956 | Android with cool Samsung Browser | Samsung Browser | Android |
| 9460 | iOS with cool Safari | Safari | iOS |
+--------+--------------------------------------------------+--------------------------+---------------------------+
Raw native druid json query:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2021-10-15T11:25:23.000Z/2021-10-15T12:25:23.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "os_name",
"outputType": "string",
"outputName": "os_name"
},
{
"type": "default",
"dimension": "browser",
"outputType": "string",
"outputName": "browser"
},
{
"type": "extraction",
"dimension": "combined",
"outputType": "string",
"outputName": "coolBrowser",
"extractionFn": {
"type": "javascript",
"function": "function(t) { parts = t.split(\";\"); return parts[0] + \" with cool \" + parts[1] ; }",
"injective": false
}
}
],
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "not",
"field": {
"type": "selector",
"dimension": "os_name",
"value": ""
}
},
{
"type": "not",
"field": {
"type": "selector",
"dimension": "browser",
"value": ""
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "hits",
"fieldName": "hits"
}
],
"virtualColumns": [
{
"type": "expression",
"name": "combined",
"expression": "concat(os_name, ';', browser)",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
},
"limitSpec": {
"type": "default",
"columns": [
{
"dimension": "hits",
"direction": "descending",
"dimensionOrder": "lexicographic"
}
]
}
}

Split big JSON file by type

Consider a big JSON in this format(ex: all.json):
[
{
"Name": "abc",
"Type": "movie"
},
{
"Name": "bcd",
"Type": "series"
},
{
"Name": "asd",
"Type": "movie"
},
{
"Name": "sdf",
"Type": "series"
}
]
I want split this file in two files by type
series.json
[
{
"Name": "bcd",
"Type": "series"
},
{
"Name": "sdf",
"Type": "series"
}
]
movie.json
[
{
"Name": "abc",
"Type": "movie"
},
{
"Name": "asd",
"Type": "movie"
}
]
What is the better approach to do this split using powershell? Someone can help?
try this:
#import data from file
$Array=Get-Content "C:\temp\test.json" | ConvertFrom-Json
#group data by type and export
$Array | group Type | %{
$File="C:\temp\{0}.json" -f $_.Name
$_.Group | ConvertTo-Json | Out-File $File
}

KSQL streams - Get data from Array of Struct

My JSON looks like:
{
"Obj1": {
"a": "abc",
"b": "def",
"c": "ghi"
},
"ArrayObj": [
{
"key1": "1",
"Key2": "2",
"Key3": "3",
},
{
"key1": "4",
"Key2": "5",
"Key3": "6",
},
{
"key1": "7",
"Key2": "8",
"Key3": "9",
}
]
}
I have written KSQL streams to convert it to AVRO and save to a topic, So that I can push it to JDBC Sink connector
CREATE STREAM Example1(ArrayObj ARRAY<STRUCT<key1 VARCHAR, Key2 VARCHAR>>,Obj1 STRUCT<a VARCHAR>)WITH(kafka_topic='sample_topic', value_format='JSON');
CREATE STREAM Example_Avro WITH(VALUE_FORMAT='avro') AS SELECT e.ArrayObj[0] FROM Example1 e;
In Example_Avro , I can get only first object in a array.
How can I get data shown as below, when I hit select * from Example_Avro in KSQL ?
a b key1 key2 key3
abc def 1 2 3
abc def 4 5 6
abc def 7 8 9
Test data (I removed the invalid trailing commas after key3 value):
ksql> PRINT test4;
Format:JSON
1/9/20 7:45:18 PM UTC , NULL , { "Obj1": { "a": "abc", "b": "def", "c": "ghi" }, "ArrayObj": [ { "key1": "1", "Key2": "2", "Key3": "3" }, { "key1": "4", "Key2": "5", "Key3": "6" }, { "key1": "7", "Key2": "8", "Key3": "9" } ] }
Query:
SELECT OBJ1->A AS A,
OBJ1->B AS B,
EXPLODE(ARRAYOBJ)->KEY1 AS KEY1,
EXPLODE(ARRAYOBJ)->KEY2 AS KEY2,
EXPLODE(ARRAYOBJ)->KEY3 AS KEY3
FROM TEST4
EMIT CHANGES;
Result:
+-------+-------+------+-------+-------+
|A |B |KEY1 |KEY2 |KEY3 |
+-------+-------+------+-------+-------+
|abc |def |1 |2 |3 |
|abc |def |4 |5 |6 |
|abc |def |7 |8 |9 |
Tested on ksqlDB 0.6, in which the EXPLODE function was added.

Create new Dataframe from Json element inside XML using Pyspark

Hi I'm dealing with rather difficult XML file which I'm trying to reformat and clean for some processing. I've been using Pyspark to process the data into a dataframe and I am using com.databricks.spark.xml to read the file.
My Dataframe looks like this; Each field is JSON Formatted
+----------------+---------------------------------+
| Identifier| Info|
+----------------+---------------------------------+
| JSON | Json |
| | |
| | |
+----------------+---------------------------------+
This is a sample value from the Identifier column
{
"Other": [
{
"_Type": "A",
"_VALUE": "999"
},
{
"_Type": "B",
"_VALUE": "31086"
},
{
"_Type": "C",
"_VALUE": "13123"
},
{
"_Type": "D",
"_VALUE": "32323"
},
{
"_Type": "E",
"_VALUE": "2223"
},
{
"_Type": "F",
"_VALUE": "100"
},
]
}
And this is how the Info Column looks like
{
"Demo": {
"BirthDate": "2009-09-13",
"BirthPlace": {
"_VALUE": null,
"_nil": true
},
"Rel": {
"_VALUE": null,
"_nil": true
}
},
"EmailList": {
"_VALUE": null,
"_nil": true
},
"Name": {
"LastName": "Marwan",
"FullName": {
"_VALUE": null,
"_nil": true
},
"GivenName": "Saad",
"MiddleName": null,
"PreferredFamilyName": {
"_VALUE": null,
"_nil": true
}
},
"OtherNames": {
"_VALUE": null,
"_nil": true
}
}
I am trying to create a dataframe that looks like the following
+-------+--------+-----------+------------+------------+
| F| E| LastName| GivenName | BirthDate|
+-------+--------+-----------+------------+------------+

OrientDB ETL loading CSV with vertices in one file and edges in another

I have some data that is in 2 CSV files, one contains the vertices and the other file contains the edges are in the other file. I'm working out how to set this up using ETL and am close but not quite there yet--it mostly works but my edges have properties and I'm not sure that they're loading right. This question was helpful but I'm still missing something...
Here's my data:
vertices.csv:
label,data,date
v01,0.1234,2015-01-01
v02,0.5678,2015-01-02
v03,0.9012,2015-01-03
edges.csv:
u,v,weight,date
v01,v02,12.4,2015-06-17
v02,v03,17.9,2015-09-14
I import my vertices using this:
commonVertices.json:
{
"begin": [
{ "let": { "name": "$filePath",
"expression": "$fileDirectory.append($fileName)"
}
},
],
"config": { "log": "info"},
"source": { "file": { "path": "$filePath" } },
"extractor": { "csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"dateFormat": "yyyy-mm-dd"
}
},
"transformers": [
{ "vertex": { "class": "myVertex" } },
{ "code": { "language": "Javascript",
"code": "print(' Current record: ' + record); record;" }
}
],
"loader": { "orientdb": {
"dbURL": "plocal:my_orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ { "name": "myVertex", "extends", "V" },
],
"indexes": []
}
}
}
vertices.json:
{ "config": { "log": "info",
"fileDirectory": "./",
"fileName": "vertices.csv"
}
}
commonEdges.json:
{
"begin": [
{ "let": { "name": "$filePath",
"expression": "$fileDirectory.append($fileName )"
}
},
],
"config": { "log": "info"
},
"source": { "file": { "path": "$filePath" } },
"extractor": { "csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"dateFormat": "yyyy-mm-dd"
}
},
"transformers": [
{ "merge": { "joinFieldName": "u", "lookup": "myVertex.label" } },
{ "edge": { "class": "myEdge",
"joinFieldName": "v",
"lookup": "myVertex.label",
"direction": "out",
"unresolvedLinkAction": "NOTHING"
}
},
{ "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:my_orientdb",
"dbType": "graph",
"batchCommit": 1000,
"useLightweightEdges": false,
"classes": [
{ "name": "myEdge", "extends", "E" }
],
"indexes": []
}
}
}
edges.json:
{
"config": {
"log": "info",
"fileDirectory": "./",
"fileName": "edges.csv"
}
}
I am running it with oetl.sh like this:
$ oetl.sh vertices.json commonVertices.json
$ oetl.sh edges.json commonEdges.json
Everything runs, but when I query the edges... I'm new to OrientDB, so maybe it is getting the properties in my edges, but when I query the edges I don't see the weight and date fields:
orientdb {db=my_orientdb}> SELECT FROM myEdge
+----+-----+------+-----+-----+
|# |#RID |#CLASS|out |in |
+----+-----+------+-----+-----+
|0 |#33:0|myEdge|#25:0|#26:0|
|1 |#34:0|myEdge|#26:0|#27:0|
+----+-----+------+-----+-----+
The vertex table contains the [weight] field from my edges.csv and the [date] field is getting clobbered in a weird way. The day of the month is getting overwritten to the day from the edge.csv file, which is undesirable, but it's odd to me that the month itself isn't also getting change:
orientdb {db=my_orientdb}> SELECT FROM myVertex
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|# |#RID |#CLASS |data |date |label|weight|out_myEdge|in_myEdge|
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|0 |#25:0|myVertex|0.1234|2015-01-17 00:06:00|v01 |12.4 |[#33:0] | |
|1 |#26:0|myVertex|0.5678|2015-01-14 00:09:00|v02 |17.9 |[#34:0] |[#33:0] |
|2 |#27:0|myVertex|0.9012|2015-01-03 00:01:00|v03 | | |[#34:0] |
+----+-----+--------+------+-------------------+-----+------+----------+---------+
I'm sure it's probably a simple tweak, any help would be great!
In edge transformer use edgeFields to bind properties in edges. Example:
"transformers": [
{ "merge": { "joinFieldName": "u", "lookup": "myVertex.label" } },
{ "edge": { "class": "myEdge",
"joinFieldName": "v",
"lookup": "myVertex.label",
"edgeFields": { "weight": "${input.weight}", "date": "${input.date}" },
"direction": "out",
"unresolvedLinkAction": "NOTHING"
}
},
{ "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
],
Hope it helps.