I have 3 CSV files to load in an OrientDB graph.
People
Product
Purchases
People.csv is like
person_id;name
1;francesco
2;luca
Product.csv is like
product_id;product_name
101;apple
102;banana
Purchases.csv is like
person_id;product_id;avg_price
1;101;$1.10
2;101;$1.08
1;102;$5.34
I load first all the people and the products with 2 different ETL jobs.
Each job loads vertices.
How can I load periodically just the edges using OrientdbETL, as people buy new products?
All the Transformers and particularly EDGE output OrientVertex, that can only be INSERTed by the LOADER step.
(The EDGE Transformer adds EDGE properties to the Vertex, but the actual action is an INSERT of the Vertex). Is there a way to update a Vertex using the ETL?
Rgds,
Francesco
An ETL json with these transformers should import the "Purchase" edges from purchases.csv and update the avg_price of each purchased product.
"transformers": [
{ "merge": { "joinFieldName": "product_id", "lookup": "Product.id" } },
{ "vertex": {"class": "Product", "skipDuplicates": true} },
{ "edge": { "class": "Purchase",
"joinFieldName": "person_id",
"lookup": "Person.id",
"direction": "in"
}
},
{ "field": { "fieldNames": ["person_id", "product_id"], "operation": "remove" } }
]
class and attribute names ("Product.id", "Person", etc) may be different based on your DB schema.
Related
Have been using rasa nlu to classify intents and entities for my chatbot. Everything works as expected (with extensive training) but with entities, it seems to predict the value based on the exact position and length of the word. This is fine for a scenario where the entities are limited. But when the bot needs to identify a word (which has a different length and not trained yet, for example a new name), it's failing to detect. Is there a way wherein I can make rasa identify the entities based on the relative position of the word or better yet, insert a list of words that becomes the domain specific for the entity to find a match with (like phrase list in LUIS)?
{"q":"i want to buy a Casio SX56"}
{
"project": "default",
"entities": [
{
"extractor": "ner_crf",
"confidence": 0.7043648832678735,
"end": 26,
"value": "Casio SX56",
"entity": "watch",
"start": 16
}
],
"intent": {
"confidence": 0.8835646513829762,
"name": "buy_watch"
},
"text": "i want to buy a Casio SX56",
"model": "model_20180522-165141",
"intent_ranking": [
{
"confidence": 0.8835646513829762,
"name": "buy_watch"
},
{
"confidence": 0.07072182459497935,
"name": "greet"
}
]
}
But if Casio SX56 gets replaced with Citizen M1:
{"q":"i want to buy a Citizen M1"}
{
"project": "default",
"intent": {
"confidence": 0.8710909096729019,
"name": "buy_watch"
},
"text": "i want to buy a Citizen M1",
"model": "model_20180522-165141",
"intent_ranking": [
{
"confidence": 0.8710909096729019,
"name": "buy_watch"
},
{
"confidence": 0.07355588750895545,
"name": "greet"
}
]
}
Thank you!
Make sure you actually added each entity value training examples before training it with rasa_nlu.
--- For successful entity extraction we need to create at least 2 or more contextual training data ---
add this eg. in rasa_nlu training data if it's not extracting properly
"text": "i want to buy a Citizen M1",
"model": "model_20180522-165141",
"intent_ranking": [
{
"confidence": 0.8710909096729019,
"name": "buy_watch"
},
{
"confidence": 0.07355588750895545,
"name": "greet"
}
]
entity extraction with phrase matching does work in rasa_nlu try it with spacy_sklearn backend pipeline
The feature I was looking for is phrase matcher which would allow me to add a list of possible entities to the training model. This way, if any new name pops up, we can simply add the name to the phrase list and the model would be able to identify it with all possible utterances. Though this is still in development and should be added to the master soon: https://github.com/RasaHQ/rasa_nlu/pull/822
I'm trying to import edges from a CSV-file into OrientDB. The vertices are stored in a separate file and already imported via ETL into OrientDB.
So my situation is similar to OrientDB import edges only using ETL tool and OrientDB ETL loading CSV with vertices in one file and edges in another.
Update
Friend.csv
"id","client_id","first_name","last_name"
"0","0","John-0","Doe"
"1","1","John-1","Doe"
"2","2","John-2","Doe"
...
The "id" field is removed by the Friend-Importer, but the "client_id" is stored. The idea is to have a known client-side generated id for searching etc.
PeindingFriendship.csv
"friendship_id","client_id","from","to"
"0","0-1","1","0"
"2","0-15","15","0"
"3","0-16","16","0"
...
The "friendship_id" and "client_id" should be imported as attributes of the "PendingFriendship" edge. "from" is a "client_id" of a Friend. "to" is a "client_id" of another Friend.
For "client_id" exists a unique Index on both Friend and PendingFriendship.
My ETL configuration looks like this
...
"extractor": {
"csv": {
}
},
"transformers": [
{
"command": {
"command": "CREATE EDGE PendingFriendship FROM (SELECT FROM Friend WHERE client_id = '${input.from}') TO (SELECT FROM Friend WHERE client_id = '${input.to}') SET client_id = '${input.client_id}'",
"output": "edge"
}
},
{
"field": {
"fieldName": "from",
"expression": "remove"
}
},
{
"field": {
"fieldName": "to",
"operation": "remove"
}
},
{
"field": {
"fieldName": "friendship_id",
"expression": "remove"
}
},
{
"field": {
"fieldName": "client_id",
"operation": "remove"
}
},
{
"field": {
"fieldName": "#class",
"value": "PendingFriendship"
}
}
],
...
The issue with this configuration is that it creates two edge entries. One is the expected "PendingFriendship" edge. The second one is an empty "PendingFriendship" edge, with all the fields I removed as attributes with empty values.
The import fails, at the second row/document, because another empty "PendingFriendship" cannot be inserted because it violates a uniqueness constraint.
How can I avoid the creation of the unnecessary empty "PendingFriendship".
What is the best way to import edges into OrientDB? All the examples in the documentation use CSV files where vertices and edges are in one file, but this is not the case for me.
I also had a look into the Edge-Transformer, but it returns a Vertex not an Edge!
Created PendingFriendships
After some time I found a way (workaround) to import the above data into OrientDB. Instead of using the ETL Tool I wrote simple ruby scripts which call the HTTP API of OrientDB using the Batch endpoint.
Steps:
Import the Friends.
Use the response to create a mapping of client_ids to #rids.
Parse the PeindingFriendship.csv and build batch requests.
Each Friendships is created by its own command.
The mapping from 2. is used to insert the #rids into the command from 4.
Send the batch requests in junks of 1000 commands.
Example Batch-Request body:
{
"transaction" : true,
"operations" : [
{
"type" : "cmd",
"language" : "sql",
"command" : "create edge PendingFriendship from #27:178 to #27:179 set client_id='4711'"
}
]
}
This isn't the answer to the question I asked, but it solves the higher goal of importing data into OrientDB, for me. Therefore I leave it open for the community to mark this question as solved or not.
I am using ibm graph in bluemix and new to this.
I created a graph named 'test' using the GUI provided by bluemix and uploaded the sample data 'Music Festival' provided by ibm in that graph.
Now I am trying to query all the vertices having label 'attendee' using below query.
def gt = graph.traversal();
gt.V().hasLabel("attendee");
But I am getting error as
Error: Error encountered evaluating script def gt = graph.traversal();gt.V().hasLabel("attendee"); with reason com.thinkaurelius.titan.core.TitanException: Could not find a suitable index to answer graph query and graph scans are disabled: [(~label = attendee)]:VERTEX
Not sure what I am doing wrong.
Can somebody tell where am i going wrong?
How can i get rid of this error and get the expected output?
Thanks
#Radhika, Your Gremlin query is a valid Gremlin query. However, some vendors (such as IBM Graph and Titan) chose to only allow users to start their queries with a query that is indexed.This is to make sure you get the performance of your queries. Calling hasLabel() by itself will give you the Could not find a suitable index... error as you can't create indexes for labels. What you need to do is follow this step with a step that uses a indexed property as in this query :
graph.traversal();gt.V().hasLabel("band").has("genre","pop");
An index for genre has been created in the schema for the sample music festival data as you can see below
{
"propertyKeys": [
{ "name": "name", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "gender", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "age", "dataType": "Integer", "cardinality": "SINGLE" },
{ "name": "genre", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "monthly_listeners", "dataType": "String", "cardinality": "SINGLE" },
{ "name":"date","dataType":"String","cardinality":"SINGLE" },
{ "name":"time","dataType":"String","cardinality":"SINGLE" }
],
"vertexLabels": [
{ "name": "attendee" },
{ "name": "band" },
{ "name": "venue" }
],
"edgeLabels": [
{ "name": "bought_ticket", "multiplicity": "MULTI" },
{ "name":"advertised_to","multiplicity":"MULTI" },
{ "name":"performing_at","multiplicity":"MULTI" }
],
"vertexIndexes": [
{ "name": "vByName", "propertyKeys": ["name"], "composite": true, "unique": false },
{ "name": "vByGender", "propertyKeys": ["gender"], "composite": true, "unique": false },
{ "name": "vByGenre", "propertyKeys": ["genre"], "composite": true, "unique": false}
],
"edgeIndexes" :[
{ "name": "eByBoughtTicket", "propertyKeys": ["time"], "composite": true, "unique": false }
]
That's why the above query works and you need to do the same.
If you don't have a schema, create one. You can model it after the
one above or follow the API
doc
Create an (Vertex/Label) index for the properties that you'll start
your traversals from. In this example, Name, Gender and Genre for
vertex properties and name for the edge properties.
Call the schema
endpoint
to add your schema to your graph
It's recommended to create your schema before adding any data to
your graph so that you don't have to reindex later. That'll save you
a lot of time.
Once you create your schema, you can't modify what you created
already, but you can add new properties/indexes later on.
Look at the following code samples for Java and Nodejs for the exact code to use.
I hope that helps
I'm interested in loading some data into an OrientDB from some CSV files that contain spatial coordinates in WGS84 Lat/Long.
I'm using OrientDB 2.2.8 and have the lucene spatial module added to my $ORIENTDB_HOME/lib directory.
I'm loading my data into a database using ETL and would like to add the spatial index but I'm not sure how to do this.
Say my CSV file has the following columns:
Label (string)
Latitude (float)
Longitude (float)
I've tried this in my ETL:
"loader": {
"orientdb": {
"dbURL": "plocal:myDatabase.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ { "name": "vertex", "extends", "V" } ],
"indexes": [ { "class": "vertex", "fields":["Label:string"], "type":"UNIQUE" },
{ "class": "Label", "fields":["Latitude:float","Longitude:float"], "type":"SPATIAL" }
]
}
}
but it's not working. I get the following error message:
ETL process has problem: com.orientechnologies.orient.core.index.OIndexException: Index with type SPATIAL and algorithm null does not exist.
Has anyone looked into creating spatial indices via ETL? Most of the stuff I'm seeing on this is using either Java or via direct query.
Thanks in advance for any advice.
I was able to get it to load using the legacy spatial capabilities.
I put together a cheezy dataset that has some coordinates for a few of the Nazca line geoglyphs:
Name,Latitude,Longitude
Hummingbird,-14.692131,-75.148892
Monkey,-14.7067274,-75.1475391
Condor,-14.6983457,-75.1283374
Spider,-14.694363,-75.1235815
Spiral,-14.688309,-75.122757
Hands,-14.694459,-75.113881
Tree,-14.693897,-75.114467
Astronaut,-14.745222,-75.079755
Dog,-14.706401,-75.130788
I used a script to create my GeoGlyph class, createVertexGeoGlyph.osql:
set echo true
connect PLOCAL:./nazca.orientdb admin admin
CREATE CLASS GeoGlyph EXTENDS V CLUSTERS 1
CREATE PROPERTY GeoGlyph.Name STRING
CREATE PROPERTY GeoGlyph.Latitude FLOAT
CREATE PROPERTY GeoGlyph.Longitude FLOAT
CREATE PROPERTY GeoGlyph.Tag EMBEDDEDSET STRING
CREATE INDEX GeoGlyph.index.Location ON GeoGlyph(Latitude,Longitude) SPATIAL ENGINE LUCENE
which I load into my database using
$ console.sh createVertexGeoGlyph.osql
I do it this way because it seems to work more consistently for me. I've had some difficulties with getting the ETL engine to create defined properties when I've wanted it to off CSV imports. Sometimes it wants to cooperate and create my properties and other times has trouble.
So, the next step to get the data in is to create my .json files for the ETL process. I like to make two, one that is file-specific and another that is a common file since often I have datasets that span multiple files.
First, I have a my nazca_liens.json file:
{
"config": {
"log": "info",
"fileDirectory": "./",
"fileName": "nazca_lines.csv"
}
}
Next is the commonGeoGlyph.json file:
{
"begin": [
{ "let": { "name": "$filePath", "expression": "$fileDirectory.append($fileName )" } },
],
"config": { "log": "debug" },
"source": { "file": { "path": "$filePath" } },
"extractor":
{
"csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"separator": ",",
"columnsOnFirstLine": true,
"dateFormat": "yyyy-MM-dd"
}
},
"transformers": [
{ "vertex": { "class": "GeoGlyph" } },
{ "code": { "language":"Javascript",
"code": "print('>>> Current record: ' + record); record;" }
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:nazca.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [],
"indexes": []
}
}
}
There's more stuff in the file than is necessary, I use it as a template for a lot of stuff. In this case, I don't have to create my index in the ETL file itself because I already created it in the createVertexGeoGlyph.osql file.
To load the data I just use the oetl.sh script:
$ oetl.sh commonGeoGlyph.json nazca_lines.json
This is what's working for me... I'm sure there are better ways to do it, but this works. I'm posting this here to tie off the question. Hopefully someone will find this to be useful.
I have a simple tree structure in a MySQL table (id, parentId) with about 3 million vertices and wanted to import this into a OrientDB Graph database. The ETL importer imports the vertices smoothly, but can't create edges (NullPointerException). The ETL does not even work on a plain database with the given examples in the documentation (http://orientdb.com/docs/last/Import-a-tree-structure.html throws the same exception), so I just imported the vertices and wanted to create the edges manually.
I have a Vertex class (Address) with two properties (id, parentId) and I want to create the Edges between these Vertices (parentId -> id). Is there a simple way to do this instead of inserting the edges in a loop? Something like in SQL
INSERT INTO E (out, in) VALUES (SELECT parentId, id FROM Address)
Since edges shall only be created with CREATE EDGE, I guess OrientDB does not support such an operation by default. But maybe there is a workaround to create these 3 million edges?
I found it is easy to create a link between the two records:
CREATE LINK parentLink TYPE LINK FROM Address.parentId TO Address.Id
However, I cannot create Edges in such a way. I tried working with variables
CREATE EDGE isParentOf FROM (SELECT FROM Address) TO (SELECT FROM Address WHERE id = $current.parentId)
But that does not work.
Have you tried this ETL Json:
{
"config": {"log": "debug", "parallel": true },
"extractor" : {
"jdbc": { "driver": "oracle.jdbc.driver.OracleDriver",
"url": "jdbc:oracle:thin:hostname/db",
"userName": "username",
"userPassword": "password",
"query": "select id, A.parentId from Address a where rownum<2" }
},
"transformers": [`enter code here`
{ "vertex": { "class": "Address" }},
{ "edge": { "class": "isParentOf",
"joinFieldName": "parentId",
"lookup": "Address.Id",
"direction": "in",
"skipDuplicates":true
}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:server/db",
"dbUser": "user",
"dbPassword": "passwd!",
"dbType": "graph",
"classes": [
{"name": "Address", "extends": "V"},
{"name": "isParentOf", "extends": "E"}
], "indexes": [
{"class":"Address", "fields":["ID:string"], "type":"UNIQUE" }
]
}
}
}