Use ETL to load CSV data into OrientDB containing a SPATIAL index - orientdb

I'm interested in loading some data into an OrientDB from some CSV files that contain spatial coordinates in WGS84 Lat/Long.
I'm using OrientDB 2.2.8 and have the lucene spatial module added to my $ORIENTDB_HOME/lib directory.
I'm loading my data into a database using ETL and would like to add the spatial index but I'm not sure how to do this.
Say my CSV file has the following columns:
Label (string)
Latitude (float)
Longitude (float)
I've tried this in my ETL:
"loader": {
"orientdb": {
"dbURL": "plocal:myDatabase.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ { "name": "vertex", "extends", "V" } ],
"indexes": [ { "class": "vertex", "fields":["Label:string"], "type":"UNIQUE" },
{ "class": "Label", "fields":["Latitude:float","Longitude:float"], "type":"SPATIAL" }
]
}
}
but it's not working. I get the following error message:
ETL process has problem: com.orientechnologies.orient.core.index.OIndexException: Index with type SPATIAL and algorithm null does not exist.
Has anyone looked into creating spatial indices via ETL? Most of the stuff I'm seeing on this is using either Java or via direct query.
Thanks in advance for any advice.

I was able to get it to load using the legacy spatial capabilities.
I put together a cheezy dataset that has some coordinates for a few of the Nazca line geoglyphs:
Name,Latitude,Longitude
Hummingbird,-14.692131,-75.148892
Monkey,-14.7067274,-75.1475391
Condor,-14.6983457,-75.1283374
Spider,-14.694363,-75.1235815
Spiral,-14.688309,-75.122757
Hands,-14.694459,-75.113881
Tree,-14.693897,-75.114467
Astronaut,-14.745222,-75.079755
Dog,-14.706401,-75.130788
I used a script to create my GeoGlyph class, createVertexGeoGlyph.osql:
set echo true
connect PLOCAL:./nazca.orientdb admin admin
CREATE CLASS GeoGlyph EXTENDS V CLUSTERS 1
CREATE PROPERTY GeoGlyph.Name STRING
CREATE PROPERTY GeoGlyph.Latitude FLOAT
CREATE PROPERTY GeoGlyph.Longitude FLOAT
CREATE PROPERTY GeoGlyph.Tag EMBEDDEDSET STRING
CREATE INDEX GeoGlyph.index.Location ON GeoGlyph(Latitude,Longitude) SPATIAL ENGINE LUCENE
which I load into my database using
$ console.sh createVertexGeoGlyph.osql
I do it this way because it seems to work more consistently for me. I've had some difficulties with getting the ETL engine to create defined properties when I've wanted it to off CSV imports. Sometimes it wants to cooperate and create my properties and other times has trouble.
So, the next step to get the data in is to create my .json files for the ETL process. I like to make two, one that is file-specific and another that is a common file since often I have datasets that span multiple files.
First, I have a my nazca_liens.json file:
{
"config": {
"log": "info",
"fileDirectory": "./",
"fileName": "nazca_lines.csv"
}
}
Next is the commonGeoGlyph.json file:
{
"begin": [
{ "let": { "name": "$filePath", "expression": "$fileDirectory.append($fileName )" } },
],
"config": { "log": "debug" },
"source": { "file": { "path": "$filePath" } },
"extractor":
{
"csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"separator": ",",
"columnsOnFirstLine": true,
"dateFormat": "yyyy-MM-dd"
}
},
"transformers": [
{ "vertex": { "class": "GeoGlyph" } },
{ "code": { "language":"Javascript",
"code": "print('>>> Current record: ' + record); record;" }
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:nazca.orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [],
"indexes": []
}
}
}
There's more stuff in the file than is necessary, I use it as a template for a lot of stuff. In this case, I don't have to create my index in the ETL file itself because I already created it in the createVertexGeoGlyph.osql file.
To load the data I just use the oetl.sh script:
$ oetl.sh commonGeoGlyph.json nazca_lines.json
This is what's working for me... I'm sure there are better ways to do it, but this works. I'm posting this here to tie off the question. Hopefully someone will find this to be useful.

Related

Can not create new layer (featuretype) in GeoServer using REST API

So I just used 2 working days trying to figure this out. We are automatic rendering process for maps. All the data is given in SQL base and my job is to write "wrapper" so we can implement this in our in-house framework. I managed all but one needed requests.
That request is POST featuretype since this is a way of creating a layer that can later be rendered.
I have all requests saved in postman for pre-testing on example data given by geoserver itself. I can't even get response with status code 201 and always get 500 internal server error. This status is described as possible syntax error in sytax. But I actually just copied and pasted exampled and used geoserver provided data.
This is the requst: http://127.0.0.1:8080/geoserver/rest/workspaces/tiger/datastores/nyc/featuretypes
and its body:
{
"name": "poi",
"nativeName": "poi",
"namespace": {
"name": "tiger",
"href": "http://localhost:8080/geoserver/rest/namespaces/tiger.json"
},
"title": "Manhattan (NY) points of interest",
"abstract": "Points of interest in New York, New York (on Manhattan). One of the attributes contains the name of a file with a picture of the point of interest.",
"keywords": {
"string": [
"poi",
"Manhattan",
"DS_poi",
"points_of_interest",
"sampleKeyword\\#language=ab\\;",
"area of effect\\#language=bg\\;\\#vocabulary=technical\\;",
"Привет\\#language=ru\\;\\#vocabulary=friendly\\;"
]
},
"metadataLinks": {
"metadataLink": [
{
"type": "text/plain",
"metadataType": "FGDC",
"content": "www.google.com"
}
]
},
"dataLinks": {
"org.geoserver.catalog.impl.DataLinkInfoImpl": [
{
"type": "text/plain",
"content": "http://www.google.com"
}
]
},
"nativeCRS": "GEOGCS[\"WGS 84\", \n DATUM[\"World Geodetic System 1984\", \n SPHEROID[\"WGS 84\", 6378137.0, 298.257223563, AUTHORITY[\"EPSG\",\"7030\"]], \n AUTHORITY[\"EPSG\",\"6326\"]], \n PRIMEM[\"Greenwich\", 0.0, AUTHORITY[\"EPSG\",\"8901\"]], \n UNIT[\"degree\", 0.017453292519943295], \n AXIS[\"Geodetic longitude\", EAST], \n AXIS[\"Geodetic latitude\", NORTH], \n AUTHORITY[\"EPSG\",\"4326\"]]",
"srs": "EPSG:4326",
"nativeBoundingBox": {
"minx": -74.0118315772888,
"maxx": -74.00153046439813,
"miny": 40.70754683896324,
"maxy": 40.719885123828675,
"crs": "EPSG:4326"
},
"latLonBoundingBox": {
"minx": -74.0118315772888,
"maxx": -74.00857344353275,
"miny": 40.70754683896324,
"maxy": 40.711945649065406,
"crs": "EPSG:4326"
},
"projectionPolicy": "REPROJECT_TO_DECLARED",
"enabled": true,
"metadata": {
"entry": [
{
"#key": "kml.regionateStrategy",
"$": "external-sorting"
},
{
"#key": "kml.regionateFeatureLimit",
"$": "15"
},
{
"#key": "cacheAgeMax",
"$": "3000"
},
{
"#key": "cachingEnabled",
"$": "true"
},
{
"#key": "kml.regionateAttribute",
"$": "NAME"
},
{
"#key": "indexingEnabled",
"$": "false"
},
{
"#key": "dirName",
"$": "DS_poi_poi"
}
]
},
"store": {
"#class": "dataStore",
"name": "tiger:nyc",
"href": "http://localhost:8080/geoserver/rest/workspaces/tiger/datastores/nyc.json"
},
"cqlFilter": "INCLUDE",
"maxFeatures": 100,
"numDecimals": 6,
"responseSRS": {
"string": [
4326
]
},
"overridingServiceSRS": true,
"skipNumberMatched": true,
"circularArcPresent": true,
"linearizationTolerance": 10,
"attributes": {
"attribute": [
{
"name": "the_geom",
"minOccurs": 0,
"maxOccurs": 1,
"nillable": true,
"binding": "com.vividsolutions.jts.geom.Point"
},
{},
{},
{}
]
}
}
So it is example case and I can't get any useful response from the server. I get the code 500 with body name (the first item in json). Similarly I get same code with body FeatureTypeInfo when trying with xml body(first tag).
I already tried the request in new instance of geoserver in Docker (changed the port) and still no success.
I check if datastore, workspace is available and that layer "poi" doesn't yet exists.
Here are also some logs of request (similar for xml body):
2018-08-03 07:35:02,198 ERROR [geoserver.rest] -
com.thoughtworks.xstream.mapper.CannotResolveClassException: name at
com.thoughtworks.xstream.mapper.DefaultMapper.realClass(DefaultMapper.java:79)
at .....
Does anyone know the solution to this and got it working. I am using GeoServer 2.13.1
So i was still looking for the answer and using this post (https://gis.stackexchange.com/questions/12970/create-a-layer-in-geoserver-using-rest) got to the right content to POST featureType and hence creating a layer in GeoServer.
The documentation is off in REST API docs.
Using above link I found out that when using JSON there is a missing insertion in JSON. For API to work here we need to add:
{featureType:
name: "...",
nativeName: "...",
.
.
.}
So that it doesn't start with "name" attribute but it is contained in "featureType".
I didn't try that for XML also but I guess it could be similar.
Hope this helps someone out there struggling like I did.
Blaz is correct here, you need an outer object of FeatureType and then an inner object with your config. So;
{
"featureType": {
"name": "layer",
"nativeName": "poi",
"your config": "stuff"
}
I find though that using a post request I get very little if any response and its not obvious if the layer creation worked. But you can call http://IP:8080/geoserver/rest/layers.json to check if your new layer is there.
It costs me a lot of time to create FeatureTypes using REST API. Use Json like this really works:
{
"featureType": {
"name": "layer",
"nativeName": "poi"
"otherProperties...":"values..."
}
And use Json below to create Workspace:
{
"workspace": {
"name": "test_workspace"
}
}
The REST API is out of date now. That's disappointing. Is there anyone knows how to get the lastest REST API document?

Using ADF Copy Activity with dynamic schema mapping

I'm trying to drive the columnMapping property from a database configuration table. My first activity in the pipeline pulls in the rows from the config table. My copy activity source is a Json file in Azure blob storage and my sink is an Azure SQL database.
In copy activity I'm setting the mapping using the dynamic content window. The code looks like this:
"translator": {
"value": "#json(activity('Lookup1').output.value[0].ColumnMapping)",
"type": "Expression"
}
My question is, what should the value of activity('Lookup1').output.value[0].ColumnMapping look like?
I've tried several different json formats but the copy activity always seems to ignore it.
For example, I've tried:
{
"type": "TabularTranslator",
"columnMappings": {
"view.url": "url"
}
}
and:
"columnMappings": {
"view.url": "url"
}
and:
{
"view.url": "url"
}
In this example, view.url is the name of the column in the JSON source, and url is the name of the column in my destination table in Azure SQL database.
The issue is due to the dot (.) sign in your column name.
To use column mapping, you should also specify structure in your source and sink dataset.
For your source dataset, you need specify your format correctly. And since your column name has dot, you need specify the json path as following.
You could use ADF UI to setup a copy for a single file first to get the related format, structure and column mapping format. Then change it to lookup.
And as my understanding, your first format should be the right format. If it is already in json format, then you may not need use "json" function in your expression.
There seems to be a disconnect between the question and the answer, so I'll hopefully provide a more straightforward answer.
When setting this up, you should have a source dataset with dynamic mapping. The sink doesn't require one, as we're going to specify it in the mapping.
Within the copy activity, format the dynamic json like the following:
{
"structure": [
{
"name": "Address Number"
},
{
"name": "Payment ID"
},
{
"name": "Document Number"
},
...
...
]
}
You would then specify your dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Address Number",
"type": "Int32"
},
"sink": {
"name": "address_number"
}
},
{
"source": {
"name": "Payment ID",
"type": "Int64"
},
"sink": {
"name": "payment_id"
}
},
{
"source": {
"name": "Document Number",
"type": "Int32"
},
"sink": {
"name": "document_number"
}
},
...
...
]
}
}
Assuming these were set in separate variables, you would want to send the source as a string, and the mapping as json:
source: #string(json(variables('str_dyn_structure')).structure)
mapping: #json(variables('str_dyn_translator')).translator
VladDrak - You could skip the source dynamic definition by building dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"type": "String",
"ordinal": "1"
},
"sink": {
"name": "dateOfActivity",
"type": "String"
}
},
{
"source": {
"type": "String",
"ordinal": "2"
},
"sink": {
"name": "CampaignID",
"type": "String"
}
}
]
}
}

Azure Data Factory: Parameterized the Folder and file path

Environments
Azure Data Factory
Scenario
I have ADF pipeline which reads the data from On premise server and writes the data to azure data lake.
For the same - I have provided Folder structure in ADF*(dataset)*as follows
Folder Path : - DBName/RawTables/Transactional
File Path : - TableName.csv
Problem
Is it possible to parameterized the Folder name or file path ? Basically - if tomorrow - I want to change the folder path*(without deployment)* then we should be updating the metadata or table structure.
So the short answer here is no. You won't be able to achieve this level of dynamic flexibility with ADF on its own.
You'll need to add new defined datasets to your pipeline as inputs for folder changes. In Data Lake you could probably get away with a single stored procedure that accepts a parameter for the file path which could be reused. But this would still require tweaks to the ADF JSON when calling the proc.
Of course, the catch all situation here is to use an ADF custom activity and write a C# class with methods to do whatever you need. Maybe overkill though and lots of effort to setup the authentication to data lake store.
Hope this gives you a steer.
Mangesh, why don't you try the .Net custom activity in ADF. This custom activity will be your first activity that will potentially check for the processed folder and if the processed folder is present it will move that to History(say) folder. As, ADF is a platform for data movement and data transformation, it doesn't deal with the IO activity. You can learn more about the .Net custom activity at:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-use-custom-activities
What you want to do is possible with the new Lookup activity in Azure Data Factory V2. Documentation is here: Lookup Activity.
A JSON example would be something like this:
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
}
}
},
{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from #{activity('LookupActivity').output.tableName}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [ "Succeeded" ]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
}
]
}
}

Get all vertices having a labelname

I am using ibm graph in bluemix and new to this.
I created a graph named 'test' using the GUI provided by bluemix and uploaded the sample data 'Music Festival' provided by ibm in that graph.
Now I am trying to query all the vertices having label 'attendee' using below query.
def gt = graph.traversal();
gt.V().hasLabel("attendee");
But I am getting error as
Error: Error encountered evaluating script def gt = graph.traversal();gt.V().hasLabel("attendee"); with reason com.thinkaurelius.titan.core.TitanException: Could not find a suitable index to answer graph query and graph scans are disabled: [(~label = attendee)]:VERTEX
Not sure what I am doing wrong.
Can somebody tell where am i going wrong?
How can i get rid of this error and get the expected output?
Thanks
#Radhika, Your Gremlin query is a valid Gremlin query. However, some vendors (such as IBM Graph and Titan) chose to only allow users to start their queries with a query that is indexed.This is to make sure you get the performance of your queries. Calling hasLabel() by itself will give you the Could not find a suitable index... error as you can't create indexes for labels. What you need to do is follow this step with a step that uses a indexed property as in this query :
graph.traversal();gt.V().hasLabel("band").has("genre","pop");
An index for genre has been created in the schema for the sample music festival data as you can see below
{
"propertyKeys": [
{ "name": "name", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "gender", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "age", "dataType": "Integer", "cardinality": "SINGLE" },
{ "name": "genre", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "monthly_listeners", "dataType": "String", "cardinality": "SINGLE" },
{ "name":"date","dataType":"String","cardinality":"SINGLE" },
{ "name":"time","dataType":"String","cardinality":"SINGLE" }
],
"vertexLabels": [
{ "name": "attendee" },
{ "name": "band" },
{ "name": "venue" }
],
"edgeLabels": [
{ "name": "bought_ticket", "multiplicity": "MULTI" },
{ "name":"advertised_to","multiplicity":"MULTI" },
{ "name":"performing_at","multiplicity":"MULTI" }
],
"vertexIndexes": [
{ "name": "vByName", "propertyKeys": ["name"], "composite": true, "unique": false },
{ "name": "vByGender", "propertyKeys": ["gender"], "composite": true, "unique": false },
{ "name": "vByGenre", "propertyKeys": ["genre"], "composite": true, "unique": false}
],
"edgeIndexes" :[
{ "name": "eByBoughtTicket", "propertyKeys": ["time"], "composite": true, "unique": false }
]
That's why the above query works and you need to do the same.
If you don't have a schema, create one. You can model it after the
one above or follow the API
doc
Create an (Vertex/Label) index for the properties that you'll start
your traversals from. In this example, Name, Gender and Genre for
vertex properties and name for the edge properties.
Call the schema
endpoint
to add your schema to your graph
It's recommended to create your schema before adding any data to
your graph so that you don't have to reindex later. That'll save you
a lot of time.
Once you create your schema, you can't modify what you created
already, but you can add new properties/indexes later on.
Look at the following code samples for Java and Nodejs for the exact code to use.
I hope that helps

Only data from node 1 visible in a 2 node OrientDB cluster

I created a 2-node OrientDB cluster by following the below steps. But while distributing it, the data present in only one of the node is accessible. Please can you help me debug this issue. The OrientDB version is 2.2.6
Steps involved :
Utilized plocal mode in ETL tool and stored part of the data in node 1 and the other part in node2. The data stored actually belongs to just one class of vertex alone. ( On checking the data from console, the data has got injested properly ).
Then executed both the nodes in distributed mode, data from only one machineis accessible.
The default-distributed-db-config.json file is specified below :
{
"autoDeploy": true,
"readQuorum": 1,
"writeQuorum": 1,
"executionMode": "undefined",
"readYourWrites": true,
"servers": {
"*": "master"
},
"clusters": {
"internal": {
},
"address": {
"servers" : [ "orientmaster" ]
},
"address_1": {
"servers" : [ "orientslave1" ]
},
"*": {
"servers": ["<NEW_NODE>"]
}
}
}
There are two clusters created for the vertex named address namely address and address_1. The data in machine orientslave1 is stored using ETL tool into cluster address_1 , similarly the data in machine orientmaster is stored into the cluster address. ( I've ensured that both of these cluster ids are different at time of creation )
However when these two machines are connected together in distributed mode, the data in cluster address_1 is only visible
The ETL json is attached below :
{
"source": { "file": { "path": "/home/ubuntu/labvolume1/DataStorage/geo1_5lacs.csv" } },
"extractor": { "csv": {"columnsOnFirstLine": false, "columns":["place:string"] } },
"transformers": [
{ "vertex": { "class": "ADDRESS", "skipDuplicates":true } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ubuntu/labvolume1/orientdb/databases/ETL_Test1",
"dbType": "graph",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"wal": false,
"tx":false,
"classes": [
{"name": "ADDRESS", "extends": "V", "clusters":1}
], "indexes": [
{"class":"ADDRESS", "fields":["place:string"], "type":"UNIQUE" }
]
}
}
}
Please let me know, if there is anything i'm doing wrongly