I am learning Atlas and trying to find a way to import metadata from RDBMS like (Sql Server or Postgre Sql).
Could somebody provide reference/s to do it or steps?
I am using Atlas in docker with build in HBase and Solr. Intention is to import metadata from AWS RDS.
Update 1
To rephrase my question. Can we import metadata directly from RDS Sql Server or PostgreSql without importing actual data in hive (hadoop)?
Any comment/s or answer is appreciated. Thank you!
AFAIK, Atlas works on hive metastore.
Below is the AWS documention of how to do it in AWS Emr while creating the cluster it self. ... Metadata classification, lineage, and discovery using Apache Atlas on Amazon EMR
Here is Cloudera source from sqoop stand point.
From Cloudera source : Populate metadata repository from RDBMS in Apache Atlas question from Cloudera.
1) you create the new types in Atlas. For example, in the case of Oracle, and Oracle table type, column type, etc.
2) create a script or process that pulls the meta data from the source meta data store.
3) Once you have the meta data you want to store in Atlas, your process would create the associated Atlas entities, based on the new types, using the Java API or JSON representations through the REST API directly. If you wanted to, you could add lineage to that as you store the new entities.
The below documentation has step by step details on how to use sqoop to move from any RDBMS to hive.
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-access/content/using_sqoop_to_move_...
You can refer to this as well: http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal
To get the metadata of all this sqoop imported data in to Atlas, make sure the below configurations are set properly.
http://atlas.incubator.apache.org/Bridge-Sqoop.html
Please note the above configuration step is not needed if your cluster configuration is managed by Ambari.
Using Rest API is one way is a good way to show MySQL metadata to the atlas catalog
other way using spark hive_support() spark -> read MySQL using JDBC -> write into hive , or using sqoop
To help to create RDBMS related instances, DB, tables, columns, I have created a GitHub repository
contains a template that can help you to understand how to add RDBMS or MySQL entities to the atlas
https://github.com/vettrivikas/Apche-Atlas-for-RDBMS
We can use REST API to create a type and then send data to it. Like
Lets say i have a dashboard and a visualization on it. I can create a Type Definition and then push data to it
{
"entityDefs": [
{
"superTypes": [
"DataSet"
],
"name": "Dashboard",
"description": "The definition of a Dashboard",
"attributeDefs": [
{
"name": "name",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": -1,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false,
"searchWeight": -1
},
{
"name": "childDataset",
"typeName": "array<Visualization>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false,
"searchWeight": -1
}
]
},
{
"superTypes": [
"DataSet"
],
"name": "Visualization",
"description": "The definition of a Dashboard",
"attributeDefs": [
{
"name": "name",
"typeName": "string",
"isOptional": true,
"cardinality": "SINGLE",
"valuesMinCount": -1,
"valuesMaxCount": 1,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false,
"searchWeight": -1
},
{
"name": "parentDataset",
"typeName": "array<Dashboard>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false,
"searchWeight": -1
}
]
}
],
"relationshipDefs": [
{
"category": "RELATIONSHIP",
"name": "dashboards_visualization_assignment",
"description": "The relationship between a Dashboard and a Visualization",
"relationshipCategory": "ASSOCIATION",
"attributeDefs": [],
"propagateTags": "NONE",
"endDef1": {
"type": "Dashboard",
"name": "childDataset",
"isContainer": false,
"cardinality": "SET",
"isLegacyAttribute": false
},
"endDef2": {
"type": "Visualization",
"name": "parentDataset",
"isContainer": false,
"cardinality": "SET",
"isLegacyAttribute": false
}
}
]
}
Then, you can simply add data using a REST Call to {servername}:{port}/api/atlas/v2/entity/bulk
{
"entities": [
{
"typeName": "Dashboard",
"guid": -1000,
"createdBy": "admin",
"attributes": {
"name": "sample dashboard",
"childDataset": [
{
"guid": "-200",
"typeName": "Visualization"
}
]
}
}
],
"referredEntities": {
"-200": {
"guid": "-200",
"typeName": "Visualization",
"attributes": {
"qualifiedName": "bar-chart"
}
}
}
}
}
Now, Look for Entities in Atlas.
Dashboard Entity on Atlas
Related
I configured a Kafka JDBC Source connector in order to push on a Kafka topic the record changed (insert or update) from a PostgreSQL database.
I use "timestamp+incrementing" mode. Seems to work fine.
I didnt't configure the JDBC Sink connector because I'm using a Kafka Consumer that listen on the topic.
The message on the topic is a JSON. This is an example:
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int64",
"optional": false,
"field": "id"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_create_date"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_modify_date"
},
{
"type": "int32",
"optional": true,
"field": "entity_version"
},
{
"type": "string",
"optional": true,
"field": "firstname"
},
{
"type": "string",
"optional": true,
"field": "lastname"
}
],
"optional": false,
"name": "author"
},
"payload": {
"id": 1,
"entity_create_date": 1600287236682,
"entity_modify_date": 1600287236682,
"entity_version": 1,
"firstname": "George",
"lastname": "Orwell"
}
}
As you can see there is no info about if this change is captured by Source connector because of an insert or an update.
I need this information. How can solve?
You can't get that information using the JDBC Source connector, unless you do something bespoke in the source schema and triggers.
This is one of the reasons why log-based CDC is generally a better way to get events from the source database, and for other reasons including:
capturing deletes
capturing the type of operation
capturing all events, not just what's there at the time when the connector polls.
For more details on the nuances of this see this blog or a talk based on the same.
Using a CDC based approach as suggested by #Robin Moffatt may be the proper way to handle your requirement. Checkout https://debezium.io/
However, looking at your table data you could use "entity_create_date" and "entity_modify_date" in your consumer to determine if the message in an insert or update. If "entity_create_date" = "entity_modify_date" then it's an insert else it's an update.
I have the following definition in my pipeline that I am running on Azure DevOps Server Version Dev17.M153.3
After saving the change I can see the following content has been added to pipeline definition
"approvals": [
{
"rank": 1,
"isAutomated": false,
"isNotificationOn": false,
"approver": {
"displayName": "Aouslender, Alexey",
"url": "http://tdc1tfsapp01:8080/tfs/DefaultCollection/_apis/Identities/2d86d86b-fe02-4e22-aa53-4315cdb3821c",
"_links": {
"avatar": {
"href": "http://tdc1tfsapp01:8080/tfs/DefaultCollection/_apis/GraphProfile/MemberAvatars/win.Uy0xLTUtMjEtMzMwNDk4NzQ2Ni0xODkxMDA3NDIzLTI5MjUxNTc3OTctNDU4NDA1"
}
},
"id": "2d86d86b-fe02-4e22-aa53-4315cdb3821c",
"uniqueName": "DOMAIN\\PXXXXXX",
"imageUrl": "http://tdc1tfsapp01:8080/tfs/DefaultCollection/_apis/GraphProfile/MemberAvatars/win.Uy0xLTUtMjEtMzMwNDk4NzQ2Ni0xODkxMDA3NDIzLTI5MjUxNTc3OTctNDU4NDA1",
"descriptor": "win.Uy0xLTUtMjEtMzMwNDk4NzQ2Ni0xODkxMDA3NDIzLTI5MjUxNTc3OTctNDU4NDA1"
},
"id": 3546
}
]
Now I am exporting pipeline, using export option. Then I am deleting pipeline and importing it using exported json file.
The imported pipeline missing the Approvers definition, nevertheless I can see the definition in exported json.
"preDeployApprovals": {
"approvals": [
{
"rank": 1,
"isAutomated": false,
"isNotificationOn": false,
"approver": {
"displayName": "Aouslender, Alexey",
"url": "http://tdc1tfsapp01:8080/tfs/DefaultCollection/_apis/Identities/2d86d86b-fe02-4e22-aa53-4315cdb3821c",
"_links": {
"avatar": {
"href": "http://tdc1tfsapp01:8080/tfs/DefaultCollection/_apis/GraphProfile/MemberAvatars/win.Uy0xLTUtMjEtMzMwNDk4NzQ2Ni0xODkxMDA3NDIzLTI5MjUxNTc3OTctNDU4NDA1"
}
},
"id": "2d86d86b-fe02-4e22-aa53-4315cdb3821c",
"uniqueName": "DOMAIN\\PXXXXXX",
"imageUrl": "http://tdc1tfsapp01:8080/tfs/DefaultCollection/_apis/GraphProfile/MemberAvatars/win.Uy0xLTUtMjEtMzMwNDk4NzQ2Ni0xODkxMDA3NDIzLTI5MjUxNTc3OTctNDU4NDA1",
"descriptor": "win.Uy0xLTUtMjEtMzMwNDk4NzQ2Ni0xODkxMDA3NDIzLTI5MjUxNTc3OTctNDU4NDA1"
},
"id": 3535
}
],
"approvalOptions": {
"requiredApproverCount": null,
"releaseCreatorCanBeApprover": true,
"autoTriggeredAndPreviousEnvironmentApprovedCanBeSkipped": false,
"enforceIdentityRevalidation": false,
"timeoutInMinutes": 0,
"executionOrder": 1
}
}
Am I missing something here or is it actually a Microsoft bug?
It's by design. Following properties in the release definition are not imported: Agent Queues, Deployment Groups, Deployment Group Tags, Approvals, Variable Groups and values of secret variables.
Generally if in the same team project you can Clone the release definition directly.
I'm trying to replicate an Azure DevOps process from one organization to another via the AZDO REST Api. I'm working on replicating the layout and am stuck because I can't discover the relationship between a custom field and a picklist when querying the source AZDO instance.
In my scenario I have a test work item type which I've called Issue. On the Issue interface I've created a custom field which is a picklist. While I can retrieve a list of lists via the Rest API and examine the field as well, I can't figure out how the two are related.
Here is a partial payload from the field:
{
"count": 39,
"value": [
...
{
"referenceName": "Custom.IssueSource",
"name": "Issue Source",
"type": "string",
"description": "Who is this attributed to",
"required": true,
"url": "https://dev.azure.com/MYORG/_apis/work/processes/f390103e-7097-4f19-b5b5-f9dbcf92bb6f/behaviors",
"customization": "custom"
},
... ]
}
and here is a partial payload from the lists get query which I used trial and error to determine was the picklist I've assigned:
{
"count": 10,
"value": [
...
{
"id": "2998d4e4-2bec-4935-98a1-b67a0b0b6d5d",
"name": "picklist_e854661e-8620-4ad9-be28-b974c5cb3a5d",
"type": "String",
"isSuggested": false,
"url": "https://dev.azure.com/MYORG/_apis/work/processes/lists/2998d4e4-2bec-4935-98a1-b67a0b0b6d5d"
},
...
]
}
Here is a partial layout response for the WIT:
{
"pages": [
{
"id": "d0171d51-ff84-4038-afc1-8800ab613160.System.WorkItemType.Details",
"inherited": true,
"label": "Details",
"pageType": "custom",
"visible": true,
"isContribution": false,
"sections": [
{
"id": "Section1",
"groups": [
...
{
"id": "bf03e049-5062-4d82-b91d-4396541fbed2",
"label": "Custom",
"isContribution": false,
"visible": true,
"controls": [
{
"id": "Custom.IssueSource",
"label": "Issue Source",
"controlType": "FieldControl",
"readOnly": false,
"visible": true,
"isContribution": false
}
]
}
]
},
... ]
}
Using fiddler against the AZDO web interface, the only time I see a reference to the picklist is from another non-AZDO API to https://dev.azure.com/MYORG/_apis/Contribution/dataProviders/query
Is there a way to discover the link via the AZDO Rest API? I saw this question which was similar but was about creating the link
Figured it out. Turns out you need to query from a different scope - work item tracking rather than work item tracking process:
https://dev.azure.com/MYORG/_apis/wit/fields/Custom.IssueSource?api-version=5.0-preview.2
returns
{
"name": "Issue Source",
"referenceName": "Custom.IssueSource",
"description": "Who is this attributed to",
"type": "string",
"usage": "workItem",
"readOnly": false,
"canSortBy": true,
"isQueryable": true,
...
"isIdentity": false,
--> "isPicklist": true,
"isPicklistSuggested": false,
--> "picklistId": "2998d4e4-2bec-4935-98a1-b67a0b0b6d5d",
"url": "https://dev.azure.com/MYORG/_apis/wit/fields/Custom.IssueSource"
}
I am using ibm graph in bluemix and new to this.
I created a graph named 'test' using the GUI provided by bluemix and uploaded the sample data 'Music Festival' provided by ibm in that graph.
Now I am trying to query all the vertices having label 'attendee' using below query.
def gt = graph.traversal();
gt.V().hasLabel("attendee");
But I am getting error as
Error: Error encountered evaluating script def gt = graph.traversal();gt.V().hasLabel("attendee"); with reason com.thinkaurelius.titan.core.TitanException: Could not find a suitable index to answer graph query and graph scans are disabled: [(~label = attendee)]:VERTEX
Not sure what I am doing wrong.
Can somebody tell where am i going wrong?
How can i get rid of this error and get the expected output?
Thanks
#Radhika, Your Gremlin query is a valid Gremlin query. However, some vendors (such as IBM Graph and Titan) chose to only allow users to start their queries with a query that is indexed.This is to make sure you get the performance of your queries. Calling hasLabel() by itself will give you the Could not find a suitable index... error as you can't create indexes for labels. What you need to do is follow this step with a step that uses a indexed property as in this query :
graph.traversal();gt.V().hasLabel("band").has("genre","pop");
An index for genre has been created in the schema for the sample music festival data as you can see below
{
"propertyKeys": [
{ "name": "name", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "gender", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "age", "dataType": "Integer", "cardinality": "SINGLE" },
{ "name": "genre", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "monthly_listeners", "dataType": "String", "cardinality": "SINGLE" },
{ "name":"date","dataType":"String","cardinality":"SINGLE" },
{ "name":"time","dataType":"String","cardinality":"SINGLE" }
],
"vertexLabels": [
{ "name": "attendee" },
{ "name": "band" },
{ "name": "venue" }
],
"edgeLabels": [
{ "name": "bought_ticket", "multiplicity": "MULTI" },
{ "name":"advertised_to","multiplicity":"MULTI" },
{ "name":"performing_at","multiplicity":"MULTI" }
],
"vertexIndexes": [
{ "name": "vByName", "propertyKeys": ["name"], "composite": true, "unique": false },
{ "name": "vByGender", "propertyKeys": ["gender"], "composite": true, "unique": false },
{ "name": "vByGenre", "propertyKeys": ["genre"], "composite": true, "unique": false}
],
"edgeIndexes" :[
{ "name": "eByBoughtTicket", "propertyKeys": ["time"], "composite": true, "unique": false }
]
That's why the above query works and you need to do the same.
If you don't have a schema, create one. You can model it after the
one above or follow the API
doc
Create an (Vertex/Label) index for the properties that you'll start
your traversals from. In this example, Name, Gender and Genre for
vertex properties and name for the edge properties.
Call the schema
endpoint
to add your schema to your graph
It's recommended to create your schema before adding any data to
your graph so that you don't have to reindex later. That'll save you
a lot of time.
Once you create your schema, you can't modify what you created
already, but you can add new properties/indexes later on.
Look at the following code samples for Java and Nodejs for the exact code to use.
I hope that helps
I created a 2-node OrientDB cluster by following the below steps. But while distributing it, the data present in only one of the node is accessible. Please can you help me debug this issue. The OrientDB version is 2.2.6
Steps involved :
Utilized plocal mode in ETL tool and stored part of the data in node 1 and the other part in node2. The data stored actually belongs to just one class of vertex alone. ( On checking the data from console, the data has got injested properly ).
Then executed both the nodes in distributed mode, data from only one machineis accessible.
The default-distributed-db-config.json file is specified below :
{
"autoDeploy": true,
"readQuorum": 1,
"writeQuorum": 1,
"executionMode": "undefined",
"readYourWrites": true,
"servers": {
"*": "master"
},
"clusters": {
"internal": {
},
"address": {
"servers" : [ "orientmaster" ]
},
"address_1": {
"servers" : [ "orientslave1" ]
},
"*": {
"servers": ["<NEW_NODE>"]
}
}
}
There are two clusters created for the vertex named address namely address and address_1. The data in machine orientslave1 is stored using ETL tool into cluster address_1 , similarly the data in machine orientmaster is stored into the cluster address. ( I've ensured that both of these cluster ids are different at time of creation )
However when these two machines are connected together in distributed mode, the data in cluster address_1 is only visible
The ETL json is attached below :
{
"source": { "file": { "path": "/home/ubuntu/labvolume1/DataStorage/geo1_5lacs.csv" } },
"extractor": { "csv": {"columnsOnFirstLine": false, "columns":["place:string"] } },
"transformers": [
{ "vertex": { "class": "ADDRESS", "skipDuplicates":true } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ubuntu/labvolume1/orientdb/databases/ETL_Test1",
"dbType": "graph",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"wal": false,
"tx":false,
"classes": [
{"name": "ADDRESS", "extends": "V", "clusters":1}
], "indexes": [
{"class":"ADDRESS", "fields":["place:string"], "type":"UNIQUE" }
]
}
}
}
Please let me know, if there is anything i'm doing wrongly