How to scan Azure MariaDB in Microsoft Purview - azure-purview

Microsoft Purview supports resources of type Azure Database for MySQL.
I have an Azure Database for MariaDB (which is also a flavor mySQL) but it seems that I can't register it as a source.
Is there a way to register and scan Azure Database for MariaDB resources?

Unfortunately, Azure Purview does not support Azure Maria DB as a data source (as of 8/22). Hopefully, they will support it soon.
Until then the following python code which scans a Maria DB server and returns a collection of Atlas entities for the databases, tables, and columns could be useful.
from mysql.connector import errorcode
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core import PurviewClient
from pyapacheatlas.core import AtlasEntity, AtlasProcess
from pyapacheatlas.core import AtlasAttributeDef, EntityTypeDef, RelationshipTypeDef
from pyapacheatlas.core.util import GuidTracker
from mysql.connector import errorcode
import mysql.connector
def createMariaDbEntities(gt, dbConnParams, serverUri, serverName):
mariaSeverEntity = AtlasEntity(
name=serverName,
typeName="azure_mariadb_server",
qualified_name=serverUri,
guid=gt.get_guid()
)
entities = []
entities.append(mariaSeverEntity)
try:
conn = mysql.connector.connect(**dbConnParams)
#print("Connection established")
except mysql.connector.Error as err:
if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
print("Something is wrong with the user name or password")
elif err.errno == errorcode.ER_BAD_DB_ERROR:
print("Database does not exist")
else:
print(err)
else:
cursor = conn.cursor()
# find all databases
enumerateDatabasesQuery = "show databases"
cursor.execute(enumerateDatabasesQuery)
res = cursor.fetchall()
databases = [r[0] for r in res if r[0][0].isnumeric()]
for (db) in databases:
# find all the tables in the database
readTablesQuery = "SHOW TABLE STATUS FROM `{0}`;".format(db)
cursor.execute(readTablesQuery)
rows = cursor.fetchall()
tables = [row[0] for row in rows if int(row[4]) > 0]
if (len(tables) > 0):
#print("datebase:", db)
dbEntity = AtlasEntity(
name=db,
typeName="azure_mariadb_db",
qualified_name="{0}/{1}".format(serverUri, db),
guid=gt.get_guid(),
server=mariaSeverEntity
)
entities.append(dbEntity)
for table in tables:
# print("Table:",table)
tableEntity = AtlasEntity(
name=table,
typeName="azure_mariadb_table",
qualified_name="{0}/{1}/{2}".format(serverUri, db, table),
guid=gt.get_guid(),
db=dbEntity
)
entities.append(tableEntity)
# find all the columns in the table
columnsQuery = "SHOW COLUMNS FROM `{0}`.`{1}`;".format(db, table)
cursor.execute(columnsQuery)
rows = cursor.fetchall()
columns = [(row[0], row[1]) for row in rows]
for column in columns:
# print("Column:",column)
columnEntity = AtlasEntity(
name=column[0],
attributes={
"dataType": column[1]
},
typeName="azure_mariadb_table_column",
qualified_name="{0}/{1}/{2}/{3}".format(serverUri, db, table, column[0]),
guid=gt.get_guid()
)
columnEntity.addRelationship(table=tableEntity)
entities.append(columnEntity)
# Cleanup
conn.commit()
cursor.close()
conn.close()
return entities

Related

How to view child nodes of a parent from panda dataframe?

I was trying to extract data from mongodb. So, I was using panda as a dataframe. I was using twitter dataset. The dataset was in json and when I import it in the database it looks like this:
user:Object
id:1292598776
id_str:1292598776
name:ahmd
screen_name:sameh7753
location:
url:null
description:null
protected:false
followers_count:5
friends_count:76
listed_count:0
created_at:Sat Mar 23 21:59:37 +0000 2013
favourites_count:1
utc_offset:null
time_zone:null
geo_enabled:true
lang:ar
contributors_enabled:false
is_translator:false
profile_background_color:C0DEED
profile_use_background_image:true
default_profile:true
default_profile_image:false
follow_request_sent:null
So, here 'user' is the parent and under it there are many children.There are other fields too in the dataset.
So, I was trying to execute a query which will find any tweet, tweeted on 2013 and the location of the tweet is "US". And then I was storing those cursors in the panda data frame. So when I was printing the data frame I was expecting to see those screen_name but it was not getting printed and also I couldn't access those data.
Here is the code I was using:
import pandas as pd
from pymongo import MongoClient
import matplotlib.pyplot as plt
import re
pd.set_option('display.expand_frame_repr', False)
def _connect_mongo(host, port, db):
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, host, port):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, db=db)
cursor = db[collection].find({'created_at':{'$regex': '2013'}},
{'place.country':'US'}, no_cursor_timeout=True).toArray()
print cursor
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
return df
db = 'twittersmall' #'twitter'
collection='twitterdata' #'twitterCol' #
#query={'lang':'{$exists: true}'}
host='localhost'
port=27017
var = read_mongo(db, collection, host, port)
print var
It only prints under the user column in panda data frame this:
False {u'follow_request_sent':
u'profile_use_b...
And rest of the attributes don't get print and I can't even access them by writing var['user.screen_name'] stated in the python code.
How can I access the data?
First you have to include from pandas.io.json import json_normalize.
Now your read_mongo function should be like this-
def read_mongo(db, collection, host, port):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, db=db)
cursor = db[collection].find({'created_at':{'$regex': '2013'}},
no_cursor_timeout=True)
cursor = list(cursor)
df = json_normalize(cursor)
return df
Here json_normalaize flattens those fields which have children and make them columns of the panda dataframe.

how to declare query to fetch data from mongodb?

import pandas as pd
from pymongo import MongoClient
import matplotlib.pyplot as plt
def _connect_mongo(host, port, db):
""" A util for making a connection to mongo
if username and password:
mongo_uri = 'mongodb://%s:%s#%s:%s/%s' % (username, password, host, port, db)
conn = MongoClient(mongo_uri)
else:
"""
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, host, port, query):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, db=db)
# Make a query to the specific DB and Collection
cursor = db[collection].find(query)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
'''
Delete the _id
if no_id:
del df['_id']
'''
return df
#initialization
db = 'twittersmall'
collection='twitterdata'
query='{lang:{$exists: true}}'
host='localhost'
port=27017
var = read_mongo(db, collection, host, port, query)
print var
tweets_by_lang = var['lang'].value_counts()
fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Languages', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
In this code, I was trying to fetch those data from mongodb where language field exists(might be null). So in the attribute, query I assigned a filter that will be used in the fetching operation. But the problem is, when I initialize query='{lang:{$exists: true}}', query is of string datatype and query must be a dictionary. When I declare query={lang:{$exists: true}} , it says -> syntax error. Obviously because so far as I know declaration of dictionary is {'key':'value'} . And when I declare like this query={'lang':'{$exists: true}'} it doesn't work because of keyError as there's filed in the database called lang.
So, how to declare this query and pass it in the method?
ps: when I use query={lang:{$exists: true}} in Webstorm terminal, it works, but I am currently working on jupyter notebook that integrates ipython, so that I can create graph, charts using the data from mongodb. I also used pandas for dataframe.

How to add data in right collection in pymongo?

I want to add data in right collection considering by name. The code below is defining well. collection(db,name) returns the name of collection. But when I want to save the collection name via rightCollection = collections(db, name) and inserting it as db.rightCollection.insert({"1" : "Righ collection"}). Pymongo is creating the collection under name rightCollection not Peter. I want to insert data in Peter. Why is it so? Can I resolve it?
from pymongo import MongoClient
def collections(db,name):
if(name is 'Peter'):
return db.Peter
client = MongoClient()
db = client.myDB
name="Peter"
rightCollection = collections(db, name)
db.rightCollection.insert({"1" : "Righ collection"})
Using pymongo 3.2.2, you don't need the collections function, you can just use the collection name directly:
from pymongo import MongoClient
client = MongoClient()
db = client.myDB
db.Peter.insert_one({'1': 'Right collection'})
That should insert the document {'1': 'Right collection} into collection Peter under database myDB. To verify that the data is inserted correctly, you can use the mongo shell:
> use myDB
> db.Peter.find()
{ "_id": ObjectId("57df7a4f98e914c98d540992"), "1": "Right collection" }
Or, if you need the name Peter to be defined in a variable, you can do:
from pymongo import MongoClient
client = MongoClient()
db = client.myDB
coll_name = 'Peter'
db[coll_name].insert_one({'1': 'Right collection'})

Reading the PostgreSQL hstore type with F#

I'm trying to read PostgreSQL database with F#. The table I read contains column of hstore type but I'm not able to access this column.
I tried two approaches. First with using SqlDataProvider
type sql = SqlDataProvider<
ConnectionString = connString,
DatabaseVendor = Common.DatabaseProviderTypes.POSTGRESQL,
ResolutionPath = resPath,
IndividualsAmount = 1000,
UseOptionTypes = true, Owner="public">
let ctx = sql.GetDataContext()
for item in ctx.Public.Articles.Take(10) do printfn "%s" item.???
My item doesn't contain columns at all.
When I use another approach
type dbSchema =
SqlEntityConnection<
ConnectionString=connString,
Provider="Npgsql">
let ctx = dbSchema.GetDataContext()
for item in ctx.articles do printfn "%s" item.???
The item contains all columns except column of hstore type.
In the code of SqlProvider there I see some mapping of hstore in the PostgreSQL provider but I have no idea how to use it.

OrientDB - How do I insert a document with connections to multiple other documents?

Using OrientDB 1.7-rc and Scala, I would like to insert a document (ODocument), into a document (not graph) database, with connections to other documents. How should I do this?
I've tried the following, but it seems to insert an embedded list of documents into the Package document, rather than connect the package to a set of Version documents (which is what I want):
val doc = new ODocument("Package")
.field("id", "MyPackage")
.field("versions", List(new ODocument("Version").field("id", "MyVersion")))
EDIT:
I've tried inserting a Package with connections to Versions through SQL, and that seems to produce the desired result:
insert into Package(id, versions) values ('MyPackage', [#10:3, #10:4] )
However, I need to be able to do this from Scala, which has yet to produce the correct results when loading the ODocument back. How can I do it (from Scala)?
You need to create the individual documents first and then inter-link them using below SQL commands.
Some examples given in OrientDB documentation
insert into Profile (name, friends) values ('Luca', [#10:3, #10:4] )
OR
insert into Profile SET name = 'Luca', friends = [#10:3, #10:4]
Check here for more details.
I tried posting in comments above, but somehow the code is not readable, so posting the response separately again.
Here is an example of linking two documents in OrientDB. This is take from documentation. Here we are adding new user in DB and connecting it to give role:
var db = orient.getDatabase();
var role = db.query("select from ORole where name = ?", roleName);
if( role == null ){
response.send(404, "Role not found", "text/plain", "Error: role name not found" );
} else {
db.begin();
try{
var result = db.save({ "#class" : "OUser", name : "Gaurav", password : "gauravpwd", roles : role});
db.commit();
return result;
}catch ( err ){
db.rollback();
response.send(500, "Error: Server", "text/plain", err.toString() );
}
}
Hope it helps you and others.
This is how to insert a Package with a linkset referring to an arbitrary number of Versions:
val version = new ODocument("Version")
.field("id", "1.0")
version.save()
val versions = new java.util.HashSet[ODocument]()
versions.add(version)
val package = new ODocument("Package")
.field("id", "MyPackage")
.field("versions", versions)
package.save()
When inserting a Java Set into an ODocument field, OrientDB understands this to mean one wants to insert a linkset, which is an unordered, unique, collection of references.
When reading the Package back out of the database, you should get hold of its Versions like this:
val versions = doc.field[java.util.HashSet[ODocument]]("versions").asScala.toSeq
As when the linkset of versions is saved, a HashSet should be used when loading the referenced ODocument instances.
Optionally, to enforce that Package.versions is in fact a linkset of Versions, you may encode this in the database schema (in SQL):
create property Package.versions linkset Version