How move data to a location after doing a Databricks Merge - pyspark

I would like to know how to go about moving the results of a Merge with Databricks to a location such as Azure SQL Database.
The folloiwng is a typical Databricks Merge sample from:
https://learn.microsoft.com/en-us/azure/databricks/delta/merge
I would like to know how to send the results of the following Python merge to an Azure SQL Database
from delta.tables import *
deltaTablePeople = DeltaTable.forPath(spark, '/tmp/delta/people-10m')
deltaTablePeopleUpdates = DeltaTable.forPath(spark, '/tmp/delta/people-10m-updates')
dfUpdates = deltaTablePeopleUpdates.toDF()
deltaTablePeople.alias('people') \
.merge(
dfUpdates.alias('updates'),
'people.id = updates.id'
) \
.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.whenNotMatchedInsert(values =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.execute()

Firstly you need to create a mount point in Databricks
Please refer this link for creating mount point https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts
Once you complete merge operation write the dataframe to ADLS
Follow this link https://docs.delta.io/0.2.0/delta-batch.html

Related

Transforming and Reading json files in Azure Synapse notebook

I have below Json in one of my storage account and I am able to read it by following the below code. I need help in reading the columns where "pod" has value "kube-apiserver-78" or "kube-apiserver-79" and username has "system:serviceaccount:xyz" or "system:serviceaccount:poq" : can someone help me how can I translate it below code.
df = spark.read.json('abfss://insights-logs-kube-audit#azogs.dfs.core.windows.net/resourceId=/SUBSCRIPTIONS/5IS/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV/y=2022/m=08/d=09/h=11/m=00/')
df.show()
Sample Json file in Storage container Which I read:
{ "operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read", "category": "kube-audit", "ccpNamespace": "5f", "resourceId": "/SUBSCRIPTIONS/SID/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV", "properties": {"log":"{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Metadata\",\"auditID\":\"b7b1ca3\",\"stage\":\"ResponseComplete\",\"requestURI\":\"/apis/chaos-mesh.org/v1alpha1/namespaces/ve/httpchaos?limit=500\",\"verb\":\"list\",\"user\":{\"username\":\"system:serviceaccount:xyz\",\"uid\":\"3eb35e\",\"groups\":[\"system:serviceaccounts\",\"system:serviceaccounts:internal-services\",\"system:authenticated\"]},\"sourceIPs\":[\"100.100.100.100\"],\"userAgent\":\"ktl/v1.18.10 (linux/amd64) kubernetes/62c\",\"objectRef\":{\"resource\":\"httpchaos\",\"namespace\":\"vo\",\"apiGroup\":\"chaos-mesh.org\",\"apiVersion\":\"v1alpha1\"},\"responseStatus\":{\"metadata\":{},\"code\":200},\"requestReceivedTimestamp\":\"2022-05-23T13:45:13.140759Z\",\"stageTimestamp\":\"2022-05-23T13:45:13.146101Z\",\"annotations\":{\"authentication.k8s.io/legacy-token\":\"system:serviceaccount:ixyzr\",\"authorization.k8s.io/decision\":\"allow\",\"authorization.k8s.io/reason\":\"RBAC: allowed by ClusterRoleBinding \\\"admin\\\" of ClusterRole \\\"cluster-admin\\\" to ServiceAccount \\\"abc/xyz\\\"\"}}\n","stream":"stdout","pod":"kube-apiserver-78"}, "time": "2022-05-23T13:45:13.0000000Z", "Cloud": "AzureCloud", "Environment": "prod", "UnderlayClass": "hcp-underlay", "UnderlayName": "h-24"}
{ "operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read", "category": "kube-audit", "ccpNamespace": "5f", "resourceId": "/SUBSCRIPTIONS/SID/RESOURCEGROUPS/AZURE-DEV/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/AZURE-DEV", "properties": {"log":"{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Metadata\",\"auditID\":\"b7b1cax3\",\"stage\":\"ResponseComplete\",\"requestURI\":\"/apis/chaos-mesh.org/v1alpha1/namespaces/ve/httpchaos?limit=500\",\"verb\":\"list\",\"user\":{\"username\":\"system:serviceaccount:xyz\",\"uid\":\"3eb35e\",\"groups\":[\"system:serviceaccounts\",\"system:serviceaccounts:internal-services\",\"system:authenticated\"]},\"sourceIPs\":[\"100.100.100.100\"],\"userAgent\":\"ktl/v1.18.10 (linux/amd64) kubernetes/62c\",\"objectRef\":{\"resource\":\"httpchaos\",\"namespace\":\"vo\",\"apiGroup\":\"chaos-mesh.org\",\"apiVersion\":\"v1alpha1\"},\"responseStatus\":{\"metadata\":{},\"code\":200},\"requestReceivedTimestamp\":\"2022-05-23T13:45:13.140759Z\",\"stageTimestamp\":\"2022-05-23T13:45:13.146101Z\",\"annotations\":{\"authentication.k8s.io/legacy-token\":\"system:serviceaccount:ixyzr\",\"authorization.k8s.io/decision\":\"allow\",\"authorization.k8s.io/reason\":\"RBAC: allowed by ClusterRoleBinding \\\"admin\\\" of ClusterRole \\\"cluster-admin\\\" to ServiceAccount \\\"abc/xyz\\\"\"}}\n","stream":"stdout","pod":"kube-apiserver-78"}, "time": "2022-05-23T13:45:13.0000000Z", "Cloud": "AzureCloud", "Environment": "prod", "UnderlayClass": "hcp-underlay", "UnderlayName": "h-24"}
To query Json file After reading it convert it into temporal tables in Apache Spark and query them using Spark SQL.
To convert it into temporal table, use command:
df.createOrReplaceTempView("Name for temporal table")
Then query on this temporal table using Spark SQL.
SELECT * FROM "Name for temporal table"
WHERE (pod = 'kube-apiserver-78' or pod = 'kube-apiserver-79')
and (username = 'system:serviceaccount:xyz' or username = 'system:serviceaccount:poq')
Reference: Query JSON Files with Azure Synapse Analytics Notebooks

How to upload files from local to staging table in Snowflake using Spark

I was using Snowflake connector for the same purpose in Python.
The code I had used was:
import snowflake.connector as sf
conn = sf.connect(user=user, password=password, account=account, warehouse=warehouse,
database=database,schema=schema)
def execute_query(connection, query):
cursor = connection.cursor()
cursor.execute(query)
cursor.close()
query = "create or replace stage table_stage file_format = (TYPE=CSV);"
execute_query(conn, query)
query = "put file://local_file.csv #table_stage auto_compress=true"
execute_query(conn, query)
Now, I need to achieve the same using Spark, the code I'm using is:
sfOptions = {
"sfURL": "url",
"sfAccount": "account",
"sfUser": "user",
"sfPassword": "user",
"sfDatabase": "database",
"sfSchema": "PUBLIC",
"sfWarehouse": "warehouse"
}
spark.sparkContext.jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions,
"create or replace stage table_stage file_format = (TYPE=CSV); "
spark.sparkContext.jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions,
"put file://local_file.csv #table_stage auto_compress=true"
I'm able to create staging table with this, but not able to upload the files.
Please suggest any alternative method for doing the same.

get task id's from kafka connect API to print in logs

I have a kafka connect sink code for which below json is passed as curl command to register tasks.
Please let me know if anyone has any idea on how to get the task id's of my connect. For example in below example, we have defined max tasks is 3, so I need to know
the name of 3 tasks for logs i.e. I need to know which line of my log belongs to which task.
In below example, I know I have 3 tasks - TestCheck-1, TestCheck-2 and TestCheck-3 based on the kafka connect logs. I want to know how to get the task names so that I can print them in my kafka connect log lines.
{
"name": "TestCheck",
"config": {
"topics": "topic1",
"connector.class": "ApplicationSinkTask Class package",
"tasks.max": "3",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"connector.url": "jdbc connection url",
"driver.name": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"username": "myusername",
"password": "mypassword",
"table.name": "test_table",
"database.name": "test",
}
}
When I register, I will get below details.
curl -X POST -H "Content-Type: application/json" --data #myjson.json http://service:8082/connectors
{"name":"TestCheck","config":{"topics":"topic1","connector.class":"ApplicationSinkTask Class package","tasks.max":"3","key.converter":"org.apache.kafka.connect.storage.StringConverter","value.converter":"org.apache.kafka.connect.storage.StringConverter","connector.url":"jdbc:sqlserver://datahubprod.database.windows.net:1433;","driver.name":"jdbc connection url","username":"myuser","password":"mypassword","table.name":"test_table","database.name":"test","name":"TestCheck"},"tasks":[{"connector":"TestCheck","task":0},{"connector":"TestCheck","task":1},{"connector":"TestCheck","task":2}],"type":null}
You can manage the connectors with the Kafka Connect Rest API. There's a whole heap of commands which you can find here
The example given in the above link shows you can retrieve all task for a given connector using the command
$ curl localhost:8083/connectors/local-file-sink/tasks
[
{
"id": {
"connector": "local-file-sink",
"task": 0
},
"config": {
"task.class": "org.apache.kafka.connect.file.FileStreamSinkTask",
"topics": "connect-test",
"file": "test.sink.txt"
}
}
]
You can use a language of your choice to send the curl command and import the json response into a variable/dictionary for further use, such as printing to a log. Here's a very simple example using python which will assign the whole output to a variable.
import requests
import json
connectors = 'http://localhost:8083/connectors'
p = requests.get(connectors)
data = p.json()
If you parse the data variable to a a dictionary, you can the access each element, i.e the task id
I hope this helps!

Extract value from cloudant IBM Bluemix NoSQL Database

How to Extract value from Cloudant IBM Bluemix NoSQL Database stored in JSON format?
I tried this code
def readDataFrameFromCloudant(host,user,pw,database):
cloudantdata=spark.read.format("com.cloudant.spark"). \
option("cloudant.host",host). \
option("cloudant.username", user). \
option("cloudant.password", pw). \
load(database)
cloudantdata.createOrReplaceTempView("washing")
spark.sql("SELECT * from washing").show()
return cloudantdata
hostname = ""
user = ""
pw = ""
database = "database"
cloudantdata=readDataFrameFromCloudant(hostname, user, pw, database)
It is stored in this format
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
I want this result
Expected
Actual Outcome
Create a dummy dataset for reproducing the issue:
cloudantdata = spark.read.json(sc.parallelize(["""
{
"_id": "31c24a382f3e4d333421fc89ada5361e",
"_rev": "1-8ba1be454fed5b48fa493e9fe97bedae",
"d": {
"count": 9,
"hardness": 72,
"temperature": 85,
"flowrate": 11,
"fluidlevel": "acceptable",
"ts": 1502677759234
}
}
"""]))
cloudantdata.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', d=Row(count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234))]
Now flatten:
flat_df = cloudantdata.select("_id", "_rev", "d.*")
flat_df.take(1)
Returns:
[Row(_id='31c24a382f3e4d333421fc89ada5361e', _rev='1-8ba1be454fed5b48fa493e9fe97bedae', count=9, flowrate=11, fluidlevel='acceptable', hardness=72, temperature=85, ts=1502677759234)]
I tested this code with an IBM Data Science Experience notebook using Python 3.5 (Experimental) with Spark 2.0
This answer is based on: https://stackoverflow.com/a/45694796/1033422

wget command line - define format and encoding

trying to download all categories from an ecommerce website using rest and wget (curl either) I cannot make a readable file. the following line is the one that Im executing:
...>wget https://api.mercadolibre.com/sites/MLB/categories/all --no-check-certificate
I receive information like this->
½Û’Û8².ü*_[6q Hö]>t{\=¶ÇëðÇŽˆ¢ªè–ÄÜmïXûÖôeÇŽ¹˜˜»û®^ìHU €()‰dåŠ1]ì®,$&
I expected something like:
, {
"id": "MLA1743",
"name": "Autos, Motos y Otros"
}, {
"id": "MLA1384",
"name": "Bebés"
}, {
"id": "MLA1039",
"name": "Cámaras y Accesorios"
}, {
"id": "MLA1051",
"name": "Celulares y Teléfonos"
}, {
"id": "MLA1798",
"name": "Coleccionables y Hobbies"
}
sorry if its a newbie question but i cannot find a proper tutorial. brgds
The content is gzip-encoded. You can figure this out by looking at the Content-Encoding header the servers sends with the response. You can access the data like this:
wget -o- https://api.mercadolibre.com/sites/MLB/categories/all | zcat
Or just save it to a file first:
wget -o all.gz https://api.mercadolibre.com/sites/MLB/categories/all
gunzip all.gz