Can someone let me know what asterics ** achieves when writing to Cosmos DB from Databrick.
# Write configuration
writeConfig = {
"Endpoint": "https://doctorwho.documents.azure.com:443/",
"Masterkey": "YOUR-KEY-HERE",
"Database": "DepartureDelays",
"Collection": "flights_fromsea",
"Upsert": "true"
}
# Write to Cosmos DB from the flights DataFrame
flights.write.format("com.microsoft.azure.cosmosdb.spark").options(
**writeConfig).save()
Thanks
This is simply to allow you to pass multiple arguments directly using a list, tuple or a dictionary in your case.
So rather than you say:
flights.write.format("com.microsoft.azure.cosmosdb.spark")\
.option("Endpoint", "https://doctorwho.documents.azure.com:443/")\
.option("Upsert", "true")\
.option("Masterkey", "YOUR-KEY-HERE")\
...etc
You simply have all your arguments in a dictionary and then pass it like the following
flights.write.format("com.microsoft.azure.cosmosdb.spark").options(
**yourdict).save()
Related
I'm using Azure data factory to retrieve data and copy into a database... the Source looks like this:
{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
In my ADF pipeline, I am trying to get the count of GroupIds and store that in a database column (along with the associated Id from the JSON above).
Is there some kind of syntax I can use to tell ADF that I just want the count of GroupIds or is this going to require some kind of recursive loop activity?
You can use the length function in Azure Data Factory (ADF) to check the length of json arrays:
length(json(variables('varSource')).GroupIds)
If you are loading the data to a SQL database then you could use OPENJSON, a simple example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT *
FROM OPENJSON( #json, '$.GroupIds' )
SELECT COUNT(*) countOfGroupIds
FROM OPENJSON( #json, '$.GroupIds' );
My results:
If your data is stored in a table the code is similar. Make sense?
Another funky way to approach it if you really need the count in-line, is to convert the JSON to XML using the built-in functions and then run some XPath on it. It's not as complicated as it sounds and would allow you to get the result inside the pipeline.
The Data Factory XML function converts JSON to XML, but that JSON must have a single root property. We can fix up the json with concat and a single line of code. In this example I'm using a Set Variable activity, where varSource is your original JSON:
#concat('{"root":', variables('varSource'), '}')
Next, we can just apply the XPath with another simple expression:
#string(xpath(xml(json(variables('varIntermed1'))), 'count(/root/GroupIds)'))
My results:
Easy huh. It's a shame there isn't more built-in support for JPath unless I'm missing something, although you can use limited JPath in the Copy activity.
You can use Data flow activity in the Azure data factory pipeline to get the count.
Step1:
Connect the Source to JSON dataset, and in Source options under JSON settings, select single document.
In the source preview, you can see there are 5 GroupIDs per ID.
Step2:
Use flatten transformation to deformalize the values into rows for GroupIDs.
Select GroupIDs array in Unroll by and Unroll root.
Step3:
Use Aggregate transformation, to get the count of GroupIDs group by ID.
Under Group by: Select a column from the drop-down for your aggregation.
Under Aggregate: You can build the expression to get the count of the column (GroupIDs).
Aggregate Data preview:
Step4: Connect the output to Sink transformation to load the final output to database.
I'm using the Cosmos DB Connector for Spark. Is it possible to use Mongo Shell "JSON-style" queries with the Cosmos DB connector instead of SQL queries?
I tried using the MongoDB Connector instead to achieve the same functionality but have run into some annoying bugs with memory limits using the Mongo Connector. So I've abandoned that approach.
This is the way I'd prefer to query:
val results = db.cars.find(
{
"car.actor.account.name": "Bill"
}
)
This is the way the cosmos connector allows:
val readConfig: Config = Config(Map(
"Endpoint" -> config.getString("endpoint"),
"Masterkey" -> config.getString("masterkey"),
"Database" -> config.getString("database"),
"Collection" -> "cars",
"preferredRegions" -> "South Central US",
"schema_samplesize" -> "100",
"query_custom" -> "SELECT * FROM root WHERE root['$v']['car']['$v']['actor']['$v']['account']['$v']['name']['$v'] = 'Bill'"
))
val results = spark.sqlContext.read.cosmosDB(readConfig)
Obviously the SQL-oriented approach doesn't lend itself well to the deeply nested data structures I'm getting from Cosmos DB. It's quite a bit more verbose, too; requiring each nested dictionary to be referenced with "['$v']" for reasons I'm unclear on. I'd much prefer to be able to use the Mongo-style syntax.
The Cosmos DB Connector for Spark mentioned in this link is for Cosmos DB SQL API, so you only can query it in SQL-oriented
// Import Necessary Libraries
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config
// Read Configuration
val readConfig = Config(Map(
"Endpoint" -> "https://doctorwho.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE",
"Database" -> "DepartureDelays",
"Collection" -> "flights_pcoll",
"query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'" // Optional
))
// Connect via azure-cosmosdb-spark to create Spark DataFrame
val flights = spark.read.cosmosDB(readConfig)
flights.count()
But if your cosmos db is mongo api,you could follow the statements in above link:
Please refer to this guide:https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/
I am trying to do a conditional update to a nested value. basically one variable in a nested array of 2 variables per array item has a boolean component I want to update based on the string value of the other variable.
I also want to do all of that based on a targeted find query. I came up with this below, but it doesn't work.
#!/usr/bin/env python
import ssl
from pymongo import MongoClient
client = MongoClient("somehost", ssl=True, ssl_cert_reqs=ssl.CERT_NONE, replicaSet='rs0')
db = client.maestro
mycollection = db.users
print 'connected, now performing update'
mycollection.find_and_modify(query={'emailAddress':'somedude#someplace.wat'}, update={ "nested.name" : "issomename" }, { "$set": {'nested.$*.value': True}}, upsert=True, full_response=True)
This code results in:
SyntaxError: non-keyword arg after keyword arg
This makes me think that the find_and_modify() method can't handle the conditional update bit.
Is there some way to achieve this, or have I gone down a wrong path? What would you all suggest as a better approach?
#!/usr/bin/env python
import ssl
from pymongo import MongoClient
client = MongoClient("somehost.wat", ssl=True, ssl_cert_reqs=ssl.CERT_NONE, replicaSet='rs0')
db = client.dbname
mycollection = db.docs
user_email = 'user#somehost.wat'
mycollection.update({ "emailAddress": user_email,"nestedvalue": { "$elemMatch" : {"name": "somename"} } }, { "$set": {"nestedvalue.$.value": True}})
This did the trick.
Instead of find_any_modify, use update_one if you want to update just one record or update_many in case of many.
The usage is like this:
mycollection.upadte_one({'emailAddress':'somedude#someplace.wat'},{"$set": {'nested.$*.value': True}})
for further detail please go through this link:
https://docs.mongodb.com/getting-started/python/update/
I'm using ElasticSearch-Hadoop API. And I was trying to get _mtermvector by using the following Spark code:
val query= """_mtermvectors {
"ids" : ["1256"],
"parameters": {
"fields": [
"tname"
],
"term_statistics": true
}
}"""
var idRdd = sparkContext.esRDD("uindex/type1",query)
It didn't work, any ideas please, appreciate!
You can't use endpoints (like _mtermvectors) which are part of the document API's with ES-Hadoop. Only queries which belongs to the query API's, query DSL or external resource are allowed.
Hope that it helps.
To be more specific,
I loaded the data into Mongodb by Pymongo with this script.
header = ['id', 'info']
for each in reader:
row={}
for field in header:
row[field]=each[field]
db.segment.insert_one(row)
The id column has unique Id of users and Info column is composed as nested json.
For example, here is the data set in the db
{
u'_id': ObjectId('111'),
u'id': u'123',
u'info': {
"TYPE": "food",
"dishes":"166",
"cc": "20160327 040001",
"country": "japan",
"money": 3521,
"info2": [{"type"; "dishes", "number":"2"}]
}
}
What i want to do is to read the value in the nested json format.
So what i did is ..
pipe = [{"$group":{"_id":"$id", "Totalmoney":{"$sum":"$info.money"}}}]
total_money = db.segment.aggregate(pipeline=pipe)
but the result of sum is always "0"for every id.
What am i doing wrong? how can i fix it?
I have to use mongodb because of the data size which is too big to be handled by python
Thank you in advance.