What is the pyspark equivalent of MERGE INTO for databricks delta lake? - merge

The databricks documentation describes how to do a merge for delta-tables.
In SQL the syntax
MERGE INTO [db_name.]target_table [AS target_alias]
USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]
can be used. Is a python-equivalent available?

I managed to find the documentation using the help of Alexandros Biratsis. The documentation can be found here. An example of such a merge is given by
deltaTable.alias("events").merge(
source = updatesDF.alias("updates"),
condition = "events.eventId = updates.eventId"
).whenMatchedUpdate(set =
{
"data": "updates.data",
"count": "events.count + 1"
}
).whenNotMatchedInsert(values =
{
"date": "updates.date",
"eventId": "updates.eventId",
"data": "updates.data",
"count": "1"
}
).execute()

Related

Read JSON in ADF

In Azure Data Factory, I need to be able to process a JSON response. I don't want to hardcode the array position in case they change, so something like this is out of the question:
#activity('Place Details').output.result.components[2].name
How can I get the name 123 where types = number given a JSON array like below:
"result": {
"components": [
{
"name": "ABC",
"types": [
"alphabet"
]
},
{
"name": "123",
"types": [
"number"
]
}
]
}
One example using the OPENJSON method:
DECLARE #json NVARCHAR(MAX) = '{
"result": {
"components": [
{
"name": "ABC",
"types": [
"alphabet"
]
},
{
"name": "123",
"types": [
"number"
]
}
]
}
}'
;WITH cte AS (
SELECT
JSON_VALUE( o.[value], '$.name' ) [name],
JSON_VALUE( o.[value], '$.types[0]' ) [types]
FROM OPENJSON( #json, '$.result.components' ) o
)
SELECT [name]
FROM cte
WHERE types = 'number'
I will have a look at other methods.

How I can enable Wildcard index at the time of collection creation using Azure CLI command in Cosmosdb Mongo API?

az cosmosdb mongodb collection create -g -a -d -n --idx #indexes.json
Running command in this format is giving error:
Deployment failed. Correlation ID: ..... Index definition does not contains '_id' index specification.
My indexes.json file content is as shown underneath:
[
{ "_id": 1 },
{
"key":
{"keys":["$**"]},
"options":{"Wildcard":true}
}
]
Try using this json instead.
"indexes": [
{
"key": {
"keys": [
"_id"
]
}
},
{
"key": {
"keys": [
"$**"
]
}
}
]

Cannot parametrize any value under placement.managedCluster.config

My goal is to create dataproc workflow template from python code. Meanwhile I want to have ability to parametrize placement.managedCluster.config.gceClusterConfig.subnetworkUri field during template instantiation.
I read template from json file like:
{
"id": "bigquery-extractor",
"placement": {
"managed_cluster": {
"config": {
"gce_cluster_config": {
"subnetwork_uri": "some-subnet-name"
},
"software_config" : {
"image_version": "1.5"
}
},
"cluster_name": "some-name"
}
},
"jobs": [
{
"pyspark_job": {
"args": [
"job_argument"
],
"main_python_file_uri": "gs:///path-to-file"
},
"step_id": "extract"
}
],
"parameters": [
{
"name": "CLUSTER_NAME",
"fields": [
"placement.managedCluster.clusterName"
]
},
{
"name": "SUBNETWORK_URI",
"fields": [
"placement.managedCluster.config.gceClusterConfig.subnetworkUri"
]
},
{
"name": "MAIN_PY_FILE",
"fields": [
"jobs['extract'].pysparkJob.mainPythonFileUri"
]
},
{
"name": "JOB_ARGUMENT",
"fields": [
"jobs['extract'].pysparkJob.args[0]"
]
}
]
}
code snippet I use:
options = ClientOptions(api_endpoint="{}-dataproc.googleapis.com:443".format(region))
client = dataproc.WorkflowTemplateServiceClient(client_options=options)
template_file = open(path_to_file, "r")
template_dict = eval(template_file.read())
print(template_dict)
template = dataproc.WorkflowTemplate(template_dict)
full_region_id = "projects/{project_id}/regions/{region}".format(project_id=project_id, region=region)
try:
client.create_workflow_template(
parent=full_region_id,
template=template
)
except AlreadyExists as err:
print(err)
pass
when I try to run this code I get the following error:
google.api_core.exceptions.InvalidArgument: 400 Invalid field path placement.managed_cluster.configuration.gce_cluster_config.subnetwork_uri: Field gce_cluster_config does not exist.
This behavior is the same also if I try to parametrize placement.managedCluster.config.softwareConfig.imageVersion, I will get
google.api_core.exceptions.InvalidArgument: 400 Invalid field path placement.managed_cluster.configuration.software_config.image_version: Field software_config does not exist.
But if I exclude any field under placement.managedCluster.config from parameters map, template is created successfully.
I didn't find any restriction on parametrizing these fields. Is there any? Or is it just me doing something wrong?
This doc listed the parameterizable fields. It seems that only managedCluster.name of managedCluster is parameterizable:
Managed cluster name. Dataproc will use the user-supplied name as the name prefix, and append random characters to create a unique cluster name. The cluster is deleted at the end of the workflow.
I don't see managedCluster.config parameterizable.

Use $redact to replace sub-documents with "access denied"

I've written some $redact operation to filter my documents:
db.test.aggregate([
{ $redact: {
$cond: {
if: { "$ifNull" : ["$_acl.READ", false] },
then: { $cond: {
if: { $anyElementTrue: {
$map: {
input: "$_acl.READ",
as: "myfield",
in: { $setIsSubset: [ "$$myfield", ["user1“] ] }
}
}},
then: "$$DESCEND",
else: "$$PRUNE"
}},
else: "$$DESCEND",
}
}}
])
This will remove all (sub)documents, where _acl.READ doesn't contain user1. But it will keep all (sub)documents where _acl.READ is not set.
After the aggregation I can't tell if some information was removed of if it simply wasn't part of the document.
Though I'd like remove sensitive information, but keep some hint which tells that access was denied. I.e.
{
id: ...,
subDoc1: {
foo: "bar",
_acl: {
READ: [ ["user1"] ]
}
},
subDoc2: {
_error: "ACCESS DENIED"
}
}
I just can't figure out, how to modify the document while using $redact.
Thank you!
The $redact pipeline stage is quite unique in the aggregation framework as it is not only capable of recursively descending into the nested structure of a document but also in that it can traverse across all of the keys at any level. It does however still require a concept of "depth" in that a key must either contain a sub-document object or an array which itself is composed of sub-documents.
But what it cannot do is "replace" or "swap-out" content. The only actions allowed here are fairly set, or more specifically from the documentation:
The argument can be any valid expression as long as it resolves to $$DESCEND, $$PRUNE, or $$KEEP system variables. For more information on expressions, see Expressions.
The possibly misleading statement there is "The argument can be any valid expression", which is in fact true, but it must however return exactly the same content as what would be resolved to be present in one of those system variables anyhow.
So in order to give some sort of "Access Denied" response in replacement of the "redacted" content, you would have to process differently. Also you would need to consider the limitations of other operators which could simply not work in a "recursive" or in a manner that requires "traversal" as mentioned earlier.
Keeping with the example from the documentation:
{
"_id": 1,
"title": "123 Department Report",
"tags": [ "G", "STLW" ],
"year": 2014,
"subsections": [
{
"subtitle": "Section 1: Overview",
"tags": [ "SI", "G" ],
"content": "Section 1: This is the content of section 1."
},
{
"subtitle": "Section 2: Analysis",
"tags": [ "STLW" ],
"content": "Section 2: This is the content of section 2."
},
{
"subtitle": "Section 3: Budgeting",
"tags": [ "TK" ],
"content": {
"text": "Section 3: This is the content of section3.",
"tags": [ "HCS" ]
}
}
]
}
If we want to process this to "replace" when matching the "roles tags" of [ "G", "STLW" ], then you would do something like this instead:
var userAccess = [ "STLW", "G" ];
db.sample.aggregate([
{ "$project": {
"title": 1,
"tags": 1,
"year": 1,
"subsections": { "$map": {
"input": "$subsections",
"as": "el",
"in": { "$cond": [
{ "$gt": [
{ "$size": { "$setIntersection": [ "$$el.tags", userAccess ] }},
0
]},
"$$el",
{
"subtitle": "$$el.subtitle",
"label": { "$literal": "Access Denied" }
}
]}
}}
}}
])
That's going to produce a result like this:
{
"_id": 1,
"title": "123 Department Report",
"tags": [ "G", "STLW" ],
"year": 2014,
"subsections": [
{
"subtitle": "Section 1: Overview",
"tags": [ "SI", "G" ],
"content": "Section 1: This is the content of section 1."
},
{
"subtitle": "Section 2: Analysis",
"tags": [ "STLW" ],
"content": "Section 2: This is the content of section 2."
},
{
"subtitle" : "Section 3: Budgeting",
"label" : "Access Denied"
}
]
}
Basically, we are rather using the $map operator to process the array of items and pass a condition to each element. In this case the $cond operator first looks at the condition to decide whether the "tags" field here has any $setIntersection result with the userAccess variable we defined earlier.
Where that condition was deemed true then the element is returned un-altered. Otherwise in the false case, rather than remove the element ( not impossible with $map but another step), since $map returns an equal number of elements as it received in "input", you just replace the returned content with something else. In this case and object with a single key and a $literal value. Being "Access Denied".
So keep in mind what you cannot do, being:
You cannot actually traverse document keys. Any processing needs to be explicit to the keys specifically mentioned.
The content therefore cannot be in another other form than an array as MongoDB cannot traverse accross keys. You would need to otherwise evaluate conditionally at each key path.
Filtering the "top-level" document is right out. Unless you really want to add an additional stage at the end that does this:
{ "$project": {
"doc": { "$cond": [
{ "$gt": [
{ "$size": { "$setIntersection": [ "$tags", userAccess ] }},
0
]},
"$ROOT",
{
"title": "$title",
"label": { "$literal": "Access Denied" }
}
]}
}}
With all said and done, there really is not a lot of purpose in any of this unless you are indeed intending to actually "aggregate" something at the end of the day. Just making the server do exactly the same filtering of document content that you can do in client code it usually not the best use of expensive CPU cycles.
Even in the basic examples as given, it makes a lot more sense to just do this in client code unless you are really getting a major benefit out of removing entries that do not meet your conditions from being transferred over the network. In your case there is no such benefit, so better to client code instead.

Why Doctrine2 ODM's results of findBy() and createQueryBuilder()->getQuery()->execute() differ from each other?

I have tried two different ways to do the same query with Doctrine's MongoDB-ODM.
Can you figure out that why the two, in my opinion similar queries both return different result? Snippet 1 doesn't return anything where Snippet 2 returns correct database entries. Both of the queries seem similar in the log file - except #1 does not have skip & limit lines.
Snippet 1
$dateDayAgo = new \DateTime('1 day ago');
$recentLogins = $this->get('user_activity_tracker')->findBy(array(
'targetUser' => $userAccount->getId(),
'code' => array('$in' => array('login.attempt','login.ok')),
'ts' => array('$gte', $dateDayAgo)
))->sort(['ts' => 1]);
Symfony's log entries from the Snippet 1:
[2012-08-13 09:14:33] doctrine.INFO: MongoDB query: { "find": true, "query": { "targetUser": ObjectId("4fa377e06803fa7303000002"), "code": { "$in": [ "login.attempt", "login.ok" ] }, "ts": [ "$gte", new Date("Sun, 12 Aug 2012 09:14:33 +0000") ] }, "fields": [ ], "db": "eventio_com", "collection": "ActivityEvent" } [] []
[2012-08-13 09:14:33] doctrine.INFO: MongoDB query: { "sort": true, "sortFields": { "ts": 1 }, "query": { "targetUser": ObjectId("4fa377e06803fa7303000002"), "code": { "$in": [ "login.attempt", "login.ok" ] }, "ts": [ "$gte", new Date("Sun, 12 Aug 2012 09:14:33 +0000") ] }, "fields": [ ] } [] []
Snippet 2
$recentLoginsQuery = $this->get('user_activity_tracker')->createQueryBuilder()
->field('targetUser')->equals($userAccount->getId())
->field('code')->in(array('login.attempt','login.ok'))
->field('ts')->gte($dateDayAgo)
->sort('ts','asc')
->getQuery();
$recentLogins = $recentLoginsQuery->execute();
Log entries for Snippet 2:
[2012-08-13 09:17:30] doctrine.INFO: MongoDB query: { "find": true, "query": { "targetUser": ObjectId("4fa377e06803fa7303000002"), "code": { "$in": [ "login.attempt", "login.ok" ] }, "ts": { "$gte": new Date("Sun, 12 Aug 2012 09:17:30 +0000") } }, "fields": [ ], "db": "eventio_com", "collection": "ActivityEvent" } [] []
[2012-08-13 09:17:30] doctrine.INFO: MongoDB query: { "limit": true, "limitNum": null, "query": { "targetUser": ObjectId("4fa377e06803fa7303000002"), "code": { "$in": [ "login.attempt", "login.ok" ] }, "ts": { "$gte": new Date("Sun, 12 Aug 2012 09:17:30 +0000") } }, "fields": [ ] } [] []
[2012-08-13 09:17:30] doctrine.INFO: MongoDB query: { "skip": true, "skipNum": null, "query": { "targetUser": ObjectId("4fa377e06803fa7303000002"), "code": { "$in": [ "login.attempt", "login.ok" ] }, "ts": { "$gte": new Date("Sun, 12 Aug 2012 09:17:30 +0000") } }, "fields": [ ] } [] []
[2012-08-13 09:17:30] doctrine.INFO: MongoDB query: { "sort": true, "sortFields": { "ts": 1 }, "query": { "targetUser": ObjectId("4fa377e06803fa7303000002"), "code": { "$in": [ "login.attempt", "login.ok" ] }, "ts": { "$gte": new Date("Sun, 12 Aug 2012 09:17:30 +0000") } }, "fields": [ ] } [] []
My 'user_activity_tracker' service works just as a proxy to the underlying Doctrine repository / document manager. Both snippets get a LoggableCursor back after query.
The extra log output with the query builder method is due to Query::prepareCursor(), which always sets additional cursor options. The repository findBy() method, which utilizes DocumentPersister::loadAll(), only sets options if a non-null value is provided. That explains the difference in log output, but is unrelated to a difference of result sets.
The logged query for each example are identical, apart from a small drift in the ts criteria. If the count() values of both cursors differ, and the results are different after unwrapping the cursor with iterator_to_array(), I would suggest attempting to reproduce this in a failing test case and submit a pull request against the mongodb-odm repository.