Druid Using multiple dimensions for a Dimension Extraction Function - druid

Is it possible to use multiple dimensions for a dimension extraction function?
Something like:
{
"type": "extraction",
"dimension": ["dimension_1", "dimension_2"],
"outputName": "new_dimension",
"outputType": "STRING",
"extractionFn": {
"type": "javascript",
"function": "function(x, y){ // do sth with both x and y to return the result }"
}
}

I do not think this is possible. However, you can create something like that by first "merge" the 2 different dimensions using a virtualColumn, and then use an extraction function. You can then split the values again.
Example query (using https://github.com/level23/druid-client)
$client = new DruidClient([
"router_url" => "https://your.druid"
]);
// Build a groupBy query.
$builder = $client->query("hits")
->interval("now - 1 hour/now")
->select("os_name")
->select("browser")
->virtualColumn("concat(os_name, ';', browser)", "combined")
->sum("hits")
->select("combined", "coolBrowser", function (ExtractionBuilder $extractionBuilder) {
$extractionBuilder->javascript("function(t) { parts = t.split(';'); return parts[0] + ' with cool ' + parts[1] ; }");
})
->where("os_name", "!=", "")
->where("browser", "!=", "")
->orderBy("hits", "desc")
;
// Execute the query.
$response = $builder->groupBy();
Example result:
+--------+--------------------------------------------------+--------------------------+---------------------------+
| hits | coolBrowser | browser | os_name |
+--------+--------------------------------------------------+--------------------------+---------------------------+
| 418145 | Android with cool Chrome Mobile | Chrome Mobile | Android |
| 62937 | Windows 10 with cool Edge | Edge | Windows 10 |
| 27956 | Android with cool Samsung Browser | Samsung Browser | Android |
| 9460 | iOS with cool Safari | Safari | iOS |
+--------+--------------------------------------------------+--------------------------+---------------------------+
Raw native druid json query:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2021-10-15T11:25:23.000Z/2021-10-15T12:25:23.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "os_name",
"outputType": "string",
"outputName": "os_name"
},
{
"type": "default",
"dimension": "browser",
"outputType": "string",
"outputName": "browser"
},
{
"type": "extraction",
"dimension": "combined",
"outputType": "string",
"outputName": "coolBrowser",
"extractionFn": {
"type": "javascript",
"function": "function(t) { parts = t.split(\";\"); return parts[0] + \" with cool \" + parts[1] ; }",
"injective": false
}
}
],
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "not",
"field": {
"type": "selector",
"dimension": "os_name",
"value": ""
}
},
{
"type": "not",
"field": {
"type": "selector",
"dimension": "browser",
"value": ""
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "hits",
"fieldName": "hits"
}
],
"virtualColumns": [
{
"type": "expression",
"name": "combined",
"expression": "concat(os_name, ';', browser)",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
},
"limitSpec": {
"type": "default",
"columns": [
{
"dimension": "hits",
"direction": "descending",
"dimensionOrder": "lexicographic"
}
]
}
}

Related

Add MySQL column comment as a metadata to Avro schema through the Debezium connector

Kafka Connect is used through the Confluent platform and io.debezium.connector.mysql.MySqlConnector is used as the Debezium connector.
In my case, MySQL table includes columns with sensitive data and these columns must be tagged as sensitive for further use.
SHOW FULL COLUMNS FROM astronauts;
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
| orderid | int | NULL | YES | | NULL | | select,insert,update,references | |
| name | varchar(100) | utf8mb4_0900_ai_ci | NO | | NULL | | select,insert,update,references | sensitive column |
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
Notice MySQL comment for the name column.
Based on this table, I would like to have this Avro schema in the Schema registry:
{
"connect.name": "dbserver1.inventory.astronauts.Envelope",
"connect.version": 1,
"fields": [
{
"default": null,
"name": "before",
"type": [
"null",
{
"connect.name": "dbserver1.inventory.astronauts.Value",
"fields": [
{
"default": null,
"name": "orderid",
"type": [
"null",
"int"
]
},
{
"name": "name",
"type": {
"MY_CUSTOM_ATTRIBUTE": "sensitive column",
"type": "string"
}
}
],
"name": "Value",
"type": "record"
}
]
},
{
"default": null,
"name": "after",
"type": [
"null",
"Value"
]
},
{
"name": "source",
"type": {
"connect.name": "io.debezium.connector.mysql.Source",
"fields": [
{
"name": "version",
"type": "string"
},
{
"name": "connector",
"type": "string"
},
{
"name": "name",
"type": "string"
},
{
"name": "ts_ms",
"type": "long"
},
{
"default": "false",
"name": "snapshot",
"type": [
{
"connect.default": "false",
"connect.name": "io.debezium.data.Enum",
"connect.parameters": {
"allowed": "true,last,false,incremental"
},
"connect.version": 1,
"type": "string"
},
"null"
]
},
{
"name": "db",
"type": "string"
},
{
"default": null,
"name": "sequence",
"type": [
"null",
"string"
]
},
{
"default": null,
"name": "table",
"type": [
"null",
"string"
]
},
{
"name": "server_id",
"type": "long"
},
{
"default": null,
"name": "gtid",
"type": [
"null",
"string"
]
},
{
"name": "file",
"type": "string"
},
{
"name": "pos",
"type": "long"
},
{
"name": "row",
"type": "int"
},
{
"default": null,
"name": "thread",
"type": [
"null",
"long"
]
},
{
"default": null,
"name": "query",
"type": [
"null",
"string"
]
}
],
"name": "Source",
"namespace": "io.debezium.connector.mysql",
"type": "record"
}
},
{
"name": "op",
"type": "string"
},
{
"default": null,
"name": "ts_ms",
"type": [
"null",
"long"
]
},
{
"default": null,
"name": "transaction",
"type": [
"null",
{
"connect.name": "event.block",
"connect.version": 1,
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "total_order",
"type": "long"
},
{
"name": "data_collection_order",
"type": "long"
}
],
"name": "block",
"namespace": "event",
"type": "record"
}
]
}
],
"name": "Envelope",
"namespace": "dbserver1.inventory.astronauts",
"type": "record"
}
Notice the custom schema field named MY_CUSTOM_ATTRIBUTE.
Debezium 2.0 supports schema doc from column comments [DBZ-5489], however, personally I think the doc field attribute is not appropriate since:
any implementation of a schema registry or system that processes the schemas is free to drop those fields when encoding/decoding and its fully spec compliant
Additionally, the doc field is solely intended to provide information to a user of the schema and is not intended as a form of metadata that downstream programs can rely on
source: https://avro.apache.org/docs/1.10.2/spec.html#Schema+Resolution
Based on the Avro schema docs, custom attributes for Avro schemas are allowed and these attributes are known as metadata:
A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
source: https://avro.apache.org/docs/1.10.2/spec.html#schemas
I think Debezium transformations might be a solution, however, I have the following problems:
No idea how to get MySQL column comments in my custom transformation
org.apache.kafka.connect.data.SchemaBuilder does not allow to add custom attributes, afaik just doc and the specific field
Here are several native transformations for reference: https://github.com/apache/kafka/tree/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/

How to group by single field and return more values together

I'm starting to use apache druid but having some difficult to run native queries (and some SQL too).
1- Is it possible to groupBy a single column while also returning more channels?
2- How could I groupBy a single column, while returning different grouped itens on same query/row ?
Query I'm trying to use:
{
"queryType": "groupBy",
"dataSource": "my-data-source",
"granularity": "all",
"intervals": ["2022-06-27T03:00:00.000Z/2022-06-28T03:00:00.000Z"],
"context:": { "timeout: 30000 },
"dimensions": ["userId"],
"filter": {
"type": "and",
"fields": [
{
"type": "or",
"fields": [{...}]
}
]
},
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}
Tried to add a filtered type inside aggregations:[] but 0 changes happened.
"aggregations": [
{
"type: "count",
"name": "count"
},
{
"type": "filtered",
"filter": {
"type": "selector",
"dimension": "block_id",
"value": "block1"
},
"aggregator": {
"type": "count",
"name": "block1",
"fieldName": "block_id"
}
}
]
Grouping Aggregator also didn't work.
"aggregations": [
{
"type": "count",
"name": "count"
},
{
"type": "grouping",
"name": "groupedData",
"groupings": ["block_id"]
}
],
Below is the image illustrating the results I'm trying to achieve.
Not sure yet how to get the results in the format you want, but as a start, something like this might be a step:
{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "dataTest"
},
"intervals": {
"type": "intervals",
"intervals": [
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
]
},
"filter": null,
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "d2_ts2",
"outputType": "STRING"
},
{
"type": "default",
"dimension": "d3_email",
"outputType": "STRING"
}
],
"aggregations": [
{
"type": "count",
"name": "myCount",
}
],
"descending": false
}
I'm curious, what is the use case?
Using a SQL query you can do it this way:
SELECT UserID,
sum(1) FILTER (WHERE BlockId = 'block1') as Block1,
sum(1) FILTER (WHERE BlockId = 'block2') as Block2,
sum(1) FILTER (WHERE BlockId = 'block3') as Block3
FROM inline_data
GROUP BY 1
The Native Query for this (from the explain) is:
{
"queryType": "topN",
"dataSource": {
"type": "table",
"name": "inline_data"
},
"virtualColumns": [
{
"type": "expression",
"name": "v0",
"expression": "1",
"outputType": "LONG"
}
],
"dimension": {
"type": "default",
"dimension": "UserID",
"outputName": "d0",
"outputType": "STRING"
},
"metric": {
"type": "dimension",
"previousStop": null,
"ordering": {
"type": "lexicographic"
}
},
"threshold": 101,
"intervals": {
"type": "intervals",
"intervals": [
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
]
},
"filter": null,
"granularity": {
"type": "all"
},
"aggregations": [
{
"type": "filtered",
"aggregator": {
"type": "longSum",
"name": "a0",
"fieldName": "v0",
"expression": null
},
"filter": {
"type": "selector",
"dimension": "BlockId",
"value": "block1",
"extractionFn": null
},
"name": "a0"
},
{
"type": "filtered",
"aggregator": {
"type": "longSum",
"name": "a1",
"fieldName": "v0",
"expression": null
},
"filter": {
"type": "selector",
"dimension": "BlockId",
"value": "block2",
"extractionFn": null
},
"name": "a1"
},
{
"type": "filtered",
"aggregator": {
"type": "longSum",
"name": "a2",
"fieldName": "v0",
"expression": null
},
"filter": {
"type": "selector",
"dimension": "BlockId",
"value": "block3",
"extractionFn": null
},
"name": "a2"
}
],
"postAggregations": [],
"context": {
"populateCache": false,
"sqlOuterLimit": 101,
"sqlQueryId": "bb92e899-c127-49b0-be1b-d4b38909d166",
"useApproximateCountDistinct": false,
"useApproximateTopN": false,
"useCache": false,
"useNativeQueryExplain": true
},
"descending": false
}

Replace values in Json using powershell

I have a json file in which i would like to change values and save again as a Json:
Values that need to be updated:
domain
repo
[
{
"name": "[concat(parameters('factoryName'), '/LS_New')]",
"type": "Microsoft.DataFactory/factories/linkedServices",
"apiVersion": "2018-06-01",
"properties": {
"description": "Connection",
"annotations": [],
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://url.net",
"accessToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "LS_vault",
"type": "LinkedServiceReference"
},
"secretName": "TOKEN"
},
"newClusterNodeType": "Standard_DS4_v2",
"newClusterNumOfWorker": "2:10",
"newClusterSparkEnvVars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"newClusterVersion": "7.2.x-scala2.12"
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/linkedServices/LS_evaKeyVault')]"
]
},
{
"name": "[concat(parameters('factoryName'), '/PIP_Log')]",
"type": "Microsoft.DataFactory/factories/pipelines",
"apiVersion": "2018-06-01",
"properties": {
"description": "Unzip",
"activities": [
{
"name": "Parse",
"description": "This notebook",
"type": "DatabricksNotebook",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"notebookPath": "/dataPipelines/main_notebook.py",
"baseParameters": {
"businessgroup": {
"value": "#pipeline().parameters.businessgroup",
"type": "Expression"
},
"project": {
"value": "#pipeline().parameters.project",
"type": "Expression"
}
},
"libraries": [
{
"pypi": {
"package": "cytoolz"
}
},
{
"pypi": {
"package": "log",
"repo": "https://b73gxyht"
}
}
]
},
"linkedServiceName": {
"referenceName": "LS_o",
"type": "LinkedServiceReference"
}
}
],
"parameters": {
"businessgroup": {
"type": "string",
"defaultValue": "test"
},
"project": {
"type": "string",
"defaultValue": "log-analytics"
}
},
"annotations": []
},
"dependsOn": [
"[concat(variables('factoryId'), '/linkedServices/LS_o')]"
]
}
]
I tried using regex but i am only able to update 1 value :
<valuesToReplace>
<valueToReplace>
<regExSearch>(\/PIP_Log[\w\W]*?[pP]roperties[\w\W]*?[lL]ibraries[\w\W]*?[pP]ypi[\w\W]*?"repo":\s)"(.*?[^\\])"</regExSearch>
<replaceWith>__PATValue__</replaceWith>
</valueToReplace>
<valueToReplace>
<regExSearch>('\/LS_New[\w\W]*?[pP]roperties[\w\W]*?[tT]ypeProperties[\w\W]*?"domain":\s"(.*?[^\\])")</regExSearch>
<replaceWith>__LSDomainName__</replaceWith>
</valueToReplace>
</valuesToReplace>
Here is the powershell code. The loop goes through all the values that are to be replaced.
I tried using dynamic variable in select-string and looping, but it doesn't seem to work
foreach($valueToReplace in $configFile.valuesToReplace.valueToReplace)
{
$regEx = $valueToReplace.regExSearch
$replaceValue = '"' + $valueToReplace.replaceWith + '"'
$matches = [regex]::Matches($json, $regEx)
$matchExactValueRegex = $matches.Value | Select-String -Pattern """repo\D:\s*(.*)" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue = $matches.Value | Select-String -Pattern "repo\D:\s\D__(.*)__""" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue = """$patValue"""
$json1 = [regex]::Replace($json, $matchExactValueRegex , $updateReplaceValue)
$matchExactValueRegex1 = $matches.Value | Select-String -Pattern """domain\D:\s*(.*)" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue1 = $matches.Value | Select-String -Pattern "domain\D:\s\D__(.*)__""" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue1 = """$domainURL"""
$json = [regex]::Replace($json1, $matchExactValueRegex1 , $updateReplaceValue1)
}
else
{
Write-Warning "Inactive config value"
}
$json | Out-File $armFileWithReplacedValues
Where am i missing??
You should not peek and poke in serialized files (as e.g. Json files) directly. Instead deserialize the file with the ConvertFrom-Json cmdlet, make your changes to the object and serialize it again with the ConvertTo-Json cmdlet:
$Data = ConvertFrom-Json $Json
$Data[0].properties.typeproperties.domain = '_LSDomainName__'
$Data[1].properties.activities.typeproperties.libraries[1].pypi.repo = '__PATValue__'
$Data | ConvertTo-Json -Depth 9 | Out-File $armFileWithReplacedValues

Search and replace JSON multiline using regex in VSCode

I have a really long JSON schema. Using VSCode, I need to replace the partnerName type to be string, null (it appears more than 20 times, the snippet below is just 1 appearance).
How can I search and replace the multiline for the entire partnerName entry?
From other question, I've tried using regex [\n\s]+, (.*\n)+ to be
"partnerName": {(.*\n)+"type": "null"(.*\n)+}
But it's still not matching.
Search for:
"partnerName": {
"type": "null"
},
Replace with:
"partnerName": {
"type": "string, null"
},
Snippet example:
{
"type": "object",
"properties": {
"node": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"description": {
"type": "string"
},
"type": {
"type": "string"
},
"frequency": {
"type": "string"
},
"maxCount": {
"type": "integer"
},
"points": {
"type": "integer"
},
"startAt": {
"type": "string"
},
"endAt": {
"type": "string"
},
"partnerName": {
"type": "null"
},
"action": {
"type": "null"
}
},
"required": [
"id",
"name",
"description",
"type",
"frequency",
"maxCount",
"points",
"startAt",
"endAt",
"partnerName",
"action"
]
}
},
"required": [
"node"
]
},
Try this regex:
(partnerName".*\n\s*"type":\s*)"null"
and replace with:
$1"string, null"

OrientDB ETL loading CSV with vertices in one file and edges in another

I have some data that is in 2 CSV files, one contains the vertices and the other file contains the edges are in the other file. I'm working out how to set this up using ETL and am close but not quite there yet--it mostly works but my edges have properties and I'm not sure that they're loading right. This question was helpful but I'm still missing something...
Here's my data:
vertices.csv:
label,data,date
v01,0.1234,2015-01-01
v02,0.5678,2015-01-02
v03,0.9012,2015-01-03
edges.csv:
u,v,weight,date
v01,v02,12.4,2015-06-17
v02,v03,17.9,2015-09-14
I import my vertices using this:
commonVertices.json:
{
"begin": [
{ "let": { "name": "$filePath",
"expression": "$fileDirectory.append($fileName)"
}
},
],
"config": { "log": "info"},
"source": { "file": { "path": "$filePath" } },
"extractor": { "csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"dateFormat": "yyyy-mm-dd"
}
},
"transformers": [
{ "vertex": { "class": "myVertex" } },
{ "code": { "language": "Javascript",
"code": "print(' Current record: ' + record); record;" }
}
],
"loader": { "orientdb": {
"dbURL": "plocal:my_orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ { "name": "myVertex", "extends", "V" },
],
"indexes": []
}
}
}
vertices.json:
{ "config": { "log": "info",
"fileDirectory": "./",
"fileName": "vertices.csv"
}
}
commonEdges.json:
{
"begin": [
{ "let": { "name": "$filePath",
"expression": "$fileDirectory.append($fileName )"
}
},
],
"config": { "log": "info"
},
"source": { "file": { "path": "$filePath" } },
"extractor": { "csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"dateFormat": "yyyy-mm-dd"
}
},
"transformers": [
{ "merge": { "joinFieldName": "u", "lookup": "myVertex.label" } },
{ "edge": { "class": "myEdge",
"joinFieldName": "v",
"lookup": "myVertex.label",
"direction": "out",
"unresolvedLinkAction": "NOTHING"
}
},
{ "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:my_orientdb",
"dbType": "graph",
"batchCommit": 1000,
"useLightweightEdges": false,
"classes": [
{ "name": "myEdge", "extends", "E" }
],
"indexes": []
}
}
}
edges.json:
{
"config": {
"log": "info",
"fileDirectory": "./",
"fileName": "edges.csv"
}
}
I am running it with oetl.sh like this:
$ oetl.sh vertices.json commonVertices.json
$ oetl.sh edges.json commonEdges.json
Everything runs, but when I query the edges... I'm new to OrientDB, so maybe it is getting the properties in my edges, but when I query the edges I don't see the weight and date fields:
orientdb {db=my_orientdb}> SELECT FROM myEdge
+----+-----+------+-----+-----+
|# |#RID |#CLASS|out |in |
+----+-----+------+-----+-----+
|0 |#33:0|myEdge|#25:0|#26:0|
|1 |#34:0|myEdge|#26:0|#27:0|
+----+-----+------+-----+-----+
The vertex table contains the [weight] field from my edges.csv and the [date] field is getting clobbered in a weird way. The day of the month is getting overwritten to the day from the edge.csv file, which is undesirable, but it's odd to me that the month itself isn't also getting change:
orientdb {db=my_orientdb}> SELECT FROM myVertex
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|# |#RID |#CLASS |data |date |label|weight|out_myEdge|in_myEdge|
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|0 |#25:0|myVertex|0.1234|2015-01-17 00:06:00|v01 |12.4 |[#33:0] | |
|1 |#26:0|myVertex|0.5678|2015-01-14 00:09:00|v02 |17.9 |[#34:0] |[#33:0] |
|2 |#27:0|myVertex|0.9012|2015-01-03 00:01:00|v03 | | |[#34:0] |
+----+-----+--------+------+-------------------+-----+------+----------+---------+
I'm sure it's probably a simple tweak, any help would be great!
In edge transformer use edgeFields to bind properties in edges. Example:
"transformers": [
{ "merge": { "joinFieldName": "u", "lookup": "myVertex.label" } },
{ "edge": { "class": "myEdge",
"joinFieldName": "v",
"lookup": "myVertex.label",
"edgeFields": { "weight": "${input.weight}", "date": "${input.date}" },
"direction": "out",
"unresolvedLinkAction": "NOTHING"
}
},
{ "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
],
Hope it helps.