Split big JSON file by type

Split big JSON file by type - powershell

Consider a big JSON in this format(ex: all.json):
[
{
"Name": "abc",
"Type": "movie"
},
{
"Name": "bcd",
"Type": "series"
},
{
"Name": "asd",
"Type": "movie"
},
{
"Name": "sdf",
"Type": "series"
}
]
I want split this file in two files by type
series.json
[
{
"Name": "bcd",
"Type": "series"
},
{
"Name": "sdf",
"Type": "series"
}
]
movie.json
[
{
"Name": "abc",
"Type": "movie"
},
{
"Name": "asd",
"Type": "movie"
}
]
What is the better approach to do this split using powershell? Someone can help?

try this:
#import data from file
$Array=Get-Content "C:\temp\test.json" | ConvertFrom-Json
#group data by type and export
$Array | group Type | %{
$File="C:\temp\{0}.json" -f $_.Name
$_.Group | ConvertTo-Json | Out-File $File
}

Related

PubSub Subscription error with REPEATED Column Type - Avro Schema

I am trying to use the PubSub Subscription "Write to BigQuery" but am running into an issue with the "REPEATED" column type. the message I get when update the subscription is
Incompatible schema mode for field 'Values': field is REQUIRED in the topic schema, but REPEATED in the BigQuery table schema
My Avro Schema is:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": [
{
"type": "record",
"name": "Values",
"fields": [
{
"name": "AttributeID",
"type": "string"
},
{
"name": "AttributeValue",
"type": "string"
}
]
}
]
}
]
}
Input JSON That "Matches" Schema:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": {
"AttributeID": "TEST_ID_1",
"AttributeValue": "Value_1"
}
}
my Table looks like:
ItemID | STRING | NULLABLE
UserType | STRING | NULLABLE
Values | RECORD | REPEATED
AttributeID | STRING | NULLABLE
AttributeValue | STRING | NULLABLE
I am able to "Test" and "Validate Schema" and it comes back with a success. Question is, what am I missing on the Avro for the Values node to make it "REPEATED" vs "Required" for subscription to be created.

The issue is that Values is not an array type in your Avro schema, meaning it expects only one in the message, while it is a repeated type in your BigQuery schema, meaning it expects a list of them.

Per Kamal's comment above, this schema works:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": {
"type": "array",
"items": {
"name": "NameDetails",
"type": "record",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "Value",
"type": "string"
}
]
}
}
}
]
}
the payload:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": [
{ "AttributeID": "TEST_ID_1" },
{ "AttributeValue": "Value_1" }
]
}

Druid Using multiple dimensions for a Dimension Extraction Function

Is it possible to use multiple dimensions for a dimension extraction function?
Something like:
{
"type": "extraction",
"dimension": ["dimension_1", "dimension_2"],
"outputName": "new_dimension",
"outputType": "STRING",
"extractionFn": {
"type": "javascript",
"function": "function(x, y){ // do sth with both x and y to return the result }"
}
}

I do not think this is possible. However, you can create something like that by first "merge" the 2 different dimensions using a virtualColumn, and then use an extraction function. You can then split the values again.
Example query (using https://github.com/level23/druid-client)
$client = new DruidClient([
"router_url" => "https://your.druid"
]);
// Build a groupBy query.
$builder = $client->query("hits")
->interval("now - 1 hour/now")
->select("os_name")
->select("browser")
->virtualColumn("concat(os_name, ';', browser)", "combined")
->sum("hits")
->select("combined", "coolBrowser", function (ExtractionBuilder $extractionBuilder) {
$extractionBuilder->javascript("function(t) { parts = t.split(';'); return parts[0] + ' with cool ' + parts[1] ; }");
})
->where("os_name", "!=", "")
->where("browser", "!=", "")
->orderBy("hits", "desc")
;
// Execute the query.
$response = $builder->groupBy();
Example result:
+--------+--------------------------------------------------+--------------------------+---------------------------+
| hits | coolBrowser | browser | os_name |
+--------+--------------------------------------------------+--------------------------+---------------------------+
| 418145 | Android with cool Chrome Mobile | Chrome Mobile | Android |
| 62937 | Windows 10 with cool Edge | Edge | Windows 10 |
| 27956 | Android with cool Samsung Browser | Samsung Browser | Android |
| 9460 | iOS with cool Safari | Safari | iOS |
+--------+--------------------------------------------------+--------------------------+---------------------------+
Raw native druid json query:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2021-10-15T11:25:23.000Z/2021-10-15T12:25:23.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "os_name",
"outputType": "string",
"outputName": "os_name"
},
{
"type": "default",
"dimension": "browser",
"outputType": "string",
"outputName": "browser"
},
{
"type": "extraction",
"dimension": "combined",
"outputType": "string",
"outputName": "coolBrowser",
"extractionFn": {
"type": "javascript",
"function": "function(t) { parts = t.split(\";\"); return parts[0] + \" with cool \" + parts[1] ; }",
"injective": false
}
}
],
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "not",
"field": {
"type": "selector",
"dimension": "os_name",
"value": ""
}
},
{
"type": "not",
"field": {
"type": "selector",
"dimension": "browser",
"value": ""
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "hits",
"fieldName": "hits"
}
],
"virtualColumns": [
{
"type": "expression",
"name": "combined",
"expression": "concat(os_name, ';', browser)",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
},
"limitSpec": {
"type": "default",
"columns": [
{
"dimension": "hits",
"direction": "descending",
"dimensionOrder": "lexicographic"
}
]
}
}

Transform nested JSON in data factory to sql

New to data factory. I have a json file that needs to manipulate but I can't figure out how to go about it. The file has a generic "name" property but it should have the value as the key name. How can I get it so that I can get the value as key?
So far been getting Complex JSON errors. This json is coming from file store.
[
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name1",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Birthday Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "12/1/1999"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "John"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Smith"
}
]
}
}
]
},
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name2",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Entry Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "4/3/2010"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Jane"
},
{
"Name": "LastName",
"$type": "Text",
"Value": "Doe"
}
]
}
}
]
}
]
Expected output
DocumentData: [
{
"Form":"Birthday Form",
"Date": "12/1/1999",
"FirstName": "John",
"LastName": "Smith"
},
{
"Form":"Entry Form",
"Date": "4/3/2010",
"FirstName": "Jane",
"LastName": "Doe"
}
]

#jaimers,
I was able to achieve it by making use of the Data Flow Activity
The below is the complete DataFlow
1) Source1
This step involves getting the data from source. You will have to configure the Source dataset.
The only change I had done in the source was to Convert Fields.Name,Field.Type,Field.Value as string[] (From string).
This was required to make/create key value pair of the fields in the Subsequent steps.
Flatten1
I had made use of Flatten at the Document level.
And got the values of DocumentData.DocumentName and DocumentData.Fields
Note : If you don't want DocumentData.DocumentName - You can safely ignore it.
4) DerivedColumn1
This is actual step where I convert name:key1 key:value1 to key1:value1.
To do that I had made use of the below expression :
keyValues(Fields.Name,Fields.Value)
Note: Keyvalues() function expects 2 array arguments. Hence, in the first step we had changed the type of Fields.Name and Fields.Value to array.
4) Select
Just to select the columns that need to be sent as an output
Output

You mentioned SQL in your title so if you have access to a SQL database, eg Azure SQL DB, then it is quite capable with manipulating JSON, eg using the OPENJSON and FOR JSON PATH methods. A simple example:
DECLARE #json VARCHAR(MAX) = '[
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name1",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Birthday Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "12/1/1999"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "John"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Smith"
}
]
}
}
]
},
{
"Version": "1.1",
"Documents": [
{
"DocumentState": "Correct",
"DocumentData": {
"Name": "Name2",
"$type": "Document",
"Fields": [
{
"Name": "Form",
"$type": "Text",
"Value": "Entry Form"
},
{
"Name": "Date",
"$type": "Text",
"Value": "4/3/2010"
},
{
"Name": "FirstName",
"$type": "Text",
"Value": "Jane"
},
{
"Name": "LastName",
"$type": "Text",
"Value": "Doe"
}
]
}
}
]
}
]'
-- Restructure the JSON and add a root
SELECT *
FROM OPENJSON ( #json )
WITH
(
Form VARCHAR(50) '$.Documents[0].DocumentData.Fields[0].Value',
[Date] DATE '$.Documents[0].DocumentData.Fields[1].Value',
FirstName VARCHAR(50) '$.Documents[0].DocumentData.Fields[2].Value',
LastName VARCHAR(50) '$.Documents[0].DocumentData.Fields[3].Value'
)
FOR JSON PATH, ROOT('DocumentData');
My results:
NB I've used the ROOT clause to add a root to the JSON document. You could make the #json a stored proc parameter and use a Stored Proc task from the pipeline.

Replace values in Json using powershell

I have a json file in which i would like to change values and save again as a Json:
Values that need to be updated:
domain
repo
[
{
"name": "[concat(parameters('factoryName'), '/LS_New')]",
"type": "Microsoft.DataFactory/factories/linkedServices",
"apiVersion": "2018-06-01",
"properties": {
"description": "Connection",
"annotations": [],
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://url.net",
"accessToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "LS_vault",
"type": "LinkedServiceReference"
},
"secretName": "TOKEN"
},
"newClusterNodeType": "Standard_DS4_v2",
"newClusterNumOfWorker": "2:10",
"newClusterSparkEnvVars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"newClusterVersion": "7.2.x-scala2.12"
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/linkedServices/LS_evaKeyVault')]"
]
},
{
"name": "[concat(parameters('factoryName'), '/PIP_Log')]",
"type": "Microsoft.DataFactory/factories/pipelines",
"apiVersion": "2018-06-01",
"properties": {
"description": "Unzip",
"activities": [
{
"name": "Parse",
"description": "This notebook",
"type": "DatabricksNotebook",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"notebookPath": "/dataPipelines/main_notebook.py",
"baseParameters": {
"businessgroup": {
"value": "#pipeline().parameters.businessgroup",
"type": "Expression"
},
"project": {
"value": "#pipeline().parameters.project",
"type": "Expression"
}
},
"libraries": [
{
"pypi": {
"package": "cytoolz"
}
},
{
"pypi": {
"package": "log",
"repo": "https://b73gxyht"
}
}
]
},
"linkedServiceName": {
"referenceName": "LS_o",
"type": "LinkedServiceReference"
}
}
],
"parameters": {
"businessgroup": {
"type": "string",
"defaultValue": "test"
},
"project": {
"type": "string",
"defaultValue": "log-analytics"
}
},
"annotations": []
},
"dependsOn": [
"[concat(variables('factoryId'), '/linkedServices/LS_o')]"
]
}
]
I tried using regex but i am only able to update 1 value :
<valuesToReplace>
<valueToReplace>
<regExSearch>(\/PIP_Log[\w\W]*?[pP]roperties[\w\W]*?[lL]ibraries[\w\W]*?[pP]ypi[\w\W]*?"repo":\s)"(.*?[^\\])"</regExSearch>
<replaceWith>__PATValue__</replaceWith>
</valueToReplace>
<valueToReplace>
<regExSearch>('\/LS_New[\w\W]*?[pP]roperties[\w\W]*?[tT]ypeProperties[\w\W]*?"domain":\s"(.*?[^\\])")</regExSearch>
<replaceWith>__LSDomainName__</replaceWith>
</valueToReplace>
</valuesToReplace>
Here is the powershell code. The loop goes through all the values that are to be replaced.
I tried using dynamic variable in select-string and looping, but it doesn't seem to work
foreach($valueToReplace in $configFile.valuesToReplace.valueToReplace)
{
$regEx = $valueToReplace.regExSearch
$replaceValue = '"' + $valueToReplace.replaceWith + '"'
$matches = [regex]::Matches($json, $regEx)
$matchExactValueRegex = $matches.Value | Select-String -Pattern """repo\D:\s*(.*)" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue = $matches.Value | Select-String -Pattern "repo\D:\s\D__(.*)__""" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue = """$patValue"""
$json1 = [regex]::Replace($json, $matchExactValueRegex , $updateReplaceValue)
$matchExactValueRegex1 = $matches.Value | Select-String -Pattern """domain\D:\s*(.*)" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue1 = $matches.Value | Select-String -Pattern "domain\D:\s\D__(.*)__""" | % {$_.Matches.Groups[1].Value}
$updateReplaceValue1 = """$domainURL"""
$json = [regex]::Replace($json1, $matchExactValueRegex1 , $updateReplaceValue1)
}
else
{
Write-Warning "Inactive config value"
}
$json | Out-File $armFileWithReplacedValues
Where am i missing??

You should not peek and poke in serialized files (as e.g. Json files) directly. Instead deserialize the file with the ConvertFrom-Json cmdlet, make your changes to the object and serialize it again with the ConvertTo-Json cmdlet:
$Data = ConvertFrom-Json $Json
$Data[0].properties.typeproperties.domain = '_LSDomainName__'
$Data[1].properties.activities.typeproperties.libraries[1].pypi.repo = '__PATValue__'
$Data | ConvertTo-Json -Depth 9 | Out-File $armFileWithReplacedValues

Not able to fetch the individual details from JSON data

"Ns": {
"value": [
{
"Nname": "exa",
"SR": [
{
"name": "port1",
"properties": {
"description": "Allow port1",
"destinationPortRange": "1111",
"priority": 100
}
},
{
"name": "port1_0",
"properties": {
"description": "Allow port1",
"destinationPortRange": "1111",
"priority": 150
}
},
{
"name": "port2",
"properties": {
"description": "Allow 1115",
"destinationPortRange": "1115",
"priority": 100,
}
}
]
}
]
}
Want to assert the details of priority and name but was not able to do it.
Here is what I have implemented:
$Ndetails = templateProperties.parameters.Ns.value.SR
foreach ($Ndata in $Ndetails) {
$Ndata .properties.destinationPortRange |
Should -BeExactly #('1111','1111','1115')
} 
How to resolve the same using Pester in PowerShell?

You don't need to use foreach for this. You can just use Select-Object for this. Assuming your JSON is as #Mark Wragg linked in the comments:
$Json = #'
[{
"Ns": {
"value": [{
"Nname": "exa",
"SR": [{
"name": "port1",
"properties": {
"description": "Allow port1",
"destinationPortRange": "1111",
"priority": 100
}
},
{
"name": "port1_0",
"properties": {
"description": "Allow port1",
"destinationPortRange": "1111",
"priority": 150
}
},
{
"name": "port2",
"properties": {
"description": "Allow 1115",
"destinationPortRange": "1115",
"priority": 100
}
}
]
}]
}
}]
'#
$t = $Json | ConvertFrom-Json
Your test file should look like this:
$result = $t.Ns.value.SR.properties.destinationPortRange
it 'destinationPortRange matches' {
$result | Should -BeExactly #('1111','1111','1115')
}
Explanation
Your use of foreach was incorrect as you compared single element (also notice that I deleted unnecessary space)
$Ndata.properties.destinationPortRange
to the array
| Should -BeExactly #('1111','1111','1115')
What you have to do is to compare array to array as in my example.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Split big JSON file by type - powershell

try this: #import data from file $Array=Get-Content "C:\temp\test.json" | ConvertFrom-Json #group data by type and export $Array | group Type | %{ $File="C:\temp\{0}.json" -f $_.Name $_.Group | ConvertTo-Json | Out-File $File }

Related

PubSub Subscription error with REPEATED Column Type - Avro Schema

Druid Using multiple dimensions for a Dimension Extraction Function

Transform nested JSON in data factory to sql

Replace values in Json using powershell

Not able to fetch the individual details from JSON data

Categories

Resources