Split json using NiFi - mongodb

I have a json with all the records with merged I need to split the merged json and load in separate database using NiFi
My file when I execute
db.collection.findOne()
My input looks like:
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
},
{
"name": "siva",
"id": 102,
"company": "shar"
},
{
"name": "vanai",
"id": 103,
"company": "ddr"
},
{
"name": "karti",
"id": 104,
"company": "sir"
}
]
I am getting all the json. I need to get output as:
{name: "sai", id:101, company: "sdr"}
So i only want one record, how can I parse the json using NiFi?

There is a SplitJson processor for this purpose:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.SplitJson/index.html
There are various JSON Path testers online to come up with the correct expression:
https://jsonpath.curiousconcept.com/

Use Split json processor with below configs as shown in the below screenshot
SplitJson Config

As Bryan said, you can use SplitJson processor, and then, you can forward the splitted data flow to other databases. The processor internally using this json pathfinder. You can read there the operations that the processor supports.
Just use this to get the first element by:
// JSON Path Expression for the first element:
$[0]
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
}
]

Related

ADF - Loop through a large JSON file in a dataflow

We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.

Fanout data from Kafka to S3 using kafka connect

My kafka gets a wrapper payload in Json format. The wrapper payload looks like this:
{
"format": "wrapper",
"time": 1626814608000,
"events": [
{
"id": "item1",
"type": "product1",
"count": 200
},
{
"id": "item2",
"type": "product2",
"count": 300
}
],
"metadata": {
"schema": "schema-1"
}
}
I should export this to S3. But the catch here is, I should not store the wrapper. Instead, I should be storing the individual events based on the item.
For example, it should be stored in S3 as follows:
bucket/product1:
{"id": "item1", "type": "product1", "count": 200}
bucket/product2:
{"id": "item2", "type": "product2", "count": 300}
If you notice, the input is the wrapper with those events internally. However, my output should be each of those individual events stored in S3 in the same bucket with the product type as prefix.
My question is, is it possible to use Kafka Connect to do this? I see they have this Single Message transformer which seems to be a way to mutate data inside the object, but not to fanout like what I want. Even the signature looks like an R=>R
https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java
So based on my research, it does not seem possible. But I want to check if I am missing something before using a different option.
Transforms accept one event and output one event.
You need to use a stream processor such as Kafka Streams branch or flatMap functions to split an array of events into multiple events or multiple topics.

Azure Data Factory Copy Activity - Append to JSON File

I am working on creating a data factory pipeline that copies data from a REST API endpoint to Azure Blob Storage. The API has a limitation of only returning 1000 records at a time, so I have built in a loop into my pipeline that will iterate through all of the pages. What I am wondering is - would it be possible to use the copy activity to append to the same file in the Azure Blob, rather than create a separate file for each page?
Below is what the API response looks like. The only value that I need from each response is the "records" list, so I was thinking if it is possible, I could get rid of the other stuff and just keep appending to the same file as the loop runs - although I do not know if the copy activity is capable of doing this. Would this be possible? Or is the only way to do this is to land all the responses as separate files in Blob Storage and then combine them after the fact?
Thank You
{
"totalResults": 8483,
"pageResults": 3,
"timeStamp": "2020/08/24 10:43:26",
"parameters": {
"page": 1,
"resultsPerPage": 3,
"filters": [],
"fields": [
"lastName",
"firstName",
"checklistItemsAssigned",
"checklistItemsStarted",
"checklistItemsCompleted",
"checklistItemsOverdue"
],
"sort": {
"field": "lastName",
"direction": "asc"
}
},
"records": [
{
"checklistItemsAssigned": 10,
"lastName": "One",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 10,
"checklistItemsCompleted": 10
},
{
"checklistItemsAssigned": 5,
"lastName": "Two",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
},
{
"checklistItemsAssigned": 5,
"lastName": "Three",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
}
]
}
ADF's Copy activity supports copying blobs from block, append, or page type of blobs but copying data to only block blobs. Blobk blobs can only be overwritten.
You can probably create an append type of blob using Storage SDK, but it would be an overkill for most of the project. I would go with creating new blobs and merging them at the last stage.

how to create Jsonpath file to load data in redshift

one of my sample record for Json:
{
"viewerId": "Ext-04835139",
"sid5": "269410578:2995631181:2211755370:3307088398:33879957",
"firstHbTimems": 1.506283958371E12,
"ipAddress": "74.58.57.31",
"streamUrl": "https://dc3-ll-livedazn-dznlivejp.hs.llnwd.net/live/channel/1007/all/stream.m3u8?event_id=61824040049&h=c912885e2a69ffa7ea84f45dc18c004d",
"asset": "[nlq9biy7trxl1cjceg70rogvd] Saints # Panthers",
"os": "IOS",
"osVersion": "10.3.3",
"deviceModel": "iPhone",
"geoInfo": {
"city": 63666,
"state": 3851,
"isp": 120,
"longitudeTimes1K": -73562,
"country": 37,
"dma": 0,
"asn": 5769,
"latitudeTimes1K": 45502,
"publicIP": 1245329695
},
"totalPlayingTime": 4.097,
"totalBufferingTime": 0.0,
"VST": 1.411,
"avgBitrate": 202.0,
"playStateSwitch": [
"{'seqNum': 0, 'eventNum': 0, 'sessionTimeMs': 7, 'startPlayState': 'eUnknown', 'endPlayState': 'eBuffering'}",
"{'seqNum': 1, 'eventNum': 5, 'sessionTimeMs': 1411, 'startPlayState': 'eBuffering', 'endPlayState': 'ePlaying'}"
],
"bitrateSwitch": [
],
"errorEvent": [
],
"tags": {
"LSsportName": "Football",
"c3.device.model": "iPhone+6+Plus",
"LSvideoType": "LIVE",
"c3.device.ua": "DAZN%2F5560+CFNetwork%2F811.5.4+Darwin%2F16.7.0",
"LSfixtureId": "5trxst8tv7slixckvawmtf949",
"genre": "Sport",
"LScompetitionName": "NFL+Game+Pass",
"show": "NFL+Game+Pass",
"c3.cmp.0._type": "DEVATLAS",
"c3.protocol.type": "cws",
"LSsportId": "9ita1e50vxttzd1xll3iyaulu",
"stageId": "8hm0ew6b8m7907ty8vy8tu4tl",
"LSvenueId": "na",
"syndicator": "None",
"applicationVersion": "2.0.8",
"deviceConnectionType": "wifi",
"c3.client.marketingName": "iPhone+6+Plus",
"playerVersion": "1.2.6.0",
"c3.cmp.0._id": "da",
"drmType": "AES128",
"c3.sh": "dc3-ll-livedazn-dznlivejp.hs.llnwd.net",
"c3.pt.ver": "10.3.3",
"applicationType": "ios",
"c3.viewer.id": "Ext-04835139",
"LSinterfaceLanguage": "en",
"c3.pt.os": "IOS",
"playerVendor": "Open+Source",
"c3.client.brand": "Apple",
"c3.cws.sf": "7",
"c3.cmp.0._ver": "1",
"c3.client.hwType": "Mobile+Phone",
"c3.pt.os.ver": "10.3.3",
"isAd": "false",
"c3.device.cver.bld": "2.124.0.33357",
"stageName": "Regular+Season",
"c3.client.osName": "iOS",
"contentType": "Live",
"c3.device.cver": "2.124.0",
"LScompetitionId": "wy3kluvb4efae1of0d8146c1",
"expireDate": "na",
"c3.client.model": "iPhone+6+Plus",
"c3.client.manufacturer": "Apple",
"LSproductionValue": "na",
"pubDate": "2017-09-23",
"c3.cluster.name": "production",
"accountType": "FreeTrial",
"c3.adaptor.type": "eCws1_7",
"c3.device.brand": "iPhone",
"c3.pt.br": "Non-Browser+Apps",
"contentId": "nlq9biy7trxl1cjceg70rogvd",
"streamingProtocol": "FairPlay",
"LSvenueName": "na",
"c3.device.type": "Mobile",
"c3.protocol.level": "2.4",
"c3.player.name": "AVPlayer",
"contentName": "Saints+%40+Panthers",
"c3.device.manufacturer": "Apple",
"c3.framework": "AVFoundation",
"c3.pt": "iOS",
"c3.device.ver": "6+Plus",
"c3.video.isLive": "T",
"c3.cmp.0._cfg_ver": "1504808821",
"c3.cws.clv": "2.124.0.33357",
"LScountryCode": "America%2FEl_Salvador"
},
"playername": "AVPlayer",
"isLive": "T",
"playerVersion": "1.2.6.0"
}
How to create jsonpath file to load it in redshift ?
Thanks
You have a nested array within your json - so a jsonpath will not expand that out for you.
You have a couple of choices on how to proceed:
You can load your data at the higher level (e.g. playStateSwitch
rather than seqNum within that) - and then try to use redshift to
process that data. This can be tricky as you cannot explode json
data from an array in redshift.
You can preprocess the data using e.g. aws glue / python / pyspark
or some other etl tool that can handle these nested arrays.
It all depends on the end goal, which is not clear form the above description.
I will approach the solution in the following order
Define which fields and array values that are required to be loaded into the Redshift. If the need is to copy all the records then the next check is how to handle the multiple array records.
If array or key/value are missing as part of JSON source then JSONPath will not work as is - So, better to update the JSON to add the missing array prior to COPY the data set over to RS.
The JSON update can be done using Linux commands or external tools like JP or refer additional reference
If all the values in the nested arrays are required then an alternative work around will be using external table an example
Otherwise, the JSONPATH file can be developed in this format
{
"jsonpaths": [
"$.viewerId", ///root level fields
...
"$geoInfo.city", /// object hierarchy
...
"$playStateSwitch[0].seqNum" ///define the required array number
...
]
}
Hope, this helps.

Does the OData protocol provide a way to transform an array of objects to an array of raw values?

Is there a way specify in an OData query that instead of certain name/value pairs being returned, a raw array should be returned instead? For example, if I have an OData query that results in the following:
{
"#odata.context": "http://blah.org/MyService/$metadata#People",
"value": [
{
"Name": "Joe Smith",
"Age": 55,
"Employers": [
{
"Name": "Acme",
"StartDate": "1/1/1990"
},
{
"Name": "Enron",
"StartDate": "1/1/1995"
},
{
"Name": "Amazon",
"StartDate": "1/1/1999"
}
]
},
{
"Name": "Jane Doe",
"Age": 30,
"Employers": [
{
"Name": "Joe's Crab Shack",
"StartDate": "1/1/2007"
},
{
"Name": "TGI Fridays",
"StartDate": "1/1/2010"
}
]
}
]
}
Is there anything I can add to the query to instead get back:
{
"#odata.context": "http://blah.org/MyService/$metadata#People",
"value": [
{
"Name": "Joe Smith",
"Age": 55,
"Employers": [
[ "Acme", "1/1/1990" ],
[ "Enron", "1/1/1995" ],
[ "Amazon", "1/1/1999" ]
]
},
{
"Name": "Jane Doe",
"Age": 30,
"Employers": [
[ "Joe's Crab Shack", "1/1/2007" ],
[ "TGI Fridays", "1/1/2010" ]
]
}
]
}
While I could obviously do the transformation client side, in my use case the field names are very large compared to the data, and I would rather not transmit all those names over the wire nor spend the CPU cycles on the client doing the transformation. Before I come up with my own custom parameters to indicate that the format should be as I desire, I wanted to check if there wasn't already a standardized way to do so.
OData provides several options to control the amount of data and metadata to be included in the response.
In OData v4, you can add odata.metadata=minimal to the Accept header parameters (check the documentation here). This is the default behaviour but even with this, it will still include the field names in the response and for a good reason.
I can see why you want to send only the values without the fields name but keep in mind that this will change the semantic meaning of the response structure. It will make it less intuitive to deal with as a json record on the client side.
So to answer your question, The answer is 'NO',
Other options to minimize the response size:
You can use the $value OData option to gets the raw value of a single property.
Check this example:
services.odata.org/OData/OData.svc/Categories(1)/Products(1)/Supplier/Address/City/$value
You can also use the $select option to cherry pick only the fields you need by selecting a subset of properties to include in the response