how to create Jsonpath file to load data in redshift - amazon-redshift

one of my sample record for Json:
{
"viewerId": "Ext-04835139",
"sid5": "269410578:2995631181:2211755370:3307088398:33879957",
"firstHbTimems": 1.506283958371E12,
"ipAddress": "74.58.57.31",
"streamUrl": "https://dc3-ll-livedazn-dznlivejp.hs.llnwd.net/live/channel/1007/all/stream.m3u8?event_id=61824040049&h=c912885e2a69ffa7ea84f45dc18c004d",
"asset": "[nlq9biy7trxl1cjceg70rogvd] Saints # Panthers",
"os": "IOS",
"osVersion": "10.3.3",
"deviceModel": "iPhone",
"geoInfo": {
"city": 63666,
"state": 3851,
"isp": 120,
"longitudeTimes1K": -73562,
"country": 37,
"dma": 0,
"asn": 5769,
"latitudeTimes1K": 45502,
"publicIP": 1245329695
},
"totalPlayingTime": 4.097,
"totalBufferingTime": 0.0,
"VST": 1.411,
"avgBitrate": 202.0,
"playStateSwitch": [
"{'seqNum': 0, 'eventNum': 0, 'sessionTimeMs': 7, 'startPlayState': 'eUnknown', 'endPlayState': 'eBuffering'}",
"{'seqNum': 1, 'eventNum': 5, 'sessionTimeMs': 1411, 'startPlayState': 'eBuffering', 'endPlayState': 'ePlaying'}"
],
"bitrateSwitch": [
],
"errorEvent": [
],
"tags": {
"LSsportName": "Football",
"c3.device.model": "iPhone+6+Plus",
"LSvideoType": "LIVE",
"c3.device.ua": "DAZN%2F5560+CFNetwork%2F811.5.4+Darwin%2F16.7.0",
"LSfixtureId": "5trxst8tv7slixckvawmtf949",
"genre": "Sport",
"LScompetitionName": "NFL+Game+Pass",
"show": "NFL+Game+Pass",
"c3.cmp.0._type": "DEVATLAS",
"c3.protocol.type": "cws",
"LSsportId": "9ita1e50vxttzd1xll3iyaulu",
"stageId": "8hm0ew6b8m7907ty8vy8tu4tl",
"LSvenueId": "na",
"syndicator": "None",
"applicationVersion": "2.0.8",
"deviceConnectionType": "wifi",
"c3.client.marketingName": "iPhone+6+Plus",
"playerVersion": "1.2.6.0",
"c3.cmp.0._id": "da",
"drmType": "AES128",
"c3.sh": "dc3-ll-livedazn-dznlivejp.hs.llnwd.net",
"c3.pt.ver": "10.3.3",
"applicationType": "ios",
"c3.viewer.id": "Ext-04835139",
"LSinterfaceLanguage": "en",
"c3.pt.os": "IOS",
"playerVendor": "Open+Source",
"c3.client.brand": "Apple",
"c3.cws.sf": "7",
"c3.cmp.0._ver": "1",
"c3.client.hwType": "Mobile+Phone",
"c3.pt.os.ver": "10.3.3",
"isAd": "false",
"c3.device.cver.bld": "2.124.0.33357",
"stageName": "Regular+Season",
"c3.client.osName": "iOS",
"contentType": "Live",
"c3.device.cver": "2.124.0",
"LScompetitionId": "wy3kluvb4efae1of0d8146c1",
"expireDate": "na",
"c3.client.model": "iPhone+6+Plus",
"c3.client.manufacturer": "Apple",
"LSproductionValue": "na",
"pubDate": "2017-09-23",
"c3.cluster.name": "production",
"accountType": "FreeTrial",
"c3.adaptor.type": "eCws1_7",
"c3.device.brand": "iPhone",
"c3.pt.br": "Non-Browser+Apps",
"contentId": "nlq9biy7trxl1cjceg70rogvd",
"streamingProtocol": "FairPlay",
"LSvenueName": "na",
"c3.device.type": "Mobile",
"c3.protocol.level": "2.4",
"c3.player.name": "AVPlayer",
"contentName": "Saints+%40+Panthers",
"c3.device.manufacturer": "Apple",
"c3.framework": "AVFoundation",
"c3.pt": "iOS",
"c3.device.ver": "6+Plus",
"c3.video.isLive": "T",
"c3.cmp.0._cfg_ver": "1504808821",
"c3.cws.clv": "2.124.0.33357",
"LScountryCode": "America%2FEl_Salvador"
},
"playername": "AVPlayer",
"isLive": "T",
"playerVersion": "1.2.6.0"
}
How to create jsonpath file to load it in redshift ?
Thanks

You have a nested array within your json - so a jsonpath will not expand that out for you.
You have a couple of choices on how to proceed:
You can load your data at the higher level (e.g. playStateSwitch
rather than seqNum within that) - and then try to use redshift to
process that data. This can be tricky as you cannot explode json
data from an array in redshift.
You can preprocess the data using e.g. aws glue / python / pyspark
or some other etl tool that can handle these nested arrays.

It all depends on the end goal, which is not clear form the above description.
I will approach the solution in the following order
Define which fields and array values that are required to be loaded into the Redshift. If the need is to copy all the records then the next check is how to handle the multiple array records.
If array or key/value are missing as part of JSON source then JSONPath will not work as is - So, better to update the JSON to add the missing array prior to COPY the data set over to RS.
The JSON update can be done using Linux commands or external tools like JP or refer additional reference
If all the values in the nested arrays are required then an alternative work around will be using external table an example
Otherwise, the JSONPATH file can be developed in this format
{
"jsonpaths": [
"$.viewerId", ///root level fields
...
"$geoInfo.city", /// object hierarchy
...
"$playStateSwitch[0].seqNum" ///define the required array number
...
]
}
Hope, this helps.

Related

Pyspark: Best way to set json strings in dataframe column

I need to create couple of columns in Dataframe where I want to parse and store the json string. Here is one json which I need to store in one column. Other json are also similar.Can you please help in how to transform and store this json string in the column. The values section needs to be filled from the values from other dataframe columns within the same data frame.
{
"name": "",
"headers": [
{
"name": "A",
"dataType": "number"
},
{
"name": "B",
"dataType": "string"
},
{
"name": "C",
"dataType": "string"
}
],
"values": [
[
2,
"some value",
"some value"
]
]
}

How can I select internal fields of CDC JSON to be Key of record in Kafka using SMT?

I have tried using SMT configuration ValueToKey and ExtractField$Key for my following CDC JSON data. But as id field is internal it is giving me an error as field is not recognized. How can I make it accessible to internal fields ?
"before": null,
"after": {
"id": 4,
"salary": 5000
},
"source": {
"version": "1.5.0.Final",
"connector": "mysql",
"name": "Try-",
"ts_ms": 1623834752000,
"snapshot": "false",
"db": "mysql_db",
"sequence": null,
"table": "EmpSalary",
"server_id": 1,
"gtid": null,
"file": "binlog.000004",
"pos": 374,
"row": 0,
"thread": null,
"query": null
},
"op": "c",
"ts_ms": 1623834752982,
"transaction": null
}
Configuration Used:
transforms=createKey,extractInt
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id
key.converter.schemas.enable=false
value.converter.schemas.enable=false
With these transformations and changes in properties file. I could make it possible.
Unfortunately accessing nested fields is not possible without using a different transform.
If you want to use the built-in ones, you'd need to extract the after state before you can access its fields
transforms=extractAfterState,createKey,extractInt
# Add these
transforms.extractAfterState.type=io.debezium.transforms.ExtractNewRecordState
# since you cannot get the ID from null events
transforms.extractAfterState.drop.tombstones=true
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id

Azure Data Factory Copy Activity - Append to JSON File

I am working on creating a data factory pipeline that copies data from a REST API endpoint to Azure Blob Storage. The API has a limitation of only returning 1000 records at a time, so I have built in a loop into my pipeline that will iterate through all of the pages. What I am wondering is - would it be possible to use the copy activity to append to the same file in the Azure Blob, rather than create a separate file for each page?
Below is what the API response looks like. The only value that I need from each response is the "records" list, so I was thinking if it is possible, I could get rid of the other stuff and just keep appending to the same file as the loop runs - although I do not know if the copy activity is capable of doing this. Would this be possible? Or is the only way to do this is to land all the responses as separate files in Blob Storage and then combine them after the fact?
Thank You
{
"totalResults": 8483,
"pageResults": 3,
"timeStamp": "2020/08/24 10:43:26",
"parameters": {
"page": 1,
"resultsPerPage": 3,
"filters": [],
"fields": [
"lastName",
"firstName",
"checklistItemsAssigned",
"checklistItemsStarted",
"checklistItemsCompleted",
"checklistItemsOverdue"
],
"sort": {
"field": "lastName",
"direction": "asc"
}
},
"records": [
{
"checklistItemsAssigned": 10,
"lastName": "One",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 10,
"checklistItemsCompleted": 10
},
{
"checklistItemsAssigned": 5,
"lastName": "Two",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
},
{
"checklistItemsAssigned": 5,
"lastName": "Three",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
}
]
}
ADF's Copy activity supports copying blobs from block, append, or page type of blobs but copying data to only block blobs. Blobk blobs can only be overwritten.
You can probably create an append type of blob using Storage SDK, but it would be an overkill for most of the project. I would go with creating new blobs and merging them at the last stage.

Split json using NiFi

I have a json with all the records with merged I need to split the merged json and load in separate database using NiFi
My file when I execute
db.collection.findOne()
My input looks like:
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
},
{
"name": "siva",
"id": 102,
"company": "shar"
},
{
"name": "vanai",
"id": 103,
"company": "ddr"
},
{
"name": "karti",
"id": 104,
"company": "sir"
}
]
I am getting all the json. I need to get output as:
{name: "sai", id:101, company: "sdr"}
So i only want one record, how can I parse the json using NiFi?
There is a SplitJson processor for this purpose:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.SplitJson/index.html
There are various JSON Path testers online to come up with the correct expression:
https://jsonpath.curiousconcept.com/
Use Split json processor with below configs as shown in the below screenshot
SplitJson Config
As Bryan said, you can use SplitJson processor, and then, you can forward the splitted data flow to other databases. The processor internally using this json pathfinder. You can read there the operations that the processor supports.
Just use this to get the first element by:
// JSON Path Expression for the first element:
$[0]
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
}
]

Does the OData protocol provide a way to transform an array of objects to an array of raw values?

Is there a way specify in an OData query that instead of certain name/value pairs being returned, a raw array should be returned instead? For example, if I have an OData query that results in the following:
{
"#odata.context": "http://blah.org/MyService/$metadata#People",
"value": [
{
"Name": "Joe Smith",
"Age": 55,
"Employers": [
{
"Name": "Acme",
"StartDate": "1/1/1990"
},
{
"Name": "Enron",
"StartDate": "1/1/1995"
},
{
"Name": "Amazon",
"StartDate": "1/1/1999"
}
]
},
{
"Name": "Jane Doe",
"Age": 30,
"Employers": [
{
"Name": "Joe's Crab Shack",
"StartDate": "1/1/2007"
},
{
"Name": "TGI Fridays",
"StartDate": "1/1/2010"
}
]
}
]
}
Is there anything I can add to the query to instead get back:
{
"#odata.context": "http://blah.org/MyService/$metadata#People",
"value": [
{
"Name": "Joe Smith",
"Age": 55,
"Employers": [
[ "Acme", "1/1/1990" ],
[ "Enron", "1/1/1995" ],
[ "Amazon", "1/1/1999" ]
]
},
{
"Name": "Jane Doe",
"Age": 30,
"Employers": [
[ "Joe's Crab Shack", "1/1/2007" ],
[ "TGI Fridays", "1/1/2010" ]
]
}
]
}
While I could obviously do the transformation client side, in my use case the field names are very large compared to the data, and I would rather not transmit all those names over the wire nor spend the CPU cycles on the client doing the transformation. Before I come up with my own custom parameters to indicate that the format should be as I desire, I wanted to check if there wasn't already a standardized way to do so.
OData provides several options to control the amount of data and metadata to be included in the response.
In OData v4, you can add odata.metadata=minimal to the Accept header parameters (check the documentation here). This is the default behaviour but even with this, it will still include the field names in the response and for a good reason.
I can see why you want to send only the values without the fields name but keep in mind that this will change the semantic meaning of the response structure. It will make it less intuitive to deal with as a json record on the client side.
So to answer your question, The answer is 'NO',
Other options to minimize the response size:
You can use the $value OData option to gets the raw value of a single property.
Check this example:
services.odata.org/OData/OData.svc/Categories(1)/Products(1)/Supplier/Address/City/$value
You can also use the $select option to cherry pick only the fields you need by selecting a subset of properties to include in the response