Azure Data Factory Copy Activity - Append to JSON File - azure-data-factory

I am working on creating a data factory pipeline that copies data from a REST API endpoint to Azure Blob Storage. The API has a limitation of only returning 1000 records at a time, so I have built in a loop into my pipeline that will iterate through all of the pages. What I am wondering is - would it be possible to use the copy activity to append to the same file in the Azure Blob, rather than create a separate file for each page?
Below is what the API response looks like. The only value that I need from each response is the "records" list, so I was thinking if it is possible, I could get rid of the other stuff and just keep appending to the same file as the loop runs - although I do not know if the copy activity is capable of doing this. Would this be possible? Or is the only way to do this is to land all the responses as separate files in Blob Storage and then combine them after the fact?
Thank You
{
"totalResults": 8483,
"pageResults": 3,
"timeStamp": "2020/08/24 10:43:26",
"parameters": {
"page": 1,
"resultsPerPage": 3,
"filters": [],
"fields": [
"lastName",
"firstName",
"checklistItemsAssigned",
"checklistItemsStarted",
"checklistItemsCompleted",
"checklistItemsOverdue"
],
"sort": {
"field": "lastName",
"direction": "asc"
}
},
"records": [
{
"checklistItemsAssigned": 10,
"lastName": "One",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 10,
"checklistItemsCompleted": 10
},
{
"checklistItemsAssigned": 5,
"lastName": "Two",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
},
{
"checklistItemsAssigned": 5,
"lastName": "Three",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
}
]
}

ADF's Copy activity supports copying blobs from block, append, or page type of blobs but copying data to only block blobs. Blobk blobs can only be overwritten.
You can probably create an append type of blob using Storage SDK, but it would be an overkill for most of the project. I would go with creating new blobs and merging them at the last stage.

Related

ADF - Loop through a large JSON file in a dataflow

We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.

azure data factory dependencies

I have two activities in my azure data factory.
Activity A1 = a stored proc on a sql db. Input=none, output = DB (output1). The stored proc targets the output dataset.
Activity A2 = an azure copy activity ("type": "Copy") which copies from blob to same sql db. Input = blob, Output = DB (output2)
i need to run activity A1 before A2 and i cant for the world figure out what dependencies to put between them.
i tried to mark A2 as having two inputs - the blob + the DB (output1). if i do this, the copy activity doesn't throw error, but it does NOT copy the blob to db (i think it silently uses the DB as the source of copy, instead of blob as source of copy and somehow does nothing).
if i remove the DB input (output1) on A2 it can successfully copy blob to DB but i no longer have the dependency chain that A1 needs to run before A2
thanks!
I figured this out - I was able to keep two dependencies on A2, but just needed to make sure of the ordering of the 2 inputs. Weird. Looks like the Copy activity just acts on the FIRST input - so when i moved the blob as the first input it worked! :) (earlier i had the DB output1 as first input and it silently did nothing)
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "MyBlobInput"
},
{
"name": "MyDBOutput1"
}
],
"outputs": [
{
"name": "MyDBOutput2"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 3,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQL",
"description": "Copy Activity"
}
],

how to create Jsonpath file to load data in redshift

one of my sample record for Json:
{
"viewerId": "Ext-04835139",
"sid5": "269410578:2995631181:2211755370:3307088398:33879957",
"firstHbTimems": 1.506283958371E12,
"ipAddress": "74.58.57.31",
"streamUrl": "https://dc3-ll-livedazn-dznlivejp.hs.llnwd.net/live/channel/1007/all/stream.m3u8?event_id=61824040049&h=c912885e2a69ffa7ea84f45dc18c004d",
"asset": "[nlq9biy7trxl1cjceg70rogvd] Saints # Panthers",
"os": "IOS",
"osVersion": "10.3.3",
"deviceModel": "iPhone",
"geoInfo": {
"city": 63666,
"state": 3851,
"isp": 120,
"longitudeTimes1K": -73562,
"country": 37,
"dma": 0,
"asn": 5769,
"latitudeTimes1K": 45502,
"publicIP": 1245329695
},
"totalPlayingTime": 4.097,
"totalBufferingTime": 0.0,
"VST": 1.411,
"avgBitrate": 202.0,
"playStateSwitch": [
"{'seqNum': 0, 'eventNum': 0, 'sessionTimeMs': 7, 'startPlayState': 'eUnknown', 'endPlayState': 'eBuffering'}",
"{'seqNum': 1, 'eventNum': 5, 'sessionTimeMs': 1411, 'startPlayState': 'eBuffering', 'endPlayState': 'ePlaying'}"
],
"bitrateSwitch": [
],
"errorEvent": [
],
"tags": {
"LSsportName": "Football",
"c3.device.model": "iPhone+6+Plus",
"LSvideoType": "LIVE",
"c3.device.ua": "DAZN%2F5560+CFNetwork%2F811.5.4+Darwin%2F16.7.0",
"LSfixtureId": "5trxst8tv7slixckvawmtf949",
"genre": "Sport",
"LScompetitionName": "NFL+Game+Pass",
"show": "NFL+Game+Pass",
"c3.cmp.0._type": "DEVATLAS",
"c3.protocol.type": "cws",
"LSsportId": "9ita1e50vxttzd1xll3iyaulu",
"stageId": "8hm0ew6b8m7907ty8vy8tu4tl",
"LSvenueId": "na",
"syndicator": "None",
"applicationVersion": "2.0.8",
"deviceConnectionType": "wifi",
"c3.client.marketingName": "iPhone+6+Plus",
"playerVersion": "1.2.6.0",
"c3.cmp.0._id": "da",
"drmType": "AES128",
"c3.sh": "dc3-ll-livedazn-dznlivejp.hs.llnwd.net",
"c3.pt.ver": "10.3.3",
"applicationType": "ios",
"c3.viewer.id": "Ext-04835139",
"LSinterfaceLanguage": "en",
"c3.pt.os": "IOS",
"playerVendor": "Open+Source",
"c3.client.brand": "Apple",
"c3.cws.sf": "7",
"c3.cmp.0._ver": "1",
"c3.client.hwType": "Mobile+Phone",
"c3.pt.os.ver": "10.3.3",
"isAd": "false",
"c3.device.cver.bld": "2.124.0.33357",
"stageName": "Regular+Season",
"c3.client.osName": "iOS",
"contentType": "Live",
"c3.device.cver": "2.124.0",
"LScompetitionId": "wy3kluvb4efae1of0d8146c1",
"expireDate": "na",
"c3.client.model": "iPhone+6+Plus",
"c3.client.manufacturer": "Apple",
"LSproductionValue": "na",
"pubDate": "2017-09-23",
"c3.cluster.name": "production",
"accountType": "FreeTrial",
"c3.adaptor.type": "eCws1_7",
"c3.device.brand": "iPhone",
"c3.pt.br": "Non-Browser+Apps",
"contentId": "nlq9biy7trxl1cjceg70rogvd",
"streamingProtocol": "FairPlay",
"LSvenueName": "na",
"c3.device.type": "Mobile",
"c3.protocol.level": "2.4",
"c3.player.name": "AVPlayer",
"contentName": "Saints+%40+Panthers",
"c3.device.manufacturer": "Apple",
"c3.framework": "AVFoundation",
"c3.pt": "iOS",
"c3.device.ver": "6+Plus",
"c3.video.isLive": "T",
"c3.cmp.0._cfg_ver": "1504808821",
"c3.cws.clv": "2.124.0.33357",
"LScountryCode": "America%2FEl_Salvador"
},
"playername": "AVPlayer",
"isLive": "T",
"playerVersion": "1.2.6.0"
}
How to create jsonpath file to load it in redshift ?
Thanks
You have a nested array within your json - so a jsonpath will not expand that out for you.
You have a couple of choices on how to proceed:
You can load your data at the higher level (e.g. playStateSwitch
rather than seqNum within that) - and then try to use redshift to
process that data. This can be tricky as you cannot explode json
data from an array in redshift.
You can preprocess the data using e.g. aws glue / python / pyspark
or some other etl tool that can handle these nested arrays.
It all depends on the end goal, which is not clear form the above description.
I will approach the solution in the following order
Define which fields and array values that are required to be loaded into the Redshift. If the need is to copy all the records then the next check is how to handle the multiple array records.
If array or key/value are missing as part of JSON source then JSONPath will not work as is - So, better to update the JSON to add the missing array prior to COPY the data set over to RS.
The JSON update can be done using Linux commands or external tools like JP or refer additional reference
If all the values in the nested arrays are required then an alternative work around will be using external table an example
Otherwise, the JSONPATH file can be developed in this format
{
"jsonpaths": [
"$.viewerId", ///root level fields
...
"$geoInfo.city", /// object hierarchy
...
"$playStateSwitch[0].seqNum" ///define the required array number
...
]
}
Hope, this helps.

Split json using NiFi

I have a json with all the records with merged I need to split the merged json and load in separate database using NiFi
My file when I execute
db.collection.findOne()
My input looks like:
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
},
{
"name": "siva",
"id": 102,
"company": "shar"
},
{
"name": "vanai",
"id": 103,
"company": "ddr"
},
{
"name": "karti",
"id": 104,
"company": "sir"
}
]
I am getting all the json. I need to get output as:
{name: "sai", id:101, company: "sdr"}
So i only want one record, how can I parse the json using NiFi?
There is a SplitJson processor for this purpose:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.SplitJson/index.html
There are various JSON Path testers online to come up with the correct expression:
https://jsonpath.curiousconcept.com/
Use Split json processor with below configs as shown in the below screenshot
SplitJson Config
As Bryan said, you can use SplitJson processor, and then, you can forward the splitted data flow to other databases. The processor internally using this json pathfinder. You can read there the operations that the processor supports.
Just use this to get the first element by:
// JSON Path Expression for the first element:
$[0]
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
}
]

MongoDB Storing User Files

I've been trying to become familiar with using MongoDB, particularly GridFS (which is implemented through the client). In a project I'm playing around with, I have users with corresponding locations, followed by corresponding data, but it is currently just stored on a filesystem which will ultimately lead to problems.
So an example data structure I have now is:
./(userId)/images/img1.png
./(userId)/data/sensor1/sensorOutput1.data
./(userId)/data/sensor1/sensorOutput2.data
./(userId)/data/sensor2/sensorOutput1.data
./(userId)/data/sensor2/sensorOutput2.data
So, in looking at GridFS, I see that you can pass attribute/values to be associated with the meta-data for each file. Looking at this setup and after reading tutorials on MongoDB, I thought of this approach:
Have one database with multiple collections (such as data or images just to separate some of the data). Then in each of the collections, have attributes for each document such as:
Under image collection:
{
"userId": 5,
"path": "",
"filename":"img1.png"
...
}
or
Under data collection:
{
"userId": 5,
"path": "sensor1",
"filename": "sensorOutput1.data"
...
}
Alternatively, I could just have a collection, and for the sake of this example, I'll call "Everything"
{
"userId": 5,
"path": "images/",
"filename": "img1.png"
...
}
{
"userId": 5,
"path": "data/sensor1/",
"filename": "sensorOutput1.data"
...
}
{
"userId": 5,
"path": "data/sensor2/",
"filename": "sensorOutput1.data"
...
}
Do either of these solutions seem reasonable? Would I then create an index on the "path" attribute? I've seen examples for adding files to MongoDB just haven't found one with how to structure user files.
Thanks!