We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.
Related
My kafka gets a wrapper payload in Json format. The wrapper payload looks like this:
{
"format": "wrapper",
"time": 1626814608000,
"events": [
{
"id": "item1",
"type": "product1",
"count": 200
},
{
"id": "item2",
"type": "product2",
"count": 300
}
],
"metadata": {
"schema": "schema-1"
}
}
I should export this to S3. But the catch here is, I should not store the wrapper. Instead, I should be storing the individual events based on the item.
For example, it should be stored in S3 as follows:
bucket/product1:
{"id": "item1", "type": "product1", "count": 200}
bucket/product2:
{"id": "item2", "type": "product2", "count": 300}
If you notice, the input is the wrapper with those events internally. However, my output should be each of those individual events stored in S3 in the same bucket with the product type as prefix.
My question is, is it possible to use Kafka Connect to do this? I see they have this Single Message transformer which seems to be a way to mutate data inside the object, but not to fanout like what I want. Even the signature looks like an R=>R
https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java
So based on my research, it does not seem possible. But I want to check if I am missing something before using a different option.
Transforms accept one event and output one event.
You need to use a stream processor such as Kafka Streams branch or flatMap functions to split an array of events into multiple events or multiple topics.
Does anyone know if there is a way to pass a schema mapping to multiple csv without doing it manually? I have 30 csv passed through a data flow in a foreach activity, so I can't detect or set fields's type. (Because i could only for the first)
Thanks for your help! :)
A Copy Activity mapping can be parameterized and changed at runtime if explicit mapping is required. The parameter is just a json object that you'd pass in for each of the files you are processing. It looks something like this:
{
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Id"
},
"sink": {
"name": "CustomerID"
}
},
{
"source": {
"name": "Name"
},
"sink": {
"name": "LastName"
}
},
{
"source": {
"name": "LastModifiedDate"
},
"sink": {
"name": "ModifiedDate"
}
}
]
}
You can read more about it here: Schema and data type mapping in copy activity
So, you can either pre-generate these mapping and fetch them via a lookup in a previous step in the pipeline or if they need to be dynamic you an create them at runtime with code (e.g. have an Azure Function that looks up the current schema of the CSV and returns a properly formatted translator object).
Once you have the object as a parameter you can pass it to the copy activity. On the mapping properties of the copy activity you just Add Dynamic Content and select the appropriate parameter. It will look something like this:
I like to use the Runner in Postman to run / test a whole collection of endpoints. Each endpoint should get different parameter or request body data on each iteration.
So far i figured out the data file usage for one endpoint. See https://learning.postman.com/docs/running-collections/working-with-data-files/
But is there a way to provide data for more then one endpoint where the endpoints need different variables in the same run?
example:
[GET]categories/:categoryId?lang=en
[GET]articles/?filter[height]=10,40&sort[name]=desc
Datafile for first endpoint:
[{
"categoryId": 1123,
"lang": en
},
{
"categoryId": 3342,
"lang": de
}]
Datafile for second endpoint:
[{
"filter": "height",
"filterValue": "10,40",
"sort": "name",
"sortDir": "desc"
},
{
"filter": "material",
"filterValue": "chrome",
"sort": "relevance",
"sortDir": "asc"
}]
Right now, there is no way to add more data files. https://community.postman.com/t/pass-multiple-data-files-to-a-collection/899
My suggestion is
Separate each endpoint that need data file into different collections.
Use newman as library to run it all.
I am working on creating a data factory pipeline that copies data from a REST API endpoint to Azure Blob Storage. The API has a limitation of only returning 1000 records at a time, so I have built in a loop into my pipeline that will iterate through all of the pages. What I am wondering is - would it be possible to use the copy activity to append to the same file in the Azure Blob, rather than create a separate file for each page?
Below is what the API response looks like. The only value that I need from each response is the "records" list, so I was thinking if it is possible, I could get rid of the other stuff and just keep appending to the same file as the loop runs - although I do not know if the copy activity is capable of doing this. Would this be possible? Or is the only way to do this is to land all the responses as separate files in Blob Storage and then combine them after the fact?
Thank You
{
"totalResults": 8483,
"pageResults": 3,
"timeStamp": "2020/08/24 10:43:26",
"parameters": {
"page": 1,
"resultsPerPage": 3,
"filters": [],
"fields": [
"lastName",
"firstName",
"checklistItemsAssigned",
"checklistItemsStarted",
"checklistItemsCompleted",
"checklistItemsOverdue"
],
"sort": {
"field": "lastName",
"direction": "asc"
}
},
"records": [
{
"checklistItemsAssigned": 10,
"lastName": "One",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 10,
"checklistItemsCompleted": 10
},
{
"checklistItemsAssigned": 5,
"lastName": "Two",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
},
{
"checklistItemsAssigned": 5,
"lastName": "Three",
"firstName": "Person",
"checklistItemsOverdue": 0,
"checklistItemsStarted": 5,
"checklistItemsCompleted": 5
}
]
}
ADF's Copy activity supports copying blobs from block, append, or page type of blobs but copying data to only block blobs. Blobk blobs can only be overwritten.
You can probably create an append type of blob using Storage SDK, but it would be an overkill for most of the project. I would go with creating new blobs and merging them at the last stage.
I have a json with all the records with merged I need to split the merged json and load in separate database using NiFi
My file when I execute
db.collection.findOne()
My input looks like:
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
},
{
"name": "siva",
"id": 102,
"company": "shar"
},
{
"name": "vanai",
"id": 103,
"company": "ddr"
},
{
"name": "karti",
"id": 104,
"company": "sir"
}
]
I am getting all the json. I need to get output as:
{name: "sai", id:101, company: "sdr"}
So i only want one record, how can I parse the json using NiFi?
There is a SplitJson processor for this purpose:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.SplitJson/index.html
There are various JSON Path testers online to come up with the correct expression:
https://jsonpath.curiousconcept.com/
Use Split json processor with below configs as shown in the below screenshot
SplitJson Config
As Bryan said, you can use SplitJson processor, and then, you can forward the splitted data flow to other databases. The processor internally using this json pathfinder. You can read there the operations that the processor supports.
Just use this to get the first element by:
// JSON Path Expression for the first element:
$[0]
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
}
]