How do I use ADF copy activity with multiple rows in source? - azure-data-factory

I have source which is JSON array, sink is SQL server. When I use column mapping and see the code I can see mapping is done to first element of array so each run produces single record despite the fact that source has multiple records. How do I use copy activity to import ALL the rows?
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['#odata.context']": "BuyerFinancing",
"['#odata.nextLink']": "PropertyCondition",
"value[0].AssociationFee": "AssociationFee",
"value[0].AssociationFeeFrequency": "AssociationFeeFrequency",
"value[0].AssociationName": "AssociationName",

Use * as the source field to indicate all elements in json format. For example, with json:
{
"results": [
{"field1": "valuea", "field2": "valueb"},
{"field1": "valuex", "field2": "valuey"}
]
}
and a database table with a column result to store the json. The mapping with results as the collection and * and the sub element will create two records with:
{"field1": "valuea", "field2": "valueb"}
{"field1": "valuex", "field2": "valuey"}
in the result field.
Copy Data Field Mapping

ADF support cross apply for json array. Please check the example in this doc. https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#jsonformat-example
For schema mapping: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#schema-mapping

Related

JSON data getting retrieved from mongodb with formats added explicitly inside field e.g.({"field": {$numberInt: "20"}}). How to process that data?

I have used mongo import to import data into mongodb from csv files. I am trying to retrieve data from an Mongodb realm service. The returned data for the entry is as follows:
{
"_id": "6124edd04543fb222e",
"Field1": "some string",
"Field2": {
"$numberDouble": "145.81"
},
"Field3": {
"$numberInt": "0"
},
"Field4": {
"$numberInt": "15"
},
"Field5": {
"$numberInt": "0"
}
How do I convert this into normal JSON by removing $numberInt and $numberDouble like :
{
"_id": "6124edd04543fb222e",
"Field1": "some string",
"Field2": 145.8,
"Field3": 0,
"Field4": 15,
"Field5": 0
}
The fields are also different for different documents so cannot use Mongoose directly. Are there any solutions to this?
Also would help to know why the numbers are being stored as $numberInt:"".
Edit:
For anyone with the same problem this is how I solved it.
The array of documents is in EJSON format instead of JSON like said in the upvoted answer. To covert it back into normal JSON, I used JSON.stringify to first convert each document I got from map function into string and then parsed it using EJSON.parse with
{strict:false} (this option is important)
option to convert it into normal JSON.
{restaurants.map((restaurant) => {
restaurant=EJSON.parse(JSON.stringify(restaurant),{strict:false});
}
EJSON.parse documentation here. The module to be installed and imported is mongodb-extjson.
The format with $numberInt etc. is called (MongoDB) Extended JSON.
You are getting it on the output side either because this is how you inserted your data (meaning your inserted data was incorrect, you need to fix the ingestion side) or because you requested extended JSON serialization.
If the data in the database is correct, and you want non-extended JSON output, you generally need to write your own serializers to JSON since there are multiple possibilities of how to format the data. MongoDB's JSON output format is the Extended JSON you're seeing in your first quote.

How can I get the count of JSON array in ADF?

I'm using Azure data factory to retrieve data and copy into a database... the Source looks like this:
{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}
In my ADF pipeline, I am trying to get the count of GroupIds and store that in a database column (along with the associated Id from the JSON above).
Is there some kind of syntax I can use to tell ADF that I just want the count of GroupIds or is this going to require some kind of recursive loop activity?
You can use the length function in Azure Data Factory (ADF) to check the length of json arrays:
length(json(variables('varSource')).GroupIds)
If you are loading the data to a SQL database then you could use OPENJSON, a simple example:
DECLARE #json NVARCHAR(MAX) = '{
"GroupIds": [
"4ee1a-0856-4618-4c3c77302b",
"21259-0ce1-4a30-2a499965d9",
"b2209-4dda-4e2f-029384e4ad",
"63ac6-fcbc-8f7e-36fdc5e4f9",
"821c9-aa73-4a94-3fc0bd2338"
],
"Id": "w5a19-a493-bfd4-0a0c8djc05",
"Name": "Test Item",
"Description": "test item description",
"Notes": null,
"ExternalId": null,
"ExpiryDate": null,
"ActiveStatus": 0,
"TagIds": [
"784083-4c77-b8fb-0135046c",
"86de96-44c1-a497-0a308607",
"7565aa-437f-af36-8f9306c9",
"d5d841-1762-8c14-d8420da2",
"bac054-2b6e-a19b-ef5b0b0c"
],
"ResourceIds": []
}'
SELECT *
FROM OPENJSON( #json, '$.GroupIds' )
SELECT COUNT(*) countOfGroupIds
FROM OPENJSON( #json, '$.GroupIds' );
My results:
If your data is stored in a table the code is similar. Make sense?
Another funky way to approach it if you really need the count in-line, is to convert the JSON to XML using the built-in functions and then run some XPath on it. It's not as complicated as it sounds and would allow you to get the result inside the pipeline.
The Data Factory XML function converts JSON to XML, but that JSON must have a single root property. We can fix up the json with concat and a single line of code. In this example I'm using a Set Variable activity, where varSource is your original JSON:
#concat('{"root":', variables('varSource'), '}')
Next, we can just apply the XPath with another simple expression:
#string(xpath(xml(json(variables('varIntermed1'))), 'count(/root/GroupIds)'))
My results:
Easy huh. It's a shame there isn't more built-in support for JPath unless I'm missing something, although you can use limited JPath in the Copy activity.
You can use Data flow activity in the Azure data factory pipeline to get the count.
Step1:
Connect the Source to JSON dataset, and in Source options under JSON settings, select single document.
In the source preview, you can see there are 5 GroupIDs per ID.
Step2:
Use flatten transformation to deformalize the values into rows for GroupIDs.
Select GroupIDs array in Unroll by and Unroll root.
Step3:
Use Aggregate transformation, to get the count of GroupIDs group by ID.
Under Group by: Select a column from the drop-down for your aggregation.
Under Aggregate: You can build the expression to get the count of the column (GroupIDs).
Aggregate Data preview:
Step4: Connect the output to Sink transformation to load the final output to database.

Building query in Postgres 9.4.2 for JSONB datatype using builtin function

I have a table schema as follows:
DummyTable
-------------
someData JSONB
All my values will be a JSON object. For example, when you do a select *
from DummyTable, it would look like
someData(JSONB)
------------------
{"values":["P1","P2","P3"],"key":"ProductOne"}
{"values":["P3"],"key":"ProductTwo"}
I want a query which will give me result set as follows:
[
{
"values": ["P1","P2","P3"],
"key": "ProductOne"
},
{
"values": ["P4"],
"key": "ProductTwo"
}
]
I'm using Postgres version 9.4.2. I looked at documentation page of the same, but could not find the query which would give the above result.
However, in my API, I can build the JSON by iterating over rows, but I would prefer query doing the same. I tried json_build_array, row_to_json on a result which would be given by select * from table_name, but no luck.
Any help would be appreciated.
Here is the link I looked for to write a query for JSONB
You can use json_agg or jsonb_agg:
create table dummytable(somedata jsonb not null);
insert into dummytable(somedata) values
('{"values":["P1","P2","P3"],"key":"ProductOne"}'),
('{"values":["P3"],"key":"ProductTwo"}');
select jsonb_pretty(jsonb_agg(somedata)) from dummytable;
Result:
[
{
"key": "ProductOne",
"values": [
"P1",
"P2",
"P3"
]
},
{
"key": "ProductTwo",
"values": [
"P3"
]
}
]
Although retrieving the data row by row and building on client side can be made more efficient, as the server can start to send data much sooner - after it retrieves first matching row from storage. If it needs to build the json array first, it would need to retrieve all the rows and merge them before being able to start sending data.

using pymongo, how can i deal with nested json format?

To be more specific,
I loaded the data into Mongodb by Pymongo with this script.
header = ['id', 'info']
for each in reader:
row={}
for field in header:
row[field]=each[field]
db.segment.insert_one(row)
The id column has unique Id of users and Info column is composed as nested json.
For example, here is the data set in the db
{
u'_id': ObjectId('111'),
u'id': u'123',
u'info': {
"TYPE": "food",
"dishes":"166",
"cc": "20160327 040001",
"country": "japan",
"money": 3521,
"info2": [{"type"; "dishes", "number":"2"}]
}
}
What i want to do is to read the value in the nested json format.
So what i did is ..
pipe = [{"$group":{"_id":"$id", "Totalmoney":{"$sum":"$info.money"}}}]
total_money = db.segment.aggregate(pipeline=pipe)
but the result of sum is always "0"for every id.
What am i doing wrong? how can i fix it?
I have to use mongodb because of the data size which is too big to be handled by python
Thank you in advance.

How do I manage a sublist in Mongodb?

I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?
Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.