How to create seperate CSV for nested Json array in Az Data Factory - azure-data-factory

I have a json as below. I can flatten this in ADF using DataFlow but subject details array can hold lot of values, for which I would like to create seperate CSV.
{
"studentid": 99999,
"schoolid": "100574521",
"name": "BLUE LAY",
"set_id": 53,
"subject_details": [
{
"subject_code": "url_key",
"value": "100574521"
},
{
"subject_code": "band",
"value": "29732"
},
{
"subject_code": "description",
"value": "Summer "
},
{
"subject_code": "options_container",
"value": "container2"
},
{
"subject_code": "has_options",
"value": "0"
},
{
"subject_code": "category_ids",
"value": [
"463",
"630"
]
}
]
}

You can create separate CSV file from the JSON nested array using the select in Data flows.
After creating the source for the JSON in Data flow create a select to the source from the options. After that, in the select settings you will see all the JSON columns and to get the nested array of JSON in the csv delete all columns apart from subject_details by clicking on the delete icon on every column.
Now, add a sink to select to save this in a csv file of blob.
Create respective linked service and dataset for the csv and add it to the sink.
Then, in the settings of sink and select output to single file from the drop down of File name option and give the csv file name in the Output to single file.
Execute this dataflow by the pipeline and you can see the CSV in blob.
You will get this kind of data in the csv if you do it before the flatten. So, the best practice is to create the csv file for the nested json array after flattening it.
Please refer this Microsoft Documentation to know more about the Flatten in Data flow.
After flattening the JSON, you can use the same select option to create the csv from it by deleting all columns apart from these two like above. If you want, you can export the data as csv in the Data preview of select itself.
Then add sink and execute the data flow and you can see the csv data like below in the blob.
In your JSON nested array, the value had different data types like string in first 5 rows and array in the last one. It will be difficult to flatten that, so try to maintain the same data type for the value like array for all rows if you face any issues in flatten.

Related

list contains for structure data type in DMN decision table

I am planning to use Drools for executing the DMN models. However I am having trouble to write a condition in DMN Decision table where the input is an array of objects with structure data type and condition is to check if the array contains an object with specific fields. For ex:
Input to decision table is as below:
[
{
"name": "abc",
"lastname": "pqr"
},
{
"name": "xyz",
"lastname": "lmn"
},
{
"name": "pqr",
"lastname": "jkl"
}
]
Expected output: True if the above list contains an element that match {"name": "abc", "lastname": "pqr"} both on the same element in the list.
I see that FEEL has support for list contains, but I could not find a syntax where objects in array are not of primitive types like number,string etc but structures. So, I need help on writing this condition in Decision table.
Thanks!
Edited description:
I am trying to achieve the following using the decision table wherein details is list of info structure. Unfortunately as you see I am not getting the desired output wherein my input list contains the specific element I am looking for.
Input: details = [{"name": "hello", "lastname": "world"}]
Expected Output = "Hello world" based on condition match in row 1 of the decision table.
Actual Output = null
NOTE: Also in row no 2 of the decision table, I only check for condition wherein I am only interested in the checking for the name field.
Content for the DMN file can be found over here
In this question is not clear the overall need and requirements for the Decision Table.
For what pertaining the part of the question about:
True if the above list contains an element that match {"name": "abc", "lastname": "pqr"}
...
I see that FEEL has support for list contains, but I could not find a syntax where objects in array are not of primitive types like number,string etc but structures.
This can be indeed achieved with the list contains() function, described here.
Example expression
list contains(my list, {"name": "abc", "lastname": "pqr"})
where my list is the verbatim FEEL list from the original question statement.
Example run:
giving the expected output, true.
Naturally 2 context (complex structures) are the same if all their properties and fields are equivalent.
In DMN, there are multiple ways to achieve the same result.
If I understand the real goal of your use case, I want to suggest a better approach, much easier to maintain from a design point of view.
First of all, you have a list of users as input so those are the data types:
Then, you have to structure a bit your decision:
The decision node at least one user match will go trough the user list and will check if there is at least one user that matches the conditions inside the matching BKM.
at least one user match can implemented with the following FEEL expression:
some user in users satisfies matching(user)
The great benefit of this approach is that you can reason on specific element of your list inside the matching BKM, which makes the matching decision table extremely straightforward:

Representing a file in MongoDB

I would like to process a CSV or Excel file, convert it into JSON and store it in MongoDB for a particular user. I would then like to do queries that filter depending on the user id, file name, or by attributes in the cells.
The method suggested to me is that each document would represent a row from the CSV/Excel file. I would add the filename and username to every single row.
Here is an example of one document (row)
{ user_id: 1, file_name: "fileName.csv", name: "Michael", surname: "Smith"},
The problem I have with this is that every time a query is executed it will have to go through the whole database and filter out any rows not associated with that user id or filename. If the database contained tens of millions of rows then surely this would be very slow?
The structure I would think is better is this but this I've been told it wouldn't be fast to query. I would have thought it would be quicker as now you just need to find one entry by user id, then the files you want to query, then the rows.
{
"user_id":1,
"files":[
{
"file_name":"fileName.csv",
"rows":[
{
"name":"Michael",
"surname":"Smith"
}
]
}
]
}
I'm still rather new to MongoDB so I'm sure it's just a lack of understanding on my part.
What is the best representation of the data?

Apache Druid - preserving order of elements in a multi-value dimension

I am using Apache Druid to store multi-value dimensions for customers.
While loading data from a CSV, I noticed that the order of the elements in the multi-value dimension is getting changed. E.g. Mumbai|Delhi|Chennai gets ingested as ["Chennai","Mumbai","Delhi"].
It is important for us to preserve the order of elements in order to apply filters in the query using MV_OFFSET function. One work around is to create explicit order element and concatenate it to the element (like ["3~Chennai","1~Mumbai","2~Delhi"])- but this hampers plain group by aggregations.
Is there any way to preserve the order of the elements in a multi-value dimension during load time?
Thanks to the response from Navis Ryu on Druid slack channel, following dimension spec will keep the order of the elements unchanged:
"dimensions": [
"page",
"language",
{
"type": "string",
"name": "userId",
"multiValueHandling": "ARRAY"
}
]
More details around the functionality here.

Why use Postgres JSON column type?

The JSON column type accepts non valid JSON
eg
[1,2,3] can be inserted without the closing {}
Is there any difference between JSON and string?
While [1,2,3] is valid JSON, as zerkms has stated in the comments, to answer the primary question: Is there any difference between JSON and string?
The answer is yes. A whole new set of query operations, functions, etc. apply to json or jsonb columns that do not apply to text (or related types) columns.
For example, while with text columns you would need to use regular expressions and related string functions to parse the string (or a custom function), with json or jsonb, there exists a separate set of query operators that works within the structured nature of JSON.
From the Postgres doc, given the following JSON:
{
"guid": "9c36adc1-7fb5-4d5b-83b4-90356a46061a",
"name": "Angela Barton",
"is_active": true,
"company": "Magnafone",
"address": "178 Howard Place, Gulf, Washington, 702",
"registered": "2009-11-07T08:53:22 +08:00",
"latitude": 19.793713,
"longitude": 86.513373,
"tags": [
"enim",
"aliquip",
"qui"
]
}
The doc then says:
We store these documents in a table named api, in a jsonb column named
jdoc. If a GIN index is created on this column, queries like the
following can make use of the index:
-- Find documents in which the key "company" has value "Magnafone"
SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc #> '{"company": "Magnafone"}';
This allows you to query the jsonb (or json) fields very differently than if it were simply a text or related field.
Here is some Postgres doc that provides some of those query operators and functions.
Basically, if you have JSON data that you want to treat as JSON data, then a column is best specified as json or jsonb (which one you choose depends on whether you want to store it as plain text or binary, respectively).
The above data can be stored in text, but the JSON data types have the advantage you can apply JSON rules in those columns. There are several functions which are JSON specified which cannot be used for text fields.
Refer to this link to understand about the json functions/procedures

Logging file access with MongoDB

I am designing my first MongoDB (and first NoSQL) database and would like to store information about files in a collection. As part of each file document, I would like to store a log of file accesses (both reads and writes).
I was considering creating an array of log messages as part of the document:
{
"filename": "some_file_name",
"logs" : [
{ "timestamp": "2012-08-27 11:40:45", "user": "joe", "access": "read" },
{ "timestamp": "2012-08-27 11:41:01", "user": "mary", "access": "write" },
{ "timestamp": "2012-08-27 11:43:23", "user": "joe", "access": "read" }
]
}
Each log message will contain a timestamp, the type of access, and the username of the person accessing the file. I figured that this would allow very quick access to the logs for a particular file, probably the most common operation that will be performed with the logs.
I know that MongoDB has a 16Mbyte document size limit. I imagine that files that are accessed very frequently could push against this limit.
Is there a better way to design the NoSQL schema for this type of logging?
Lets first try to calculate avg size of the one log record:
timestamp word = 18, timestamp value = 8, user word = 8, user value=20 (10 chars it is max(or avg for sure) I guess), access word = 12, access value 10. So total is 76 bytes. So you can have ~220000 of log records.
And half of physical space will be used by field names. In case if you will name timestamp = t, user = u, access=a -- you will be able to store ~440000 of log items.
So, i think it is enough for the most systems. In my projects I always trying to embed rather than create separate collection, because it a way to achieve good performance with mongodb.
In the future you can move your logs records into separate collection. Also for performance you can have like a 30 last log records (simple denormalize them) in file document, for fast retrieving in addition to logs collection.
Also if you will go with one collection, make sure that you not loading logs when you no need them (you can include/exclude fields in mongodb). Also use $slice to do paging.
And one last thing: Enjoy mongo!
If you think document limit will become an issue there are few alternatives.
The obvious one is to simple create a new document for each log.
So you will have a collecton "logs". With this schema.
{
"filename": "some_file_name",
"timestamp": "2012-08-27 11:40:45",
"user": "joe",
"access": "read"
}
A query to find which files "joe" read will be something like the
db.logs.find({user: "joe", access: "read"})