Create new Dataframe from Json element inside XML using Pyspark - pyspark

Hi I'm dealing with rather difficult XML file which I'm trying to reformat and clean for some processing. I've been using Pyspark to process the data into a dataframe and I am using com.databricks.spark.xml to read the file.
My Dataframe looks like this; Each field is JSON Formatted
+----------------+---------------------------------+
| Identifier| Info|
+----------------+---------------------------------+
| JSON | Json |
| | |
| | |
+----------------+---------------------------------+
This is a sample value from the Identifier column
{
"Other": [
{
"_Type": "A",
"_VALUE": "999"
},
{
"_Type": "B",
"_VALUE": "31086"
},
{
"_Type": "C",
"_VALUE": "13123"
},
{
"_Type": "D",
"_VALUE": "32323"
},
{
"_Type": "E",
"_VALUE": "2223"
},
{
"_Type": "F",
"_VALUE": "100"
},
]
}
And this is how the Info Column looks like
{
"Demo": {
"BirthDate": "2009-09-13",
"BirthPlace": {
"_VALUE": null,
"_nil": true
},
"Rel": {
"_VALUE": null,
"_nil": true
}
},
"EmailList": {
"_VALUE": null,
"_nil": true
},
"Name": {
"LastName": "Marwan",
"FullName": {
"_VALUE": null,
"_nil": true
},
"GivenName": "Saad",
"MiddleName": null,
"PreferredFamilyName": {
"_VALUE": null,
"_nil": true
}
},
"OtherNames": {
"_VALUE": null,
"_nil": true
}
}
I am trying to create a dataframe that looks like the following
+-------+--------+-----------+------------+------------+
| F| E| LastName| GivenName | BirthDate|
+-------+--------+-----------+------------+------------+

Related

Merge 2 JSON objects from 2 files using jq

I have two json files
1.json
{
"outputs": {
"item1": {
"name": "name1",
"email": "email1"
}
}
}
2.json
{
"outputs": {
"item2": {
"name": "name2",
"email": "email2"
}
}
}
I'm trying to merge them using jq
jq -s '{
"Items" :
{
"list" : .[] | .outputs ,
},
}' 1.json 2.json
and I get just two Items objects, but I want to have one Items object and all item* merged like this
{
"Items": {
"objects": {
"item1": {
"name": "name1",
"email": "email1"
},
"item2": {
"name": "name2",
"email": "email2"
}
}
}
}
I've tried .[0] * .[1] trick, but I cannot put it into object construction.
How can do this with jq?
Try adding the --slurped array:
jq -s '{Items: {objects: map(.outputs) | add}}' 1.json 2.json
Demo
Another approach could be using reduce with inputs and the -n flag:
jq -n 'reduce inputs.outputs as $i ({}; .Items.objects += $i)' 1.json 2.json
Demo
Output:
{
"Items": {
"objects": {
"item1": {
"name": "name1",
"email": "email1"
},
"item2": {
"name": "name2",
"email": "email2"
}
}
}
}

MongoDb extract array

I have a mongodb which I want to extract some specific data.
Here is my json:
{
"jobs" : [
{
"id": 554523,
"code": "1256-554523",
"name": "Banco de Talentos",
"status": "published",
"type": "vacancy_type_effective",
"publicationType": "external",
"numVacancies": 1,
"departmentId": 108141,
"departmentName": "FUTURAS OPORTUNIDADES",
"roleId": 169970,
"roleName": "BANCO DE TALENTOS",
"createdAt": "2020-10-30T12:23:48.572Z",
"updatedAt": "2020-12-30T23:21:30.403Z",
"branchId": null,
"branchName": null
},
{
"id": 616834,
"code": "1256-616834",
"name": "YYYYYY (o) YYYYY",
"status": "frozen",
"type": "vacancy_type_effective",
"publicationType": "external",
"numVacancies": 1,
"departmentId": 109190,
"departmentName": "TESTE TESTE",
"roleId": 165712,
"roleName": "SL - TESTE PL",
"createdAt": "2020-12-16T14:17:36.187Z",
"updatedAt": "2021-01-29T17:08:43.613Z",
"branchId": 120448,
"branchName": "TESTE TESTE1"
}
],
"application": [
{
"id": 50707344,
"score": 40.251965332031254,
"partnerName": "indeed",
"endedAt": null,
"createdAt": "2020-12-21T11:21:30.587Z",
"updatedAt": "2021-02-18T22:02:35.866Z",
"tags": {},
"candidate": {
"birthdate": "1986-04-04",
"id": 578615,
"name": "TESTE",
"lastName": "TESTE TESTE",
"email": "teste#teste.com.br",
"identificationDocument": "34356792807",
"countryOfOrigin": "BR",
"linkedinProfileUrl": "teste",
"gender": "female",
"mobileNumber": "+5511972319799",
"phoneNumber": "(11)2463-2039"
},
"job": {
"id": 619713,
"name": "XXXXde XXXX Pleno"
},
"manualCandidate": null,
"currentStep": {
"id": 3527370,
"name": "Cadastro",
"status": "done"
}
},
{
"id": 50707915,
"score": 3.75547943115234E+1,
"partnerName": "indeed",
"endedAt": null,
"createdAt": "2020-12-21T11:31:31.877Z",
"updatedAt": "2021-02-18T14:07:06.605Z",
"tags": {},
"candidate": {
"birthdate": "1971-10-02",
"id": 919358,
"name": "TESTE TESTE",
"lastName": "SILVA",
"email": "teste.teste#teste.com",
"identificationDocument": "3232323232",
"countryOfOrigin": "BR",
"linkedinProfileUrl": "teste/",
"gender": "female",
"mobileNumber": "11 94021- 5521",
"phoneNumber": "+5511995685247"
},
"job": {
"id": 619713,
"name": "Analista de XXXXX Pleno"
},
"manualCandidate": null,
"currentStep": {
"id": 3527370,
"name": "Cadastro",
"status": "done"
}
}
]
}
My question is: How can I extract only the array objects in jobs and application separately? Anyone knows the code in Mongodb for do this?
I need do this task for after I can insert the extract separated data in different collections.
Thanks a lot.
Unfortunately there is no real "good" way of doing this. Here is an example of how I would do it by using $facet and other operators to manipulate the structure
db.collection.aggregate([
{
$match: {
/** your query*/
}
},
{
$facet: {
jobs: [
{
$unwind: "$jobs"
},
{
$replaceRoot: {
newRoot: "$jobs"
}
}
],
applications: [
{
$unwind: "$application"
},
{
$replaceRoot: {
newRoot: "$application"
}
}
]
}
},
{
$addFields: {
"merged": {
"$concatArrays": [
"$jobs",
"$applications"
]
}
}
},
{
$unwind: "$merged"
},
{
"$replaceRoot": {
"newRoot": "$merged"
}
}
])
Mongo Playground
I would personally just do it in code after you fetched the document.

check if a field of type array contains an array

Im using mongoose, I have the following data of user collection:
[{
"_id": "1",
"notes": [
{
"value": "A90",
"text": "math"
},
{
"value": "A80",
"text": "english"
},
{
"value": "A70",
"text": "art"
}
]
},
{
"_id": "2",
"notes": [
{
"value": "A90",
"text": "math"
},
{
"value": "A80",
"text": "english"
}
]
},
{
"_id": "3",
"notes": [
{
"value": "A80",
"text": "art"
}
]
}]
and I have as a parameters the following array: [ "A90", "A80" ]
so I want to make a query to use this array to return only the records that have all the array items in the notes (value) table.
So for the example above it will return:
[{
"_id": "1",
"notes": [
{
"value": "A90",
"text": "math"
},
{
"value": "A80",
"text": "english"
},
{
"value": "A70",
"text": "art"
}
]
},
{
"_id": "2",
"notes": [
{
"value": "A90",
"text": "math"
},
{
"value": "A80",
"text": "english"
}
]
}]
I tried the following find query:
{ "notes": { $elemMatch: { value: { $in: valuesArray } } }}
but it returns a record even if just one element in valuesArray exist.
it turned out to be quite easy:
find({ "notes.value": { $all: arrayValues } })

MongoDB Aggregation Error Returning wrong result

I have my json object like this
{
"_id": "5c2e811154855c0012308f00",
"__pclass": "QXRzXFByb2plY3RcTW9kZWxcUHJvamVjdA==",
"id": 44328,
"name": "Test project via postman2//2",
"address": "some random address",
"area": null,
"bidDate": null,
"building": {
"name": "Health Care Facilities",
"type": "Dental Clinic"
},
"collaborators": [],
"createdBy": {
"user": {
"id": 7662036,
"name": "Someone Here"
},
"firm": {
"id": 2520967,
"type": "ATS"
}
},
"createdDate": "2019-01-03T21:39:29Z",
"customers": [],
"doneBy": null,
"file": null,
"firm": {
"id": 1,
"name": "MyFirm"
},
"leadSource": {
"name": "dontknow",
"number": "93794497"
},
"location": {
"id": null,
"city": {
"id": 567,
"name": "Bahamas"
},
"country": {
"id": 38,
"name": "Canada"
},
"province": {
"id": 7,
"name": "British Columbia"
}
},
"modifiedBy": null,
"modifiedDate": null,
"projectPhase": {
"id": 1,
"name": "pre-design"
},
"quotes": [{
"id": 19,
"opportunityValues": {
"Key1": 100,
"Key2 Key2": 100,
"Key3 Key3 Key3": 200,
}
}],
"specForecast": [],
"specIds": [],
"tags": [],
"valuation": "something"
}
I am trying to aggregate using this query in MongoDB. My aggregation key is 4 level deep and also contains spaces. On all online examples shows me the aggregation at the first level. Looking to the online codes, I tried to re-iterate the same with my 4th level deep key.
db.mydata.aggregate([
{$match: {"id": 44328 } } ,
{$group: { _id: "$quotes.id",
totalKey2:{ $sum: "$quotes.opportunityValues.Key2 Key2"},
totalKey3:{ $sum: "$quotes.opportunityValues.Key3 Key3 Key3"}
}
}
]);
This should return
_id totalKey2 totalKey3
0 19 100 300
But it is returning
_id totalKey2 totalKey3
0 19 0 0
What am I doing Wrong?
Although it's not recommended to use space in field names in Mongo, it works as expected.
The problem with your query is that "quotes" is an array and you should first unwind it before grouping it.
This works as expected:
db.mydata.aggregate([
{ $match: { "id": 44328 } } ,
{ $unwind: "$quotes" },
{ $group: { _id: "$quotes.id",
totalKey2:{ $sum: "$quotes.opportunityValues.Key2 Key2" },
totalKey3:{ $sum: "$quotes.opportunityValues.Key3 Key3 Key3" } }
}
]);

OrientDB ETL loading CSV with vertices in one file and edges in another

I have some data that is in 2 CSV files, one contains the vertices and the other file contains the edges are in the other file. I'm working out how to set this up using ETL and am close but not quite there yet--it mostly works but my edges have properties and I'm not sure that they're loading right. This question was helpful but I'm still missing something...
Here's my data:
vertices.csv:
label,data,date
v01,0.1234,2015-01-01
v02,0.5678,2015-01-02
v03,0.9012,2015-01-03
edges.csv:
u,v,weight,date
v01,v02,12.4,2015-06-17
v02,v03,17.9,2015-09-14
I import my vertices using this:
commonVertices.json:
{
"begin": [
{ "let": { "name": "$filePath",
"expression": "$fileDirectory.append($fileName)"
}
},
],
"config": { "log": "info"},
"source": { "file": { "path": "$filePath" } },
"extractor": { "csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"dateFormat": "yyyy-mm-dd"
}
},
"transformers": [
{ "vertex": { "class": "myVertex" } },
{ "code": { "language": "Javascript",
"code": "print(' Current record: ' + record); record;" }
}
],
"loader": { "orientdb": {
"dbURL": "plocal:my_orientdb",
"dbType": "graph",
"batchCommit": 1000,
"classes": [ { "name": "myVertex", "extends", "V" },
],
"indexes": []
}
}
}
vertices.json:
{ "config": { "log": "info",
"fileDirectory": "./",
"fileName": "vertices.csv"
}
}
commonEdges.json:
{
"begin": [
{ "let": { "name": "$filePath",
"expression": "$fileDirectory.append($fileName )"
}
},
],
"config": { "log": "info"
},
"source": { "file": { "path": "$filePath" } },
"extractor": { "csv": { "ignoreEmptyLines": true,
"nullValue": "N/A",
"dateFormat": "yyyy-mm-dd"
}
},
"transformers": [
{ "merge": { "joinFieldName": "u", "lookup": "myVertex.label" } },
{ "edge": { "class": "myEdge",
"joinFieldName": "v",
"lookup": "myVertex.label",
"direction": "out",
"unresolvedLinkAction": "NOTHING"
}
},
{ "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:my_orientdb",
"dbType": "graph",
"batchCommit": 1000,
"useLightweightEdges": false,
"classes": [
{ "name": "myEdge", "extends", "E" }
],
"indexes": []
}
}
}
edges.json:
{
"config": {
"log": "info",
"fileDirectory": "./",
"fileName": "edges.csv"
}
}
I am running it with oetl.sh like this:
$ oetl.sh vertices.json commonVertices.json
$ oetl.sh edges.json commonEdges.json
Everything runs, but when I query the edges... I'm new to OrientDB, so maybe it is getting the properties in my edges, but when I query the edges I don't see the weight and date fields:
orientdb {db=my_orientdb}> SELECT FROM myEdge
+----+-----+------+-----+-----+
|# |#RID |#CLASS|out |in |
+----+-----+------+-----+-----+
|0 |#33:0|myEdge|#25:0|#26:0|
|1 |#34:0|myEdge|#26:0|#27:0|
+----+-----+------+-----+-----+
The vertex table contains the [weight] field from my edges.csv and the [date] field is getting clobbered in a weird way. The day of the month is getting overwritten to the day from the edge.csv file, which is undesirable, but it's odd to me that the month itself isn't also getting change:
orientdb {db=my_orientdb}> SELECT FROM myVertex
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|# |#RID |#CLASS |data |date |label|weight|out_myEdge|in_myEdge|
+----+-----+--------+------+-------------------+-----+------+----------+---------+
|0 |#25:0|myVertex|0.1234|2015-01-17 00:06:00|v01 |12.4 |[#33:0] | |
|1 |#26:0|myVertex|0.5678|2015-01-14 00:09:00|v02 |17.9 |[#34:0] |[#33:0] |
|2 |#27:0|myVertex|0.9012|2015-01-03 00:01:00|v03 | | |[#34:0] |
+----+-----+--------+------+-------------------+-----+------+----------+---------+
I'm sure it's probably a simple tweak, any help would be great!
In edge transformer use edgeFields to bind properties in edges. Example:
"transformers": [
{ "merge": { "joinFieldName": "u", "lookup": "myVertex.label" } },
{ "edge": { "class": "myEdge",
"joinFieldName": "v",
"lookup": "myVertex.label",
"edgeFields": { "weight": "${input.weight}", "date": "${input.date}" },
"direction": "out",
"unresolvedLinkAction": "NOTHING"
}
},
{ "field": { "fieldNames": ["u", "v"], "operation": "remove" } }
],
Hope it helps.