Parsing Really Messy Nested JSON Strings - pyspark

I have a series of deeply nested json strings in a pyspark dataframe column. I need to explode and filter based on the contents of these strings and would like to add them as columns. I've tried defining the StructTypes but each time it continues to return an empty DF.
Tried using json_tuples to parse but there are no common keys to rejoin the dataframes and the row numbers dont match up? I think it might have to do with some null fields
The sub field can be nullable
Sample JSON
{
"TIME": "datatime",
"SID": "yjhrtr",
"ID": {
"Source": "Person",
"AuthIFO": {
"Prov": "Abc",
"IOI": "123",
"DETAILS": {
"Id": "12345",
"SId": "ABCDE"
}
}
},
"Content": {
"User1": "AB878A",
"UserInfo": "False",
"D": "ghgf64G",
"T": "yjuyjtyfrZ6",
"Tname": "WE ARE THE WORLD",
"ST": null,
"TID": "BPV 1431: 1",
"src": "test",
"OT": "test2",
"OA": "test3",
"OP": "test34
},
"Test": false
}

Related

Pyspark: Best way to set json strings in dataframe column

I need to create couple of columns in Dataframe where I want to parse and store the json string. Here is one json which I need to store in one column. Other json are also similar.Can you please help in how to transform and store this json string in the column. The values section needs to be filled from the values from other dataframe columns within the same data frame.
{
"name": "",
"headers": [
{
"name": "A",
"dataType": "number"
},
{
"name": "B",
"dataType": "string"
},
{
"name": "C",
"dataType": "string"
}
],
"values": [
[
2,
"some value",
"some value"
]
]
}

Updating Mongo DB collection field from object to array of objects

I had to change one of the fields of my collection in mongoDB from an object to array of objects containing a lot of data. New documents get inserted without any problem, but when attempted to get old data, it never maps to the original DTO correctly and runs into errors.
subject is the field that was changed in Students collection.
I was wondering is there any way to update all the records so they all have the same data type, without losing any data.
The old version of Student:
{
"_id": "5fb2ae251373a76ae58945df",
"isActive": true,
"details": {
"picture": "http://placehold.it/32x32",
"age": 17,
"eyeColor": "green",
"name": "Vasquez Sparks",
"gender": "male",
"email": "vasquezsparks#orbalix.com",
"phone": "+1 (962) 512-3196",
"address": "619 Emerald Street, Nutrioso, Georgia, 6576"
},
"subject":
{
"id": 0,
"name": "math",
"module": {
"name": "Advanced",
"semester": "second"
}
}
}
This needs to be updated to the new version like this:
{
"_id": "5fb2ae251373a76ae58945df",
"isActive": true,
"details": {
"picture": "http://placehold.it/32x32",
"age": 17,
"eyeColor": "green",
"name": "Vasquez Sparks",
"gender": "male",
"email": "vasquezsparks#orbalix.com",
"phone": "+1 (962) 512-3196",
"address": "619 Emerald Street, Nutrioso, Georgia, 6576"
},
"subject": [
{
"id": 0,
"name": "math",
"module": {
"name": "Advanced",
"semester": "second"
}
},
{
"id": 1,
"name": "history",
"module": {
"name": "Basic",
"semester": "first"
}
},
{
"id": 2,
"name": "English",
"module": {
"name": "Basic",
"semester": "second"
}
}
]
}
I understand there might be a way to rename old collection, create new and insert data based on old one in to new one. I was wondering for some direct way.
The goal is to turn subject into an array of 1 if it is not already an array, otherwise leave it alone. This will do the trick:
update args are (predicate, actions, options).
db.foo.update(
// Match only those docs where subject is an object (i.e. not turned into array):
{$expr: {$eq:[{$type:"$subject"},"object"]}},
// Actions: set subject to be an array containing $subject. You MUST use the pipeline version
// of the update actions to correctly substitute $subject in the expression!
[ {$set: {subject: ["$subject"] }} ],
// Do this for ALL matches, not just first:
{multi:true});
You can run this converter over and over because it will ignore converted docs.
If the goal is to convert and add some new subjects, preserving the first one, then we can set up the additional subjects and concatenate them into one array as follows:
var mmm = [ {id:8, name:"CORN"}, {id:9, name:"DOG"} ];
rc = db.foo.update({$expr: {$eq:[{$type:"$subject"},"object"]}},
[ {$set: {subject: {$concatArrays: [["$subject"], mmm]} }} ],
{multi:true});

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Remove Parse-specific fields from database results

Every time I query something from database I get objects like this:
{
"title": "faq",
"description": "",
"type": "application/pdf",
"size": 122974,
"filename": "faq.pdf",
"order": 0,
"createdAt": "2017-08-17T08:10:33.101Z",
"updatedAt": "2017-08-17T08:10:33.101Z",
"ACL": {
"role:cxcccccc_public": {
"read": true
},
"role:cxaaaaaaa_admin": {
"read": true,
"write": true
}
},
"objectId": "l6L5J1mRpH",
"__type": "Object",
"className": "Document"
}
A lot of these fields are Parse-specific: createdAt, updatedAt, ACL, className...
I didn't insert them when I added the row in the database but still they are returned when I do a query.
I want to expose this data through a clean REST API so I want to get rid of them.
Is there a way to get only the data I specified at insert time?
For now I am using lowdash, filtering with
data = _.omit(data, ['ACL', '__type', 'className', ...])

How to query data from mongoDB which column is an array

{
"name": "Grieg And Sibelius Songs",
"status": 1,
"lastupdate": 1306294550,
"intro": "",
"artist": [
"ar9341cd00668311e0a45217d9fa59cf02",
"ar9341cd00668311e0a45217d9fa59cf0x"
],
"publisher": "(Warner Music)",
"release_date": "2006-05-08"
}
I want to find the the data that artist column "ar9341cd00668311e0a45217d9fa59cf02" in array("ar9341cd00668311e0a45217d9fa59cf02","ar9341cd00668311e0a45217d9fa59cf0d","ar9341cd00668311e0a45217d9fa59cf0r") what should I do?
You can use $exists to check the presence of the item in array,
db.yourcollectionname.find({artist: {$exists:'ar9341cd00668311e0a45217d9fa59cf02'}})