Pyspark: Best way to set json strings in dataframe column - pyspark

I need to create couple of columns in Dataframe where I want to parse and store the json string. Here is one json which I need to store in one column. Other json are also similar.Can you please help in how to transform and store this json string in the column. The values section needs to be filled from the values from other dataframe columns within the same data frame.
{
"name": "",
"headers": [
{
"name": "A",
"dataType": "number"
},
{
"name": "B",
"dataType": "string"
},
{
"name": "C",
"dataType": "string"
}
],
"values": [
[
2,
"some value",
"some value"
]
]
}

Related

PostgresSQL nested jsonb update value of complex key/value pairs

Starting out with JSONB data type and I'm hoping someone can help me out.
I have a table (properties) with two columns (id as primary key and data as jsonb).
The data structure is:
{
"ProductType": "ABC",
"ProductName": "XYZ",
"attributes": [
{
"name": "Color",
"type": "STRING",
"value": "Silver"
},
{
"name": "Case",
"type": "STRING",
"value": "Shells"
},
...
]
}
I would like to update the value of a specific attributes element by name for a row with a given id. For example, for the element with "name"="Case" change the value to "Glass". So it ends up like
{
"ProductType": "ABC",
"ProductName": "XYZ",
"attributes": [
{
"name": "Color",
"type": "STRING",
"value": "Silver"
},
{
"name": "Case",
"type": "STRING",
"value": "Glass"
},
...
]
}
Is this possible with this structure using SQL?
I have created table structure if any of you would like to give it a shot.
dbfiddle
Use the jsonb concatenation operator, ||, to replace keys on the fly:
WITH properties (id, data) AS (
values
(1, '{"ProductType": "ABC","ProductName": "XYZ","attributes": [{"name": "Color","type": "STRING","value": "Silver"},{"name": "Case","type": "STRING","value": "Shells"}]}'::jsonb),
(2, '{"ProductType": "ABC","ProductName": "XYZ","attributes": [{"name": "Color","type": "STRING","value": "Red"},{"name": "Case","type": "STRING","value": "Shells"}]}'::jsonb)
)
SELECT id,
data||
jsonb_build_object(
'attributes',
jsonb_agg(
case
when attribs->>'name' = 'Case' then attribs||'{"value": "Glass"}'::jsonb
else attribs
end
)
) as data
FROM properties m
CROSS JOIN LATERAL JSONB_ARRAY_ELEMENTS(data->'attributes') as a(attribs)
GROUP BY id, data
Updated fiddle

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Parsing Really Messy Nested JSON Strings

I have a series of deeply nested json strings in a pyspark dataframe column. I need to explode and filter based on the contents of these strings and would like to add them as columns. I've tried defining the StructTypes but each time it continues to return an empty DF.
Tried using json_tuples to parse but there are no common keys to rejoin the dataframes and the row numbers dont match up? I think it might have to do with some null fields
The sub field can be nullable
Sample JSON
{
"TIME": "datatime",
"SID": "yjhrtr",
"ID": {
"Source": "Person",
"AuthIFO": {
"Prov": "Abc",
"IOI": "123",
"DETAILS": {
"Id": "12345",
"SId": "ABCDE"
}
}
},
"Content": {
"User1": "AB878A",
"UserInfo": "False",
"D": "ghgf64G",
"T": "yjuyjtyfrZ6",
"Tname": "WE ARE THE WORLD",
"ST": null,
"TID": "BPV 1431: 1",
"src": "test",
"OT": "test2",
"OA": "test3",
"OP": "test34
},
"Test": false
}

Apache Druid sql query conversion to json based query

I am trying to convert the following druid sql query to a druid json query, as one of the columns i have is a multi-value dimension for which druid does not support a sql style query.
My sql query:
SELECT date_dt, source, type_labels, COUNT(DISTINCT unique_p_hll)
FROM "test"
WHERE
type_labels = 'z' AND
(a_id IN ('a', 'b', 'c') OR b_id IN ('m', 'n', 'p'))
GROUP BY date_dt, source, type_labels;
unique_p_hll is an hll column with uniques.
The druid json query i came up with is following:
{
"queryType": "groupBy",
"dataSource": "test",
"granularity": "day",
"dimensions": ["source", "type_labels"],
"limitSpec": {},
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "type_labels", "value": "z" },
{ "type": "or", "fields": [
{ "type": "in", "dimension": "a_id", "values": ["a", "b", "c"] },
{ "type": "in", "dimension": "b_id", "values": ["m", "n", "p"] }
]}
]
},
"aggregations": [
{ "type": "longSum", "name": "unique_p_hll", "fieldName": "p_id" }
],
"intervals": [ "2018-08-01/2018-08-02" ]
}
But the json query seems to be returning empty resultset.
I can see the output correctly in Pivot UI. Though the array column type_labels values show up as {"array_element": "z"} instead of simply "z".
Does the query return empty string, or does it return a formatted JSON with zero records?
If the former, I can suggest a couple of leads for debugging this issue:
Make sure that the query is properly sent to the Broker, as shown in Druid's query tutorial:
curl -X 'POST' -H 'Content-Type:application/json' -d #query-file.json http://<BROKER-IP>:<BROKER-PORT>/druid/v2?pretty
Also, check the Broker's log for errors.

How would I use the attribute as input for Value with JOLT?

For a specific function that I am building, I need to parse my JSON and have in some cases the attribute, instead of the value itself, be used as the value for the attribute. But how do I manage that with JOLT?
Let's say this is my input:
{
"Results": [
{
"FirstName": "John",
"LastName": "Doe"
},
{
"FirstName": "Mary",
"LastName": "Joe"
},
{
"FirstName": "Thomas",
"LastName": "Edison"
}
]
}
And this should be the outcome:
{
"Results": [
{
"Name": "FirstName",
"Value": "John"
},
{
"Name": "FirstName",
"Value": "Mary"
},
{
"Name": "FirstName",
"Value": "Thomas"
},
{
"Name": "LastName",
"Value": "Doe"
},
{
"Name": "LastName",
"Value": "Doe"
},
{
"Name": "LastName",
"Value": "Edison"
},
]
}
For those interested.. I'm building a JSON to Excel export functionality in Mendix and it has to be completely dynamic, regardless of the input. To accomplish this, I need an array where each attribute (equal to a column in Excel) has to be it's own object with a column name and a value. If each column data is it's own object, I can simply say "create column for each object with the same "Name". Little bit difficult to explain, but it 'should' work.
Arrays and Jolt, are not the best. Basically there are 3 ways to deal with arrays in Shift.
you explicitly assign data to an array position. Aka foo[0] and foo[1]
you reference a "number" that exists in the input data. Aka foo[&2] and foo[&3]
you "accumulate" data into a list. Aka foo[].
Your input data is array of size 3. Your desired output is an array of size 6. You want this to be flexible and be able to handle variable inputs.
This means option 3. So you have to "fix" / process your data into it "final form", while maintaining the original input Json structure (of a list with 3 items), and then accumulate all the "built" items into a list.
This means that you are buildling a list of lists, and then finally "squashing" it down to a single list.
Spec
[
{
// Step 1 : Pivot the data into parallel lists of keys and values
// maintaining the original outer input list structure.
"operation": "shift",
"spec": {
"Results": {
"*": { // results index
"*": { // FirstName / Lastname
"$": "temp[&2].keys[]",
"#": "temp[&2].values[]"
}
}
}
}
},
{
// Step 2 : Un-pivot the data into the desired
// Name/Value pairs, using the inner array index to
// keep things organized/separated.
"operation": "shift",
"spec": {
"temp": {
"*": { // temp index
"keys": {
"*": "temp[&2].[&].Name"
},
"values": {
"*": "temp[&2].[&].Value"
}
}
}
}
},
{
// Step 3 : Accumulate the "finished" Name/Value pairs
// into the final "one big list" output.
"operation": "shift",
"spec": {
"temp": {
"*": { // outer array
"*": "Results[]"
}
}
}
}
]