PySpark: Explode schema columns does not match with underlying nested schema - pyspark

I am using pyspark in combination with Azure-Synapse. I am reading multiple nested JSON with the same structure in a dataframe using the below sample:
{
"AmountOfOrders": 2,
"TotalEarnings": 1800,
"OrderDetails": [
{
"OrderNumber": 1,
"OrderDate": "2022-7-06",
"OrderLine": [
{
"LineNumber": 1,
"Product": "Laptop",
"Price": 1000
},
{
"LineNumber": 2,
"Product": "Tablet",
"Price": 500
},
{
"LineNumber": 3,
"Product": "Mobilephone",
"Price": 300
}
]
},
{
"OrderNumber": 2,
"OrderDate": "2022-7-06",
"OrderLine": [
{
"LineNumber": 1,
"Product": "Printer",
"Price": 100,
"Discount": 0
},
{
"LineNumber": 2,
"Product": "Paper",
"Price": 50,
"Discount": 0
},
{
"LineNumber": 3,
"Product": "Toner",
"Price": 30,
"Discount": 0
}
]
}
]
}
I am trying to get the the LineNumbers of Ordernumber 1 in a separate dataframe using a generic function which extract the array and Struct of the dataframe. Using the code below:
def read_nested_structure(df,excludeList,messageType,coll):
display(df.limit(10))
print('read_nested_structure')
cols =[]
match = 0
match_field = ""
print(df.schema[coll].dataType.fields)
for field in df.schema[coll].dataType.fields:
for c in excludeList:
if c == field.name:
print('Match = ' + field.name)
match = 1
if match == 0:
# cols.append(coll)
cols.append(col(coll + "." + field.name).alias(field.name))
match = 0
# cols.append(coll)
print(cols)
df = df.select(cols)
return df
def read_nested_structure_2(df,excludeList,messageType):
cols =[]
match = 0
for coll in df.schema.names:
if isinstance(df.schema[coll].dataType, ArrayType):
print( coll + "-- : Array")
df = df.withColumn(coll, explode(coll).alias(coll))
cols.append(coll)
elif isinstance(df.schema[coll].dataType, StructType):
if messageType == 'Header':
for field in df.schema[coll].dataType.fields:
cols.append(col(coll + "." + field.name).alias(coll + "_" + field.name))
elif messageType == 'Content':
print('Struct - Content')
for field in df.schema[coll].dataType.fields:
cols.append(col(coll + "." + field.name).alias(field.name))
else:
for c in excludeList:
if c == coll:
match = 1
if match == 0:
cols.append(coll)
df = df.select(cols)
return df
df = spark.read.load(datalakelocation + '/test.json', format='json')
df = unpack_to_content_dataframe_simple_2(df,exclude)
df = df.filter(df.OrderNumber == 1)
df = unpack_to_content_dataframe_simple_2(df,exclude)
display(df.limit(10))
which result in the following dataframe:
as you can see the yellow marked attribute is added to the dataframe which is not part of OrderNumber 1. How can I filter a row in the dataframe which results in a update schema ( in this case without the Discount attribute)?

I have used read_nested_structure_2() function in the following way to get the same results as yours. The code I used to get this result using read_nested_structure_2() is as follows:
x = read_nested_structure_2(df,[],'Header')
y = read_nested_structure_2(x,[],'Content')
y = y.filter(y.OrderNumber == 1)
z = read_nested_structure_2(y,[],'Header')
final = read_nested_structure_2(z,[],'Content')
display(final)
The output of after using this code is:
The column Discount will be created even if it is specified for one Product in the entire input JSON. In order to remove this column, we have to do it separately to get another dataframe without Discount (only if it is invalid).
You are going to use the same function to extract data from StructType or ArrayType, it is not recommended to write code to remove fields (say Discount) having all null values, in the same function. Doing so would complicate the code.
Instead, we can write another function which does this work for us. This function should remove a column where all of its values are null. The following is the function that can be used to do this.
def exclude_fields_that_dont_exist(filtered_df):
cols=[]
#iterate through columns
for column in filtered_df.columns:
#null_count is the count of null values in a column
null_count = filtered_df.filter(filtered_df[column].isNull()).count()
#check if null_count equals the total column value count
#if they are equal, those columns are not required (Like Discount)
if(filtered_df.select(column).count() != null_count):
cols.append(column)
#return dataframe with required columns
return filtered_df.select(*cols)
When you use this function on the filtered dataframe (final in my case), then you get a resulting dataframe as shown below:
mydf = exclude_fields_that_dont_exist(final)
# removes columns from a dataframe that have all null values.
display(mydf)
NOTE:
For example, for OrderNumber=1, the product Laptop has a 10% discount and the rest of the products (for the same order number) don't have a discount value (in the JSON).
The function needs to include the Discount column since it is a required information.
To avoid using more loops inside a function, you can also consider replacing all the null values with 0 since a Product with no Discount specified (null value) is same as a Product with Discount value as 0 (If this is feasible, then you can use fill() or fillna() functions to fill null values with any desired value)

Related

pyspark how to get the count of records which are not matching with the given date format

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.

UPDATE SET with different value for each row

I have python dict with relationship between elements and their values. For example:
db_rows_values = {
<element_uuid_1>: 12,
<element_uuid_2>: "abc",
<element_uuid_3>: [123, 124, 125],
}
And I need to update it in one query. I made it in python through the query generation loop with CASE:
sql_query_elements_values_part = " ".join([f"WHEN '{element_row['element_id']}' "
f"THEN '{ujson.dumps(element_row['value'])}'::JSONB "
for element_row in db_row_values])
query_part_elements_values_update = f"""
elements_value_update AS (
UPDATE m2m_entries_n_elements
SET value =
CASE element_id
{sql_query_elements_values_part}
ELSE NULL
END
WHERE element_id = ANY(%(elements_ids)s::UUID[])
AND entry_id = ANY(%(entries_ids)s::UUID[])
RETURNING element_id, entry_id, value
),
But now I need to rewrite it in plpgsql. I can pass db_rows_values as array of ROWTYPE or as json but how can I make something like WHEN THEN part?
Ok, I can pass dict as JSON, convert it to rows with json_to_recordset and change WHEN THEN to SET value = (SELECT.. WHERE)
WITH input_rows AS (
SELECT *
FROM json_to_recordset(
'[
{"element_id": 2, "value":"new_value_1"},
{"element_id": 4, "value": "new_value_2"}
]'
) AS x("element_id" int, "value" text)
)
UPDATE table1
SET value = (SELECT value FROM input_rows WHERE input_rows.element_id = table1.element_id)
WHERE element_id IN (SELECT element_id FROM input_rows);
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=f8b6cd8285ec7757e0d8f38a1becb960

How to delete a node from a JSONB Array across all table rows in Postges?

I have a table called "Bookmarks" that contains several standard rows and also a JSONB column called "columnsettings"
The content of this JSONB column looks like this.
[
{
"data": "id",
"width": 25
},
{
"data": "field_1",
"width": 125
},
{
"data": "field_12",
"width": 125
},
{
"data": "field_11",
"width": 125
},
{
"data": "field_2",
"width": 125
},
{
"data": "field_7",
"width": 125
},
{
"data": "field_8",
"width": 125
},
{
"data": "field_9",
"width": 125
},
{
"data": "field_10",
"width": 125
}
]
I am trying to write an update statement which would update this columnsettings by removing a specific node I specify. For example, I might want to update the columnsettings and remove just the node where data='field_2' as an example.
I have tried a number of things...I believe it will look something like this, but this is wrong.
update public."Bookmarks"
set columnsettings =
jsonb_set(columnsettings, (columnsettings->'data') - 'field_2');
What is the correct syntax to remove a node within a JSONB Array like this?
I did get a version working when there is a single row. This correctly updates the JSONB column and removes the node
UPDATE public."Bookmarks" SET columnsettings = columnsettings - (select position-1 from public."Bookmarks", jsonb_array_elements(columnsettings) with ordinality arr(elem, position) WHERE elem->>'data' = 'field_2')::int
However, I want it to apply to every row in the table. When there is more than 1 row, I get the error " more than one row returned by a subquery used as an expression"
How do I get this query to update all rows in the table?
UPDATED, the answer provided solved my issue.
I now have another JSONB column where I need to do the same filtering. The structure is a bit different, it looks likke this
{
"filters": [
{
"field": "field_8",
"value": [
1
],
"header": "Colors",
"uitype": 7,
"operator": "searchvalues",
"textvalues": [
"Red"
],
"displayfield": "field_8_options"
}
],
"rowHeight": 1,
"detailViewWidth": 1059
}
I tried using the syntax the same way as follows:
UPDATE public."Bookmarks"
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT bookmarkid, JSONB_AGG(el) as tabsettings
FROM public."Bookmarks",
JSONB_ARRAY_ELEMENTS(tabsettings) AS el
WHERE el->'filters'->>'field' != 'field_8'
GROUP BY bookmarkid
) AS filtered_elements
WHERE filtered_elements.bookmarkid = public."Bookmarks".bookmarkid;
This gives an error: "cannot extract elements from an object"
I thought I had the syntax correct, but how should this line be formatted?
WHERE el->'filters'->>'field' != 'field_8'
I tried this format as well to get to the array. This doesn't given an error, but it doesn't find any matches...even though there are records.
UPDATE public."Bookmarks"
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT bookmarkid, JSONB_AGG(el) as tabsettings
FROM public."Bookmarks",
JSONB_ARRAY_ELEMENTS(tabsettings->'filters') AS el
WHERE el->>'field' != 'field_8'
GROUP BY bookmarkid
) AS filtered_elements
WHERE filtered_elements.bookmarkid = public."Bookmarks".bookmarkid;
UPDATED .
This query now seems to work if there is more than one "filter" in the array.
However, if there is only 1 element in array which should be excluded, it doesn't remove the item.
UPDATE public."Bookmarks"
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT bookmarkid,
tabsettings || JSONB_BUILD_OBJECT('filters', JSONB_AGG(el)) as tabsettings
FROM public."Bookmarks",
-- this must be an array
JSONB_ARRAY_ELEMENTS(tabsettings->'filters') AS el
WHERE el->>'field' != 'field_8'
GROUP BY bookmarkid
) AS filtered_elements
WHERE filtered_elements.bookmarkid = public."Bookmarks".bookmarkid;
You can deconstruct, filter, and re-construct the JSONB array. Something like this should work:
UPDATE bookmarks
SET columnsettings = filtered_elements.columnsettings
FROM (
SELECT id, JSONB_AGG(el) as columnsettings
FROM bookmarks,
JSONB_ARRAY_ELEMENTS(columnsettings) AS el
WHERE el->>'data' != 'field_2'
GROUP BY id
) AS filtered_elements
WHERE filtered_elements.id = bookmarks.id;
Using JSONB_ARRAY_ELEMENTS, you transform the JSONB array into rows, one per object, which you call el. Then you can access the data attribute to filter out the "field_2" entry. Finally, you group by id to put the remainign values back together, and update the corresponding row.
EDIT If your data is a nested array in an object, override the object on the specific key:
UPDATE bookmarks
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT id,
tabsettings || JSONB_BUILD_OBJECT('filters', JSONB_AGG(el)) as tabsettings
FROM bookmarks,
-- this must be an array
JSONB_ARRAY_ELEMENTS(tabsettings->'filters') AS el
WHERE el->>'field' != 'field_2'
GROUP BY id
) AS filtered_elements
WHERE filtered_elements.id = bookmarks.id;

Cursor .foreach and array data

I use the cursor .forEach to display the data as a table.
.forEach(function(doc){
print((doc.Id + ';' + doc.item + ';' + doc.delta);
But this does not work with data like this:
items" : [
{
"amount" : 1,
"id" : 158,
}]
How do I bring them to the table using the cursor?
I need something like this:
id-itemId-amount
57 | 158 | 1
58 | 159 | 2
(itemID and amount from array)
The Mongo shell supports standard JavaScript for iterating over an array and the items attribute is an array. So, inside the 'for each function' you can iterate over the items array and print out the id and amount attributes for each subdocment in that array.
For example:
db.collection.find({}).forEach(function(doc) {
// standard JS loop over an array
for (var i in doc.items) {
// print the id from the 'outer' document alongside the id
// and amount from each sub document in the items array
print(doc.id + '|' + doc.items[i].id + '|' + doc.items[i].amount);
}
})
Given the following documents ...
{ "id": 1, items" : [{"amount": 10, "id": 158}, {"amount": 11, "id": 159}]}
{ "id": 2, items" : [{"amount": 20, "id": 266}, {"amount": 21, "id": 267}]}
... the above function will print:
1|10|158
1|11|159
2|20|266
2|21|267

List in the Case-When Statement in Spark SQL

I'm trying to convert a dataframe from long to wide as suggested at How to pivot DataFrame?
However, the SQL seems to misinterpret the Countries list as a variable from the table. The below are the messages I saw from the console and the sample data and codes from the above link. Anyone knows how to resolve the issues?
Messages from the scala console:
scala> val myDF1 = sqlc2.sql(query)
org.apache.spark.sql.AnalysisException: cannot resolve 'US' given input columns >id, tag, value;
id tag value
1 US 50
1 UK 100
1 Can 125
2 US 75
2 UK 150
2 Can 175
and I want:
id US UK Can
1 50 100 125
2 75 150 175
I can create a list with the value I want to pivot and then create a string containing the sql query I need.
val countries = List("US", "UK", "Can")
val numCountries = countries.length - 1
var query = "select *, "
for (i <- 0 to numCountries-1) {
query += "case when tag = " + countries(i) + " then value else 0 end as " + countries(i) + ", "
}
query += "case when tag = " + countries.last + " then value else 0 end as " + countries.last + " from myTable"
myDataFrame.registerTempTable("myTable")
val myDF1 = sqlContext.sql(query)
Country codes are literals and should be enclosed in quotes otherwise SQL parser will treat these as the names of the columns:
val caseClause = countries.map(
x => s"""CASE WHEN tag = '$x' THEN value ELSE 0 END as $x"""
).mkString(", ")
val aggClause = countries.map(x => s"""SUM($x) AS $x""").mkString(", ")
val query = s"""
SELECT id, $aggClause
FROM (SELECT id, $caseClause FROM myTable) tmp
GROUP BY id"""
sqlContext.sql(query)
Question is why even bother with building SQL strings from scratch?
def genCase(x: String) = {
when($"tag" <=> lit(x), $"value").otherwise(0).alias(x)
}
def genAgg(f: Column => Column)(x: String) = f(col(x)).alias(x)
df
.select($"id" :: countries.map(genCase): _*)
.groupBy($"id")
.agg($"id".alias("dummy"), countries.map(genAgg(sum)): _*)
.drop("dummy")