Cursor .foreach and array data - mongodb

I use the cursor .forEach to display the data as a table.
.forEach(function(doc){
print((doc.Id + ';' + doc.item + ';' + doc.delta);
But this does not work with data like this:
items" : [
{
"amount" : 1,
"id" : 158,
}]
How do I bring them to the table using the cursor?
I need something like this:
id-itemId-amount
57 | 158 | 1
58 | 159 | 2
(itemID and amount from array)

The Mongo shell supports standard JavaScript for iterating over an array and the items attribute is an array. So, inside the 'for each function' you can iterate over the items array and print out the id and amount attributes for each subdocment in that array.
For example:
db.collection.find({}).forEach(function(doc) {
// standard JS loop over an array
for (var i in doc.items) {
// print the id from the 'outer' document alongside the id
// and amount from each sub document in the items array
print(doc.id + '|' + doc.items[i].id + '|' + doc.items[i].amount);
}
})
Given the following documents ...
{ "id": 1, items" : [{"amount": 10, "id": 158}, {"amount": 11, "id": 159}]}
{ "id": 2, items" : [{"amount": 20, "id": 266}, {"amount": 21, "id": 267}]}
... the above function will print:
1|10|158
1|11|159
2|20|266
2|21|267

Related

PySpark: Explode schema columns does not match with underlying nested schema

I am using pyspark in combination with Azure-Synapse. I am reading multiple nested JSON with the same structure in a dataframe using the below sample:
{
"AmountOfOrders": 2,
"TotalEarnings": 1800,
"OrderDetails": [
{
"OrderNumber": 1,
"OrderDate": "2022-7-06",
"OrderLine": [
{
"LineNumber": 1,
"Product": "Laptop",
"Price": 1000
},
{
"LineNumber": 2,
"Product": "Tablet",
"Price": 500
},
{
"LineNumber": 3,
"Product": "Mobilephone",
"Price": 300
}
]
},
{
"OrderNumber": 2,
"OrderDate": "2022-7-06",
"OrderLine": [
{
"LineNumber": 1,
"Product": "Printer",
"Price": 100,
"Discount": 0
},
{
"LineNumber": 2,
"Product": "Paper",
"Price": 50,
"Discount": 0
},
{
"LineNumber": 3,
"Product": "Toner",
"Price": 30,
"Discount": 0
}
]
}
]
}
I am trying to get the the LineNumbers of Ordernumber 1 in a separate dataframe using a generic function which extract the array and Struct of the dataframe. Using the code below:
def read_nested_structure(df,excludeList,messageType,coll):
display(df.limit(10))
print('read_nested_structure')
cols =[]
match = 0
match_field = ""
print(df.schema[coll].dataType.fields)
for field in df.schema[coll].dataType.fields:
for c in excludeList:
if c == field.name:
print('Match = ' + field.name)
match = 1
if match == 0:
# cols.append(coll)
cols.append(col(coll + "." + field.name).alias(field.name))
match = 0
# cols.append(coll)
print(cols)
df = df.select(cols)
return df
def read_nested_structure_2(df,excludeList,messageType):
cols =[]
match = 0
for coll in df.schema.names:
if isinstance(df.schema[coll].dataType, ArrayType):
print( coll + "-- : Array")
df = df.withColumn(coll, explode(coll).alias(coll))
cols.append(coll)
elif isinstance(df.schema[coll].dataType, StructType):
if messageType == 'Header':
for field in df.schema[coll].dataType.fields:
cols.append(col(coll + "." + field.name).alias(coll + "_" + field.name))
elif messageType == 'Content':
print('Struct - Content')
for field in df.schema[coll].dataType.fields:
cols.append(col(coll + "." + field.name).alias(field.name))
else:
for c in excludeList:
if c == coll:
match = 1
if match == 0:
cols.append(coll)
df = df.select(cols)
return df
df = spark.read.load(datalakelocation + '/test.json', format='json')
df = unpack_to_content_dataframe_simple_2(df,exclude)
df = df.filter(df.OrderNumber == 1)
df = unpack_to_content_dataframe_simple_2(df,exclude)
display(df.limit(10))
which result in the following dataframe:
as you can see the yellow marked attribute is added to the dataframe which is not part of OrderNumber 1. How can I filter a row in the dataframe which results in a update schema ( in this case without the Discount attribute)?
I have used read_nested_structure_2() function in the following way to get the same results as yours. The code I used to get this result using read_nested_structure_2() is as follows:
x = read_nested_structure_2(df,[],'Header')
y = read_nested_structure_2(x,[],'Content')
y = y.filter(y.OrderNumber == 1)
z = read_nested_structure_2(y,[],'Header')
final = read_nested_structure_2(z,[],'Content')
display(final)
The output of after using this code is:
The column Discount will be created even if it is specified for one Product in the entire input JSON. In order to remove this column, we have to do it separately to get another dataframe without Discount (only if it is invalid).
You are going to use the same function to extract data from StructType or ArrayType, it is not recommended to write code to remove fields (say Discount) having all null values, in the same function. Doing so would complicate the code.
Instead, we can write another function which does this work for us. This function should remove a column where all of its values are null. The following is the function that can be used to do this.
def exclude_fields_that_dont_exist(filtered_df):
cols=[]
#iterate through columns
for column in filtered_df.columns:
#null_count is the count of null values in a column
null_count = filtered_df.filter(filtered_df[column].isNull()).count()
#check if null_count equals the total column value count
#if they are equal, those columns are not required (Like Discount)
if(filtered_df.select(column).count() != null_count):
cols.append(column)
#return dataframe with required columns
return filtered_df.select(*cols)
When you use this function on the filtered dataframe (final in my case), then you get a resulting dataframe as shown below:
mydf = exclude_fields_that_dont_exist(final)
# removes columns from a dataframe that have all null values.
display(mydf)
NOTE:
For example, for OrderNumber=1, the product Laptop has a 10% discount and the rest of the products (for the same order number) don't have a discount value (in the JSON).
The function needs to include the Discount column since it is a required information.
To avoid using more loops inside a function, you can also consider replacing all the null values with 0 since a Product with no Discount specified (null value) is same as a Product with Discount value as 0 (If this is feasible, then you can use fill() or fillna() functions to fill null values with any desired value)

How to delete a node from a JSONB Array across all table rows in Postges?

I have a table called "Bookmarks" that contains several standard rows and also a JSONB column called "columnsettings"
The content of this JSONB column looks like this.
[
{
"data": "id",
"width": 25
},
{
"data": "field_1",
"width": 125
},
{
"data": "field_12",
"width": 125
},
{
"data": "field_11",
"width": 125
},
{
"data": "field_2",
"width": 125
},
{
"data": "field_7",
"width": 125
},
{
"data": "field_8",
"width": 125
},
{
"data": "field_9",
"width": 125
},
{
"data": "field_10",
"width": 125
}
]
I am trying to write an update statement which would update this columnsettings by removing a specific node I specify. For example, I might want to update the columnsettings and remove just the node where data='field_2' as an example.
I have tried a number of things...I believe it will look something like this, but this is wrong.
update public."Bookmarks"
set columnsettings =
jsonb_set(columnsettings, (columnsettings->'data') - 'field_2');
What is the correct syntax to remove a node within a JSONB Array like this?
I did get a version working when there is a single row. This correctly updates the JSONB column and removes the node
UPDATE public."Bookmarks" SET columnsettings = columnsettings - (select position-1 from public."Bookmarks", jsonb_array_elements(columnsettings) with ordinality arr(elem, position) WHERE elem->>'data' = 'field_2')::int
However, I want it to apply to every row in the table. When there is more than 1 row, I get the error " more than one row returned by a subquery used as an expression"
How do I get this query to update all rows in the table?
UPDATED, the answer provided solved my issue.
I now have another JSONB column where I need to do the same filtering. The structure is a bit different, it looks likke this
{
"filters": [
{
"field": "field_8",
"value": [
1
],
"header": "Colors",
"uitype": 7,
"operator": "searchvalues",
"textvalues": [
"Red"
],
"displayfield": "field_8_options"
}
],
"rowHeight": 1,
"detailViewWidth": 1059
}
I tried using the syntax the same way as follows:
UPDATE public."Bookmarks"
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT bookmarkid, JSONB_AGG(el) as tabsettings
FROM public."Bookmarks",
JSONB_ARRAY_ELEMENTS(tabsettings) AS el
WHERE el->'filters'->>'field' != 'field_8'
GROUP BY bookmarkid
) AS filtered_elements
WHERE filtered_elements.bookmarkid = public."Bookmarks".bookmarkid;
This gives an error: "cannot extract elements from an object"
I thought I had the syntax correct, but how should this line be formatted?
WHERE el->'filters'->>'field' != 'field_8'
I tried this format as well to get to the array. This doesn't given an error, but it doesn't find any matches...even though there are records.
UPDATE public."Bookmarks"
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT bookmarkid, JSONB_AGG(el) as tabsettings
FROM public."Bookmarks",
JSONB_ARRAY_ELEMENTS(tabsettings->'filters') AS el
WHERE el->>'field' != 'field_8'
GROUP BY bookmarkid
) AS filtered_elements
WHERE filtered_elements.bookmarkid = public."Bookmarks".bookmarkid;
UPDATED .
This query now seems to work if there is more than one "filter" in the array.
However, if there is only 1 element in array which should be excluded, it doesn't remove the item.
UPDATE public."Bookmarks"
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT bookmarkid,
tabsettings || JSONB_BUILD_OBJECT('filters', JSONB_AGG(el)) as tabsettings
FROM public."Bookmarks",
-- this must be an array
JSONB_ARRAY_ELEMENTS(tabsettings->'filters') AS el
WHERE el->>'field' != 'field_8'
GROUP BY bookmarkid
) AS filtered_elements
WHERE filtered_elements.bookmarkid = public."Bookmarks".bookmarkid;
You can deconstruct, filter, and re-construct the JSONB array. Something like this should work:
UPDATE bookmarks
SET columnsettings = filtered_elements.columnsettings
FROM (
SELECT id, JSONB_AGG(el) as columnsettings
FROM bookmarks,
JSONB_ARRAY_ELEMENTS(columnsettings) AS el
WHERE el->>'data' != 'field_2'
GROUP BY id
) AS filtered_elements
WHERE filtered_elements.id = bookmarks.id;
Using JSONB_ARRAY_ELEMENTS, you transform the JSONB array into rows, one per object, which you call el. Then you can access the data attribute to filter out the "field_2" entry. Finally, you group by id to put the remainign values back together, and update the corresponding row.
EDIT If your data is a nested array in an object, override the object on the specific key:
UPDATE bookmarks
SET tabsettings = filtered_elements.tabsettings
FROM (
SELECT id,
tabsettings || JSONB_BUILD_OBJECT('filters', JSONB_AGG(el)) as tabsettings
FROM bookmarks,
-- this must be an array
JSONB_ARRAY_ELEMENTS(tabsettings->'filters') AS el
WHERE el->>'field' != 'field_2'
GROUP BY id
) AS filtered_elements
WHERE filtered_elements.id = bookmarks.id;

How to count number of elements in the following JSON

I have a column in my Metabase table where the column entry is like the following:
{ “text_fields”: { “Weight”: “{:optional=>true, :priority=>4, :index=>false}” }, “checkbox_fields”: {}, “dropdown_fields”: { “Brand Name”: “{:optional=>false, :priority=>1, :index=>false, :options=>[“Non Branded”]}” }}
I want to get a net count of
text_fields
checkbox_fields
dropdown_fields
The desired answer, in this case, will be: 2 (1 text field + 0 checkbox field + 1 dropdown field)
Use jsonb_each to iterate over object keys/values, and jsonb_object_keys to extract the keys only.
Example (with sample data included):
SELECT * FROM mytable ;
mycolumn
---------------------------------------------------------------------------------------------
{"text_fields": {"Weight": 1}, "checkbox_fields": {}, "dropdown_fields": {"Brand Name": 1}}
(1 row)
SELECT *,
(SELECT sum((SELECT count(*) FROM jsonb_object_keys(v1)))
FROM jsonb_each(mycolumn) AS j1(k1,v1)
WHERE k1 IN ('text_fields', 'checkbox_fields', 'dropdown_fields')
) AS mytotal
FROM mytable;
mycolumn | mytotal
---------------------------------------------------------------------------------------------+---------
{"text_fields": {"Weight": 1}, "checkbox_fields": {}, "dropdown_fields": {"Brand Name": 1}} | 2
(1 row)

Postgres: How to string pattern match query a json column?

I have a column with json type but I'm wondering how to select filter it i.e.
select * from fooTable where myjson like "orld";
How would I query for a substring match like the above. Say searching for "orld" under "bar" keys?
{ "foo": "hello", "bar": "world"}
I took a look at this documentation but it is quite confusing.
https://www.postgresql.org/docs/current/static/datatype-json.html
Use the ->> operator to get json attributes as text, example
with my_table(id, my_json) as (
values
(1, '{ "foo": "hello", "bar": "world"}'::json),
(2, '{ "foo": "hello", "bar": "moon"}'::json)
)
select t.*
from my_table t
where my_json->>'bar' like '%orld'
id | my_json
----+-----------------------------------
1 | { "foo": "hello", "bar": "world"}
(1 row)
Note that you need a placeholder % in the pattern.

Query rows for matching JSONB column where key ends with a name and key value is a specific value

Given the following rows with a jsonb column details. How do I write a query so that records where the key name ends with _col with value B are selected. So records with ids 1, 2.
id | details
1 | { "one_col": "A", "two_col": "B" }
2 | { "three_col": "B" }
3 | { another: "B" }
So far I've only find ways to match based on the value, not the key.
Use the function jsonb_each_text() which gives json objects as pairs (key, value):
with the_data(id, details) as (
values
(1, '{ "one_col": "A", "two_col": "B" }'::jsonb),
(2, '{ "three_col": "B" }'),
(3, '{ "another": "B" }')
)
select t.*
from the_data t,
lateral jsonb_each_text(details)
where key like '%_col'
and value = 'B';
id | details
----+----------------------------------
1 | {"one_col": "A", "two_col": "B"}
2 | {"three_col": "B"}
(2 rows)