Postgres Querying/filtering JSONB with nested arrays - postgresql

Below is my sample requirement
I want customers who meet all the below conditions
In country "xyz", incorporated between 2019 to 2021.
Should be having at least one account with balance between 10000 and 13000 and branch is "abc" and transaction dates between 20200110 and 20210625. It is formatted and stored as number
Should be having at least one address in the state "state1" and pin codes between 625001 and 625015
Below is table structure
CREATE TABLE customer_search_ms.customer
(
customer_id integer,
customer_details jsonb
)
There can be millions of rows in the table.
I have created GIN index of type jsonb_ops on the customer_details column as we would also be checking for existence conditions and range comparison
Below is a sample data in the customer_data JSONB column
customer_id : 1
{
"customer_data": {
"name": "abc",
"incorporated_year": 2020,
"country":"xyz",
"account_details": [
{
"transaction_dates": [
20180125, 20190125, 20200125,20200525
],
"account_id": 1016084,
"account_balance": 2000,
"account_branch": "xyz"
},
{
"transaction_dates": [
20180125, 20190125, 20200125
],
"account_id": 1016087,
"account_balance": 12010,
"account_branch": "abc"
}
],
"address": [
{
"address_id": 24739,
"door_no": 4686467,
"street_name":"street1",
"city": "city1",
"state": "state1",
"pin_code": 625001
},
{
"address_id": 24730,
"door_no": 4686442,
"street_name":"street2",
"city": "city1",
"state": "state1",
"pin_code": 625014
}
]
}
}
Now the query i have written for above is
SELECT c.customer_id,
c.customer_details
FROM customer_search_ms.customer c
WHERE c.customer_details ## CAST('$.customer_data.country == "xyz" && $.customer_data.incorporated_year >= 2019 && $.customer_data.incorporated_year <= 2021 ' AS JSONPATH)
AND c.customer_details #? CAST('$.customer_data.account_details[*] ? (#.account_balance >= 10000) ? (#.account_balance <= 13000) ?(#.account_branch == "abc") ? (#.transaction_dates >= 20200110) ? (#.transaction_dates <= 20210625)' AS JSONPATH)
AND c.customer_details #? CAST('$.customer_data.address[*] ? (#.state == "state1") ? (#.pin_code >= 625001) ? (#.pin_code <= 625015) ' AS JSONPATH)
To handle above scenario is it the best way to write. Is it possible to combine all the 3 criteria's (customer/account/address) into one expression? The table will have millions of rows.
I am of the opinion having it as one expression and hitting the DB will give the best performance. Is it possible to combine these 3 conditions as one expression

Your query does not give me the error you report. Rather, it runs, but does give the "wrong" results compared to what you want. There are several mistakes in it which are not syntax errors, but just give wrong results.
Your first jsonpath looks fine. It is a Boolean expression, and ## checks if that expression yields true.
Your second jsonpath has two problems. It yields a list of objects which match your conditions. But objects are not booleans, so ## will be unhappy and return SQL NULL, which is treated the same as false here. Instead, you need to test if that list is empty. This is what #? does, so use that instead of ##. Also, your dates are stored as 8-digit integers, but you are comparing them to 8-character strings. In jsonpath, such cross-type comparisons yield JSON null, which is treated the same as false here. So you either need to change the storage to strings, or change the literals they are compared to into integers.
Your third jsonpath also has the ## problem. And it has the reverse of the type problem, you have the pin_code stored as strings, but you are testing them against integers. Finally you have 'pin_code' misspelled in one occurrence.

Related

Convert an object to array of size 1 in PostgreSQL jsonb and transform the json with nested arrays to rows

I have a two part question
We have a PostgreSQL table with a jsonb column. The values in jsonb are valid jsons, but they are such that for some rows a node would come in as an array whereas for others it will come as an object.
for example, the json we receive could either be like this ( node4 I just an object )
"node1": {
"node2": {
"node3": {
"node4": {
"attr1": "7d181b05-9c9b-4759-9368-aa7a38b0dc69",
"attr2": "S1",
"UserID": "WebServices",
"attr3": "S&P 500*",
"attr4": "EI",
"attr5": "0"
}
}
}
}
Or like this ( node4 is an array )
"node1": {
"node2": {
"node3": {
"node4": [
{
"attr1": "7d181b05-9c9b-4759-9368-aa7a38b0dc69",
"attr2": "S1",
"UserID": "WebServices",
"attr3": "S&P 500*",
"attr4": "EI",
"attr5": "0"
},
{
"attr1": "7d181b05-9c9b-4759-9368-aa7a38b0dc69",
"attr2": "S1",
"UserID": "WebServices",
"attr3": "S&P 500*",
"attr4": "EI",
"attr5": "0"
}
]
}
}
}
And I have to write a jsonpath query to extract, for example, attr1, for each PostgreSQL row containing this json. I would like to have just one jsonpath query that would always work irrespective of whether the node is object or array. So, I want to use a path like below, assuming, if it is an array, it will return the value for all indices in that array.
jsonb_path_query(payload, '$.node1.node2.node3.node4[*].attr1')#>> '{}' AS "ATTR1"
I would like to avoid checking whether the type in array or object and then run a separate query for each and do a union.
Is it possible?
A sub-question related to above - Since I needed the output as text without the quotes, and somewhere I saw to use #>> '{}' - so I tried that and it is working, but can someone explain, how that works?
The second part of the question is - the incoming json can have multiple sets of nested arrays and the json and the number of nodes is huge. So other part I would like to do is flatten the json into multiple rows. The examples I found were one has to identify each level and either use cross join or unnest. What I was hoping is there is a way to flatten a node that is an array, including all of the parent information, without knowing which, if any, if its parents are arrays or simple object. Is this possible as well?
Update
I tried to look at the documentation and tried to understand the #>> '{}' construct, and then I came to realise that '{}' is the right hand operand for the #>> operator which takes a path and in my case the path is the current attribute value hence {}. Looking at examples that had non-empty single attribute path helped me realise that.
Thank you
You can use a "recursive term" in the JSON path expression:
select t.some_column,
p.attr1
from the_table t
cross join jsonb_path_query(payload, 'strict $.**.attr1') as p(attr1)
Note that the strict modifier is required, otherwise, each value will be returned multiple times.
This will return one row for each key attr1 found in any level of the JSON structure.
For the given sample data, this would return:
attr1
--------------------------------------
"7d181b05-9c9b-4759-9368-aa7a38b0dc69"
"7d181b05-9c9b-4759-9368-aa7a38b0dc69"
"7d181b05-9c9b-4759-9368-aa7a38b0dc69"
"I would like to avoid checking whether the type in array or object and then run a separate query for each and do a union. Is it possible?"
Yes it is and your jsonpath query works fine in both cases either when node4 is a jsonb object or when it is a jsonb array because the jsonpath wildcard array accessor [*] also works with a jsonb object in the lax mode which is the default behavior (but not in the strict mode see the manual). See the test results in dbfiddle.
"I saw to use #>> '{}' - so I tried that and it is working, but can someone explain, how that works?"
The output of the jsonb_path_query function is of type jsonb, and when the result is a jsonb string, then it is automatically displayed with double quotes " in the query results. The operator #>> converts the output into the text type which is displayed without " in the query results and the associated text array '{}' just point at the root of the passed jsonb data.
" the incoming json can have multiple sets of nested arrays and the json and the number of nodes is huge. So other part I would like to do is flatten the json into multiple rows"
you can refer to the answer of a_horse_with_no_name using the recursive wildcard member accessor .**

How to avoid NULL check in JOIN condition in Tableau

I am trying to change backend DB for Tableau dashboard. Tableau is generating JOIN SQLs with conditions such as:
ON a.col1 = b.col2 OR (a.col1 is null and b.col2 is null)
Is there a way we can avoid having OR (a.col1 is null and b.col2 is null) condition? I tried ZN and IFNULL with the column name but still such conditions are getting added. The Druid DB supports JOIN only with equality condition and because of IS NULL check , the query is failing. Thanks
Tableau is treating NULL as if it were a value and in SQL that is not the case, it is the absence of a value. According to
https://help.tableau.com/current/pro/desktop/en-us/joining_tables.htm in the section named "About null values in join keys" it mentions an option to set "Join null values to null values", perhaps that is turned on in your case.
On the Druid side if you want to treat NULL as meaning a default value, then a possible route is to transform the NULL into a special value (say -1 or whatever is out of the normal range of the values) and have that value exist on both tables instead of NULL.
In Druid at ingestion time you can use:
...
"transformSpec": {
"transforms": [
{
"type": "expression",
"name": "col1",
"expression": "nvl( col1, -1)"
}
]
...
which will replace col1 with the calculated column col1 ( this is called shadowing) that has replaced NULL values with -1.
for more info on the transformSpec and on expression functions that are available you can go to:
https://druid.apache.org/docs/latest/misc/math-expr.html#general-functions
https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#transformspec

Creating table views with columns from nested value of a jsonb column

I have a jsonb column in one of the tables that has the following structure:
{
"3424": {
"status": "pending",
"remarks": "sample here"
},
"6436": {
"status": "new",
"remarks": "sample here"
},,
"9768": {
"status": "cancelled",
"remarks": null,
"by": "customer"
}
}
I am trying to create a view that will put the statuses in individual columns and the key will be their value:
pending | new | cancelled | accepted | id | transaction
3424 | 6436 | 9768 | null | 1 | testing
The problem is the key is dynamic (numeric and corresponds to some id) so I cannot pinpoint the exact key to use the functions/operations stated here: https://www.postgresql.org/docs/9.5/functions-json.html
I've read about json_path_query and was able to extract the statuses here without the need to know the key but I cannot combine it with the integer key yet.
select mt.id, mt.transaction, hstatus from mytable mt
cross join lateral jsonb_path_query(mt.hist, '$.**.status') hstatus
where mt.id = <id>
but this returns the statuses as rows for now. I'm pretty noob with (postgre)sql so I've only gotten this far.
You can indeed use a PATH query. Unfortunately it's not possible to access the "parent" inside a jsonpath in Postgres. But you can workaround that, by expanding the whole value to a list of key/values so that id value you have, can be accessed through .key
select jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "pending").key') #>> '{}' as pending,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "new").key') #>> '{}' as new,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "cancelled").key') #>> '{}' as cancelled,
jsonb_path_query_first(the_column, '$.keyvalue() ? (#.value.status == "accepted").key') #>> '{}' as accepted,
id,
"transaction"
from the_table
The jsonpath function $.keyvalue() returns something like this:
{"id": 0, "key": "3424", "value": {"status": "pending", "remarks": "sample here"}}
This is then used to pick the an element through a condition on #.value.status and the accessor .key then returns the corresponding key value (e.g. 3424)
The #>> '{}' is a hack to convert the returned jsonb value into a proper text value (otherwise the result would be e.g. "3424" instead of just 3424.

Postgres - How to perform a LIKE query on a JSONB field?

I have a jsonb field called passengers, with the following structure:
note that persons is an array
{
"adults": {
"count": 2,
"persons": [
{
"age": 45,
"name": "Prof. Kyleigh Walsh II",
"birth_date": "01-01-1975"
},
{
"age": 42,
"name": "Milford Wiza",
"birth_date": "02-02-1978"
}
]
}
}
How may I perform a query against the name field of this JSONB? For example, to select all rows which match the name field Prof?
Here's my rudimentary attempt:
SELECT passengers from opportunities
WHERE 'passengers->adults' != NULL
AND 'passengers->adults->persons->name' LIKE '%Prof';
This returns 0 rows, but as you can see I have one row with the name Prof. Kyleigh Walsh II
This: 'passengers->adults->persons->name' LIKE '%Prof'; checks if the string 'passengers->adults->persons->name' ends with Prof.
Each key for the JSON operator needs to be a separate element, and the column name must not be enclosed in single quotes. So 'passengers->adults->persons->name' needs to be passengers -> 'adults' -> 'persons' -> 'name'
The -> operator returns a jsonb value, you want a text value, so the last operator should be ->>
Also != null does not work, you need to use is not null.
SELECT passengers
from opportunities
WHERE passengers -> 'adults' is not NULL
AND passengers -> 'adults' -> 'persons' ->> 'name' LIKE 'Prof%';
The is not null condition isn't really necessary, because that is implied with the second condition. The second condition could be simplified to:
SELECT passengers
from opportunities
WHERE passengers #>> '{adults,persons,name}' LIKE 'Prof%';
But as persons is an array, the above wouldn't work and you need to use a different approach.
With Postgres 9.6 you will need a sub-query to unnest the array elements (and thus iterate over each one).
SELECT passengers
from opportunities
WHERE exists (select *
from jsonb_array_elements(passengers -> 'adults' -> 'persons') as p(person)
where p.person ->> 'name' LIKE 'Prof%');
To match a string at the beginning with LIKE, the wildcard needs to be at the end. '%Prof' would match 'Some Prof' but not 'Prof. Kyleigh Walsh II'
With Postgres 12, you could use a SQL/JSON Path expression:
SELECT passengers
from opportunities
WHERE passengers #? '$.adults.persons[*] ? (#.name like_regex "Prof.*")'

PostgreSQL: Find and delete duplicated jsonb data, excluding a key/value pair when comparing

I have been searching all over to find a way to do this.
I am trying to clean up a table with a lot of duplicated jsonb fields.
There are some examples out there, but as a little twist, I need to exclude one key/value pair in the jsonb field, to get the result I need.
Example jsonb
{
"main": {
"orders": {
"order_id": "1"
"customer_id": "1",
"update_at": "11/23/2017 17:47:13"
}
}
Compared to:
{
"main": {
"orders": {
"order_id": "1"
"customer_id": "1",
"updated_at": "11/23/2017 17:49:53"
}
}
If I can exclude the "updated_at" key when comparing, the query should find it a duplicate and this, and possibly other, duplicated entries should be deleted, keeping only one, the first "original" one.
I have found this query, to try and find the duplicates. But it doesn't take my situation into account. Maybe someone can help structuring this to meet the requirements.
SELECT t1.jsonb_field
FROM customers t1
INNER JOIN (SELECT jsonb_field, COUNT(*) AS CountOf
FROM customers
GROUP BY jsonb_field
HAVING COUNT(*)>1
) t2 ON t1.jsonb_field=t2.jsonb_field
WHERE
t1.customer_id = 1
Thanks in advance :-)
If the Updated at is always at the same path, then you can remove it:
SELECT t1.jsonb_field
FROM customers t1
INNER JOIN (SELECT jsonb_field, COUNT(*) AS CountOf
FROM customers
GROUP BY jsonb_field
HAVING COUNT(*)>1
) t2 ON
t1.jsonb_field #-'{main,orders,updated_at}'
=
t2.jsonb_field #-'{main,orders,updated_at}'
WHERE
t1.customer_id = 1
See https://www.postgresql.org/docs/9.5/static/functions-json.html
additional operators
EDIT
If you dont have #- you might just cast to text, and do a regex replace
regexp_replace(t1.jsonb_field::text, '"update_at": "[^"]*?"','')::jsonb
=
regexp_replace(t2.jsonb_field::text, '"update_at": "[^"]*?"','')::jsonb
I even think, you don't need to cast it back to jsonb. But to be save.
Mind the regex matche ANY "update_at" field (by key) in the json. It should not match data, because it would not match an escaped closing quote \", nor find the colon after it.
Note the regex actually should be '"update_at": "[^"]*?",?'
But on sql fiddle that fails. (maybe depends on the postgresbuild..., check with your version, because as far as regex go, this is correct)
If the comma is not removed, the cast to json fails.
you can try '"update_at": "[^"]*?",'
no ? : that will remove the comma, but fail if update_at was the last in the list.
worst case, nest the 2
regexp_replace(
regexp_replace(t1.jsonb_field::text, '"update_at": "[^"]*?",','')
'"update_at": "[^"]*?"','')::jsonb
for postgresql 9.4
Though sqlfidle only has 9.3 and 9.6
9.3 is missing the json_object_agg. But the postgres doc says it is in 9.4. So this should work
It will only work, if all records have objects under the important keys.
main->orders
If main->orders is a json array, or scalar, then this may give an error.
Same if {"main": [1,2]} => error.
Each json_each returns a table with a row for each key in the json
json_object_agg aggregates them back to a json array.
The case statement filters the one key on each level that needs to be handled.
In the deepest nest level, it filters out the updated_at row.
On sqlfidle set query separator to '//'
If you use psql client, replace the // with ;
create or replace function foo(x json)
returns jsonb
language sql
as $$
select json_object_agg(key,
case key when 'main' then
(select json_object_agg(t2.key,
case t2.key when 'orders' then
(select json_object_agg(t3.key, t3.value)
from json_each(t2.value) as t3
WHERE t3.key <> 'updated_at'
)
else t2.value
end)
from json_each(t1.value) as t2
)
else t1.value
end)::jsonb
from json_each(x) as t1
$$ //
select foo(x)
from
(select '{ "main":{"orders":{"order_id": "1", "customer_id": "1", "updated_at": "11/23/2017 17:49:53" }}}'::json as x) as t1
x (the argument) may need to be jsonb, if that is your datatype