How to avoid NULL check in JOIN condition in Tableau - druid

I am trying to change backend DB for Tableau dashboard. Tableau is generating JOIN SQLs with conditions such as:
ON a.col1 = b.col2 OR (a.col1 is null and b.col2 is null)
Is there a way we can avoid having OR (a.col1 is null and b.col2 is null) condition? I tried ZN and IFNULL with the column name but still such conditions are getting added. The Druid DB supports JOIN only with equality condition and because of IS NULL check , the query is failing. Thanks

Tableau is treating NULL as if it were a value and in SQL that is not the case, it is the absence of a value. According to
https://help.tableau.com/current/pro/desktop/en-us/joining_tables.htm in the section named "About null values in join keys" it mentions an option to set "Join null values to null values", perhaps that is turned on in your case.
On the Druid side if you want to treat NULL as meaning a default value, then a possible route is to transform the NULL into a special value (say -1 or whatever is out of the normal range of the values) and have that value exist on both tables instead of NULL.
In Druid at ingestion time you can use:
...
"transformSpec": {
"transforms": [
{
"type": "expression",
"name": "col1",
"expression": "nvl( col1, -1)"
}
]
...
which will replace col1 with the calculated column col1 ( this is called shadowing) that has replaced NULL values with -1.
for more info on the transformSpec and on expression functions that are available you can go to:
https://druid.apache.org/docs/latest/misc/math-expr.html#general-functions
https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#transformspec

Related

Postgres ignoring null values when using filter as not in array

SELECT * FROM Entity e WHERE e.Status <> ANY(ARRAY[1,2,3]);
Here Status is a nullable integer column. Using the above query i am unable to fetch the records whose status value is NULL.
SELECT * FROM Entity e WHERE (e.Status is NULL OR e.Status = 4);
This query does the trick. Could someone explain me why the first query was not working as expected.
NULL kinda means "unknown", so the expressions
NULL = NULL
and
NULL != NULL
are neither true nor false, they're NULL. Because it is not known whether an "unknown" value is equal or unequal to another "unknown" value.
Since <> ANY uses an equality test, if the value searched in the array is NULL, then the result will be NULL.
So your second query is correct.
It is spelled out in the docs Array ANY:
If the array expression yields a null array, the result of ANY will be null. If the left-hand expression yields null, the result of ANY is ordinarily null (though a non-strict comparison operator could possibly yield a different result). Also, if the right-hand array contains any null elements and no true comparison result is obtained, the result of ANY will be null, not false (again, assuming a strict comparison operator). This is in accordance with SQL's normal rules for Boolean combinations of null values.
FYI:
e.Status is NULL OR e.Status = 4
can be shortened to:
e_status IS NOT DISTINCT FROM 4
per Comparison operators.

Druid SQL: filter on result of expression

I have HTTP access log data in a Druid data source, and I want to see access patterns based on certain identifiers in the URL path. I wrote this query, and it works fine:
select regexp_extract(path, '/id/+([0-9]+)', 1) as "id",
sum("count") as "request_count"
from "access-logs"
where __time >= timestamp '2022-01-01'
group by 1
The only problem is that not all requests match that pattern, so I get one row in the result with an empty "id". I tried adding an extra condition in the where clause:
select regexp_extract(path, '/id/+([0-9]+)', 1) as "id",
sum("count") as "request_count"
from "access-logs"
where __time >= timestamp '2022-01-01' and "id" != ''
group by 1
But when I do that, I get this error message:
Error: Plan validation failed: org.apache.calcite.runtime.CalciteContextException:
From line 4, column 46 to line 4, column 49: Column 'id' not found in any table
So it doesn't let me reference the result of the expression in the where clause. I could of course just copy the entire regexp_extract expression, but is there a cleaner way of doing this?
Since id is an aggregated column, you would need a HAVING clause to filter on it.

Postgres Querying/filtering JSONB with nested arrays

Below is my sample requirement
I want customers who meet all the below conditions
In country "xyz", incorporated between 2019 to 2021.
Should be having at least one account with balance between 10000 and 13000 and branch is "abc" and transaction dates between 20200110 and 20210625. It is formatted and stored as number
Should be having at least one address in the state "state1" and pin codes between 625001 and 625015
Below is table structure
CREATE TABLE customer_search_ms.customer
(
customer_id integer,
customer_details jsonb
)
There can be millions of rows in the table.
I have created GIN index of type jsonb_ops on the customer_details column as we would also be checking for existence conditions and range comparison
Below is a sample data in the customer_data JSONB column
customer_id : 1
{
"customer_data": {
"name": "abc",
"incorporated_year": 2020,
"country":"xyz",
"account_details": [
{
"transaction_dates": [
20180125, 20190125, 20200125,20200525
],
"account_id": 1016084,
"account_balance": 2000,
"account_branch": "xyz"
},
{
"transaction_dates": [
20180125, 20190125, 20200125
],
"account_id": 1016087,
"account_balance": 12010,
"account_branch": "abc"
}
],
"address": [
{
"address_id": 24739,
"door_no": 4686467,
"street_name":"street1",
"city": "city1",
"state": "state1",
"pin_code": 625001
},
{
"address_id": 24730,
"door_no": 4686442,
"street_name":"street2",
"city": "city1",
"state": "state1",
"pin_code": 625014
}
]
}
}
Now the query i have written for above is
SELECT c.customer_id,
c.customer_details
FROM customer_search_ms.customer c
WHERE c.customer_details ## CAST('$.customer_data.country == "xyz" && $.customer_data.incorporated_year >= 2019 && $.customer_data.incorporated_year <= 2021 ' AS JSONPATH)
AND c.customer_details #? CAST('$.customer_data.account_details[*] ? (#.account_balance >= 10000) ? (#.account_balance <= 13000) ?(#.account_branch == "abc") ? (#.transaction_dates >= 20200110) ? (#.transaction_dates <= 20210625)' AS JSONPATH)
AND c.customer_details #? CAST('$.customer_data.address[*] ? (#.state == "state1") ? (#.pin_code >= 625001) ? (#.pin_code <= 625015) ' AS JSONPATH)
To handle above scenario is it the best way to write. Is it possible to combine all the 3 criteria's (customer/account/address) into one expression? The table will have millions of rows.
I am of the opinion having it as one expression and hitting the DB will give the best performance. Is it possible to combine these 3 conditions as one expression
Your query does not give me the error you report. Rather, it runs, but does give the "wrong" results compared to what you want. There are several mistakes in it which are not syntax errors, but just give wrong results.
Your first jsonpath looks fine. It is a Boolean expression, and ## checks if that expression yields true.
Your second jsonpath has two problems. It yields a list of objects which match your conditions. But objects are not booleans, so ## will be unhappy and return SQL NULL, which is treated the same as false here. Instead, you need to test if that list is empty. This is what #? does, so use that instead of ##. Also, your dates are stored as 8-digit integers, but you are comparing them to 8-character strings. In jsonpath, such cross-type comparisons yield JSON null, which is treated the same as false here. So you either need to change the storage to strings, or change the literals they are compared to into integers.
Your third jsonpath also has the ## problem. And it has the reverse of the type problem, you have the pin_code stored as strings, but you are testing them against integers. Finally you have 'pin_code' misspelled in one occurrence.

NULL values in NOT NULL expression

I have this ID column coming as NULL even though I have put a NOT NULL expression. It's not any empty string issue or any special character issue because I tried using this expression:
AND COALESCE(SYSTEM_ID,ASSET_NUMBER,SERIAL_NUMBER,'X') <>'X'
In my where clause (same in select) and X comes in output, why is this happening - I have provided sample select code. I am able to solve it by putting the filter in Having clause or using an outer query
SELECT COALESCE(COL1,COL2,COL3) AS ID,
COL4,
...
COL16
FROM TABLE
WHERE COALESCE(COL1,COL2,COL3) IS NOT NULL
GROUP BY
COL4,
...
COL16

PostgreSQL: Find and delete duplicated jsonb data, excluding a key/value pair when comparing

I have been searching all over to find a way to do this.
I am trying to clean up a table with a lot of duplicated jsonb fields.
There are some examples out there, but as a little twist, I need to exclude one key/value pair in the jsonb field, to get the result I need.
Example jsonb
{
"main": {
"orders": {
"order_id": "1"
"customer_id": "1",
"update_at": "11/23/2017 17:47:13"
}
}
Compared to:
{
"main": {
"orders": {
"order_id": "1"
"customer_id": "1",
"updated_at": "11/23/2017 17:49:53"
}
}
If I can exclude the "updated_at" key when comparing, the query should find it a duplicate and this, and possibly other, duplicated entries should be deleted, keeping only one, the first "original" one.
I have found this query, to try and find the duplicates. But it doesn't take my situation into account. Maybe someone can help structuring this to meet the requirements.
SELECT t1.jsonb_field
FROM customers t1
INNER JOIN (SELECT jsonb_field, COUNT(*) AS CountOf
FROM customers
GROUP BY jsonb_field
HAVING COUNT(*)>1
) t2 ON t1.jsonb_field=t2.jsonb_field
WHERE
t1.customer_id = 1
Thanks in advance :-)
If the Updated at is always at the same path, then you can remove it:
SELECT t1.jsonb_field
FROM customers t1
INNER JOIN (SELECT jsonb_field, COUNT(*) AS CountOf
FROM customers
GROUP BY jsonb_field
HAVING COUNT(*)>1
) t2 ON
t1.jsonb_field #-'{main,orders,updated_at}'
=
t2.jsonb_field #-'{main,orders,updated_at}'
WHERE
t1.customer_id = 1
See https://www.postgresql.org/docs/9.5/static/functions-json.html
additional operators
EDIT
If you dont have #- you might just cast to text, and do a regex replace
regexp_replace(t1.jsonb_field::text, '"update_at": "[^"]*?"','')::jsonb
=
regexp_replace(t2.jsonb_field::text, '"update_at": "[^"]*?"','')::jsonb
I even think, you don't need to cast it back to jsonb. But to be save.
Mind the regex matche ANY "update_at" field (by key) in the json. It should not match data, because it would not match an escaped closing quote \", nor find the colon after it.
Note the regex actually should be '"update_at": "[^"]*?",?'
But on sql fiddle that fails. (maybe depends on the postgresbuild..., check with your version, because as far as regex go, this is correct)
If the comma is not removed, the cast to json fails.
you can try '"update_at": "[^"]*?",'
no ? : that will remove the comma, but fail if update_at was the last in the list.
worst case, nest the 2
regexp_replace(
regexp_replace(t1.jsonb_field::text, '"update_at": "[^"]*?",','')
'"update_at": "[^"]*?"','')::jsonb
for postgresql 9.4
Though sqlfidle only has 9.3 and 9.6
9.3 is missing the json_object_agg. But the postgres doc says it is in 9.4. So this should work
It will only work, if all records have objects under the important keys.
main->orders
If main->orders is a json array, or scalar, then this may give an error.
Same if {"main": [1,2]} => error.
Each json_each returns a table with a row for each key in the json
json_object_agg aggregates them back to a json array.
The case statement filters the one key on each level that needs to be handled.
In the deepest nest level, it filters out the updated_at row.
On sqlfidle set query separator to '//'
If you use psql client, replace the // with ;
create or replace function foo(x json)
returns jsonb
language sql
as $$
select json_object_agg(key,
case key when 'main' then
(select json_object_agg(t2.key,
case t2.key when 'orders' then
(select json_object_agg(t3.key, t3.value)
from json_each(t2.value) as t3
WHERE t3.key <> 'updated_at'
)
else t2.value
end)
from json_each(t1.value) as t2
)
else t1.value
end)::jsonb
from json_each(x) as t1
$$ //
select foo(x)
from
(select '{ "main":{"orders":{"order_id": "1", "customer_id": "1", "updated_at": "11/23/2017 17:49:53" }}}'::json as x) as t1
x (the argument) may need to be jsonb, if that is your datatype