Replacing null values by average of values grouped by concatenated categories in Teradata - group-by

Suppose that I have a lot of NULL values (missing values) in a column named 'score'. I want to replace them by a specific average not from all the values of the column 'score' but by groups that I built with a crosscategory from two concatenated categories:
This kind of query works for getting averages by groups:
SELECT
category1 || ' > ' || category2 AS crosscategory,
ROUND(CAST(AVG(score) AS FLOAT), 2) AS score_avg
FROM DatabaseName.TableName
GROUP BY crosscategory
ORDER BY score_avg;
This one works to replace NULL values by a constant:
SELECT
NVL(score, 0) AS score_without_missing_values
FROM DatabaseName.TableName
The problem that I cannot solve now is how to articulate the replacement of NULL values with a constant here the averages computed with the functions AVG and GROUP BY.
Thank you very much for your help!

Seems you want a Group Average:
SELECT
t.*,
coalesce(score, AVG(score) OVER (PARTITION BY category1, category2)) AS score_avg
FROM DatabaseName.TableName AS t
I removed the ROUND/CAST, because AVG returns FLOAT by default and ROUND in probably not needed (if you need it, you might better cast to a DECIMAL).

Related

Create rows from part of column names

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

What does filter mean?

select driverid, count(*)
from f1db.results
where position is null
group by driverid order by driverid
My thought: first find out all the records that position is null then do the aggregate function.
select driverid, count(*) filter (where position is null) as outs
from f1db.results
group by driverid order by driverid
First time, I meet with filter clause, not sure what does it mean?
Two code block results are different.
Already googled, seems don't have many tutorials about the FILTER. Kindly share some links.
A more helpful term to search for is aggregate filters, since FILTER (WHERE) is only used with aggregates.
Filter clauses are an additional filter that is applied only to that aggregate and nowhere else. That means it doesn't affect the rest of the columns!
You can use it to get a subset of the data, like for example, to get the percentage of cats in a pet shop:
SELECT shop_id,
CAST(COUNT(*) FILTER (WHERE species = 'cat')
AS DOUBLE PRECISION) / COUNT(*) as "percentage"
FROM animals
GROUP BY shop_id
For more information, see the docs
FILTER ehm... filters the records which should be aggregated - in your case, your aggregation counts all positions that are NULL.
E.g. you have 10 records:
SELECT COUNT(*)
would return 10.
If 2 of the records have position = NULL
than
SELECT COUNT(*) FILTER (WHERE position IS NULL)
would return 2.
What you want to achieve is to count all non-NULL records:
SELECT COUNT(*) FILTER (WHERE position IS NOT NULL)
In the example, it returns 8.

Use sum function in calculated column

Is it possible to use a sum function in a calculated column?
If yes, I would like to create a calculated column, that calculates the sum of a column in the same table where the date is smaller than the date of this entry. is this possible?
And last, would this optimize repeated calls on this value over the exemplified view below?
SELECT ProductGroup, SalesDate, (
SELECT SUM(Sales)
FROM SomeList
WHERE (ProductGroup= KVU.ProductGroup) AND (SalesDate<= KVU.SalesDate)) AS cumulated
FROM SomeList AS KVU
Is it possible to use a sum function in a calculated column?
Yes, it's possible using a scalar valued function (scalar UDF) for you computed column but this would be a disaster. Using scalar UDFs for computed columns destroy performance. Adding a scalar UDF that accesses data (which would be required here) makes things even worse.
It sounds to me like you just need a good ol' fashioned index to speed things up. First some sample data:
IF OBJECT_ID('dbo.somelist','U') IS NOT NULL DROP TABLE dbo.somelist;
GO
CREATE TABLE dbo.somelist
(
ProductGroup INT NOT NULL,
[Month] TINYINT NOT NULL CHECK ([Month] <= 12),
Sales DECIMAL(10,2) NOT NULL
);
INSERT dbo.somelist
VALUES (1,1,22),(2,1,45),(2,1,25),(2,1,19),(1,2,100),(1,2,200),(2,2,50.55);
and the correct index:
CREATE NONCLUSTERED INDEX nc_somelist ON dbo.somelist(ProductGroup,[Month])
INCLUDE (Sales);
With this index in place this query would be extremely efficient:
SELECT s.ProductGroup, s.[Month], SUM(s.Sales)
FROM dbo.somelist AS s
GROUP BY s.ProductGroup, s.[Month];
If you needed to get a COUNT by month & product group you could create an indexed view like so:
CREATE VIEW dbo.vw_somelist WITH SCHEMABINDING AS
SELECT s.ProductGroup, s.[Month], TotalSales = COUNT_BIG(*)
FROM dbo.somelist AS s
GROUP BY s.ProductGroup, s.[Month];
GO
CREATE UNIQUE CLUSTERED INDEX uq_cl__vw_somelist ON dbo.vw_somelist(ProductGroup, [Month]);
Once that indexed view was in place your COUNTs would be pre-aggregated. You cannot, however, include SUM in an indexed view.

How to group by date and calculate the averages at the same time

I am quite new to this, so here it goes: I am trying to convert from unixtime to date format and then group by this by date while calculating the average on another column. This is in MariaDB.
CREATE OR REPLACE
VIEW `history_uint_view` AS select
`history_uint`.`itemid` AS `itemid`,
date(from_unixtime(`history_uint`.`clock`)) AS `mydate`,
AVG(`history_uint`.`value`) AS `value`
from
`history_uint`
where
((month(from_unixtime(`history_uint`.`clock`)) = month((now() - interval 1 month))) and ((`history_uint`.`value` in (1,
0))
and (`history_uint`.`itemid` in (54799, 54810, 54821, 54832, 54843, 54854, 54865, 54876, 54887, 54898, 54909, 54920, 58165, 58226, 59337, 59500, 59503, 59506, 60621, 60624, 60627, 60630, 60633, 60636, 60639, 60642, 60645, 60648, 60651, 60654, 60657, 60660, 60663, 60666, 60669, 60672, 60675, 60678, 60681, 60684, 60687, 60690, 60693, 60696, 60699, 64610)))
GROUP by 'itemid', 'mydate', 'value'
When you select aggregate functions (like AVG) with columns without aggregate functions, you should list all columns but the ones with aggregate function in GROUP BY-clause.
So your group by should look like:
GROUP by itemid, mydate
If you use single quotes (like 'itemid'), MariaDB treats them as strings, not columns.

How to convert null rows to 0 and sum the entire column using DB2?

I'm using the following query to sum the entire column. In the TOREMOVEALLPRIV column, I have both integer and null values.
I want to sum both null and integer values and print the total sum value.
Here is my query which print the sum values as null:
select
sum(URT.PRODSYS) as URT_SUM_PRODSYS,
sum(URT.Users) as URT_SUM_USERS,
sum(URT.total_orphaned) as URT_SUM_TOTAL_ORPHANED,
sum(URT.Bp_errors) as URT_SUM_BP_ERRORS,
sum(URT.Ma_errors) as URT_SUM_MA_ERRORS,
sum(URT.Pp_errors) as URT_SUM_PP_ERRORS,
sum(URT.REQUIREURTCBN) as URT_SUM_CBNREQ,
sum(URT.REQUIREURTQEV) as URT_SUM_QEVREQ,
sum(URT.REQUIREURTPRIV) as URT_SUM_PRIVREQ,
sum(URT.cbnperf) as URT_SUM_CBNPERF,
sum(URT.qevperf) as URT_SUM_QEVPERF,
sum(URT.privperf) as URT_SUM_PRIVPERF,
sum(URT.TO_REMOVEALLPRIV) as TO_REMOVEALLPRIV_SUM
from
URTCUSTSTATUS URT
inner join CUSTOMER C on URT.customer_id=C.customer_id;
Output image:
Expected Output:
Instead of null, I need to print sum of rows whichever have integers.
The SUM function automatically handles that for you. You said the column had a mix of NULL and numbers; the SUM automatically ignores the NULL values and correctly returns the sum of the numbers. You can read it on IBM Knowledge Center:
The function is applied to the set of values derived from the argument values by the elimination of null values.
Note: All aggregate functions ignore NULL values except the COUNT function. Example: if you have two records with values 5 and NULL, the SUM and AVG functions will both return 5, but the COUNT function will return 2.
However, it seems that you misunderstood why you're getting NULL as a result. It's not because the column contains null values, it's because there are no records selected. That's the only case when the SUM function returns NULL. If you want to return zero in this case, you can use the COALESCE or IFNULL function. Both are the same for this scenario:
COALESCE(sum(URT.TO_REMOVEALLPRIV), 0) as TO_REMOVEALLPRIV_SUM
or
IFNULL(sum(URT.TO_REMOVEALLPRIV), 0) as TO_REMOVEALLPRIV_SUM
I'm guessing that you want to do the same to all other columns in your query, so I'm not sure why you only complained about the TO_REMOVEALLPRIV column.
What you're looking for is the COALESCE function:
select
sum(URT.PRODSYS) as URT_SUM_PRODSYS,
sum(URT.Users) as URT_SUM_USERS,
sum(URT.total_orphaned) as URT_SUM_TOTAL_ORPHANED,
sum(URT.Bp_errors) as URT_SUM_BP_ERRORS,
sum(URT.Ma_errors) as URT_SUM_MA_ERRORS,
sum(URT.Pp_errors) as URT_SUM_PP_ERRORS,
sum(URT.REQUIREURTCBN) as URT_SUM_CBNREQ,
sum(URT.REQUIREURTQEV) as URT_SUM_QEVREQ,
sum(URT.REQUIREURTPRIV) as URT_SUM_PRIVREQ,
sum(URT.cbnperf) as URT_SUM_CBNPERF,
sum(URT.qevperf) as URT_SUM_QEVPERF,
sum(URT.privperf) as URT_SUM_PRIVPERF,
sum(COALESCE(URT.TO_REMOVEALLPRIV,0)) as TO_REMOVEALLPRIV_SUM
from
URTCUSTSTATUS URT
inner join CUSTOMER C on URT.customer_id=C.customer_id;