array_agg group by and null - postgresql

Given this table:
SELECT * FROM CommodityPricing order by dateField
"SILVER";60.45;"2002-01-01"
"GOLD";130.45;"2002-01-01"
"COPPER";96.45;"2002-01-01"
"SILVER";70.45;"2003-01-01"
"GOLD";140.45;"2003-01-01"
"COPPER";99.45;"2003-01-01"
"GOLD";150.45;"2004-01-01"
"MERCURY";60;"2004-01-01"
"SILVER";80.45;"2004-01-01"
As of 2004, COPPER was dropped and mercury introduced.
How can I get the value of (array_agg(value order by date desc) ) [1] as NULL for COPPER?
select commodity,(array_agg(value order by date desc) ) --[1]
from CommodityPricing
group by commodity
"COPPER";"{99.45,96.45}"
"GOLD";"{150.45,140.45,130.45}"
"MERCURY";"{60}"
"SILVER";"{80.45,70.45,60.45}"

SQL Fiddle
select
commodity,
array_agg(
case when commodity = 'COPPER' then null else price end
order by date desc
)
from CommodityPricing
group by commodity
;

To "pad" missing rows with NULL values in the resulting array, build your query on full grid of rows and LEFT JOIN actual values to the grid.
Given this table definition:
CREATE TEMP TABLE price (
commodity text
, value numeric
, ts timestamp -- using ts instead of the inappropriate name date
);
I use generate_series() to get a list of timestamps representing the years and CROSS JOIN to a unique list of all commodities (SELECT DISTINCT ...).
SELECT commodity, (array_agg(value ORDER BY ts DESC)) AS years
FROM generate_series ('2002-01-01 00:00:00'::timestamp
, '2004-01-01 00:00:00'::timestamp
, '1y') t(ts)
CROSS JOIN (SELECT DISTINCT commodity FROM price) c(commodity)
LEFT JOIN price p USING (ts, commodity)
GROUP BY commodity;
Result:
COPPER {NULL,99.45,96.45}
GOLD {150.45,140.45,130.45}
MERCURY {60,NULL,NULL}
SILVER {80.45,70.45,60.45}
SQL Fiddle.
I cast the array to text in the fiddle, because the display sucks and would swallow NULL values otherwise.

Related

Use coalesce function in sql to to return zero counts of records without showing any column value to be null

I am trying to return zero counts as results for a query. But when I am running this query, a particular column's values are returning as null.
select tab1.source_type,
coalesce(tab2.numberofrecords,0) as numberofrecords,
coalesce(dt,current_date-40) as dt,
coalesce(client_id,client_id) as client_id
from (select distinct source_type from integration_customers )
as tab1
left join
(select count(id) as Numberofrecords,
source_type, Date(created_at) as dt,
client_id
from integration_customers ic
where Date(created_at)= current_date-39
and
source_type in
(select distinct "source" from integration_integrationconfig ii where status ='complete')
group by source_type ,dt,client_id
order by dt desc) as tab2
on tab1.source_type = tab2.source_type
But the results for this query is something like this:
I want to remove these null values and show the client id specifically for each zero record as well.
The table integration customers has the client id, created at,source type.

Update null values in a column based on non null values percentage of the column

I need to update the null values of a column in a table for each category based on the percentage of the non-null values. The following table shows the null values for a particular category -
There are only two types of values in the column. The percentage of types based on rows is -
The number of rows with null values is 7, I need to randomly populate the null values based on the percentage share of the non-null values as shown below - 38%(CV) of 7 = 3, 63%(NCV) of 7 = 4
If you want to dynamically calculate the "NULL rate", one way to do it could be:
with pcts as (
select
(select count(*)::numeric from the_table where type = 'cv') / (select count(*) from the_table where type is not null) as cv_pct,
(select count(*)::numeric from the_table where type = 'ncv') / (select count(*) from the_table where type is not null) as ncv_pct,
(select count(*) from the_table where type is null) as null_count
), calc as (
select d.ctid,
p.cv_pct,
p.ncv_pct,
row_number() over () as rn,
case
when row_number() over () <= round(null_count * p.cv_pct) then 'cv'
else 'ncv'
end as new_type
from the_table d
cross join pcts p
where type is null
)
update the_table t
set type = c.new_type
from calc c
where t.ctid = c.ctid
The first CTE calculates the percentage of each type and the total number of NULL values (in theory the percentage of the NCV type isn't really needed, but I included it for completeness)
The second then calculates for each row which new type should be used. This is done by multiplying the "current" row number with the expected percentage (the CASE expression)
This is then used to update the target table. I have used the ctid as an alternative for a primary key, because your sample data does not have any unique column (or combination of columns). If you do have a primary key that you haven't shown, replace ctid with that primary key column.
I wouldn't be surprised though, if there was a shorter, more efficient way to do it, but for now I can't think of a better alternative.
Online example
If you are on PG11 or later, you can use the groups frame to do this in what should be close to a single pass (except reordering for output when sorted by tid) with window functions:
select tid, category, id, type,
case
when type is not null then type
when round(
(count(*) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding))::numeric /
coalesce(
nullif(
count(*) over (partition by category
order by type nulls last
groups 2 preceding
exclude group), 0), 1
) *
count(*) over (partition by category
order by type nulls last
groups current row)
) >= row_number() over (partition by category, type
order by tid)
then
first_value(type) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding)
else
first_value(type) over (partition by category
order by type nulls last
groups 1 preceding
exclude group)
end as extended_type
from cv_ncv
order by tid;
Working fiddle here.

How to join on closest date in Postgresql

Suppose, I have following tables
product_prices
product|price|date
-------+-----+----------
apple |10 |2014-03-01
-------+-----+----------
apple |20 |2014-05-02
-------+-----+----------
egg |2 |2014-03-03
-------+-----+----------
egg |4 |2015-10-12
purchases:
user|product|date
----+-------+----------
John|apple |2014-03-02
----+-------+----------
John|apple |2014-06-03
----+-------+----------
John|egg |2014-08-13
----+-------+----------
John|egg |2016-08-13
What I need is table similar to this:
name|product|purchase date |price date|price
----+-------+--------------+----------+-----
John|apple |2014-03-02 |2014-03-01|10
----+-------+--------------+----------+-----
John|apple |2014-06-03 |2014-05-02|20
----+-------+--------------+----------+-----
John|egg |2014-08-13 |2014-08-13|2
----+-------+--------------+----------+-----
John|egg |2016-08-13 |2015-10-12|4
Or "what is the price for product at this day". Where price is calculated based on date from products table.
On real DB I tried to use something similar to:
SELECT name, product, pu.date, pp.date, pp.price
FROM purchases AS pu
LEFT JOIN product_prices AS pp
ON pu.date = (
SELECT date
FROM product_prices
ORDER BY date DESC LIMIT 1);
But I keep either getting only left part of table (with (null) instead of product dates and prices) or many rows with all the combinations of prices and dates.
I would suggest changing product_prices table to use a daterange column instead (or at least a start_date and an end_date).
You can use an exclusion constraint to make sure you never have overlapping ranges for one product and an insert trigger that "closes" the "current" prices and creates a new unbounded range for the newly inserted price.
A daterange can efficiently be indexed and with that in place the query gets as easy as:
SELECT name, product, pu.date, pp.valid_during, pp.price
FROM purchases AS pu
LEFT JOIN product_prices AS pp ON pu.date <# pp.valid_during
(assuming the range column is named valid_during)
The exclusion constraint would only work however if the product was an integer (not a varchar) - but I guess your real product_purchases table uses a foreign key to some product table anyway (which is an integer).
The new table definitions could look something like this:
create table purchase_prices
(
product_id integer not null references products,
price numeric(16,4) not null,
valid_during daterange not null
);
And the constraint that prevents overlapping ranges:
alter table purchase_prices
add constraint check_price_range
exclude using gist (product_id with =, valid_during with &&);
The constraint needs the btree_gist extension.
As always improving query speed comes with a price and in this case it's the higher maintenance costs for the GiST index. You would need to run some tests to see if the easier (and most probably much faster) query outweighs the slower insert performance on purchase_prices.
Look at your scalar sub-query very closely. It is not correlated back to the outer query. In other words, it will return the same result every time: the latest date in the product_prices table. Period. Think about the query out of context:
SELECT date
FROM product_prices
ORDER BY date DESC LIMIT 1
There are two problems with it:
It will return 2015-10-12 for every row in the join and ultimately, nothing was purchased on that date, hence, null.
Your approximation of closest is that the dates are equal. Unless you have a product_prices row for every product for every single date, you'll always have misses. "Closest" implies distance and ranking.
WITH close_prices_by_purchase AS (
SELECT
p.user,
p.product,
p.date pp.date,
pp.price,
row_number() over (partition by pp.product, order by pp.date desc) as distance -- calculate distance between purchase date and price date
FROM purchases AS p
INNER JOIN product_prices AS pp on pp.product = p.product
WHERE pp.date < p.date
)
SELECT user as name, product, pu.date as purchase_date, pp.date as price_date, price
FROM close_prices_by_purchase AS cpbp
WHERE distance = 1; -- shortest distance
You can try something like this, although I am sure there's a better way:
with diffs as (
select
a.*,
b."date" as bdate,
b.price,
b."date" - a."date" as diffdays,
row_number() over (
partition by "user", a."product", a."date"
order by "user", a."product", a."date", b."date" - a."date" desc
) as sr
from purchases a
inner join product_prices b on a.product = b.product
where b."date" - a."date" < 1
)
select
"user" as "name",
product,
"date" as "purchase date",
bdate as "price date",
price
from diffs
where sr = 1
Example: https://www.db-fiddle.com/f/dwQ9EXmp1SdpNpxyV1wc6M/0
Explanation
I attempted to join both tables and find the difference between dates of purchase and price, and ranked them by closest date prior to the purchase. Rank of 1 will go to the closest date. Then, data with rank of 1 was extracted.
This is a great place to use date ranges! We know the start date of the price range and we can use a window function to get the next date. At that point, it's really easy to figure out the price on any day.
with price_ranges as
(select product,
price,
date as price_date,
daterange(date, lead(date, 1)
OVER (partition by product order by date), '[)'
) as valid_price_range from product_prices
)
select "user" as name,
purchases.product,
purchases.date,
price_date,
price
from purchases
join price_ranges on purchases.product = price_ranges.product
and purchases.date <# price_ranges.valid_price_range
order by purchases.date;

postgres "ntile(2) over (order by column)" invalid order

I've noticed strange behavior using postgres ntile on big table. It does not really order correctly. The example is:
select tile, min(price) as minprice, max(price) as maxprice, count(*)
from (
select ntile(2) over(order by price ASC) as tile, adult_price as price
from statistics."trips"
where back_date IS NULL AND adult_price IS NOT null
and departure = 297 and arrival = 151
) as t
group by tile
It gaves really strange result by my opinion:
tile minprice maxprice count
1 2250 5359 74257
2 2250 27735 74257
nothing special about count and maxprice. It could be correct for some data sets. But min_price shows that something is wrong with the ordering.
How can be min_price of 2nd tile less than max_price from 1st? But this is approved by checking result of internal select. It is ordered only partially and has some "mixed" parts where price order is broken.
Splitting in 2 parts manually shows the same min and max price, but different middle prices:
select * from (
select row_number() over (order by adult_price ASC) as row_number, adult_price
from statistics."trips"
where back_date IS NULL AND adult_price IS NOT null
and departure = 297 and arrival = 151
order by price
) as t
where row_number IN(1, 74257, 74258, 148514)
row_number adult_price
1 2250
74257 4075
74258 4075
149413 27735
It must be the confusing column names.
The column price in the outer query is actually adult_price in the inner query, and the ntile is calculated ordering by a different column (price in the inner query).

How to Calculate Gap Between two Dates in SQL Server 2005?

I have a data set as shown in the picture.
I am trying to get the date difference between eligenddate (First row) and eligstartdate (second row). I would really appreciate any suggestions.
Thank you
SQL2005:
One solution is to insert into a table variable (#DateWithRowNum - the number of rows is small) or into a temp table (#DateWithRowNum - the number of rows is high) the rows with a row number (generated using [elig]startdate as order by criteria; also see note #1) plus a self join thus:
DECLARE #DateWithRowNum TABLE (
memberid VARCHAR(50) NOT NULL,
rownum INT,
PRIMARY KEY(memberid, rownum),
startdate DATETIME NOT NULL,
enddate DATETIME NOT NULL
)
INSERT #DateWithRowNum (memberid, rownum, startdate, enddate)
SELECT memberid,
ROW_NUMBER() OVER(PARTITION BY memberid ORDER By startdate),
startdate,
enddate
FROM dbo.MyTable
SELECT crt.*, DATEDIFF(MONTH, crt.enddate, prev.startdate) AS gap
FROM #DateWithRowNum crt
LEFT JOIN #DateWithRowNum prev ON crt.memberid = prev.memberid AND crt.rownum - 1 = prev.rownum
ORDER BY crt.memberid, crt.rownum
Another solution is to use common table expression instead of table variable / temp table thus:
;WITH DateWithRowNum AS (
SELECT memberid,
ROW_NUMBER() OVER(PARTITION BY memberid ORDER By startdate),
startdate,
enddate
FROM dbo.MyTable
)
SELECT crt.*, DATEDIFF(MONTH, crt.enddate, prev.startdate) AS gap
FROM DateWithRowNum crt
LEFT /*HASH*/ JOIN DateWithRowNum prev ON crt.memberid = prev.memberid AND crt.rownum - 1 = prev.rownum
ORDER BY crt.memberid, crt.rownum
Note #1: I assume that you need to calculate these values for every memberid
Note #2: HASH hint forces SQL Server to evaluate just once every data source (crt or prev) of LEFT JOIN.