I have some Postgres data like this:
date | count
2015-01-01 | 20
2015-01-02 | 15
2015-01-05 | 30
I want to run a query that pulls this data with 0s in place for the dates that are missing, like this:
date | count
2015-01-01 | 20
2015-01-02 | 15
2015-01-03 | 0
2015-01-04 | 0
2015-01-05 | 30
This is for a very large range of dates, and I need it to fill in all the gaps. How can I accomplish this with just SQL?
Given a table junk of:
d | c
------------+----
2015-01-01 | 20
2015-01-02 | 15
2015-01-05 | 30
Running
select fake.d, coalesce(j.c, 0) as c
from (select min(d) + generate_series(0,7,1) as d from junk) fake
left outer join junk j on fake.d=j.d;
gets us:
d | c
------------+----------
2015-01-01 | 20
2015-01-02 | 15
2015-01-03 | 0
2015-01-04 | 0
2015-01-05 | 30
2015-01-06 | 0
2015-01-07 | 0
2015-01-08 | 0
You could of course adjust the start date for the series, length it runs for, etc.
Where is this data going? To an outside source or another table or view?
There's probably a better solution but you could create a new table(or in excel wherever the data is going) that has the entire date-range you want with another integer column of null values. Then update that table with your current dataset then replace all nulls with zero.
It's a really roundabout way to do things but it'll work.
I don't have enough rep to comment :(
This is also a good reference
Using COALESCE to handle NULL values in PostgreSQL
Related
I have a dataset with a column for Date that looks like this:
| Date | Another column |
| -------- | -------------- |
| 1.2019 | row1 |
| 2.2019 | row2 |
| 11.2018 | row3 |
| 8.2021 | row4 |
| 6.2021 | row5 |
The Date column is interpreted as a float dtype but in reality 1.2019 means month 1 - that is, january - of the year 2019. I changed it to string type and it worked well, at least it seems so. But I want to plot this data against the total count of something, which is the column 2 of the dataset, but when I plot it:
the x-axis is not ordered. Well, why would it be? There is no ordered relationship between the string 1.2019 and 2.2019: there is no way to know the first is january of 2019 and the second one is february. I thought of using regex, or even mapping 1.2019 to jan-2019 but the problem persists: strings with no date ordered relationship. I know there is the datetime method but I don't know if this would help me.
How can I proceed? it is probably very easy, but I am stucked here!
Convert to datetime with pandas.to_datetime:
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%m.%Y')
or if you have a pandas version that refuses to convert if the day is missing:
pd.to_datetime('1.'+df['Date'].astype(str), format='%d.%m.%Y')
output:
Date Another column
0 2019-01-01 row1
1 2019-02-01 row2
2 2018-11-01 row3
3 2021-08-01 row4
4 2021-06-01 row5
I will be very grateful for your advice regarding the following issue.
Given:
PostgreSQL database
Initial (basic) query
select day, Value_1, Value_2, Value_3
from table
where day=current_date
which returns a row with following columns
Day | Value_1(int) | Value_2(int) | Value 3 (int)
2019-11-14 | 10 | 10 | 14
It is needed to create a view with this starting information and add a new row every day based on the outcome of initial query executed at 22:00.
The expected outcome tomorrow at 22:01 will be
Day | Value_1 | Value_2 | Value_3
2019-11-14 | 10 | 10 | 14
2019-11-15 | N | M | P
Many thanks in advance for your time and support.
I have a query which returns results of data, which runs on a frequent basis. The new table will contain results of the old table as well but I only want to take whatever is in new in the most recent run of the new table and send that as an email. I already have the line for the email and trade data but just need a way to be able to:
display the results of the new table to be emailed
save the complete results of the new table to be used in the next run of the query
e.g.
Old results: tbl
| idx | name | age |
| 0 | Tom | 30 |
| 1 | Jerry | 25 |
| 2 | Bob | 30 |
| 3 | Ken | 45 |
New results: tbl
| idx | name | age |
| 0 | Tom | 30 |
| 1 | Jerry | 25 |
| 2 | Bob | 30 |
| 3 | Ken | 45 |
| 4 | Sam | 40 |
output required:
| 4 | Sam | 40 |
and then save the New results to be used in the next run
Thanks! :)
If the only changes between runs is that records are being appended onto the new table, you could just keep a variable denoting the last index seen and then select only those rows where idx is larger than that.
If the indexes are always increasing, this could be achieved using a query like
lastidx:exec last idx from tbl
select from tbl where idx>lastidx
If the idx values don't always increase monotonically, you could keep a count of the number of rows instead and only
lasti:count tbl
select from tbl where i>=lasti
This doesn't require saving the whole table in memory for use in the next iteration.
E.g to start with the old table had 4 rows so lasti = 4
q)tbl
idx name age
-------------
0 Tom 30
1 Jerry 25
2 Bob 30
3 Ken 45
q)lasti
4
The new table comes in and running the command selects the new row
q)tbl
idx name age
-------------
0 Tom 30
1 Jerry 25
2 Bob 30
3 Ken 45
4 Sam 40
q)select from tbl where i>lasti
idx name age
------------
4 Sam 40
lasti can then be updated to reflect the new count
q)lasti:count tbl
q)lasti
5
One way you can get this done, assuming the idx is the unique key :
q)old:([] idx:0 1 2 3; name:`T`J`B`K; age: 30 25 30 45)
q)new:old,enlist `idx`name`age!(4; `S;40) //new output from your query
q)out:()
q)if[0<count i:new[`idx] except old[`idx] ; out:new i ; old:new]
q)out
idx name age
------------
4 S 40
Another way, if your new records are always added to the last of old records:
q)old:([] idx:0 1 2 3; name:`T`J`B`K; age: 30 25 30 45)
q)i:count old
q)new:old,enlist `idx`name`age!(4; `S;40) //new output from your query
q)out:()
q)if[i<c:count new ; out:(i-c)#new ; old:new; i:c]
q)out
idx name age
------------
4 S 40
I suspect I require some sort of windowing function to do this. I have the following item data as an example:
count | date
------+-----------
3 | 2017-09-15
9 | 2017-09-18
2 | 2017-09-19
6 | 2017-09-20
3 | 2017-09-21
So there are gaps in my data first off, and I have another query here:
select until_date, until_date - (lag(until_date) over ()) as delta_days from ranges
Which I have generated the following data:
until_date | delta_days
-----------+-----------
2017-09-08 |
2017-09-11 | 3
2017-09-13 | 2
2017-09-18 | 5
2017-09-21 | 3
2017-09-22 | 1
So I'd like my final query to produce this result:
start_date | ending_date | total_items
-----------+-------------+--------------
2017-09-08 | 2017-09-10 | 0
2017-09-11 | 2017-09-12 | 0
2017-09-13 | 2017-09-17 | 3
2017-09-18 | 2017-09-20 | 15
2017-09-21 | 2017-09-22 | 3
Which tells me the total count of items from the first table, per day, based on the custom ranges from the second table.
In this particular example, I would be summing up total_items BETWEEN start AND end (since there would be overlap on the dates, I'd subtract 1 from the end date to not count duplicates)
Anyone know how to do this?
Thanks!
Use the daterange type. Note that you do not have to calculate delta_days, just convert ranges to dataranges and use the operator <# - element is contained by.
with counts(count, date) as (
values
(3, '2017-09-15'::date),
(9, '2017-09-18'),
(2, '2017-09-19'),
(6, '2017-09-20'),
(3, '2017-09-21')
),
ranges (until_date) as (
values
('2017-09-08'::date),
('2017-09-11'),
('2017-09-13'),
('2017-09-18'),
('2017-09-21'),
('2017-09-22')
)
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
order by 1;
daterange | total_items
-------------------------+-------------
[2017-09-08,2017-09-11) | 0
[2017-09-11,2017-09-13) | 0
[2017-09-13,2017-09-18) | 3
[2017-09-18,2017-09-21) | 17
[2017-09-21,2017-09-22) | 3
(5 rows)
Note, that in the dateranges above lower bounds are inclusive while upper bound are exclusive.
If you want to calculate items per day in the dateranges:
select
daterange, total_items,
round(total_items::dec/(upper(daterange)- lower(daterange)), 2) as items_per_day
from (
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
) s
order by 1
daterange | total_items | items_per_day
-------------------------+-------------+---------------
[2017-09-08,2017-09-11) | 0 | 0.00
[2017-09-11,2017-09-13) | 0 | 0.00
[2017-09-13,2017-09-18) | 3 | 0.60
[2017-09-18,2017-09-21) | 17 | 5.67
[2017-09-21,2017-09-22) | 3 | 3.00
(5 rows)
I have 2 tables which I need to join against, along with a table that is generated inline using WITH. The WITH is a daterange, and I need to display all rows from 1 table for all months, even where no data exists in the 2nd table.
This is the data within the tables :
Table REFERRAL_GROUPINGS
referral_group
--------------
VER
FRD
FCC
Table DATA_VALUES
referral_group | task_date | task_id | over_threshold
---------------+------------+---------+---------------
VER | 2015-10-01 | 10 | 0
FRD | 2015-11-04 | 20 | 1
The date range will need to select 3 months :
Oct-2015
Nov-2015
Dec-2015
The data I expect to end up with will be :
MonthYear | referral_group | count_of_group | total_over_threshold
----------+----------------+----------------+---------------------
Oct-2015 | VER | 1 | 0
Oct-2015 | FRD | 0 | 0
Oct-2015 | FCC | 0 | 0
Nov-2015 | VER | 0 | 0
Nov-2015 | FRD | 1 | 1
Nov-2015 | FCC | 0 | 0
Dec-2015 | VER | 0 | 0
Dec-2015 | FRD | 0 | 0
Dec-2015 | FCC | 0 | 0
DDL to create the 2 tables and populate with data is as below..
CREATE TABLE test_data (
referral_group char(3),
task_date date,
task_id integer,
over_threshold integer);
insert into test_data values
('VER','2015-10-01',10,1),
('FRD','2015-11-04',20,0);
CREATE TABLE referral_grouper (
referral_group char(3));
insert into referral_grouper values
('FRD'),
('VER'),
('FCC');
This is a very cut-down example which uses the minimal tables/columns for this example, which is why I have no primary keys/indexes.
I can get this running under LUW no problem, by using NOT EXISTS in the joins as per this SQL.
WITH DATERANGE(FROM_DTE,yyyymm, TO_DTE) AS
(
SELECT DATE('2015-10-01'), YEAR('2015-10-01')*100+MONTH('2015-10-01'), '2015-12-31'
FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT FROM_DTE + 1 DAY, YEAR(FROM_DTE+1 DAY)*100+MONTH(FROM_DTE+1 DAY), TO_DTE
FROM DATERANGE
WHERE FROM_DTE < TO_DTE
)
select
referral_grouper.referral_group,
daterange.yyyymm,
count(test_data.task_id) AS total_count,
COALESCE(SUM(over_threshold),0) AS total_over_threshold
FROM
test_data
RIGHT OUTER JOIN daterange ON (daterange.from_dte=test_data.task_date OR NOT EXISTS (SELECT 1 FROM daterange d2 WHERE d2.from_dte=test_data.task_date))
RIGHT OUTER JOIN referral_grouper ON (referral_grouper.referral_group=test_data.referral_group OR NOT EXISTS (SELECT 1 FROM referral_grouper g2 WHERE g2.referral_group=test_data.referral_group))
GROUP BY
referral_grouper.referral_group,
daterange.yyyymm
However... This needs to work on ZOS, and under ZOS you cannot use subqueries with EXISTS in a join. Removing the NOT EXISTS means the non existing rows no longer show up.
There must be a way to write the SQL to return all rows from the 2 linking tables without using NOT EXISTS, but I just cannot seem to find it. Any help with this would be very appreciated as it has me stumped