postgresql generating a list of dates between two dates fields - postgresql

I would like to get the list of dates from two fields: start and end.
I found a case here: Show a list of dates between two dates?
But I would like to have a better solution without going through an intermediary table.
Here is the initial table:
Here is the result I would like to have:
Thank you

Use generate_series()
select t.id, t.name, t.g.dt::date as start_end
from the_table t
cross join generate_series(t.date_start, t.date_end, interval '1 day') as g(dt)
order by t.id, g.dt;

Related

Selecting certain columns from a table with dates as columns

I have a table where column names are like years "2020-05","2020-06", "2020-07" etc and so many years as columns.I need to select only the current month, next month and third month columns alone from this table.(DB : PostgreSQL Version 11)
But since the column names are "TEXT" are in the format YYYY-MM , How can I select only the current month and future 2 months from this table without hard-coding the column names.
Below is the table structure , Name : static_data
Required select statement is like this,The table contains the 14 months data as in the above screen shot like DATES as columns.From this i want the current month , and next 2 month columns along with their data, something like below.
SELECT "2020-05","2020-06","2020-07" from static
-- SELECT Current month and next 2 months
Required output:
It's nearly impossible to get the actual value of the current month as the column name, but you can do something like this:
select d.item_sku,
d.status,
to_jsonb(d) ->> to_char(current_date, 'yyyy-mm') as current_month,
to_jsonb(d) ->> to_char(current_date + interval '1 month', 'yyyy-mm') as "month + 1",
to_jsonb(d) ->> to_char(current_date + interval '2 month', 'yyyy-mm') as "month + 2"
from bad_design d
;
Technically, you can use the information schema to achieve this. But, like GMB said, please re-design your schema and do not approach this issue like this, in the first place.
The special schema information_schema contains meta-data about your DB. Among these is are details about existing columns. In other words, you can query it and convert their names into dates to compare them to what you need.
Here are a few hints.
Query existing column names.
SELECT column_name
FROM information_schema.columns
WHERE table_schema = 'your_schema'
AND table_name = 'your_table'
Compare two dates.
SELECT now() + INTERVAL '3 months' < now() AS compare;
compare
---------
f
(1 row)
You're already pretty close with the conversion yourself.
Have fun and re-design your schema!
Disclaimer: this does not answer your question - but it's too long for a comment.
You need to fix the design of this table. Instead of storing dates in columns, you should have each date on a separate row.
There are numerous drawbacks to your current design:
very simple queries are utterly complicated : filtering on dates, aggregation... All these operations require dynamic SQL, which adds a great deal of complexity
adding or removing new dates requires modifying the structure of the table
storage is wasted for rows where not all columns are filled
Instead, consider this simple design, with one table that stores the master data of each item_sku, and a child table
create table myskus (
item_sku int primary key,
name text,
cat_level_3_name text
);
create table myvalues (
item_sku int references myskus(item_sku),
date_sku date,
value_sku text,
primary key (item_sku, date_sku)
);
Now your original question is easy to solve:
select v.*, s.name, s.cat_level_3_name
from myskus s
inner join myvalues v on v.item_sku = s.item_sku
where
v.date_sku >= date_trunc('month', now())
and v.date_sku < date_trunc('month', now()) + interval '3 month'

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

Find rows which have different attribute value in ONE day for same product. (Postgresql)

can someone help me to write a query?
I have for example columns:
Date
product_key
category_code
In one day I expect to have same category_code for one product, but I want to check this with SQL.
Thank you.
If you want to find the day, the product_key and the category_code that doubles, You should use query like this:
SELECT
date,
product_key,
category_code,
count(1)
FROM your_table
GROUP BY date, product_key, category_code
HAVING count(1) > 1;
You can group your results by date and product, and use count and distinct to find if there is more than one category code for a product. You can then filter rows having more than 1 distinct category in the group.
SELECT
Date, product_key, count(distinct category_code) AS categories
FROM
my_table
GROUP BY
Date, product_key
HAVING
count(distinct category_code) > 1

Produce a row for dates that do not exist in a table [duplicate]

I have a postgresql table userDistributions like this :
user_id, start_date, end_date, project_id, distribution
I need to write a query in which a given date range and user id the output should be the sum of all distributions for every day for that given user.
So the output should be like this for input : '2-2-2012' - '2-4-2012', some user id :
Date SUM(Distribution)
2-2-2012 12
2-3-2012 15
2-4-2012 34
A user has distribution in many projects, so I need to sum the distributions in all projects for each day and output that sum against that day.
My problem is what I should group by against ? If I had a field as date (instead of start_date and end_date), then I could just write something like
select date, SUM(distributions) from userDistributions group by date;
but in this case I am stumped as what to do. Thanks for the help.
Use generate_series to produce your dates, something like this:
select dt.d::date, sum(u.distributions)
from userdistributions u
join generate_series('2012-02-02'::date, '2012-02-04'::date, '1 day') as dt(d)
on dt.d::date between u.start_date and u.end_date
group by dt.d::date
Your date format is ambiguous so I guess while converting it to ISO 8601.
This is much like #mu's answer.
However, to cover days with no matches you should use LEFT JOIN:
SELECT d.d::date, sum(u.distributions) AS dist_sum
FROM generate_series('2012-02-02'::date, '2012-02-04'::date, '1 day') AS d(d)
LEFT JOIN userdistributions u ON d.d::date BETWEEN u.start_date AND u.end_date
GROUP BY 1

How to execute SELECT DISTINCT ON query using SQLAlchemy

I have a requirement to display spend estimation for last 30 days. SpendEstimation is calculated multiple times a day. This can be achieved using simple SQL query:
SELECT DISTINCT ON (date) date(time) AS date, resource_id , time
FROM spend_estimation
WHERE
resource_id = '<id>'
and time > now() - interval '30 days'
ORDER BY date DESC, time DESC;
Unfortunately I can't seem to be able to do the same using SQLAlchemy. It always creates select distinct on all columns. Generated query does not contain distinct on.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
'date'
).order_by(
'date',
SpendEstimation.time
)
SELECT DISTINCT
date(time) AS date,
resource_id,
time
FROM spend
ORDER BY date, time
It is missing ON (date) bit. If I user query.group_by - then SQLAlchemy adds distinct on. Though I can't think of solution for given problem using group by.
Tried using function in distinct part and order by part as well.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
func.date(SpendEstimation.time).label('date')
).order_by(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.time
)
Which resulted in this SQL:
SELECT DISTINCT
date(time) AS date,
resource_id,
time,
date(time) AS date # only difference
FROM spend
ORDER BY date, time
Which is still missing DISTINCT ON.
Your SqlAlchemy version might be the culprit.
Sqlalchemy with postgres. Try to get 'DISTINCT ON' instead of 'DISTINCT'
Links to this bug report:
https://bitbucket.org/zzzeek/sqlalchemy/issues/2142
A fix wasn't backported to 0.6, looks like it was fixed in 0.7.
Stupid question: have you tried distinct on SpendEstimation.date instead of 'date'?
EDIT: It just struck me that you're trying to use the named column from the SELECT. SQLAlchemy is not that smart. Try passing in the func expression into the distinct() call.