PostgreSQL arrange calculated columns horizontaly - postgresql

I have a table of real state properties and I want to create a table that shows me the count of the properties that fall in certain price range by zones, something like this:
Zone 0-149k 150-300k
North 25 150
South 150 350
For example for the first result my query would be:
SELECT COUNT(*) FROM MY TABLE
WHERE ZONE = 'North' AND PRICE < 150000
and similar for the other fields
But I'm unable to find a unified query that shows me the data in the desired way. I've tried with the UNION command but this shows me all the data as continuous rows. Any thoughts?

You can use a FILTER on an aggregate:
SELECT
zone,
COUNT(*) FILTER (WHERE price < 150000) AS "0-149k",
COUNT(*) FILTER (WHERE 150000 <= price AND price < 300000) AS "150-300k"
FROM my_table
GROUP BY zone;
(If you have an unknown number of price ranges, see this approach).

Related

Postgres: Storing output of moving average query to a column

I have a table in Postgres 14.2
Table name is test
There are 3 columns: date, high, and five_day_mavg (date is PK if it matters)
I have a select statement which properly calculates a 5 day moving average based on the data in high.
select date,
avg(high) over (order by date rows between 4 preceding and current row) as mavg_calc
from test
It products output as such:
I have 2 goals:
First to store the output of the query in five_day_mavg.
Second to store this in such a way that when I a new row with data
in high, it automatically calculates that value
The closest I got was:
update test set five_day_mavg = a.mav_calc
from (
select date,
avg(high) over (order by date rows between 4 preceding and current row) as mav_calc
from test
) a;
but all that does is sets the value of every row in five_day_mavg to entire average of high
Thanks to #a_horse_with_no_name
I played around with the WHERE clause
update test l set five_day_mavg = b.five_day_mavg from (select date, avg(high) over (order by date rows between 4 preceding and current row) as five_day_mavg from test )b where l.date = b.date;
a couple of things. I defined each table. The original table I aliased as l, the temporary table created by doing a windows function (the select statement in parenthesis) I aliased as b and I joined with the WHERE clause on date which is the index/primary key.
Also, I was using 'a' as the letter for alias, and I think that may have contributed to the issue.
Either way, solved now.

Divide count of Table 1 by count of Table 2 on the same time interval in Tableau

I have two tables with IDs and time stamps. Table 1 has two columns: ID and created_at. Table 2 has two columns: ID and post_date. I'd like to create a chart in Tableau that displays the Number of Records in Table 1 divided by Number of Records in Table 2, by week. How can I achieve this?
One way might be to use Custom SQL like this to create a new data source for your visualization:
SELECT created_table.created_date,
created_table.created_count,
posted_table.posted_count
FROM (SELECT TRUNC (created_at) AS created_date, COUNT (*) AS created_count
FROM Table1) created_table
LEFT JOIN
(SELECT TRUNC (post_date) AS posted_date, COUNT (*) AS posted_count
FROM Table2) posted_table
ON created_table.created_date = posted_table.posted_date
This would give you dates and counts from both tables for those dates, which you could group using Tableau's date functions in the visualization. I made created_table the first part of the left join on the assumption that some records would be created and not posted, but you wouldn't have posts without creations. If that isn't the case you will want a different join.

On-demand Median Aggregation on a Large Dataset

TLDR: I need to make several median aggregations on a large dataset for a webapp, but the performance is poor. Can my query be improved/is there a better DB than AWS Redshift for this use-case?
I'm working on a team project which involves on-demand aggregations of a large dataset for visualization through our web-app. We're using Amazon Redshift loaded with almost 1,000,000,000 rows, dist-key by date (we have data from 2014 up to today's date, with 900,000 data points being ingested every day) and sort-key by a unique id. The unique id has a possibly one-to-many relationship with other unique ids, for which the 'many' relationship can be thought as the id's 'children'.
Due to confidentiality, think of the table structures like this
TABLE NAME: meal_nutrition
DISTKEY(date),
SORTKEY(patient_id),
patient_name varchar,
calories integer,
fat integer,
carbohydrates integer,
protein integer,
cholesterol integer,
sodium integer,
calories integer
TABLE NAME: patient_hierarchy
DISTKEY(date date),
SORTKEY(patient_id integer),
parent_id integer,
child_id integer,
distance integer
Think of this as a world for which there's a hierarchy of doctors. Patients are encapsulated as both actual patients and the doctors themselves, for which doctors can be the patient of other doctors. Doctors can transfer ownership of patients/doctors at any time, so the hierarchy is constantly changing.
DOCTOR (id: 1)
/ \
PATIENT(id: 2) DOCTOR (id: 3)
/ \ \
P (id: 4) D (id: 8) D(id: 20)
/ \ / \ / \ \
................
One visualization that we're having trouble with (due to performance) is a time-series graph showing the day-to-day median of several metrics for which the default date-range must be 1 year. So in this example, we want the median of fats, carbohydrates, and proteins of all meals consumed by a patient/doctor and their 'children', given a patient_id. The query used would be:
SELECT patient_name,
date,
max(median_fats),
max(median_carbs),
max(median_proteins)
FROM (SELECT mn.date date,
ph.patient_name patient_name,
MEDIAN(fats) over (PARTITION BY date) AS median_fats,
MEDIAN(carbohydrates) over (PARTITION BY date) AS median_carbs,
MEDIAN(proteins) over (PARTITION BY date) AS median_proteins
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
)
GROUP BY date, patient_name
The heaviest operations in this query are the sorts for the each of the medians (each requiring to sort ~200,000,000 rows), but we cannot avoid this. As a result, this query takes ~30s to complete, which translates to bad UX. Can the query I'm making be improved? Is there a better DB for this kind of use-case? Thanks!
As said in comments, sorting/distribution of your data is very important. If you get just one date slice of patient hierarchy all data you're using is on one node with distribution by date. It's better to distribute by meal_nutrition.patient_id and patient_hierarchy.child_id so data that is joined likely sits on the same node, and sort tables by date,patient_id and date,child_id respectively, so you can find the necessary date slices/ranges efficiently and then look up for patients efficiently.
As for the query itself, there are some options that you can try:
1) Approximate median like this:
SELECT mn.date date,
ph.patient_name patient_name,
APPROXIMATE PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY fats) AS median_fats
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2
Notes: this might not work if the memory stack is exceeded. Also, you have to have only one such function per subquery so you can't get fats, carbs and proteins in the same subquery but you can calculate them separately and then join. if this works you can then test the accuracy by running your 30s statement for a few IDs and comparing results.
2) Binning. First group by each value, or set reasonable bins, then find the group/bin that is in the middle of the distribution. That will be your median. One variable example would be:
WITH
groups as (
SELECT mn.date date,
ph.patient_name patient_name,
fats,
count(1)
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2,3
)
,running_groups as (
SELECT *
,sum(count) over (partition by date, patient_name order by fats rows between unlimited preceding and current row) as running_total
,sum(count) (partition by date, patient_name) as total
FROM groups
)
,distance_from_median as (
SELECT *
,row_number() over (partition by date, patient_name order by abs(0.5-(1.0*running_total/total))) as distance_from_median
FROM running_groups
)
SELECT
date,
patient_name,
fats
WHERE distance_from_median=1
That would likely allow grouping values on each individual node and subsequent operations with bins will be more light weight and avoid sorting the raw sets. Again, you have to benchmark. The less unique values you have the higher your performance gain will be because you'll have a small number of bins out of a big number of raw values and sorting will be much cheaper. The result is accurate except the option with even number of distinct values (for 1,2,3,4 it would return 2, not 2.5) but this is solvable by adding another layer if it's critical. The main question is if the approach itself improves performance significantly.
3) Materialize calculation for every date/patient id. If your only parameter is patient and you always calculate medians for the last year you can run the query overnight into a summary table and query that one. It's better even if (1) or (2) helps to optimize performance. You can also copy the summary table to a Postgres instance after materializing and use it as the backend for your app, you'll have better ping (Redshift is good for materializing large amounts of data but not good as web app backend). It comes with the cost of maintaining data transfer job, so if materializing/optimization made a good enough job you can leave it in Redshift.
I'm really interested in getting feedback if you try any of suggested options, this is a good use case for Redshift.

fetch data from and to date to get all matching results

Hello everyone I have to get data from and to date, I tried using between clause which fails to retrieve data what I need. Here is what I need.
I have table called hall_info which has following structure
hall_info
id | hall_name |address |contact_no
1 | abc | India |XXXX-XXXX-XX
2 | xyz | India |XXXX-XXXX-XX
Now I have one more table which is events, that contains data about when and which hall is booked on what date, the structure is as follows.
id |hall_info_id |event_date(booked_date)| event_name
1 | 2 | 2015-10-25 | Marriage
2 | 1 | 2015-10-28 | Marriage
3 | 2 | 2015-10-26 | Marriage
So what I need now is I wanna show hall_names that are not booked on selected dates, suppose if user chooses from 2015-10-23 to 2015-10-30 so I wanna list all halls that are not booked on selected dates. In above case both the halls of hall_info_id 1 and 2 ids booked in given range but still I wanna show them because they are free on 23,24,27 and on 29 date.
In second case suppose if user chooses date from 2015-10-25 and 2015-10-26 then only hall_info_id 2 is booked on both the dates 25 and 26 so in this case i wanna show only hall_info_id 1 as hall_info_id 2 is booked.
I tried using inner query and between clause but I am not getting required result to simply i have given only selected fields I have more tables to join so i cant paste my query please help with this. Thanks in advance for all who are trying.
Some changes in Yasen Zhelev's code:
SELECT * FROM hall_info
WHERE id not IN (
SELECT hall_info_id FROM events
WHERE event_date >= '2015-10-23' AND event_date <= '2015-10-30'
GROUP BY hall_info_id
HAVING COUNT(DISTINCT event_date) > DATE_PART('day', '2015-10-30'::timestamp - '2015-10-23'::timestamp))
I have not tried it but how about checking if the number of bookings per hall is less than the actual days in the selected period.
SELECT * FROM hall_info WHERE id NOT IN
(SELECT hall_info_id FROM events
WHERE event_date >= '2015-10-23' AND event_date <= '2015-10-30'
GROUP BY hall_info_id
HAVING COUNT(id) < DATEDIFF(day, '2015-10-30', '2015-10-23')
);
That will only work if you have one booking per day per hall.
To get the "available dates" for the hall returned, your query needs a row source of all possible dates. For example, if you had a calendar table populated with possible date values, e.g.
CREATE TABLE cal (dt DATE NOT NULL PRIMARY KEY) Engine=InnoDB
;
INSERT INTO cal (dt) VALUES ('2015-10-23')
,('2015-10-24'),('2015-10-25'),('2015-10-26'),('2015-10-27')
,('2015-10-28'),('2015-10-29'),('2015-10-30'),('2015-10-31')
;
The you could use a query that performs a cross join between the calendar table and hall_info... to get every hall on every date... and an anti-join pattern to eliminate rows that are already booked.
The anti-join pattern is an outer join with a restriction in the WHERE clause to eliminate matching rows.
For example:
SELECT cal.dt, h.id, h.hall_name, h.address
FROM cal cal
CROSS
JOIN hall_info h
LEFT
JOIN events e
ON e.hall_id = h.id
AND e.event_date = cal.dt
WHERE e.id IS NULL
AND cal.dt >= '2015-10-23'
AND cal.dt <= '2015-10-30'
The cross join between cal and hall_info gets all halls for all dates (restricted in the WHERE clause to a specified range of dates.)
The outer join to events find matching rows in the events table (matching on hall_id and event_date. The trick is the predicate (condition) in the WHERE clause e.id IS NULL. That throws out any rows that had a match, leaving only rows that don't have a match.
This type of problem is similar to other "sparse data" problems. e.g. How do you return a zero total for sales by a given store on a given date, when there are no rows with that store and date...
In your case, the query needs a source of rows with available date values. That doesn't necessarily have to be a table named calendar. (Other databases give us the ability to dynamically generate a row source; someday, MySQL may have similar features.)
If you want the row source to be dynamic in MySQL, then one approach would be to create a temporary table, and populate it with the dates, run the query referencing the temporary table, and then dropping the temporary table.
Another approach is to use an inline view to return the rows...
SELECT cal.dt, h.id, h.hall_name, h.address
FROM (
SELECT '2015-10-23'+INTERVAL 0 DAY AS dt
UNION ALL SELECT '2015-10-24'
UNION ALL SELECT '2015-10-25'
UNION ALL SELECT '2015-10-26'
UNION ALL SELECT '2015-10-27'
UNION ALL SELECT '2015-10-28'
UNION ALL SELECT '2015-10-29'
UNION ALL SELECT '2015-10-30'
) cal
CROSS
JOIN hall_info h
LEFT
JOIN events e
ON e.hall_id = h.id
AND e.event_date = c.dt
WHERE e.id IS NULL
FOLLOWUP: When this question was originally posted, it was tagged with mysql. The SQL in the examples above is for MySQL.
In terms of writing a query to return the specified results, the general issue is still the same in PostgreSQL. The general problem is "sparse data".
The SQL query needs a row source for the "missing" date values, but the specification doesn't provide any source for those date values.
The answer above discusses several possible row sources in MySQL: 1) a table, 2) a temporary table, 3) an inline view.
The answer also mentions that some databases (not MySQL) provide other mechanisms that can be used as a row source.
For example, PostgreSQL provides a nifty generate_series function (Reference: http://www.postgresql.org/docs/9.1/static/functions-srf.html.
It should be possible to use the generate_series function as a row source, to supply a set of rows containing the date values needed by the query to produced the specified result.
This answer demonstrates the approach to solving the "sparse data" problem.
If the specification is to return just the list of halls, and not the dates they are available, the queries above can be easily modified to remove the date expression from the SELECT list, and add a GROUP BY clause to collapse the rows into a distinct list of halls.

how do you sum over a related period

I need to sum values that are + 2 months or within a quarter period (related date table)
is there a way to use dense rank to partition those periods (custom periods)?
select
FiscalMonth
,Value
from table
The sql will have to do the following:
Join the value table and the period table
Include the period in the select list and sum the value, grouping by the period
i.e
select b.period, sum(a.value)
from table a
inner join period b on a.FiscalMonth between b.StartMonth and b.EndMonth
group by b.period
Note: The join condition will have to be modified based on what data you actually have in the period table.
Hope this helps
Well, If you need value from an X interval, by month you could use something like:
SELECT *
FROM yourTable
MONTH(some_date) = MONTH(CURRENT_DATE - INTERVAL 1 MONTH) //Could be X interval!
This is an example (which show the results of the previous month, from the actual one). Just trying to write that it is possible to massage the query in functions on intervals.
Of course, you could use the SUMcommand for the adding.