Can stddev_samp() function be used with windows functions? - amazon-redshift

I want to compute the moving average and moving standard deviation for the last 10 purchases for a customer.
avg(price) over (partition by customer_id order by purchase_date rows between 10 preceding and current row) as avg_price_10_orders,
I know I can compute the average, but I read conflicted information regarding whether I can compute the standard deviation this way:
stddev_samp(price) over (partition by customer_id order by purchase_date rows between 10 preceding and current row) as sd_10_orders
I am on Amazon Redshift cluster, the AWS documentation indicates it's okay:http://docs.aws.amazon.com/redshift/latest/dg/r_WF_STDDEV.html
However, Cloudera source says stddev_samp() is not allowed to be used with over() clause: https://www.cloudera.com/documentation/enterprise/5-3-x/topics/impala_stddev.html
Any pointers?
Thanks!

Related

In TSQL, How do I add a count column that counts the number of rows in my query?

This can be done a number of ways, which I will explain at the end. For now, I have been given a work assignment that includes the following (simplified):
"Create a record each week to track the current status that has the following: account numbers (unique within each report), a random number (provided), their status (Green, Orange, or Blue), and make sure the record also has a column which tells me how many records their are this week."
I do not need code to generate a random number.
Columns: Account, RanNum, Status, NumberOfRowsThisWeek
How do I handle adding a column that determines the number of rows in my query and produces that number, static, within each row of that column?
I may try to tweak the request and apply a rising number. How would I go about doing it in this case?
Edit: SQL Server 2014
You are not telling us which database you are using.
In SQL Server, the newer versions at least, you have windowing function or analytical functions available, and they are also available in most other popular RDBMS
You could do what you want in SQL Server by adding this to your select
,count(*) over (partition by 1) as [NrOfRows]
An analytical function does the "standard" query, and then performs the windowing function on the result set.
The count above, counts the rows in the result set, partitioned by the constant 1, which is of course stable across all rows, so it gives the full rowcount.
It is perhaps not standard in all databases to allow a constant in that way, perhaps this would give a better result in some, I know it works in SQL Server:
,count(*) over (partition by (select 1 n)) as [NrOfRows]
it sounds like you want to do some kind of simple count() / group by query
select Account, RanNum, Status, count(*) as NumberOfRowsThisWeek
from tablename
group by Account, RanNum, Status
you my need to do
select Account, RanNum, Status, NumberOfRowsThisWeek
from (
select Account, Status, count(*) as NumberOfRowsThisWeek
from tablename
group by Account, Status
)
because the random number will confuse the group by by making every row unique.

On-demand Median Aggregation on a Large Dataset

TLDR: I need to make several median aggregations on a large dataset for a webapp, but the performance is poor. Can my query be improved/is there a better DB than AWS Redshift for this use-case?
I'm working on a team project which involves on-demand aggregations of a large dataset for visualization through our web-app. We're using Amazon Redshift loaded with almost 1,000,000,000 rows, dist-key by date (we have data from 2014 up to today's date, with 900,000 data points being ingested every day) and sort-key by a unique id. The unique id has a possibly one-to-many relationship with other unique ids, for which the 'many' relationship can be thought as the id's 'children'.
Due to confidentiality, think of the table structures like this
TABLE NAME: meal_nutrition
DISTKEY(date),
SORTKEY(patient_id),
patient_name varchar,
calories integer,
fat integer,
carbohydrates integer,
protein integer,
cholesterol integer,
sodium integer,
calories integer
TABLE NAME: patient_hierarchy
DISTKEY(date date),
SORTKEY(patient_id integer),
parent_id integer,
child_id integer,
distance integer
Think of this as a world for which there's a hierarchy of doctors. Patients are encapsulated as both actual patients and the doctors themselves, for which doctors can be the patient of other doctors. Doctors can transfer ownership of patients/doctors at any time, so the hierarchy is constantly changing.
DOCTOR (id: 1)
/ \
PATIENT(id: 2) DOCTOR (id: 3)
/ \ \
P (id: 4) D (id: 8) D(id: 20)
/ \ / \ / \ \
................
One visualization that we're having trouble with (due to performance) is a time-series graph showing the day-to-day median of several metrics for which the default date-range must be 1 year. So in this example, we want the median of fats, carbohydrates, and proteins of all meals consumed by a patient/doctor and their 'children', given a patient_id. The query used would be:
SELECT patient_name,
date,
max(median_fats),
max(median_carbs),
max(median_proteins)
FROM (SELECT mn.date date,
ph.patient_name patient_name,
MEDIAN(fats) over (PARTITION BY date) AS median_fats,
MEDIAN(carbohydrates) over (PARTITION BY date) AS median_carbs,
MEDIAN(proteins) over (PARTITION BY date) AS median_proteins
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
)
GROUP BY date, patient_name
The heaviest operations in this query are the sorts for the each of the medians (each requiring to sort ~200,000,000 rows), but we cannot avoid this. As a result, this query takes ~30s to complete, which translates to bad UX. Can the query I'm making be improved? Is there a better DB for this kind of use-case? Thanks!
As said in comments, sorting/distribution of your data is very important. If you get just one date slice of patient hierarchy all data you're using is on one node with distribution by date. It's better to distribute by meal_nutrition.patient_id and patient_hierarchy.child_id so data that is joined likely sits on the same node, and sort tables by date,patient_id and date,child_id respectively, so you can find the necessary date slices/ranges efficiently and then look up for patients efficiently.
As for the query itself, there are some options that you can try:
1) Approximate median like this:
SELECT mn.date date,
ph.patient_name patient_name,
APPROXIMATE PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY fats) AS median_fats
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2
Notes: this might not work if the memory stack is exceeded. Also, you have to have only one such function per subquery so you can't get fats, carbs and proteins in the same subquery but you can calculate them separately and then join. if this works you can then test the accuracy by running your 30s statement for a few IDs and comparing results.
2) Binning. First group by each value, or set reasonable bins, then find the group/bin that is in the middle of the distribution. That will be your median. One variable example would be:
WITH
groups as (
SELECT mn.date date,
ph.patient_name patient_name,
fats,
count(1)
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2,3
)
,running_groups as (
SELECT *
,sum(count) over (partition by date, patient_name order by fats rows between unlimited preceding and current row) as running_total
,sum(count) (partition by date, patient_name) as total
FROM groups
)
,distance_from_median as (
SELECT *
,row_number() over (partition by date, patient_name order by abs(0.5-(1.0*running_total/total))) as distance_from_median
FROM running_groups
)
SELECT
date,
patient_name,
fats
WHERE distance_from_median=1
That would likely allow grouping values on each individual node and subsequent operations with bins will be more light weight and avoid sorting the raw sets. Again, you have to benchmark. The less unique values you have the higher your performance gain will be because you'll have a small number of bins out of a big number of raw values and sorting will be much cheaper. The result is accurate except the option with even number of distinct values (for 1,2,3,4 it would return 2, not 2.5) but this is solvable by adding another layer if it's critical. The main question is if the approach itself improves performance significantly.
3) Materialize calculation for every date/patient id. If your only parameter is patient and you always calculate medians for the last year you can run the query overnight into a summary table and query that one. It's better even if (1) or (2) helps to optimize performance. You can also copy the summary table to a Postgres instance after materializing and use it as the backend for your app, you'll have better ping (Redshift is good for materializing large amounts of data but not good as web app backend). It comes with the cost of maintaining data transfer job, so if materializing/optimization made a good enough job you can leave it in Redshift.
I'm really interested in getting feedback if you try any of suggested options, this is a good use case for Redshift.

Filter a value relevant to the maximum field

Here is my detail field with Order number and Amount.
Order Number Amount
2 3450
4 2300
8 4500
3 5100
Here the latest order is the maximum order number and I need to show it as follows in the report but not all these other records. So here I need to pick up the maximum order number and the relevant value for it. Help please.
Order Number Amount
8 4500
There are many ways to solve this one of the way is to use SQL Expression Fields.
Create a new SQL experssion field and write below formula
DB2 syntax
Select order number,amount from orders order by order number desc fetch first row only
oracle syntax:
SELECT order number,amount FROM (
select order number,amount ,ROW_NUMBER () OVER (ORDER BY order number DESC) RowNo from orders)
WHERE ROWNO<2
Now drag this to detail section.
Note: Above syntax is for DB2 if you are using oracle syntax will change..Let me know if you are using other than DB2 database

Group output of single column postgresql

First off I'm a total SQL noob - Thanks in advance for any assistance you can offer.
I have a FortiAnalyzer that uses a Postgres DB to store firewall logs. The Analyzer is then used to report on usage etc.
Basically I need to write a custom query that can show the Top 10 Users by bandwidth used for the top 10 Websites/destinations per user.
I can get all of the relevant information out of the unit, but I cannot get the output formatted correctly.
I would be happy with the output showing a username 10 times with the top 10 sites next to the username. First prize however would be to show the username in Column A only once, then in column B and C the destination address and bandwidth used respectively.
Here is the query I have so far:
select coalesce(nullifna(`user`), `src`) as user_src,
coalesce(hostname, dstname, 'unknown') as web_site,
sum(rcvd + sent)/1024 as bandwidth from $log
where $filter and user is not null and status in ('passthrough', 'filtered')
group by `user_src` , web_site order by user_src desc
Once the query is linked to a report chart, I them have options to limit output by x value. I could for example limit this to limit the user_src column to 100 (i.e 10 Users with 10 outputs each)
I hope this is clear to you... If not, I will do my best to answer any questions.
I start with table aggregated on website, user_src level. Than it is not difficult to get top X users for top Y sites. You will need to use window function to get desired result.
Sample data:
create table test (web_site varchar, user_src varchar, bandwidth numeric);
insert into test values
('a','s1',18),
('b','s1',12),
('c','s1',13),
('d','s2',14),
('e','s2',15),
('f','s2',16),
('g','s3',17),
('h','s3',18),
('i','s3',19)
;
Get top X websites for top Y users:
with cte as (
select
user_src,
web_site,
bandwidth,
dense_rank() over(order by site_bandwidth desc) as user_rank,
dense_rank() over(partition by user_src order by bandwidth desc) as website_rank
from
test
join (select user_src, sum(bandwidth) site_bandwidth from test group by user_src) a using (user_src)
)
select
*
from
cte
where
user_rank <= 2
and website_rank <=2
order by
user_rank,
website_rank
SQLFiddle

Calculate Mode - "Highest frequency row" DB2

What would be the most efficient way to calculating the mode across tables with joins in DB2..
I am trying to get the value with the most frequency(count) for a given column(ID - candidate key for joined table) on a given date.
The idea is to get the most common (value) from the table which has different (value)s for some accounts (for the same ID and date). We need to make it unique for use in another table.
You can use common table expressions [CTE's], indicated by WITH, to break the logic down into logical steps. First we'll build the summary rows, then we'll assign a ranking to the rows within each group, then pick out the ones that with the highest count of records.
Let's say we want to know which flavor of each item sells the most frequently on each date (perhaps assuming a record is quantity one).
WITH s as
(
SELECT itemID, saleDate, flavor, count(*) as tally
FROM sales
GROUP BY itemID, saleDate, flavor
), r as
(
SELECT itemID, saleDate, flavor, tally,
RANK() OVER (PARTITION BY itemID, saleDate ORDER BY tally desc) as pri
FROM s
)
SELECT itemID, saleDate, flavor, tally
FROM r
WHERE pri = 1
Here the names "s" and "r" refer to the result set from their respective CTE's. These names can then be used as to represent a table in another part of the statement.
The pri column will have the RANK() of tally value on the summary row from the first section "s" within the window of itemID and saleDate. Tally is descending, because we want the largest value first, which will get a RANK() of 1. Then in the main SELECT we simply pick those summary records which were first in their partition.
By using RANK() or DENSE_RANK() we could get back multiple flavors for an itemID, saleDate, if they are tied for first place. This could be eliminated by replacing RANK() with ROW_NUMBER(), but it would arbitrarily pick one of the tied flavors as a winner, and this may not be correct answer for the problem at hand.
If we had a sales quantity column in the table, we could replace COUNT(*) with SUM(salesqty) and find what had sold the most units.