Calculating Zscore for multiple rows - tsql

My table store measurement taken on several visits of a person. i want to calculate the zscore for each person.i notice my answer is not correct because is calculating for each row only.I want for each ID, the zscore on all four visits. below is what i have so far.
`select ID,VNo1,VNo2,VNo3,VNo4
.W1,AVG(W1) AS Mean , STDEVP(W1) AS StandardDeviation
, STDEVP(W1) * STDEVP(W1) AS Zscore,
from dbo.measurement
GROUP BY ID,VNo1,VNo2,VNo3,VNo4
`

I'm not sure what values do the VNoX columns hold, but if you have 1 row per visit then you just need to remove columns from your GROUP BY.
select
ID,
AVG(W1) AS Mean ,
STDEVP(W1) AS StandardDeviation,
STDEVP(W1) * STDEVP(W1) AS Zscore,
from
dbo.measurement
GROUP BY
ID

Related

PostgreSQL MIN and MAX function doesn't return one result

In PostgreSql DB, I have a table called Trip.
There is a column called id and a column called meta in the table.
A example of id in one row looks like:
id = 123456
A example of meta in one row looks like:
meta = {"runTime": 3922000, "distance": 85132, "duration": 4049000, "fuelUsed": 19.595927498516176}
I want to select the trip which has the minimum kph from the Trip table and show trip id and minimum kph. This is my query:
select tp."id" tripid, MIN((3600 * (tp."meta"->>'distance')::numeric)
/ ((tp."meta"->>'runTime')::NUMERIC)) minkph FROM "Trip" tp
WHERE tp."createdAt" BETWEEN '2020-04-01 00:00:00+00'
and '2020-04-30 00:00:00+00'
GROUP BY tp."id"
However this query returns all trips' id and division calculation results, not only one row.
Could you please help?
You can order by the calculated kph field and return only the first:
select tp."id" tripid, MIN((3600 * (tp."meta"->>'distance')::numeric)
/ ((tp."meta"->>'runTime')::NUMERIC)) minkph FROM "Trip" tp
WHERE tp."createdAt" BETWEEN '2020-04-01 00:00:00+00'
and '2020-04-30 00:00:00+00'
GROUP BY tp."id"
order by 2
limit 1
Approach 1 - General min by id:
You're expressing the column tp.id on your query so, your select will run the MIN() for every group of id. If you want the global MIN() for your query, just make this:
SELECT MIN((3600 * (tp."meta"->>'distance')::numeric) / ((tp."meta"->>'runTime')::NUMERIC)) minkph
FROM "Trip" tp
WHERE tp."createdAt" BETWEEN '2020-04-01 00:00:00+00'
AND '2020-04-30 00:00:00+00'
Every group function groups by a set of distinct data, if you don't pass any column except the MIN(), the query will result the global result in one line, for all rows.
Approach 2 - General min:
If you want to get the MIN() and the respective id, you can do as follows and do a LIMIT 1. as is:
SELECT tp."id" AS tripid, ((3600 * (tp."meta"->>'distance')::numeric) / ((tp."meta"->>'runTime')::NUMERIC)) minkph
FROM "Trip" tp
WHERE tp."createdAt" BETWEEN '2020-04-01 00:00:00+00'
AND '2020-04-30 00:00:00+00'
ORDER BY 2
LIMIT 1
In time. You can use window functions, but is a bit complex to do.

Change the relation between two tables to outer join

I have a table (table1) has fact data. Let's say (products, start, end, value1, month[calculated column]) are the columns and start and end columns are timestamp.
What I am trying to have is a table and bar chart which give me sum of value1 for each month divided by a factor number according to each month (this report is a yearly bases. I mean, I load the data into qlik sense for one year).
I used the start and end to generate autoCalendar as a timestamp field in qlik sense data manager. Then, I get the month from start and store it in the calculated column "month" in the table1 using the feature of autoCalendar (Month(start.autoCalendar.Month)).
After that, I created another table having two columns (month, value2) the value2 column is a factor value which I need it to divide the value1 according to each month. that's mean (sum(value1) /1520 [for January], sum(value2) / 650 [for February]) and so on. Here the month and month columns are relational columns in qlik sense. then I could in my expression calculated the sum(value1) and get the targeted value2 which compatible with the month for the table2.
I could make the calculation correctly. but still one thing is missed. The data of the products does not have value (value1 ) in every month. For example, let's say that I have a products (p1,p2...). I have data in the table 1 for (Jun, Feb, Nov), and for p2 for (Mrz, Apr,Mai, Dec). Hence, When the data are presented in a qlik sense table as well as in a bar chart I can see only the months which have values in the fact table. The qlik sense table contains (2 dimensions which are [products] and [month] and the measure is m1[sum(value1)/value2]).
What I want to have a yearly report showing the 12 months. and in my example I can see for p1 (only 3 months) and for p2 (4 months). When there is no data the measure column [m1] 0 and I want to have the 0 in my table and chart.
I am think, it might be a solution if I can show the data of the the qlik sense table as right outer join of my relation relationship (table1.month>>table2.month).So, is it possible in qlik sense to have outer join in such an example? or there is a better solution to my problem.
Update
Got it. Not sure if that this is the best approach but in this cases I usually fill the missing records during the script load.
// Main table
Sales:
Load
*,
ProductId & '-' & Month as Key_Product_Month
;
Load * Inline [
ProductId, Month, SalesAmount
P1 , 1 , 10
P1 , 2 , 20
P1 , 3 , 30
P2 , 1 , 40
P2 , 2 , 50
];
// Get distinct products and assign 0 as SalesAmount
Products_Temp:
Load
distinct ProductId,
0 as SalesAmount
Resident
Sales
;
join (Products_Temp) // Cross join in this case
Load
distinct Month
Resident
Sales
;
// After the cross join Products_Temp table contains
// all possible combinations between ProductId and Month
// and for each combination SalesAmount = 0
Products_Temp_1:
Load
*,
ProductId & '-' & Month as Key_Product_Month1 // Generate the unique id
Resident
Products_Temp
;
Drop Table Products_Temp; // we dont need this anymore
Concatenate (Sales)
// Concatenate to main table only the missing ProductId-Month
// combinations that are missing
Load
*
Resident
Products_Temp_1
Where
Not Exists(Key_Product_Month, Key_Product_Month1)
;
Drop Table Products_Temp_1; // not needed any more
Drop Fields Key_Product_Month1, Key_Product_Month; // not needed any more
Before the script:
After the script:
The table link in Qlik Sense (and Qlikview) is more like full outer join. if you want to show the id only from one table (and not all) you can create additional field in the table you want and then perform your calculations on top of this field instead on the linked one. For example:
Table1:
Load
id,
value1
From
MyQVD1.qvd (qvd)
;
Table2:
Load
id,
id as MyRightId
value2
From
MyQVD2.qvd (qvd)
;
In the example above both tables will still be linked on id field but if you want to count only the id values in the right table (Table2) you just need to type
count( MyRightId )
I know this questions has been answered and I quite like Stefan's approach but hope my answer will help other users. I recently ran into something similar and I used a slightly different logic with the following script:
// Main table
Sales:
Load * Inline [
ProductId, Month, SalesAmount
P1 , 1 , 10
P1 , 2 , 20
P1 , 3 , 30
P2 , 1 , 40
P2 , 2 , 50
];
Cartesian:
//Create a combination of all ProductId and Month and then load the existing data into this table
NoConcatenate Load distinct ProductId Resident Sales;
Join
Load Distinct Month Resident Sales;
Join Load ProductId, Month, SalesAmount Resident Sales; //Existing data loaded
Drop Table Sales;
This results in the following output table:
The Null value in the new (bottom-most) row can stay like that but if you prefer replacing it then use Map..Using process

Managing overflows in LISTAGG on Amazon Redshift

Using the example from this post: https://blogs.oracle.com/datawarehousing/entry/managing_overflows_in_listagg
The following statement:
SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp
GROUP BY deptno;
will generate the following output:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING;MILLER
20 SMITH;JONES;SCOTT;ADAMS;FORD
30 ALLEN;WARD;MARTIN;BLAKE;TURNER;JAMES
Let’s assume that the above statement does not run and that we have a limit of 15 characters that can be returned by each row in our LISTAGG function. This is in actuality 65535 on Amazon Redshift.
We would want the following to be returned in this case:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING
10 MILLER
20 SMITH;JONES
20 SCOTT;ADAMS
20 FORD
30 ALLEN;WARD
30 MARTIN;BLAKE
30 TURNER;JAMES
What would be the best way to recreate this result in Amazon Redshift to avoid any data loss and taking speed into consideration?
It's possible to achieve this with 2 subquery:
First:
SELECT id, field,
sum(length(field) + 1) over
(partition by id order by RANDOM() rows unbounded preceding) as total_length_now
from my_schema.my_table)
Initially we want to calculate how many chars we have for each id in our table. We can use a window function to calculate it incrementally for each row. In the 'order by' statement you can use any unique field that you have. If you don't have one, you can simply use random or an hash function, but is mandatory that the field is unique, if not, the function will not work as we want.
The '+1' in the length represent the semicolon that we will use in the listagg function.
Second:
SELECT id, field, total_length_now / 65535 as sub_id
FROM (sub_query_1)
Now we create a sub_id based on the length that we calculated before. If the total_length_now exceed the limit size (in this case 65535) the division's rest will return a new sub_id.
Last Step
SELECT id, sub_id, listagg(field, ';') as namelist
FROM (sub_query_2)
GROUP BY id, sub_id
ORDER BY id, sub_id
Now we can simply call the listagg function grouping by id and sub_id, since each group cannot exceed the size limit.
Complete query
SELECT id, sub_id, listagg(field, ';') as namelist
FROM (
SELECT id, field, total_length_now / 65535 as sub_id
FROM (SELECT id,
field,
sum(length(field) + 1) over
(partition by id order by field rows unbounded preceding) as total_length_now
from support.test))
GROUP BY id, sub_id
order by id, sub_id
Example with your data (with size limit = 10)
First and second query output:
id, field, total_length_now, sub_id
10,KING,5,0
10,CLARK,11,1
10,MILLER,18,1
20,ADAMS,6,0
20,SMITH,12,1
20,JONES,18,1
20,FORD,23,2
20,SCOTT,29,2
30,JAMES,6,0
30,BLAKE,12,1
30,WARD,17,1
30,MARTIN,24,2
30,TURNER,31,3
30,ALLEN,37,3
Final query output:
id,sub_id,namelist
10,0,KING
10,1,CLARK;MILLER
20,0,ADAMS
20,1,SMITH;JONES
20,2,FORD;SCOTT
30,0,JAMES
30,1,BLAKE;WARD
30,2,MARTIN
30,3,TURNER;ALLEN
It is possible to create a partial list, and then the rest of values as separate rows in one go, but if the number of rows is unconstrained you really need a loop statement to then convert that into a list, and the rows for remaining and so on.
So this is really a task for Apache Spark (or any other map-reduce technology).

How to reference output rows with window functions?

Suppose I have a table with quantity column.
CREATE TABLE transfers (
user_id integer,
quantity integer,
created timestamp default now()
);
I'd like to iteratively go thru a partition using window functions, but access the output rows, not the input table rows.
To access the input table rows I could do something like this:
SELECT LAG(quantity, 1, 0)
OVER (PARTITION BY user_id ORDER BY created)
FROM transfers;
I need to access the previous output row to calculate the next output row. How can i access the lag row in the output? Something like:
CREATE VIEW balance AS
SELECT LAG(balance.total, 1, 0) + quantity AS total
OVER (PARTITION BY user_id ORDER BY created)
FROM transfers;
Edit
This is a minimal example to support the question of how to access the previous output row within a window partition. I don't actually want a sum.
It seems you attempt to calculate a running sum. Luckily that's just what Sum() window function does:
WITH transfers AS(
SELECT i, random()-0.3 AS quantity FROM generate_series(1,100) as i
)
SELECT i, quantity, sum(quantity) OVER (ORDER BY i) from transfers;
I guess, looking at the question, that the only you need is to calculate a cumulative sum.
To calculate a cumulative summ use this query:
SELECT *,
SUM( CASE WHEN quantity IS NULL THEN 0 ELSE quantity END)
OVER ( PARTITION BY user_id ORDER BY created
ROWS BETWEEN unbounded preceding AND current row
) As cumulative_sum
FROM transfers
ORDER BY user_id, created
;
But if you want more complex calculations, especially containing some conditions (decisions) that depend on a result from prevoius row, then you need a recursive approach.

Retrieving Representative Records for Unique Values of Single Column

For Postgresql 8.x, I have an answers table containing (id, user_id, question_id, choice) where choice is a string value. I need a query that will return a set of records (all columns returned) for all unique choice values. What I'm looking for is a single representative record for each unique choice. I also want to have an aggregate votes column that is a count() of the number of records matching each unique choice accompanying each record. I want to force choice to lowercase for this comparison to be made (HeLLo and Hello should be considered equal). I can't GROUP BY lower(choice) because I want all columns in the result-set. Grouping by all columns causes all records to return, including all duplicates.
1. Closest I've gotten
select lower(choice), count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
The issue with this is it will not return all columns.
lower | votes
-----------------------------------------------+-------
dancing in the moonlight | 8
pumped up kicks | 7
party rock anthem | 6
sexy and i know it | 5
moves like jagger | 4
2. Trying with all columns
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
Because I am not specifying every column from the SELECT in my GROUP BY, this throws an error telling me to do so.
3. Specifying all columns in the GROUP BY
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice), id, user_id, question_id, choice order by votes desc;
This simply dumps the table with votes column as 1 for all records.
How can I get the vote count and unique representative records from 1., but with all columns from the table returned?
Join grouped results back with primary table, then show only one row for each (question,answer) combination.
similar to this:
WITH top5 AS (
select question_id, lower(choice) as choice, count(*) as votes
from answers
where question_id = 21
group by question_id , lower(choice)
order by count(*) desc
limit 5
)
SELECT DISTINCT ON(question_id,choice) *
FROM top5
JOIN answers USING(question_id,lower(choice))
ORDER BY question_id, lower(choice), answers.id;
Here's what I ended up with:
SELECT answers.*, cc.votes as votes FROM answers join (
select max(id) as id, count(id) as votes
from answers
group by trim(lower(choice))
) cc
on answers.id = cc.id ORDER BY votes desc, lower(response) asc