T-SQL: Rows to Columns With Count - tsql

Let me draw up the table first (there are dozens of columns and dozens of values under Code in reality)
Code | Pat | Col1 | Col2 | Col3
---------------------------------
ABC | 001 | | XX | Q1
ABC | 002 | xx | xx | Q1
ABC | 003 | xx | xxx | Q1
DEF | 004 | xx | xx | Q1
DEF | 005 | xx | xx | Q1
DEF | 006 | xx | xxx | Q1
The resulting table need to look like
ABC | DEF
---------
2 | 3
3 | 3
Let me try and explain. For each 'Code' column, I would need to count the number of entries in Col1 to ColX where the cell is not null/empty.
So in example above, Code ABC has a count of 2 in Col1 and a count of 3 in Col2 Similarly for DEF, both have a count of 3
I've tried lots of things but got to the point where I'm now looking at a blank page again!
ALTERNATIVELY
Code | Col1 | Col2
--------------------
ABC | 2 | 3
DEF | 3 | 3
Please advise

The alternative solution can be reached by using GROUP BY and summing up a calculated number:
SELECT
[Code],
SUM(CASE WHEN ISNULL(Col1, '') = '' THEN 0 ELSE 1 END) as [Col1],
SUM(CASE WHEN ISNULL(Col2, '') = '' THEN 0 ELSE 1 END) as [Col2],
...
FROM T
GROUP by [Code]

Related

SQL 5.7 Lead Function

I'm struggling emulating a lead function to calculate the difference of (after date - current date)
I'm currently using mysql 5.7 to accomplish this. I have tried looking at various sources on stack overflow but I'm not sure how to get the result.
This is what I want:
What I currently have now is the same thing without the days column.
I would also like to know how to get a column of dates that grabs the date after the current date.
This seems to work (except for the unclear row=4):
DROP TABLE IF EXISTS table4;
CREATE TABLE table4 (id integer, user_id integer, product varchar(10), `date` date);
INSERT INTO table4 VALUES
(1,1,'item1','2020-01-01'),
(2,1,'item2','2020-01-01'),
(3,1,'item3','2020-01-02'),
(4,1,'item4','2020-01-02'),
(5,2,'item5','2020-01-06'),
(6,2,'item6','2020-01-09'),
(7,2,'item7','2020-01-09'),
(8,2,'item8','2020-01-10');
SELECT
id,
user_id,
product,
date,
(SELECT date FROM table4 t4 WHERE t4.id>t1.id LIMIT 1) x,
COALESCE(DATEDIFF((SELECT date FROM table4 t4 WHERE t4.id>t1.id LIMIT 1),date),0) as days
FROM table4 t1
output:
+ ------- + ------------ + ------------ + --------- + ----------- + --------- +
| id | user_id | product | date | x | days |
+ ------- + ------------ + ------------ + --------- + ----------- + --------- +
| 1 | 1 | item1 | 2020-01-01 | 2020-01-01 | 0 |
| 2 | 1 | item2 | 2020-01-01 | 2020-01-02 | 1 |
| 3 | 1 | item3 | 2020-01-02 | 2020-01-02 | 0 |
| 4 | 1 | item4 | 2020-01-02 | 2020-01-06 | 4 |
| 5 | 2 | item5 | 2020-01-06 | 2020-01-09 | 3 |
| 6 | 2 | item6 | 2020-01-09 | 2020-01-09 | 0 |
| 7 | 2 | item7 | 2020-01-09 | 2020-01-10 | 1 |
| 8 | 2 | item8 | 2020-01-10 | | 0 |
+ ------- + ------------ + ------------ + ---------- + ---------- + --------- +
The column x is only here for to see which date is returned from the subquery, and not really needed for the final result.
DBFIDDLE
EDIT: when there are no "gaps" in the numbering of id, you could do this to get a solution which should have more performance:
SELECT
t1.id,
t1.user_id,
t1.product,
t1.date,
COALESCE(DATEDIFF(t2.date,t1.date),0) as days
FROM table4 t1
LEFT JOIN table4 t2 on t2.id = t1.id+1
I added this to the DBFIDDLE

How can I write a function with two tables inputs and one table output in PostgreSQL?

I want to create a function that can create a table, in which part of the columns is derived from the other two tables.
input table1:
This is a static table for each loan. Each loan has only one row with information related to that loan. For example, original unpaid balance, original interest rate...
| id | loan_age | ori_upb | ori_rate | ltv |
| --- | -------- | ------- | -------- | --- |
| 1 | 360 | 1500 | 4.5 | 0.6 |
| 2 | 360 | 2000 | 3.8 | 0.5 |
input table2:
This is a dynamic table for each loan. Each loan has seraval rows show the loan performance in each month. For example, current unpaid balance, current interest rate, delinquancy status...
| id | month| cur_upb | cur_rate |status|
| ---| --- | ------- | -------- | --- |
| 1 | 01 | 1400 | 4.5 | 0 |
| 1 | 02 | 1300 | 4.5 | 0 |
| 1 | 03 | 1200 | 4.5 | 1 |
| 2 | 01 | 2000 | 3.8 | 0 |
| 2 | 02 | 1900 | 3.8 | 0 |
| 2 | 03 | 1900 | 3.8 | 1 |
| 2 | 04 | 1900 | 3.8 | 2 |
output table:
The output table contains information from table1 and table2. Payoffupb is the last record of cur_upb in table2. This table is built for model development.
| id | loan_age | ori_upb | ori_rate | ltv | payoffmonth| payoffupb | payoffrate |lastStatus | modification |
| ---| -------- | ------- | -------- | --- | ---------- | --------- | ---------- |---------- | ------------ |
| 1 | 360 | 1500 | 4.5 | 0.6 | 03 | 1200 | 4.5 | 1 | null |
| 2 | 360 | 2000 | 3.8 | 0.5 | 04 | 1900 | 3.8 | 2 | null |
Most columns in the output table can directly get or transferred from columns in the two input tables, but some columns can not get then leave blank.
My main question is how to write a function to take two tables as inputs and output another table?
I already wrote the feature transformation part for data files in 2018, but I need to do the same thing again for data files in some other years. That's why I want to create a function to make things easier.
As you want to insert the latest entry of table2 against each entry of table1 try this
insert into table3 (id, loan_age, ori_upb, ori_rate, ltv,
payoffmonth, payoffupb, payoffrate, lastStatus )
select distinct on (t1.id)
t1.id, t1.loan_age, t1.ori_upb, t1.ori_rate, t1.ltv, t2.month, t2.cur_upb,
t2.cur_rate, t2.status
from
table1 t1
inner join
table2 t2 on t1.id=t2.id
order by t1.id , t2.month desc
DEMO1
EDIT for your updated question:
Function to do the above considering table1, table2, table3 structure will be always identical.
create or replace function insert_values(table1 varchar, table2 varchar, table3 varchar)
returns int as $$
declare
count_ int;
begin
execute format('insert into %I (id, loan_age, ori_upb, ori_rate, ltv, payoffmonth, payoffupb, payoffrate, lastStatus )
select distinct on (t1.id) t1.id, t1.loan_age, t1.ori_upb,
t1.ori_rate,t1.ltv,t2.month,t2.cur_upb, t2.cur_rate, t2.status
from %I t1 inner join %I t2 on t1.id=t2.id order by t1.id , t2.month desc',table3,table1,table2);
GET DIAGNOSTICS count_ = ROW_COUNT;
return count_;
end;
$$
language plpgsql
and call above function like below which will return the number of inserted rows:
select * from insert_values('table1','table2','table3');
DEMO2

Generate a histogram of values grouped by a column

I have the following data in a reviews table for certain set of items, using a score system that ranges from 0 to 100
+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1 | 1 | 90 |
+-----------+---------+-------+
| 2 | 1 | 40 |
+-----------+---------+-------+
| 3 | 1 | 10 |
+-----------+---------+-------+
| 4 | 2 | 90 |
+-----------+---------+-------+
| 5 | 2 | 90 |
+-----------+---------+-------+
| 6 | 2 | 70 |
+-----------+---------+-------+
| 7 | 3 | 80 |
+-----------+---------+-------+
| 8 | 3 | 80 |
+-----------+---------+-------+
| 9 | 3 | 80 |
+-----------+---------+-------+
| 10 | 3 | 80 |
+-----------+---------+-------+
| 11 | 4 | 10 |
+-----------+---------+-------+
| 12 | 4 | 30 |
+-----------+---------+-------+
| 13 | 4 | 50 |
+-----------+---------+-------+
| 14 | 4 | 80 |
+-----------+---------+-------+
I am trying to create a histogram of the score values with a bin size of five. My goal is to generate a histogram per item. In order to create a histogram of the entire table, it is possible to use the width_bucket. This can also be tuned to operate on a per-item basis:
SELECT item_id, g.n as bucket, COUNT(m.score) as count
FROM generate_series(1, 5) g(n) LEFT JOIN
review as m
ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;
However, the result looks like this:
+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1 | 5 | 1 |
+---------+--------+-------+
| 1 | 3 | 1 |
+---------+--------+-------+
| 1 | 1 | 1 |
+---------+--------+-------+
| 2 | 5 | 2 |
+---------+--------+-------+
| 2 | 4 | 2 |
+---------+--------+-------+
| 3 | 4 | 4 |
+---------+--------+-------+
| 4 | 1 | 1 |
+---------+--------+-------+
| 4 | 2 | 1 |
+---------+--------+-------+
| 4 | 3 | 1 |
+---------+--------+-------+
| 4 | 4 | 1 |
+---------+--------+-------+
That is, bins with no entries are not included. While I find this not to be a bad solution, I would rather have either all buckets, with 0 on those with no entries. Even better, using this structure:
+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1 | 1 | 0 | 1 | 0 | 1 |
+---------+----------+----------+----------+----------+----------+
| 2 | 0 | 0 | 0 | 2 | 2 |
+---------+----------+----------+----------+----------+----------+
| 3 | 0 | 0 | 0 | 4 | 0 |
+---------+----------+----------+----------+----------+----------+
| 4 | 1 | 1 | 1 | 1 | 0 |
+---------+----------+----------+----------+----------+----------+
I prefer this solution as it uses a row per item (instead of 5n), which is simpler to query and minimizes memory consumption and data transfer costs. My current approach is as follows:
select item_id,
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;
Even though this query satisfies my requirements, I am curious to see if there might be a more elegant approach. so many case statements are not easy to read and changes in the bin criteria might require updating every sum. Also I am curious about the potential performance concerns that this query might have.
The second query can be rewritten to use ranges to make editing and writing the query a bit easier:
with buckets (b1, b2, b3, b4, b5) as (
values (
int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100)
)
)
select item_id,
count(*) filter (where b1 #> score) as bucket_1,
count(*) filter (where b2 #> score) as bucket_2,
count(*) filter (where b3 #> score) as bucket_3,
count(*) filter (where b4 #> score) as bucket_4,
count(*) filter (where b5 #> score) as bucket_5
from review
cross join buckets
group by item_id
order by item_id;
A range constructed with int4range(0,20) includes the lower end and excludes the upper end.
The CTE named buckets only creates a single row, so the cross join does not change the number of rows from the review table.
I found this post useful
CREATE FUNCTION temp_histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
WITH
source AS (
SELECT * FROM %s
),
min_max AS (
SELECT min(%s) AS min, max(%s) AS max FROM source
),
temp_histogram AS (
SELECT
width_bucket(%s, min_max.min, min_max.max, 100) AS bucket,
numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
count(%s) AS freq
FROM source, min_max
WHERE %s IS NOT NULL
GROUP BY bucket
ORDER BY bucket
)
SELECT
bucket,
"range",
freq::bigint,
repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
FROM temp_histogram',
table_name_or_subquery,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name
);
END
$func$ LANGUAGE plpgsql;
Use the bucket numbers(100 in above script) in your favour.
Invoke like this
SELECT * FROM histogram($table_name_or_subquery, $column_name);
Example:
SELECT * FROM histogram('transactions_tbl', 'amount_colm');

Ranking rows according some number

I have table like this
id | name
----------
1 | A
2 | B
5 | C
100 | D
200 | E
201 | F
202 | G
I need ranking rows from 1 to 3 order by id, that is, I need result:
id | name | ranking
---------------------------
1 | A | 1
2 | B | 2
5 | C | 3
100 | D | 1
200 | E | 2
201 | F | 3
202 | G | 1
How to make this?
P.S.
I am trying:
SELECT id, name, row_number() OVER( order by id RANGE BETWEEN 1 AND 3 ) AS ranking FROM t
This gives syntax error.
RANGE is actually used for something else:
http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
http://www.postgresql.org/docs/current/static/sql-select.html
Try using a modulus instead:
SELECT id, name, 1 + (row_number() OVER( order by id ) - 1) % 3 AS ranking
FROM t

I'm a bit new to PostgreSQL and need how to construct complex query

I need to list all the cities you can get to after stopping off at exactly one other city, starting off from any city of my choice. And list with it the distance to the final city and the intermediate city.
The tables in the database consist of cities, with the attributes:
| city_id | name |
1 Edinburgh
2 Newcastle
3 Manchester
citypairs:
| citypair_id | city_id |
1 1
1 2
2 1
2 3
3 2
3 3
and distances:
| citypair_id | distance |
1 1234
2 1324
3 1324
and trains:
| train_id | departure_city_id | destination_city_id |
1 1 2
2 2 3
3 1 3
4 3 2
I haven't put any of the data in but basically if a city.name is chosen at random by me I need to find out which cities I can get to from this city if I go via another city (i.e. in two journeys) and then the distance to the final and intermediate city.
How would you, or how should I, go about forming a query to return the desired table?
Edited to include data and a missing table! As an example you can go from Edinburgh(1) to Manchester(3) via Newcastle(2) and you can go from Edinburgh to Newcastle via Manchester, however you can not go from Manchester to Edinburgh via Newcastle (since a train departs from 3, arrives at 2, but no train from 2 arrives in 1) and this route should not be returned from the query. Apologies for any confusion beforehand.
I've got a CTE that builds a tree of all the destinations.
WITH RECURSIVE trip AS (
SELECT c.city_id AS start_city,
ARRAY[c.city_id] AS route,
cast(c.name AS varchar(100)) AS route_text,
c.city_id AS leg_start_city,
c.city_id AS leg_end_city,
0 AS trip_count,
0 AS leg_length,
0 AS total_length
FROM cities c
UNION ALL
SELECT
trip.start_city,
trip.route || t.destination_city_id,
cast(trip.route_text || ',' || c.name AS varchar(100)),
t.departure_city_id,
t.destination_city_id,
trip.trip_count + 1,
d.distance,
trip.total_length + d.distance
FROM trains t
INNER JOIN trip
ON t.departure_city_id = trip.leg_end_city
INNER JOIN citypairs cps
ON t.departure_city_id = cps.city_id
INNER JOIN citypairs cpe
ON t.destination_city_id = cpe.city_id AND
cpe.citypair_id = cps.citypair_id
INNER JOIN distances d
ON cps.citypair_id = d.citypair_id
INNER JOIN cities c
ON t.destination_city_id = c.city_id
WHERE NOT (array[t.destination_city_id] <# trip.route))
SELECT *
FROM trip
WHERE trip_count = 2
AND start_city = (SELECT city_id FROM cities WHERE name = 'Edinburgh');
The CTE starts from each city (in the non-recursive part at the start), then determines all the destination cities it can go to. It keeps a track of all the cities its been to in an array (the route column), so it won't loop back to itself again. As it progresses, it keeps track of the overall trip distance, and the number of trains taken (in trip_count).
As it goes through the tree, it keeps a running total of the distance.
This gives results of
| START_CITY | ROUTE | ROUTE_TEXT | LEG_START_CITY | LEG_END_CITY | TRIP_COUNT | LEG_LENGTH | TOTAL_LENGTH |
--------------------------------------------------------------------------------------------------------------------------------
| 1 | 1,2,3 | Edinburgh,Newcastle,Manchester | 2 | 3 | 2 | 1324 | 2558 |
| 1 | 1,3,2 | Edinburgh,Manchester,Newcastle | 3 | 2 | 2 | 1324 | 2648 |
If you change remove the final WHERE clause it'll show all the possible trips in the data, likewise you can change the trip_count to find all single train destinations etc.
| START_CITY | ROUTE | ROUTE_TEXT | LEG_START_CITY | LEG_END_CITY | TRIP_COUNT | LEG_LENGTH | TOTAL_LENGTH |
--------------------------------------------------------------------------------------------------------------------------------
| 1 | 1 | Edinburgh | 1 | 1 | 0 | 0 | 0 |
| 2 | 2 | Newcastle | 2 | 2 | 0 | 0 | 0 |
| 3 | 3 | Manchester | 3 | 3 | 0 | 0 | 0 |
| 1 | 1,2 | Edinburgh,Newcastle | 1 | 2 | 1 | 1234 | 1234 |
| 1 | 1,3 | Edinburgh,Manchester | 1 | 3 | 1 | 1324 | 1324 |
| 2 | 2,3 | Newcastle,Manchester | 2 | 3 | 1 | 1324 | 1324 |
| 3 | 3,2 | Manchester,Newcastle | 3 | 2 | 1 | 1324 | 1324 |
| 1 | 1,2,3 | Edinburgh,Newcastle,Manchester | 2 | 3 | 2 | 1324 | 2558 |
| 1 | 1,3,2 | Edinburgh,Manchester,Newcastle | 3 | 2 | 2 | 1324 | 2648 |
The cast( ... as varchar(100)) is a bit hacky, and I'm not sure why it was needed, but I haven't had a chance to get around that yet.
The SQL is here for testing: http://sqlfiddle.com/#!1/93964/24
The first part is easy:
SELECT c2.name
FROM cities AS c
JOIN trains t ON c.city_id=t.departure_city_id
JOIN trains t2 ON t.destination_city_id=t2.departure_city_id
JOIN cities AS c2 ON t2.destination_city_id=c2.city_id
WHERE c2.city_id!=c.city_id
AND c.name='Edinburgh';
http://sqlfiddle.com/#!12/a656f/14
In PG 9.1+ you could even do it with a recursive CTE for any number of cities in between. The distances are a little more complicated and you probably would be better off transforming city_pairs into actual pairs.