T-SQL MIN of Subquery - tsql

I have a query that tries to return a customer number and the number of consecutive years they've been a customer. It does this by building a list of activity-year and customer, then comparing that to a list of possible years and returning the lowest year with no activity. The problem is that the possible years list is a large cross-join. I think this would run much more quickly if I could bake the EXCEPT logic inside the MIN and just reuse my list of 10 possible years.
The Query:
SELECT SUBSTRING(D,3,9) AS Cust, MIN(SUBSTRING(D,1,1)) AS Years FROM
(SELECT DISTINCT
CAST (y.years AS VARCHAR) + '-' + CAST(pm.BillToCustomerId AS VARCHAR ) AS D
FROM [DW_Mart].[dbo].[vProMaster] pm
cross join
(VALUES ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9')) AS y(years)
EXCEPT
SELECT DISTINCT CAST (DATEDIFF(YEAR,[ShipmentDate],CURRENT_TIMESTAMP) AS VARCHAR)
+ '-' + CAST ([BillToCustomerId] AS VARCHAR ) AS D
FROM [DW_Mart].[dbo].[vProMaster] pm ) AS X GROUP BY SUBSTRING(D,3,9)
My pseudocode revised query:
SELECT SUBSTRING(D,3,9) AS Cust, MIN((VALUES ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9')) EXCEPT SUBSTRING(D,1,1)) AS Years FROM
(SELECT DISTINCT CAST (DATEDIFF(YEAR,[ShipmentDate],CURRENT_TIMESTAMP) AS VARCHAR)
+ '-' + CAST ([BillToCustomerId] AS VARCHAR ) AS D
FROM [DW_Mart].[dbo].[vProMaster] pm ) AS X GROUP BY SUBSTRING(D,3,9)

how about something like ...
select billtocustomerid, max( DATEDIFF(YEAR,[ShipmentDate],CURRENT_TIMESTAMP))
from dw_mart.dbo.vpromaster
group by billtocustomerid

Turns out that the cross join crushed the server.
I’m not sure how proud I am of this puppy, but it works and it’s quick:
SELECT SUBSTRING(D,10,9) AS Cust,
CHARINDEX('1',RIGHT('00000000'+(CAST (11111111-SUM(CAST(SUBSTRING(D,1,8) AS INT)) AS VARCHAR)),8)) AS Years
FROM
(SELECT DISTINCT RIGHT('00000000'+(CAST (POWER(10,(FLOOR(8-DATEDIFF(MONTH,[ShipmentDate],CURRENT_TIMESTAMP)/12))) AS VARCHAR)),8)
+ '-' + CAST ([BillToCustomerId] AS VARCHAR ) AS D
FROM [DW_Mart].[dbo].[vProMaster] pm
where ShipmentDate < CURRENT_TIMESTAMP and ShipmentDate > DATEADD(YEAR,-8,CURRENT_TIMESTAMP)) AS X GROUP BY SUBSTRING(D,10,9)

Related

Pivot table using crosstab and count

I have to display a table like this:
Year
Month
Delivered
Not delivered
Not Received
2021
Jan
10
86
75
2021
Feb
13
36
96
2021
March
49
7
61
2021
Apr
3
21
72
Using raw data generated by this query:
SELECT
year,
TO_CHAR( creation_date, 'Month') AS month,
marking,
COUNT(*) AS count
FROM invoices
GROUP BY 1,2,3
I have tried using crosstab() but I got error:
SELECT * FROM crosstab('
SELECT
year,
TO_CHAR( creation_date, ''Month'') AS month,
marking,
COUNT(*) AS count
FROM invoices
GROUP BY 1,2,3
') AS ct(year text, month text, marking text)
I would prefer to not manually type all marking values because they are a lot.
ERROR: invalid source data SQL statement
DETAIL: The provided SQL must return 3 columns: rowid, category, and values.
1. Static solution with a limited list of marking values :
SELECT year
, TO_CHAR( creation_date, 'Month') AS month
, COUNT(*) FILTER (WHERE marking = 'Delivered') AS Delivered
, COUNT(*) FILTER (WHERE marking = 'Not delivered') AS "Not delivered"
, COUNT(*) FILTER (WHERE marking = 'Not Received') AS "Not Received"
FROM invoices
GROUP BY 1,2
2. Full dynamic solution with a large list of marking values :
This proposal is an alternative solution to the crosstab solution as proposed in A and B.
The proposed solution here just requires a dedicated composite type which can be dynamically created and then it relies on the jsonb type and standard functions :
Starting from your query which counts the number of rows per year, month and marking value :
Using the jsonb_object_agg function, the resulting rows are first
aggregated by year and month into jsonb objects whose jsonb keys
correspond to the marking values and whose jsonb values
correspond to the counts.
the resulting jsonb objects are then converted into records using the jsonb_populate_record function and the dedicated composite type.
First we dynamically create a composite type which corresponds to the ordered list of marking values :
CREATE OR REPLACE PROCEDURE create_composite_type() LANGUAGE plpgsql AS $$
DECLARE
column_list text ;
BEGIN
SELECT string_agg(DISTINCT quote_ident(marking) || ' bigint', ',' ORDER BY quote_ident(marking) || ' bigint' ASC)
INTO column_list
FROM invoices ;
EXECUTE 'DROP TYPE IF EXISTS composite_type' ;
EXECUTE 'CREATE TYPE composite_type AS (' || column_list || ')' ;
END ;
$$ ;
CALL create_composite_type() ;
Then the expected result is provided by the following query :
SELECT a.year
, TO_CHAR(a.year_month, 'Month') AS month
, (jsonb_populate_record( null :: composite_type
, jsonb_object_agg(a.marking, a.count)
)
).*
FROM
( SELECT year
, date_trunc('month', creation_date) AS year_month
, marking
, count(*) AS count
FROM invoices AS v
GROUP BY 1,2,3
) AS a
GROUP BY 1,2
ORDER BY month
Obviously, if the list of marking values may vary in time, then you have to recall the create_composite_type() procedure just before executing the query. If you don't update the composite_type, the query will still work (no error !) but some old marking values may be obsolete (not used anymore), and some new marking values may be missing in the query result (not displayed as columns).
See the full demo in dbfiddle.
You need to generate the crosstab() call dynamically.
But since SQL does not allow dynamic return types, you need a two-step workflow:
Generate query
Execute query
If you are unfamiliar with crosstab(), read this first:
PostgreSQL Crosstab Query
It's odd to generate the month from creation_date, but not the year. To simplify, I use a combined column year_month instead.
Query to generate the crosstab() query:
SELECT format(
$f$SELECT * FROM crosstab(
$q$
SELECT to_char(date_trunc('month', creation_date), 'YYYY_Month') AS year_month
, marking
, COUNT(*) AS ct
FROM invoices
GROUP BY date_trunc('month', creation_date), marking
ORDER BY date_trunc('month', creation_date) -- optional
$q$
, $c$VALUES (%s)$c$
) AS ct(year_month text, %s);
$f$, string_agg(quote_literal(sub.marking), '), (')
, string_agg(quote_ident (sub.marking), ' int, ') || ' int'
)
FROM (SELECT DISTINCT marking FROM invoices ORDER BY 1) sub;
If the table invoices is big with only few distinct values for marking (which seems likely) there are faster ways to get distinct values. See:
Optimize GROUP BY query to retrieve latest row per user
Generates a query of the form:
SELECT * FROM crosstab(
$q$
SELECT to_char(date_trunc('month', creation_date), 'YYYY_Month') AS year_month
, marking
, COUNT(*) AS ct
FROM invoices
GROUP BY date_trunc('month', creation_date), marking
ORDER BY date_trunc('month', creation_date) -- optional
$q$
, $c$VALUES ('Delivered'), ('Not Delivered'), ('Not Received')$c$
) AS ct(year_month text, "Delivered" int, "Not Delivered" int, "Not Received" int);
The simplified query does not need "extra columns. See:
Pivot on Multiple Columns using Tablefunc
Note the use date_trunc('month', creation_date) in GROUP BY and ORDER BY. This produces a valid sort order, and faster, too. See:
Cumulative sum of values by month, filling in for missing months
How to get rows by max(date) group by Year-Month in Postgres?
Also note the use of dollar-quotes to avoid quoting hell. See:
Insert text with single quotes in PostgreSQL
Months without entries don't show up in the result, and no markings for an existing month show as NULL. You can adapt either if need be. See:
Join a count query on generate_series() and retrieve Null values as '0'
Then execute the generated query.
db<>fiddle here (reusing
Edouard's fiddle, kudos!)
See:
Execute a dynamic crosstab query
In psql
In psql you can use \qexec to immediately execute the generated query. See:
Simulate CREATE DATABASE IF NOT EXISTS for PostgreSQL?
In Postgres 9.6 or later, you can also use the meta-command \crosstabview instead of crosstab():
test=> SELECT to_char(date_trunc('month', creation_date), 'YYYY_Month') AS year_month
test-> , marking
test-> , COUNT(*) AS count
test-> FROM invoices
test-> GROUP BY date_trunc('month', creation_date), 2
test-> ORDER BY date_trunc('month', creation_date)\crosstabview
year_month | Not Received | Delivered | Not Delivered
----------------+--------------+-----------+---------------
2020_January | 1 | 1 | 1
2020_March | | 2 | 2
2021_January | 1 | 1 | 2
2021_February | 1 | |
2021_March | | 1 |
2021_August | 2 | 1 | 1
2022_August | | 2 |
2022_November | 1 | 2 | 3
2022_December | 2 | |
(9 rows)
Note that \crosstabview - unlike crosstab() - does not support "extra" columns. If you insist on separate year and month columns, you need crosstab().
See:
How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?

PostgreSQL - SQL function to loop through all months of the year and pull 10 random records from each

I am attempting to pull 10 random records from each month of this year using this query here but I get an error "ERROR: relation "c1" does not exist
"
Not sure where I'm going wrong - I think it may be I'm using Mysql syntax instead, but how do I resolve this?
My desired output is like this
Month
Another header
2021-01
random email 1
2021-01
random email 2
total of ten random emails from January, then ten more for each month this year (til November of course as Dec yet to happen)..
With CTE AS
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
DISTINCT(TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM')) AS month
,CASE
WHEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
)
SELECT *
FROM CTE C1
LEFT JOIN
(SELECT RN
,month
,email
FROM CTE C2
WHERE C2.month = C1.month
ORDER BY RANDOM() LIMIT 10) C3
ON C1.RN = C3.RN
ORDER By month ASC```
You can't reference an outer table inside a derived table with a regular join. You need to use left join lateral to make that work
I did end up finding a more elegant solution to my query here via this source from github :
SELECT
month
,email
FROM
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM') AS month
,CASE
WHEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
) q
WHERE
RN <=10
ORDER BY month ASC

TSQL Fuzzy address matching grouping, 2019 Edition

I have this situation where people asked to group on bad addresses. And I need to work on the tools/env I have, I don't have choice for Google API or 3rd party Data Science tools. I also did my HW, see posts several years old, so still want to check all if any updates available.
In my scenario people want to group IDs 1-6 into single, rest I added for neg test.
SELECT * INTO #t FROM ( --test data: select * from #t drop table #t
SELECT 1 Id, '1 CROLANA HEIGHTS' Adr UNION -- A vs O
SELECT 2 Id, '1 CROLONA HEIGHTS' Adr union
SELECT 3 Id, '1 CROLONA HEIGHT DRIVE' Adr union
SELECT 4 Id,'1 CROLONA HEIGHTS DR' Adr union
SELECT 5 Id, '1 CROLONA HGHTS DR' Adr union
SELECT 6 Id, '1 CROLONA HTS DR' Adr UNION
---------------------------------------- rest should not match
SELECT 7 Id, '1 CORWING DR' Adr UNION
SELECT 8 Id, '1 SUNNYHILL DRIVE' Adr UNION
SELECT 9 Id, '1 CROWN HILL DR' Adr UNION
SELECT 10 Id, '1 ADDISON DRv' Adr ) a
------------------- and below is my fuzzy working script which can be improved)
SELECT id, adr, LEAD(adr,1) OVER ( ORDER BY adr ) adr_lead,
SOUNDEX(adr) Sdx, DIFFERENCE(adr, LEAD(adr,1) OVER ( ORDER BY adr )) diff
--- SOUNDEX(adr), COUNT(*) c
FROM #t
--GROUP BY SOUNDEX(adr)
WHERE SOUNDEX(adr) = SOUNDEX('1 CROLANA HEIGHTS')
There is suggestions which I gladly take. I'm using intell replace at the end of string and standalone words to improve data.
DECLARE #st VARCHAR(100) = 'La_Beg_10 La_midleMacy La' --replace et the end of string
SELECT 'ryba', #st, '-->' f, CASE WHEN #st LIKE '%' + ' La'
THEN SUBSTRING(#st,1,LEN(#st) - LEN('La')) + 'Lane' ELSE #st END N

Include value from cte when it has not match

In my table I have some entries which - by the table's date column - is not older than 2016-01-04 (January 4, 2016).
Now I would like to make a query which more or less counts the number of rows which have a specific date value, but I'd like this query to be able to return a 0 count for dates not present in table.
I have this:
with date_count as (select '2016-01-01'::date + CAST(offs || ' days' as
interval) as date from generate_series(0, 6, 1) AS offs ) select
date_count.date, count(allocation_id) as packs_used from medicine_allocation,
date_count where site_id = 1 and allocation_id is not null and timestamp
between date_count.date and date_count.date + interval '1 days' group by
date_count.date order by date_count.date;
This surely gives me a nice aggregated view of the date in my table, but since no rows are from before January 4 2016, they don't show in the result:
"2016-01-04 00:00:00";1
"2016-01-05 00:00:00";2
"2016-01-06 00:00:00";4
"2016-01-07 00:00:00";3
I would like this:
"2016-01-01 00:00:00";0
"2016-01-02 00:00:00";0
"2016-01-03 00:00:00";0
"2016-01-04 00:00:00";1
"2016-01-05 00:00:00";2
"2016-01-06 00:00:00";4
"2016-01-07 00:00:00";3
I have also tried right join on the cte, but this yields the same result. I cannot quite grasp how to do this... any help out there?
Best,
Janus
You simply need a left join:
with date_count as (
select '2016-01-01'::date + CAST(offs || ' days' as
interval) as date
from generate_series(0, 6, 1) AS offs
)
select dc.date, count(ma.allocation_id) as packs_used
from date_count dc left join
medicine_allocation ma
on ma.site_id = 1 and ma.allocation_id is not null and
ma.timestamp between dc.date and dc.date + interval '1 days'
group by dc.date
order by dc.date;
A word of advice: Never use commas in the FROM clause. Always use explicit JOIN syntax.
You will also notice that the where conditions were moved to the ON clause. That is necessary because they are on the second table.

T-SQL - how to get around the order by restriction in CTEs

I have the following CTE. Its purpose is to provide unique Month/Year pairs. Later code will use the CTE to produce a concatenated string list of the Month/Year pairs.
;WITH tblStoredWillsInPeriod AS
(
SELECT DISTINCT Kctc.GetMonthAndYearString(DateWillReceived) Month
FROM Kctc.StoredWills
WHERE DateWillReceived BETWEEN '2010/01/01' AND '2010/03/31'
ORDER BY DateWillReceived
)
I have omitted the implmementation of the GetMonthAndYearString function as it is trivial.
Edit: As requested by Martin, here is the surrounding code:
DECLARE #PivotColumnHeaders nvarchar(MAX)
--CTE declaration as above---
SELECT #PivotColumnHeaders =
COALESCE(
#PivotColumnHeaders + ',[' + Month + ']',
'[' + Month + ']'
)
FROM tblStoredWillsInPeriod
SELECT #PivotColumnHeaders
Sadly, it seems T-SQL is always one step ahead. When I run this code, it tells me I'm not allowed to use ORDER BY in a CTE unless I also use TOP (or FOR XML, whatever that is.) If I use TOP, it tells me I can't use it with DISTINCT. Yup, T-SQL has all the answers.
Can anyone think of a solution to this problem which is quicker than simply slashing my wrists? I understand that death from blood loss can be surprisingly lingering, and I have deadlines to meet.
Thanks for your help.
David
Will this work?
DECLARE #PivotColumnHeaders VARCHAR(MAX)
;WITH StoredWills AS
(
SELECT GETDATE() AS DateWillReceived
UNION ALL
SELECT '2010-03-14 11:48:07.580'
UNION ALL
SELECT '2010-03-12 11:48:07.580'
UNION ALL
SELECT '2010-02-12 11:48:07.580'
),
tblStoredWillsInPeriod AS
(
SELECT DISTINCT STUFF(RIGHT(convert(VARCHAR, DateWillReceived, 106),8), 4, 1, '-') AS MMMYYYY,
DatePart(Year,DateWillReceived) AS Year,
DatePart(Month,DateWillReceived) AS Month
FROM StoredWills
WHERE DateWillReceived BETWEEN '2010-01-01' AND '2010-03-31'
)
SELECT #PivotColumnHeaders =
COALESCE(
#PivotColumnHeaders + ',[' + MMMYYYY + ']',
'[' + MMMYYYY + ']'
)
FROM tblStoredWillsInPeriod
ORDER BY Year, Month
Could you clarify why you need the data in the the CTE to be ordered? And why you are not able to order the data in the query using the CTE. Remember data in an ordinary subquery can't be ordered either.
What about?
;WITH tblStoredWillsInPeriod AS
(
SELECT DISTINCT Kctc.GetMonthAndYearString(DateWillReceived) Month
FROM Kctc.StoredWills
WHERE DateWillReceived BETWEEN '2010/01/01' AND '2010/03/31'
ORDER BY DateWillReceived
),
tblStoredWillsInPeriodOrdered AS
(
SELECT TOP 100 PERCENT Month
FROM tblStoredWillsInPeriod
ORDER BY Month
)
And you think you know T-SQL syntax!
Turns out I was wrong about not being able to use TOP and DISTINCT together.
This yields a syntax error...
SELECT TOP 100 PERCENT DISTINCT...
whereas this is absolutely fine...
SELECT DISTINCT TOP 100 PERCENT...
Work that one out.
One drawback is that you have to include the ORDER BY field in the SELECT list, which in all likelihood will interfere with your expected DISTINCT results. Sometimes T-SQL has you running around in circles.
But for now, my wrists are left unmarked.
SELECT DISTINCT TOP 100 PERCENT ...
ORDER BY ...