SP/View for daily aggregation of 3-hours values - postgresql

i'm trying to group 3-hours forecasting values into a daily table, The problem is that i need to process non standard group operations on values. I attach an example (provided by Openweather ).
time temp press desc w_sp w_dir
"2017-12-20 00:00:00" -4.49 1023.42 "clear" 1.21 198.501
"2017-12-20 03:00:00" -2.51 1023.63 "clouds" 1.22 180.501
"2017-12-20 06:00:00" -0.07 1024.43 "clouds" 1.53 169.503
"2017-12-20 09:00:00" 0.57 1024.83 "snow" 1.77 138.502
"2017-12-20 12:00:00" 0.95 1024.41 "snow" 1.61 271.001
"2017-12-20 15:00:00" -0.47 1024.17 "snow" 0.61 27.5019
"2017-12-20 18:00:00" -2.52 1024.52 "clear" 1.16 13.0007
"2017-12-20 21:00:00" -2.63 1024.73 "clear" 1.07 131.504
In my case i should evaluate the overall daily meteo description according to a mix of the top 2 occurence labels, and concerning wind direction i cannot AVG the 8 values, i have to apply a specific formula.
I'm familiar with sql but not so much with postgres stored procedures, i think i need something like cursor but i'm a bit lost here. I'm sure this can be achieved in many ways but i'm asking you to give me the path. So far i have a draft of a stored procedure but i'm a bit clueless
CREATE FUNCTION meteo_forecast_daily ()
RETURNS TABLE (
forecasting_date DATE,
temperature NUMERIC,
pressure NUMERIC,
description VARCHAR(20),
w_speed NUMERIC,
w_dir NUMERIC
)
AS $$
DECLARE
clouds INTEGER;
snow INTEGER;
clear INTEGER;
rain INTEGER;
thunderstorm INTEGER;
BEGIN
RETURN QUERY SELECT
m.forecasting_time::date as forecasting_date,
avg(m.temperature) as temperature
avg(m.pressure) as pressure
description???
avg(m.w_sp) as w_speed
w_dir????
FROM
meteo_forecast_last_update m
WHERE
forecasting_time > now()
group by forecasting_date;
END; $$
LANGUAGE 'plpgsql';
Thus my question is, how can i retrieve the 8 elements for each date and process them somehow separately?
Desired result:
time temp press desc w_sp w_dir
"2017-12-20" -4.49 1023.42 "clear,clouds,rain,..." 1.21 (198.501, 212.23..)
"2017-12-21" -4.49 1023.42 "rain,snow,rain,..." 1.45 (211.501, 112.26..)
"2017-12-22" -4.49 1023.42 "clear,clouds,rain,..." 1.89 (156.501, 312.53..)
Thanks in advance and happy new year :)

You should achieve this by
SELECT m.forecasting_time::date AS forecasting_date,
AVG(m.temperature) as temperature,
AVG(m.pressure) as pressure,
STRING_AGG(DISTINCT m.description, ',') AS description,
AVG(m.w_sp) as w_speed,
ARRAY_AGG(m.w_dir) AS w_dir
FROM meteo_forecast_last_update m
WHERE m.forecasting_time > now()
GROUP BY 1 ORDER BY 1;
You may use DISTINCT inside of an aggregate function. It applies the aggregate function only for distinct values.

Related

PLSQL to TSQL - REGEXP

Im trying to convert a script from PLSQL to TSQL and am stuff with a couple of lines
table(cast(multiset(select level from dual connect by level <= len (regexp_replace(t.image, '[^**]+'))/2) as sys.OdciNumberList)) levels
where substr(REGEXP_SUBSTR (t.image, '[^**]+',1, levels.column_value),1,instr( REGEXP_SUBSTR (t.image, '[^**]+',1, levels.column_value),'=',1) -1)
IMAGE
Any help would be great.
Chris
For a better answer it would be good to include some sample input and desired results. Especially when addressing a different version of SQL. Perhaps including a PL/SQL tag would help find someone who understands PL/SQL and T-SQL. It would also be helpful to include DDL, specifically the datatype for "Level". Again, I say this not to be critical but rather guide you towards getting better answers here.
All That said, you can accomplish what you are trying to do in T-SQL leveraging a tally table, an N-Grams function and a couple other functions which I are included at the end of this post.
regexp_replace
To replace or remove characters that match a pattern in t-SQL you can use patreplace8k. Here's an example of how to use it to replace numbers with *'s:
SELECT pr.NewString
FROM samd.patReplace8K('My phone number is 555-2211','[0-9]','*') AS pr;
Returns: My phone number is -*
regexp_subsr
Here's an example of how to extract all phone numbers from a string:
DECLARE
#string VARCHAR(8000) = 'Call me later at 222-3333 or tomorrow at 312.555.2222,
(313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
#pattern VARCHAR(50) = '%[^0-9()+.-]%';
-- EXTRACTOR
SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY f.position),
ItemIndex = f.position,
ItemLength = itemLen.l,
Item = SUBSTRING(f.token, 1, itemLen.l)
FROM
(
SELECT ng.position, SUBSTRING(#string,ng.position,DATALENGTH(#string))
FROM samd.NGrams8k(#string, 1) AS ng
WHERE PATINDEX(#pattern, ng.token) < --<< this token does NOT match the pattern
ABS(SIGN(ng.position-1)-1) + --<< are you the first row? OR
PATINDEX(#pattern,SUBSTRING(#string,ng.position-1,1)) --<< always 0 for 1st row
) AS f(position, token)
CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX(#pattern,f.token),0), --CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+#pattern+'%',f.token),0),
DATALENGTH(#string)+2-f.position)-1)) AS itemLen(l)
WHERE itemLen.L > 6 -- this filter is more harmful to the extractor than the splitter
ORDER BY ItemNumber;
T-SQL INSTR Function
I included a T-SQL version of Oracles INSTR function at the end of this post. Note these examples:
DECLARE
#string VARCHAR(8000) = 'AABBCC-AA123-AAXYZPDQ-AA-54321',
#search VARCHAR(8000) = '-AA',
#position INT = 1,
#occurance INT = 2;
-- 1.1. Get me the 2nd #occurance "-AA" in #string beginning at #position 1
SELECT f.* FROM samd.instr8k(#string,#search,#position,#occurance) AS f;
-- 1.2. Retreive everything *BEFORE* the second instance of "-AA"
SELECT
ItemIndex = f.ItemIndex,
Item = SUBSTRING(#string,1,f.itemindex-1)
FROM samd.instr8k(#string,#search,#position,#occurance) AS f;
-- 1.3. Retreive everything *AFTER* the second instance of "-AA"
SELECT
ItemIndex = MAX(f.ItemIndex),
Item = MAX(SUBSTRING(#string,f.itemindex+f.itemLength,8000))
FROM samd.instr8k(#string,#search,#position,#occurance) AS f;
regexp_replace (ADVANCED)
Here's a more complex example, leveraging ngrams8k to replace phone numbers with the text "REMOVED"
DECLARE
#string VARCHAR(8000) = 'Call me later at 222-3333 or tomorrow at 312.555.2222, (313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
#pattern VARCHAR(50) = '%[0-9()+.-]%';
SELECT NewString = (
SELECT IIF(IsMatch=1 AND patSplit.item LIKE '%[0-9][0-9][0-9]%','<REMOVED>', patSplit.item)
FROM
(
SELECT 1, i.Idx, SUBSTRING(#string,1,i.Idx), CAST(0 AS BIT)
FROM (VALUES(PATINDEX(#pattern,#string)-1)) AS i(Idx) --FROM (VALUES(PATINDEX('%'+#pattern+'%',#string)-1)) AS i(Idx)
WHERE SUBSTRING(#string,1,1) NOT LIKE #pattern
UNION ALL
SELECT r.RN,
itemLength = LEAD(r.RN,1,DATALENGTH(#string)+1) OVER (ORDER BY r.RN)-r.RN,
item = SUBSTRING(#string,r.RN,
LEAD(r.RN,1,DATALENGTH(#string)+1) OVER (ORDER BY r.RN)-r.RN),
isMatch = ABS(t.p-2+1)
FROM core.rangeAB(1,DATALENGTH(#string),1,1) AS r
CROSS APPLY (VALUES (
CAST(PATINDEX(#pattern,SUBSTRING(#string,r.RN,1)) AS BIT),
CAST(PATINDEX(#pattern,SUBSTRING(#string,r.RN-1,1)) AS BIT),
SUBSTRING(#string,r.RN,r.Op+1))) AS t(c,p,s)
WHERE t.c^t.p = 1
) AS patSplit(ItemIndex, ItemLength, Item, IsMatch)
FOR XML PATH(''), TYPE).value('.','varchar(8000)');
Returns:
Call me later at or tomorrow at , , or at before noon. Thanks!
CREATE FUNCTION core.rangeAB
(
#Low BIGINT, -- (start) Lowest number in the set
#High BIGINT, -- (stop) Highest number in the set
#Gap BIGINT, -- (step) Difference between each number in the set
#Row1 BIT -- Base: 0 or 1; should RN begin with 0 or 1?
)
/****************************************************************************************
[Purpose]:
Creates a lazy, in-memory, forward-ordered sequence of up to 531,441,000,000 integers
starting with #Low and ending with #High (inclusive). RangeAB is a pure, 100% set-based
alternative to solving SQL problems using iterative methods such as loops, cursors and
recursive CTEs. RangeAB is based on Itzik Ben-Gan's getnums function for producing a
sequence of integers and uses logic from Jeff Moden's fnTally function which includes a
parameter for determining if the "row-number" (RN) should begin with 0 or 1.
I wanted to use the name "Range" because it functions and performs almost identically to
the Range function built into Python and Clojure. RANGE is a reserved SQL keyword so I
went with "RangeAB". Functions/Algorithms developed using rangeAB can be easilty ported
over to Python, Clojure or any other programming language that leverages a lazy sequence.
The two major differences between RangeAB and the Python/Clojure versions are:
1. RangeAB is *Inclusive* where the other two are *Exclusive". range(0,3) in Python and
Clojure return [0 1 2], core.rangeAB(0,3) returns [0 1 2 3].
2. RangeAB has a fourth Parameter (#Row1) to determine if RN should begin with 0 or 1.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
SELECT r.RN, r.OP, r.N1, r.N2
FROM core.rangeAB(#Low,#High,#Gap,#Row1) AS r;
[Parameters]:
#Low = BIGINT; represents the lowest value for N1.
#High = BIGINT; represents the highest value for N1.
#Gap = BIGINT; represents how much N1 and N2 will increase each row. #Gap is also the
difference between N1 and N2.
#Row1 = BIT; represents the base (first) value of RN. When #Row1 = 0, RN begins with 0,
when #row = 1 then RN begins with 1.
[Returns]:
Inline Table Valued Function returns:
RN = BIGINT; a row number that works just like T-SQL ROW_NUMBER() except that it can
start at 0 or 1 which is dictated by #Row1. If you need the numbers:
(0 or 1) through #High, then use RN as your "N" value, ((#Row1=0 for 0, #Row1=1),
otherwise use N1.
OP = BIGINT; returns the "finite opposite" of RN. When RN begins with 0 the first number
in the set will be 0 for RN, the last number in will be 0 for OP. When returning the
numbers 1 to 10, 1 to 10 is retrurned in ascending order for RN and in descending
order for OP.
Given the Numbers 1 to 3, 3 is the opposite of 1, 2 the opposite of 2, and 1 is the
opposite of 3. Given the numbers -1 to 2, the opposite of -1 is 2, the opposite of 0
is 1, and the opposite of 1 is 0.
The best practie is to only use OP when #Gap > 1; use core.O instead. Doing so will
improve performance by 1-2% (not huge but every little bit counts)
N1 = BIGINT; This is the "N" in your tally table/numbers function. this is your *Lazy*
sequence of numbers starting at #Low and incrementing by #Gap until the next number
in the sequence is greater than #High.
N2 = BIGINT; a lazy sequence of numbers starting #Low+#Gap and incrementing by #Gap. N2
will always be greater than N1 by #Gap. N2 can also be thought of as:
LEAD(N1,1,N1+#Gap) OVER (ORDER BY RN)
[Dependencies]:
N/A
[Developer Notes]:
1. core.rangeAB returns one billion rows in exactly 90 seconds on my laptop:
4X 2.7GHz CPU's, 32 GB - multiple versions of SQL Server (2005-2019)
2. The lowest and highest possible numbers returned are whatever is allowable by a
bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3).
3. #Gap does not affect RN, RN will begin at #Row1 and increase by 1 until the last row
unless its used in a subquery where a filter is applied to RN.
4. #Gap must be greater than 0 or the function will not return any rows.
5. Keep in mind that when #Row1 is 0 then the highest RN value (ROWNUMBER) will be the
number of rows returned minus 1
6. If you only need is a sequential set beginning at 0 or 1 then, for best performance
use the RN column. Use N1 and/or N2 when you need to begin your sequence at any
number other than 0 or 1 or if you need a gap between your sequence of numbers.
7. Although #Gap is a bigint it must be a positive integer or the function will
not return any rows.
8. The function will not return any rows when one of the following conditions are true:
* any of the input parameters are NULL
* #High is less than #Low
* #Gap is not greater than 0
To force the function to return all NULLs instead of not returning anything you can
add the following code to the end of the query:
UNION ALL
SELECT NULL, NULL, NULL, NULL
WHERE NOT (#High&#Low&#Gap&#Row1 IS NOT NULL AND #High >= #Low AND #Gap > 0)
This code was excluded as it adds a ~5% performance penalty.
9. There is no performance penalty for sorting by RN ASC; there is a large performance
penalty, however for sorting in descending order. If you need a descending sort the
use OP in place of RN then sort by rn ASC.
10. When setting the #Row1 to 0 and sorting by RN you will see that the 0 is added via
MERGE JOIN concatination. Under the hood the function is essentially concatinating
but, because it's using a MERGE JOIN operator instead of concatination the cost
estimations are needlessly high. You can circumvent this problem by changing:
ORDER BY core.rangeAB.RN to: ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
[Examples]:
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20140518 - Initial Development - AJB
Rev 05 - 20191122 - Developed this "core" version for open source distribution;
updated notes and did some final code clean-up
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally(RN) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
SELECT
RN = 0,
OP = (#High-#Low)/#Gap,
N1 = #Low,
N2 = #Gap+#Low
WHERE #Row1 = 0
UNION ALL -- (#High-#Low)/#Gap+1:
SELECT TOP (ABS((ISNULL(#High,0)-ISNULL(#Low,0))/ISNULL(#Gap,0)+ISNULL(#Row1,1)))
RN = i.RN,
OP = (#High-#Low)/#Gap+(2*#Row1)-i.RN,
N1 = (i.rn-#Row1)*#Gap+#Low,
N2 = (i.rn-(#Row1-1))*#Gap+#Low
FROM iTally AS i
ORDER BY i.RN
) AS r
WHERE #High&#Low&#Gap&#Row1 IS NOT NULL AND #High >= #Low
AND #Gap > 0;
GO
CREATE FUNCTION samd.ngrams8k
(
#String VARCHAR(8000), -- Input string
#N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#String). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.Position, ng.Token
FROM samd.ngrams8k(#String,#N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.Position, ng.Token
FROM dbo.SomeTable AS s
CROSS APPLY samd.ngrams8k(s.SomeValue,#N) AS ng;
[Parameters]:
#String = The input string to split into tokens.
#N = The size of each token returned.
[Returns]:
Position = BIGINT; the position of the token in the input string
token = VARCHAR(8000); a #N-sized character-level N-Gram token
[Dependencies]:
1. core.rangeAB (iTVF)
[Developer Notes]:
1. ngrams8k is not case sensitive;
2. Many functions that use ngrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not choose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #String or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#String)) OR (#N IS NULL OR #String IS NULL)
4. ngrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Split the string, "abcd" into unigrams, bigrams and trigrams
SELECT ng.Position, ng.Token FROM samd.ngrams8k('abcd',1) AS ng; -- unigrams (#N=1)
SELECT ng.Position, ng.Token FROM samd.ngrams8k('abcd',2) AS ng; -- bigrams (#N=2)
SELECT ng.Position, ng.Token FROM samd.ngrams8k('abcd',3) AS ng; -- trigrams (#N=3)
[Revision History]:
------------------------------------------------------------------------------------------
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 05 - 20171228 - Small simplification; changed:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(#String,''))-(ISNULL(#N,1)-1)),0)))
to:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(#String,''))+1-ISNULL(#N,1)),0)))
Rev 06 - 20180612 - Using CHECKSUM(N) in the to convert N in the token output instead of
using (CAST N as int). CHECKSUM removes the need to convert to int.
Rev 07 - 20180612 - re-designed to: Use core.rangeAB - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
Position = r.RN,
Token = SUBSTRING(#String,CHECKSUM(r.RN),#N)
FROM core.rangeAB(1,LEN(#String)+1-#N,1,1) AS r
WHERE #N > 0 AND #N <= LEN(#String);
GO
CREATE FUNCTION samd.patReplace8K
(
#string VARCHAR(8000),
#pattern VARCHAR(50),
#replace VARCHAR(20)
)
/*****************************************************************************************
[Purpose]:
Given a string (#string), a pattern (#pattern), and a replacement character (#replace)
patReplace8K will replace any character in #string that matches the #Pattern parameter
with the character, #replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplace8K(#String,#Pattern,#Replace) AS pr;
[Developer Notes]:
1. Required SQL Server 2008+
2. #Pattern IS case sensitive but can be easily modified to make it case insensitive
3. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
4. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
[Examples]:
--===== 1. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplace8K('My phone number is 555-2211','[0-9]','*') AS pr;
[Revision History]:
Rev 00 - 10/27/2014 Initial Development - Alan Burstein
Rev 01 - 10/29/2014 Mar 2007 - Alan Burstein
- Redesigned based on the dbo.STRIP_NUM_EE by Eirikur Eiriksson
(see: http://www.sqlservercentral.com/Forums/Topic1585850-391-2.aspx)
- change how the cte tally table is created
- put the include/exclude logic in a CASE statement instead of a WHERE clause
- Added Latin1_General_BIN Colation
- Add code to use the pattern as a parameter.
Rev 02 - 20141106
- Added final performane enhancement (more cudo's to Eirikur Eiriksson)
- Put 0 = PATINDEX filter logic into the WHERE clause
Rev 03 - 20150516
- Updated to deal with special XML characters
Rev 04 - 20170320
- changed #replace from char(1) to varchar(1) to address how spaces are handled
Rev 05 - Re-write using samd.NGrams
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN #string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))
WHEN #pattern+#replace+#string IS NOT NULL THEN
CASE WHEN PATINDEX(#pattern,token COLLATE Latin1_General_BIN)=0
THEN ng.token ELSE #replace END END
FROM samd.NGrams8K(#string, 1) AS ng
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'VARCHAR(8000)');
GO
CREATE FUNCTION samd.Instr8k
(
#string VARCHAR(8000),
#search VARCHAR(8000),
#position INT,
#occurance INT
)
/*****************************************************************************************
[Purpose]:
Returns the position (ItemIndex) of the Nth(#occurance) occurrence of one string(#search) within
another(#string). Similar to Oracle's PL/SQL INSTR funtion.
https://www.techonthenet.com/oracle/functions/instr.php
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous
SELECT ins.ItemIndex, ins.ItemLength, ins.ItemCount
FROM samd.Instr8k(#string,#search,#position,#occurance) AS ins;
--===== Against a table using APPLY
SELECT s.SomeID, ins.ItemIndex, ins.ItemLength, ins.ItemCount
FROM dbo.SomeTable AS s
CROSS APPLY samd.Instr8k(s.string,#search,#position,#occurance) AS ins
[Parameters]:
#string = VARCHAR(8000); Input sting to evaluate
#search = VARCHAR(8000); Token to search for inside of #string
#position = INT; Where to begin searching for #search; identical to the third
parameter in SQL Server CHARINDEX [, start_location]
#occurance = INT; Represents the Nth instance of the search string (#search)
[Returns]:
ItemIndex = Position of the Nth (#occurance) instance of #search inside #string
ItemLength = Length of #search (in case you need it, no need to re-evaluate the string)
ItemCount = Number of times #search appears inside #string
[Dependencies]:
1. samd.ngrams8k
1.1. dbo.rangeAB (iTVF)
2. samd.substringCount8K_lazy
[Developer Notes]:
1. samd.Instr8k does not treat the input strings (#string and #search) as case sensitive.
2. Don't use instr8k for "SubstringBetween" functionality; for better performance use
samd.SubstringBetween8k instead.
3. The #position parameter is the key benefit of this function when dealing with long
strings where the search item is towards the back of the string. For example, take a
5000 character string where, what you are looking for is always *at least* 3000
characters deep. Setting #position to 3000 will dramatically improve performance.
4. Unlike Oracle's PL/SQL INSTR function, Instr8k does not accept numbers less than 1.
[Examples]:
[Revision History]:
------------------------------------------------------------------------------------------
Rev 00 - 20191112 - Initial Development - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
ItemIndex = ISNULL(MAX(ISNULL(instr.Position,1)+(a.Pos-1)),0),
ItemLength = ISNULL(MAX(LEN(#search)),LEN(#search)),
ItemCount = ISNULL(MAX(items.SubstringCount),0)
FROM (VALUES(ISNULL(#position,1),LEN(#search))) AS a(Pos,SrchLn)
CROSS APPLY (VALUES(SUBSTRING(#string,a.Pos,8000))) AS f(String)
CROSS APPLY samd.substringCount8K_lazy(f.string,#search) AS items
CROSS APPLY
(
SELECT TOP (#occurance) RN = ROW_NUMBER() OVER (ORDER BY ng.position), ng.position
FROM samd.ngrams8k(f.string,a.SrchLn) AS ng
WHERE ng.token = #search
ORDER BY RN
) AS instr
WHERE a.Pos > 0
AND #occurance <= items.SubstringCount
AND instr.RN = #occurance;
GO
CREATE FUNCTION samd.substringCount8K_lazy
(
#string varchar(8000),
#searchstring varchar(1000)
)
/*****************************************************************************************
[Purpose]:
Scans the input string (#string) and counts how many times the search character
(#searchChar) appears. This function is Based on Itzik Ben-Gans cte numbers table logic
[Compatibility]:
SQL Server 2008+
Uses TABLE VALUES constructor (not available pre-2008)
[Author]: Alan Burstein
[Syntax]:
--===== Autonomous
SELECT f.substringCount
FROM samd.substringCount8K_lazy(#string,#searchString) AS f;
--===== Against a table using APPLY
SELECT f.substringCount
FROM dbo.someTable AS t
CROSS APPLY samd.substringCount8K_lazy(t.col, #searchString) AS f;
Parameters:
#string = VARCHAR(8000); input string to analyze
#searchString = VARCHAR(1000); substring to search for
[Returns]:
Inline table valued function returns -
substringCount = int; Number of times that #searchChar appears in #string
[Developer Notes]:
1. substringCount8K_lazy does NOT take overlapping values into consideration. For
example, this query will return a 1 but the correct result is 2:
SELECT substringCount FROM samd.substringCount8K_lazy('xxx','xx')
When overlapping values are a possibility or concern then use substringCountAdvanced8k
2. substringCount8K_lazy is what is referred to as an "inline" scalar UDF." Technically
it's aninline table valued function (iTVF) but performs the same task as a scalar
valued user defined function (UDF); the difference is that it requires the APPLY table
operator to accept column values as a parameter. For more about "inline" scalar UDFs
see thisarticle by SQL MVP Jeff Moden:
http://www.sqlservercentral.com/articles/T-SQL/91724/
and for more about how to use APPLY see the this article by SQL MVP Paul White:
http://www.sqlservercentral.com/articles/APPLY/69953/.
Note the above syntax example and usage examples below to better understand how to
use the function. Although the function is slightly more complicated to use than a
scalar UDF it will yield notably better performance for many reasons. For example,
unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
not restrict the query optimizer's ability generate a parallel query execution plan.
3. substringCount8K_lazy returns NULL when either input parameter is NULL and returns 0
when either input parameter is blank.
4. substringCount8K_lazy does not treat parameters as cases senstitive
5. substringCount8K_lazy is deterministic. For more deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. How many times does the substring "abc" appear?
SELECT f.* FROM samd.substringCount8k_lazy('abc123xxxabc','abc') AS f;
--===== 2. Return records from a table where the substring "ab" appears more than once
DECLARE #table TABLE (string varchar(8000));
DECLARE #searchString varchar(1000) = 'ab';
INSERT #table VALUES ('abcabc'),('abcd'),('bababab'),('baba'),(NULL);
SELECT searchString = #searchString, t.string, f.substringCount
FROM #table AS t
CROSS APPLY samd.substringCount8k_lazy(string,'ab') AS f
WHERE f.substringCount > 1;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20180625 - Initial Development - Alan Burstein
Rev 01 - 20190102 - Added logic to better handle #searchstring = char(32) - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT substringCount = (LEN(v.s)-LEN(REPLACE(v.s,v.st,'')))/d.l
FROM (VALUES(DATALENGTH(#searchstring))) AS d(l)
CROSS APPLY (VALUES(#string,CASE WHEN d.l>0 THEN #searchstring END)) AS v(s,st);
GO

How to make a self referential window functions

I have a table like this:
amount type app owe
1 a 10 10
2 a 8 -2
3 a 20 12
4 i 30 10
5 a 40 10
owe is:
(type == 'a')?app - sum(owe) where amount < (amount for current row):max(app-sum(owe)where amount<(amount for current row),0)
So I'd need a window function on the column that the window function is on. There are these partition on rows between rows unlimited preceding and prior row, but it has to be on a different column, not the column I'm summing. Is there a way to reference the same column the window function is on
I tried an alias
case
when type = a
then app - sum(owe)over(ROWS BETWEEN UNBOUNDED PRECEDING AND 1 preceding) as owe
else
greatest(0,app - sum(owe)over(ROWS BETWEEN UNBOUNDED PRECEDING AND 1 preceding))
end as owe
But since owe doesn't exist when I made it, I get:
owe doesn't exist.
Is there some other way?
You cannot do that with window functions. Your only chance using SQL is a recursive CTE:
WITH RECURSIVE tab_owe AS (
SELECT amount, type, app,
CASE WHEN type = 'a'
THEN app
ELSE GREATEST(app, 0)
END AS owe
FROM tab
ORDER BY amount LIMIT 1
UNION ALL
SELECT t.amount, t.type, t.app,
CASE WHEN t.type = 'a'
THEN t.app - sum(tab_owe.owe)
ELSE GREATEST(t.app - sum(tab_owe.owe), 0)
END AS owe
FROM (SELECT amount, type, app
FROM tab
WHERE amount > (SELECT max(amount) FROM tab_owe)
ORDER BY amount
LIMIT 1) AS t
CROSS JOIN tab_owe
GROUP BY t.amount, t.type, t.app
)
SELECT amount, type, app, owe
FROM tab_owe;
(untested)
This would be much easier to write in procedural code, sou consider using a table function.
This is what I came up with. Of course, I'm not a real programmer, so I'm sure there's a smarter way:
insert into mort (amount, "type", app)
values
(1,'a',10),
(2,'a',8),
(3,'a',20),
(4,'i',30),
(5,'a',40)
CREATE OR REPLACE FUNCTION mort_v ()
RETURNS TABLE (
zamount int,
ztype text,
zapp int,
zowe double precision
) AS $$
DECLARE
var_r record;
charlie double precision;
sam double precision;
BEGIN
charlie = 0;
FOR var_r IN(SELECT
amount,
"type",
app
FROM mort order by 1)
LOOP
zamount = var_r.amount;
ztype = var_r.type;
zapp = var_r.app;
sam = var_r.app - charlie;
if ztype = 'a' then
zowe = sam;
else
zowe = greatest(sam, 0);
end if;
charlie = charlie + zowe;
RETURN NEXT;
END LOOP;
END; $$
LANGUAGE 'plpgsql';
select * from mort_v()
So with my limited skills you'll notice I had to add a 'z' in front of the columns that are already in the table so I can spit it out again. If your table has 30 columns you'd normally have to do this 30 times. But, I asked a real engineer and he mentioned that if you just spit out the primary key with the calculated column, you can just join it back to the original table. That's smarter than what I have. If there's an even better solution, that would be great. This does serve as a nice reference to how to do something like a cursor in postgre and how to make variables without a '#' in front like in mssqlserver.

Why postgresql writes huge temporary files and fills my disk within a loop?

Problem: The function (below, in PostgreSQL 9.3) runs fine with few iterations, but with many iterations it writes a file of ~1 GB on the disk each iteration until the disk is full and then the code terminates with failed to write.
Question: Is there a way to not write these files on the disk? Or find some other way to circumvent the problem? Ideally I would like to put the code to run overnight to analyse results next day.
The tables are supposed to be overwritten every iteration, so I don't understand why it fills my disk. In my previous attempts it also ran out of memory but I increased max_locks_per_transaction = 256 from 64 in the postgresql.conf.
What am I doing:
I have a function that gets parameters that control the loops inside: start and end timestamp, time bin delta time span and time jump. Something like this: SELECT ib_run2('2009-06-28 13:30:00', '2009-06-29 13:50:59', '10 minute', '0.5 hour', '24 hour');
So the function divides time between start and stop into bins, in this example time from 2009-06-28 13:30:00 is divided into 10 minute intervals for 0.5 hour then jumps 24 hour and does that again until 2009-06-29 13:50:59.
For each 10-minute bin some calculations are made on a spatiotemporal dataset including selection by time and location and calculated distances.
Inside the function there is an unavoidable sequential scan of a big table (6,154,794 rows) and of several smaller ones with selection of a subset from each. The function performs calculations on these subsets and writes results into created tables.
All tables are created with CREATE TABLE. Tables starting with IB_000_ are created before the loops and updated with INSERT INTO inside the loops. Tables starting with IB_i_ are dropped and recreated within the loops each iteration.
Calculation of tables with IB_i_ involve other tables with IB_i_ created within the same iteration or external tables for calculations.
The function:
CREATE OR REPLACE FUNCTION ib_run2(
start_dt TEXT DEFAULT '2009-06-28 13:30:00'
, end_dt TEXT DEFAULT '2009-06-28 13:59:59'
, deltat TEXT DEFAULT '10 minute'
, spant TEXT DEFAULT '2 hour'
, jump_txt TEXT DEFAULT '24 hour'
) RETURNS TEXT AS
$func$
DECLARE n INT DEFAULT 1; DECLARE m INT DEFAULT 1; DECLARE iteration INT DEFAULT 0;
DECLARE delta INTERVAL; DECLARE span INTERVAL; DECLARE jump INTERVAL;
DECLARE mytext TEXT DEFAULT 'iMarinka';
DECLARE start_time_query TIMESTAMP DEFAULT now();
DECLARE dt0 TIMESTAMP;
DECLARE dt1 TIMESTAMP;
DECLARE dt TIMESTAMP;
BEGIN
dt0:=start_dt :: TIMESTAMP;
dt1:=end_dt :: TIMESTAMP;
delta:=deltat :: INTERVAL;
span:=spant :: INTERVAL;
jump:=jump_txt :: INTERVAL;
iteration:=0;
n:=ceiling(extract(EPOCH FROM (dt1-dt0) )*1.0/extract(EPOCH FROM (jump ) ));
m:=ceiling(extract(EPOCH FROM ( (dt0+span) -dt0) )*1.0/extract(EPOCH FROM (delta) ));
DROP TABLE IF EXISTS IB_000_times;
CREATE TABLE IB_000_times (
gid serial primary key, i INT, j INT
, t_from_v TIMESTAMP, t_to_v TIMESTAMP
, t_from_c TIMESTAMP, t_to_c TIMESTAMP
, t_day TEXT, date_t DATE, t TIME
, delta_t INTERVAL
, dt0 TIMESTAMP, dt1 TIMESTAMP
, dt TIMESTAMP, delta INTERVAL, span INTERVAL , jump INTERVAL );
mytext:=(m+1)*(n+1)||' iterations '||n+1||' of i '||m+1||' of j'; RAISE NOTICE '%', mytext;
FOR i IN 0..n LOOP -----------------------------------------
FOR j IN 0..m LOOP -----------------------------------------
dt := dt0 + j * delta + i * jump;
iteration := iteration + 1;
DROP TABLE IF EXISTS IB_i_times;
CREATE TABLE IB_i_times AS (
WITH a AS (SELECT dt::DATE date_t, dt::TIME t , delta delta_t)
SELECT date_t+ t - delta_t AS t_from_v
, date_t+ t AS t_to_v
, date_t+ t AS t_from_c
, date_t+ t + delta_t AS t_to_c
, to_char(date_t, 'day') AS t_day
, a.date_t , a.t, a.delta_t
FROM a
);
INSERT INTO IB_000_times (i , j,
t_from_v , t_to_v , t_from_c , t_to_c , t_day , date_t , t , delta_t ,
dt0 , dt1 , delta , span , jump , dt)
SELECT i,j, t.t_from_v, t.t_to_v, t.t_from_c, t.t_to_c, t.t_day, t.date_t, t.t, t.delta_t ,
dt0 , dt1 , delta , span , jump , dt
FROM IB_i_times t;
COPY ( select * FROM IB_000_times ) TO '/Volumes/1TB/temp/IB_000_times.csv' CSV HEADER DELIMITER ';' ;
mytext := iteration||'/'||(n+1)*(m+1)||' -----> '||' dt= '||to_char(dt,'YYYY-MM-DD HH24:MI:SS'); RAISE NOTICE '%', mytext;
mytext := 'Fin '||': i='||i||', dt='|| to_char(dt,'YYYY-MM-DD HH24:MI:SS')||', started '||start_time_query;
END LOOP;----------------------------------------------------
END LOOP;----------------------------------------------------
RETURN mytext;
END;
$func$
LANGUAGE plpgsql;
Besides the tables IB_i_times and IB_000_times there is a bunch of other tables (not shown here to save space, the code has ~500 lines) that the function creates before (and some inside) the loops and updates inside the loops.
Hard to say, why Postgres generate temp files from this source code. Use log temp files - log_temp_files - and when you identify the statements that produce temp files, you can identify a reason. Usually it is limited work_mem.
Controls logging of temporary file names and sizes. Temporary files can be created for sorts, hashes, and temporary query results. A log entry is made for each temporary file when it is deleted. A value of zero logs all temporary file information, while positive values log only files whose size is greater than or equal to the specified number of kilobytes. The default setting is -1, which disables such logging. Only superusers can change this setting. https://www.postgresql.org/docs/current/static/runtime-config-logging.html

trigger is working but no result

CREATE or replace FUNCTION billtesting() RETURNS trigger AS
$$
BEGIN
if (destination_number'^(?:[0-7] ?){6,14}[0-9]$' ~ digits and destination_number !~ '15487498')
then
insert into isp_cdr(destination_number,caller_id,duration,billsec)
values (destination_number,caller_id,duration,billsec);
UPDATE isp_cdr
SET nibble_total_billed = billsec * user_rate;
end if;
RETURN NEW;
END
$$
LANGUAGE plpgsql;
CREATE TRIGGER bill_testing_update
BEFORE UPDATE
ON isp_cdr
FOR EACH ROW
EXECUTE PROCEDURE billtesting();
This is the query for testing
insert into isp_cdr(destination_number,caller_id,duration,billsec)
values ('012687512123','123125641','43','35');
This is the sample lcr table information
digits user_rate
1 0.02
23 0.07
652 0.12
1123 0.28
87521 0.15
123161 0.54
9641231 1.20
65491641 0.89
this is the sample isp_cdr table information
destination_number caller_id duration billsec nibble_total_billed
123561231 1315142 67 58 0
with this trigger procedure i want to match the digits with destination_number with and use the user_rate to do the calculation
user_rate * billsec = nibble_total_billed
but after the insert the testing query the calculation does working it din't show any result at nibble_total_billed what i think is my pattern matching part has some mistake i need to match the digits with the destination_number.

Loading, listing, and using R Modules and Functions in PL/R

I am having difficulty with:
Listing the R packages and functions available to PostgreSQL.
Installing a package (such as Kendall) for use with PL/R
Calling an R function within PostgreSQL
Listing Available R Packages
Q.1. How do you find out what R modules have been loaded?
SELECT * FROM r_typenames();
That shows the types that are available, but what about checking if Kendall( X, Y ) is loaded? For example, the documentation shows:
CREATE TABLE plr_modules (
modseq int4,
modsrc text
);
That seems to allow inserting records to dictate that Kendall is to be loaded, but the following code doesn't explain, syntactically, how to ensure that it gets loaded:
INSERT INTO plr_modules
VALUES (0, 'pg.test.module.load <-function(msg) {print(msg)}');
Q.2. What would the above line look like if you were trying to load Kendall?
Q.3. Is it applicable?
Installing R Packages
Using the "synaptic" package manager the following packages have been installed:
r-base
r-base-core
r-base-dev
r-base-html
r-base-latex
r-cran-acepack
r-cran-boot
r-cran-car
r-cran-chron
r-cran-cluster
r-cran-codetools
r-cran-design
r-cran-foreign
r-cran-hmisc
r-cran-kernsmooth
r-cran-lattice
r-cran-matrix
r-cran-mgcv
r-cran-nlme
r-cran-quadprog
r-cran-robustbase
r-cran-rpart
r-cran-survival
r-cran-vr
r-recommended
Q.4. How do I know if Kendall is in there?
Q.5. If it isn't, how do I find out what package it is in?
Q.6. If it isn't in a package suitable for installing with apt-get (aptitude, synaptic, dpkg, what have you), how do I go about installing it on Ubuntu?
Q.7. Where are the installation steps documented?
Calling R Functions
I have the following code:
EXECUTE 'SELECT '
'regr_slope( amount, year_taken ),'
'regr_intercept( amount, year_taken ),'
'corr( amount, year_taken ),'
'sum( measurements ) AS total_measurements '
'FROM temp_regression'
INTO STRICT slope, intercept, correlation, total_measurements;
This code calls the PostgreSQL function corr to calculate Pearson's correlation over the data. Ideally, I'd like to do the following (by switching corr for plr_kendall):
EXECUTE 'SELECT '
'regr_slope( amount, year_taken ),'
'regr_intercept( amount, year_taken ),'
'plr_kendall( amount, year_taken ),'
'sum( measurements ) AS total_measurements '
'FROM temp_regression'
INTO STRICT slope, intercept, correlation, total_measurements;
Q.8. Do I have to write plr_kendall myself?
Q.9. Where can I find a simple example that walks through:
Loading an R module into PG.
Writing a PG wrapper for the desired R function.
Calling the PG wrapper from a SELECT.
For example, would the last two steps look like:
create or replace function plr_kendall( _float8, _float8 ) returns float as '
agg_kendall(arg1, arg2)
' language 'plr';
CREATE AGGREGATE agg_kendall (
sfunc = plr_array_accum,
basetype = float8, -- ???
stype = _float8, -- ???
finalfunc = plr_kendall
);
And then the SELECT as above?
Thank you!
Overview
These steps list how to call an R function from PostgreSQL using PL/R.
Prerequisties
You must already have PostgreSQL, R, and PL/R installed.
Steps
Find R Module name (e.g., Kendall)
Change to the database user:
sudo su - postgres
Run R
R
Install R Module (accept $HOME/R/x86_64-pc-linux-gnu-library/2.9/):
install.packages("Kendall", dependencies = TRUE)
Choose a CRAN Mirror, when prompted.
Create the following table:
CREATE TABLE plr_modules (
modseq int4,
modsrc text
);
Insert into that table the directive to load the R Module in question:
INSERT INTO plr_modules
VALUES (0, 'library(Kendall)' );
Restart the database (or SELECT * FROM reload_plr_modules();):
sudo /etc/init.d/postgresql-8.4 restart
Create a wrapper function in PostgreSQL:
CREATE OR REPLACE FUNCTION climate.plr_corr_kendall(
double precision[],
double precision[] )
RETURNS double precision AS
$BODY$
Kendall(arg1, arg2)
$BODY$
LANGUAGE 'plr' VOLATILE STRICT;
Create a function that uses the wrapper function.
Test the new function.
Wrapper Function
This function performs the work of gathering data from the database and creating two arrays. These arrays are passed into the plr_corr_kendall wrapper function.
CREATE OR REPLACE FUNCTION climate.analysis_vector()
RETURNS double precision AS
$BODY$
DECLARE
v_year_taken double precision[];
v_amount double precision[];
i RECORD;
BEGIN
FOR i IN (
SELECT
extract(YEAR FROM m.taken) AS year_taken,
avg( m.amount ) AS amount
FROM
climate.city c,
climate.station s,
climate.station_category sc,
climate.measurement m
WHERE
c.id = 5148 AND
earth_distance(
ll_to_earth(c.latitude_decimal,c.longitude_decimal),
ll_to_earth(s.latitude_decimal,s.longitude_decimal)) <= 30 AND
s.elevation BETWEEN 0 AND 3000 AND
s.applicable AND
sc.station_id = s.id AND
sc.category_id = 1 AND
extract(YEAR FROM sc.taken_start) >= 1900 AND
extract(YEAR FROM sc.taken_end) <= 2009 AND
m.station_id = s.id AND
m.taken BETWEEN sc.taken_start AND sc.taken_end AND
m.category_id = sc.category_id
GROUP BY
extract(YEAR FROM m.taken)
ORDER BY
extract(YEAR FROM m.taken)
) LOOP
SELECT array_append( v_year_taken, i.year_taken ) INTO v_year_taken;
SELECT array_append( v_amount, i.amount::double precision ) INTO v_amount;
END LOOP;
RAISE NOTICE '%', v_year_taken;
RAISE NOTICE '%', v_amount;
RETURN climate.plr_corr_kendall( v_year_taken, v_amount );
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;
Test
Test the function as follows:
SELECT
*
FROM
climate.analysis_vector();
Result
A number: -0.0578900910913944