Related
Im trying to convert a script from PLSQL to TSQL and am stuff with a couple of lines
table(cast(multiset(select level from dual connect by level <= len (regexp_replace(t.image, '[^**]+'))/2) as sys.OdciNumberList)) levels
where substr(REGEXP_SUBSTR (t.image, '[^**]+',1, levels.column_value),1,instr( REGEXP_SUBSTR (t.image, '[^**]+',1, levels.column_value),'=',1) -1)
IMAGE
Any help would be great.
Chris
For a better answer it would be good to include some sample input and desired results. Especially when addressing a different version of SQL. Perhaps including a PL/SQL tag would help find someone who understands PL/SQL and T-SQL. It would also be helpful to include DDL, specifically the datatype for "Level". Again, I say this not to be critical but rather guide you towards getting better answers here.
All That said, you can accomplish what you are trying to do in T-SQL leveraging a tally table, an N-Grams function and a couple other functions which I are included at the end of this post.
regexp_replace
To replace or remove characters that match a pattern in t-SQL you can use patreplace8k. Here's an example of how to use it to replace numbers with *'s:
SELECT pr.NewString
FROM samd.patReplace8K('My phone number is 555-2211','[0-9]','*') AS pr;
Returns: My phone number is -*
regexp_subsr
Here's an example of how to extract all phone numbers from a string:
DECLARE
#string VARCHAR(8000) = 'Call me later at 222-3333 or tomorrow at 312.555.2222,
(313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
#pattern VARCHAR(50) = '%[^0-9()+.-]%';
-- EXTRACTOR
SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY f.position),
ItemIndex = f.position,
ItemLength = itemLen.l,
Item = SUBSTRING(f.token, 1, itemLen.l)
FROM
(
SELECT ng.position, SUBSTRING(#string,ng.position,DATALENGTH(#string))
FROM samd.NGrams8k(#string, 1) AS ng
WHERE PATINDEX(#pattern, ng.token) < --<< this token does NOT match the pattern
ABS(SIGN(ng.position-1)-1) + --<< are you the first row? OR
PATINDEX(#pattern,SUBSTRING(#string,ng.position-1,1)) --<< always 0 for 1st row
) AS f(position, token)
CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX(#pattern,f.token),0), --CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+#pattern+'%',f.token),0),
DATALENGTH(#string)+2-f.position)-1)) AS itemLen(l)
WHERE itemLen.L > 6 -- this filter is more harmful to the extractor than the splitter
ORDER BY ItemNumber;
T-SQL INSTR Function
I included a T-SQL version of Oracles INSTR function at the end of this post. Note these examples:
DECLARE
#string VARCHAR(8000) = 'AABBCC-AA123-AAXYZPDQ-AA-54321',
#search VARCHAR(8000) = '-AA',
#position INT = 1,
#occurance INT = 2;
-- 1.1. Get me the 2nd #occurance "-AA" in #string beginning at #position 1
SELECT f.* FROM samd.instr8k(#string,#search,#position,#occurance) AS f;
-- 1.2. Retreive everything *BEFORE* the second instance of "-AA"
SELECT
ItemIndex = f.ItemIndex,
Item = SUBSTRING(#string,1,f.itemindex-1)
FROM samd.instr8k(#string,#search,#position,#occurance) AS f;
-- 1.3. Retreive everything *AFTER* the second instance of "-AA"
SELECT
ItemIndex = MAX(f.ItemIndex),
Item = MAX(SUBSTRING(#string,f.itemindex+f.itemLength,8000))
FROM samd.instr8k(#string,#search,#position,#occurance) AS f;
regexp_replace (ADVANCED)
Here's a more complex example, leveraging ngrams8k to replace phone numbers with the text "REMOVED"
DECLARE
#string VARCHAR(8000) = 'Call me later at 222-3333 or tomorrow at 312.555.2222, (313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
#pattern VARCHAR(50) = '%[0-9()+.-]%';
SELECT NewString = (
SELECT IIF(IsMatch=1 AND patSplit.item LIKE '%[0-9][0-9][0-9]%','<REMOVED>', patSplit.item)
FROM
(
SELECT 1, i.Idx, SUBSTRING(#string,1,i.Idx), CAST(0 AS BIT)
FROM (VALUES(PATINDEX(#pattern,#string)-1)) AS i(Idx) --FROM (VALUES(PATINDEX('%'+#pattern+'%',#string)-1)) AS i(Idx)
WHERE SUBSTRING(#string,1,1) NOT LIKE #pattern
UNION ALL
SELECT r.RN,
itemLength = LEAD(r.RN,1,DATALENGTH(#string)+1) OVER (ORDER BY r.RN)-r.RN,
item = SUBSTRING(#string,r.RN,
LEAD(r.RN,1,DATALENGTH(#string)+1) OVER (ORDER BY r.RN)-r.RN),
isMatch = ABS(t.p-2+1)
FROM core.rangeAB(1,DATALENGTH(#string),1,1) AS r
CROSS APPLY (VALUES (
CAST(PATINDEX(#pattern,SUBSTRING(#string,r.RN,1)) AS BIT),
CAST(PATINDEX(#pattern,SUBSTRING(#string,r.RN-1,1)) AS BIT),
SUBSTRING(#string,r.RN,r.Op+1))) AS t(c,p,s)
WHERE t.c^t.p = 1
) AS patSplit(ItemIndex, ItemLength, Item, IsMatch)
FOR XML PATH(''), TYPE).value('.','varchar(8000)');
Returns:
Call me later at or tomorrow at , , or at before noon. Thanks!
CREATE FUNCTION core.rangeAB
(
#Low BIGINT, -- (start) Lowest number in the set
#High BIGINT, -- (stop) Highest number in the set
#Gap BIGINT, -- (step) Difference between each number in the set
#Row1 BIT -- Base: 0 or 1; should RN begin with 0 or 1?
)
/****************************************************************************************
[Purpose]:
Creates a lazy, in-memory, forward-ordered sequence of up to 531,441,000,000 integers
starting with #Low and ending with #High (inclusive). RangeAB is a pure, 100% set-based
alternative to solving SQL problems using iterative methods such as loops, cursors and
recursive CTEs. RangeAB is based on Itzik Ben-Gan's getnums function for producing a
sequence of integers and uses logic from Jeff Moden's fnTally function which includes a
parameter for determining if the "row-number" (RN) should begin with 0 or 1.
I wanted to use the name "Range" because it functions and performs almost identically to
the Range function built into Python and Clojure. RANGE is a reserved SQL keyword so I
went with "RangeAB". Functions/Algorithms developed using rangeAB can be easilty ported
over to Python, Clojure or any other programming language that leverages a lazy sequence.
The two major differences between RangeAB and the Python/Clojure versions are:
1. RangeAB is *Inclusive* where the other two are *Exclusive". range(0,3) in Python and
Clojure return [0 1 2], core.rangeAB(0,3) returns [0 1 2 3].
2. RangeAB has a fourth Parameter (#Row1) to determine if RN should begin with 0 or 1.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
SELECT r.RN, r.OP, r.N1, r.N2
FROM core.rangeAB(#Low,#High,#Gap,#Row1) AS r;
[Parameters]:
#Low = BIGINT; represents the lowest value for N1.
#High = BIGINT; represents the highest value for N1.
#Gap = BIGINT; represents how much N1 and N2 will increase each row. #Gap is also the
difference between N1 and N2.
#Row1 = BIT; represents the base (first) value of RN. When #Row1 = 0, RN begins with 0,
when #row = 1 then RN begins with 1.
[Returns]:
Inline Table Valued Function returns:
RN = BIGINT; a row number that works just like T-SQL ROW_NUMBER() except that it can
start at 0 or 1 which is dictated by #Row1. If you need the numbers:
(0 or 1) through #High, then use RN as your "N" value, ((#Row1=0 for 0, #Row1=1),
otherwise use N1.
OP = BIGINT; returns the "finite opposite" of RN. When RN begins with 0 the first number
in the set will be 0 for RN, the last number in will be 0 for OP. When returning the
numbers 1 to 10, 1 to 10 is retrurned in ascending order for RN and in descending
order for OP.
Given the Numbers 1 to 3, 3 is the opposite of 1, 2 the opposite of 2, and 1 is the
opposite of 3. Given the numbers -1 to 2, the opposite of -1 is 2, the opposite of 0
is 1, and the opposite of 1 is 0.
The best practie is to only use OP when #Gap > 1; use core.O instead. Doing so will
improve performance by 1-2% (not huge but every little bit counts)
N1 = BIGINT; This is the "N" in your tally table/numbers function. this is your *Lazy*
sequence of numbers starting at #Low and incrementing by #Gap until the next number
in the sequence is greater than #High.
N2 = BIGINT; a lazy sequence of numbers starting #Low+#Gap and incrementing by #Gap. N2
will always be greater than N1 by #Gap. N2 can also be thought of as:
LEAD(N1,1,N1+#Gap) OVER (ORDER BY RN)
[Dependencies]:
N/A
[Developer Notes]:
1. core.rangeAB returns one billion rows in exactly 90 seconds on my laptop:
4X 2.7GHz CPU's, 32 GB - multiple versions of SQL Server (2005-2019)
2. The lowest and highest possible numbers returned are whatever is allowable by a
bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3).
3. #Gap does not affect RN, RN will begin at #Row1 and increase by 1 until the last row
unless its used in a subquery where a filter is applied to RN.
4. #Gap must be greater than 0 or the function will not return any rows.
5. Keep in mind that when #Row1 is 0 then the highest RN value (ROWNUMBER) will be the
number of rows returned minus 1
6. If you only need is a sequential set beginning at 0 or 1 then, for best performance
use the RN column. Use N1 and/or N2 when you need to begin your sequence at any
number other than 0 or 1 or if you need a gap between your sequence of numbers.
7. Although #Gap is a bigint it must be a positive integer or the function will
not return any rows.
8. The function will not return any rows when one of the following conditions are true:
* any of the input parameters are NULL
* #High is less than #Low
* #Gap is not greater than 0
To force the function to return all NULLs instead of not returning anything you can
add the following code to the end of the query:
UNION ALL
SELECT NULL, NULL, NULL, NULL
WHERE NOT (#High&#Low&#Gap&#Row1 IS NOT NULL AND #High >= #Low AND #Gap > 0)
This code was excluded as it adds a ~5% performance penalty.
9. There is no performance penalty for sorting by RN ASC; there is a large performance
penalty, however for sorting in descending order. If you need a descending sort the
use OP in place of RN then sort by rn ASC.
10. When setting the #Row1 to 0 and sorting by RN you will see that the 0 is added via
MERGE JOIN concatination. Under the hood the function is essentially concatinating
but, because it's using a MERGE JOIN operator instead of concatination the cost
estimations are needlessly high. You can circumvent this problem by changing:
ORDER BY core.rangeAB.RN to: ORDER BY ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
[Examples]:
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20140518 - Initial Development - AJB
Rev 05 - 20191122 - Developed this "core" version for open source distribution;
updated notes and did some final code clean-up
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally(RN) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
SELECT
RN = 0,
OP = (#High-#Low)/#Gap,
N1 = #Low,
N2 = #Gap+#Low
WHERE #Row1 = 0
UNION ALL -- (#High-#Low)/#Gap+1:
SELECT TOP (ABS((ISNULL(#High,0)-ISNULL(#Low,0))/ISNULL(#Gap,0)+ISNULL(#Row1,1)))
RN = i.RN,
OP = (#High-#Low)/#Gap+(2*#Row1)-i.RN,
N1 = (i.rn-#Row1)*#Gap+#Low,
N2 = (i.rn-(#Row1-1))*#Gap+#Low
FROM iTally AS i
ORDER BY i.RN
) AS r
WHERE #High&#Low&#Gap&#Row1 IS NOT NULL AND #High >= #Low
AND #Gap > 0;
GO
CREATE FUNCTION samd.ngrams8k
(
#String VARCHAR(8000), -- Input string
#N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#String). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.Position, ng.Token
FROM samd.ngrams8k(#String,#N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.Position, ng.Token
FROM dbo.SomeTable AS s
CROSS APPLY samd.ngrams8k(s.SomeValue,#N) AS ng;
[Parameters]:
#String = The input string to split into tokens.
#N = The size of each token returned.
[Returns]:
Position = BIGINT; the position of the token in the input string
token = VARCHAR(8000); a #N-sized character-level N-Gram token
[Dependencies]:
1. core.rangeAB (iTVF)
[Developer Notes]:
1. ngrams8k is not case sensitive;
2. Many functions that use ngrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not choose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #String or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#String)) OR (#N IS NULL OR #String IS NULL)
4. ngrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Split the string, "abcd" into unigrams, bigrams and trigrams
SELECT ng.Position, ng.Token FROM samd.ngrams8k('abcd',1) AS ng; -- unigrams (#N=1)
SELECT ng.Position, ng.Token FROM samd.ngrams8k('abcd',2) AS ng; -- bigrams (#N=2)
SELECT ng.Position, ng.Token FROM samd.ngrams8k('abcd',3) AS ng; -- trigrams (#N=3)
[Revision History]:
------------------------------------------------------------------------------------------
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 05 - 20171228 - Small simplification; changed:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(#String,''))-(ISNULL(#N,1)-1)),0)))
to:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(#String,''))+1-ISNULL(#N,1)),0)))
Rev 06 - 20180612 - Using CHECKSUM(N) in the to convert N in the token output instead of
using (CAST N as int). CHECKSUM removes the need to convert to int.
Rev 07 - 20180612 - re-designed to: Use core.rangeAB - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
Position = r.RN,
Token = SUBSTRING(#String,CHECKSUM(r.RN),#N)
FROM core.rangeAB(1,LEN(#String)+1-#N,1,1) AS r
WHERE #N > 0 AND #N <= LEN(#String);
GO
CREATE FUNCTION samd.patReplace8K
(
#string VARCHAR(8000),
#pattern VARCHAR(50),
#replace VARCHAR(20)
)
/*****************************************************************************************
[Purpose]:
Given a string (#string), a pattern (#pattern), and a replacement character (#replace)
patReplace8K will replace any character in #string that matches the #Pattern parameter
with the character, #replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplace8K(#String,#Pattern,#Replace) AS pr;
[Developer Notes]:
1. Required SQL Server 2008+
2. #Pattern IS case sensitive but can be easily modified to make it case insensitive
3. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
4. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
[Examples]:
--===== 1. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplace8K('My phone number is 555-2211','[0-9]','*') AS pr;
[Revision History]:
Rev 00 - 10/27/2014 Initial Development - Alan Burstein
Rev 01 - 10/29/2014 Mar 2007 - Alan Burstein
- Redesigned based on the dbo.STRIP_NUM_EE by Eirikur Eiriksson
(see: http://www.sqlservercentral.com/Forums/Topic1585850-391-2.aspx)
- change how the cte tally table is created
- put the include/exclude logic in a CASE statement instead of a WHERE clause
- Added Latin1_General_BIN Colation
- Add code to use the pattern as a parameter.
Rev 02 - 20141106
- Added final performane enhancement (more cudo's to Eirikur Eiriksson)
- Put 0 = PATINDEX filter logic into the WHERE clause
Rev 03 - 20150516
- Updated to deal with special XML characters
Rev 04 - 20170320
- changed #replace from char(1) to varchar(1) to address how spaces are handled
Rev 05 - Re-write using samd.NGrams
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN #string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))
WHEN #pattern+#replace+#string IS NOT NULL THEN
CASE WHEN PATINDEX(#pattern,token COLLATE Latin1_General_BIN)=0
THEN ng.token ELSE #replace END END
FROM samd.NGrams8K(#string, 1) AS ng
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'VARCHAR(8000)');
GO
CREATE FUNCTION samd.Instr8k
(
#string VARCHAR(8000),
#search VARCHAR(8000),
#position INT,
#occurance INT
)
/*****************************************************************************************
[Purpose]:
Returns the position (ItemIndex) of the Nth(#occurance) occurrence of one string(#search) within
another(#string). Similar to Oracle's PL/SQL INSTR funtion.
https://www.techonthenet.com/oracle/functions/instr.php
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous
SELECT ins.ItemIndex, ins.ItemLength, ins.ItemCount
FROM samd.Instr8k(#string,#search,#position,#occurance) AS ins;
--===== Against a table using APPLY
SELECT s.SomeID, ins.ItemIndex, ins.ItemLength, ins.ItemCount
FROM dbo.SomeTable AS s
CROSS APPLY samd.Instr8k(s.string,#search,#position,#occurance) AS ins
[Parameters]:
#string = VARCHAR(8000); Input sting to evaluate
#search = VARCHAR(8000); Token to search for inside of #string
#position = INT; Where to begin searching for #search; identical to the third
parameter in SQL Server CHARINDEX [, start_location]
#occurance = INT; Represents the Nth instance of the search string (#search)
[Returns]:
ItemIndex = Position of the Nth (#occurance) instance of #search inside #string
ItemLength = Length of #search (in case you need it, no need to re-evaluate the string)
ItemCount = Number of times #search appears inside #string
[Dependencies]:
1. samd.ngrams8k
1.1. dbo.rangeAB (iTVF)
2. samd.substringCount8K_lazy
[Developer Notes]:
1. samd.Instr8k does not treat the input strings (#string and #search) as case sensitive.
2. Don't use instr8k for "SubstringBetween" functionality; for better performance use
samd.SubstringBetween8k instead.
3. The #position parameter is the key benefit of this function when dealing with long
strings where the search item is towards the back of the string. For example, take a
5000 character string where, what you are looking for is always *at least* 3000
characters deep. Setting #position to 3000 will dramatically improve performance.
4. Unlike Oracle's PL/SQL INSTR function, Instr8k does not accept numbers less than 1.
[Examples]:
[Revision History]:
------------------------------------------------------------------------------------------
Rev 00 - 20191112 - Initial Development - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
ItemIndex = ISNULL(MAX(ISNULL(instr.Position,1)+(a.Pos-1)),0),
ItemLength = ISNULL(MAX(LEN(#search)),LEN(#search)),
ItemCount = ISNULL(MAX(items.SubstringCount),0)
FROM (VALUES(ISNULL(#position,1),LEN(#search))) AS a(Pos,SrchLn)
CROSS APPLY (VALUES(SUBSTRING(#string,a.Pos,8000))) AS f(String)
CROSS APPLY samd.substringCount8K_lazy(f.string,#search) AS items
CROSS APPLY
(
SELECT TOP (#occurance) RN = ROW_NUMBER() OVER (ORDER BY ng.position), ng.position
FROM samd.ngrams8k(f.string,a.SrchLn) AS ng
WHERE ng.token = #search
ORDER BY RN
) AS instr
WHERE a.Pos > 0
AND #occurance <= items.SubstringCount
AND instr.RN = #occurance;
GO
CREATE FUNCTION samd.substringCount8K_lazy
(
#string varchar(8000),
#searchstring varchar(1000)
)
/*****************************************************************************************
[Purpose]:
Scans the input string (#string) and counts how many times the search character
(#searchChar) appears. This function is Based on Itzik Ben-Gans cte numbers table logic
[Compatibility]:
SQL Server 2008+
Uses TABLE VALUES constructor (not available pre-2008)
[Author]: Alan Burstein
[Syntax]:
--===== Autonomous
SELECT f.substringCount
FROM samd.substringCount8K_lazy(#string,#searchString) AS f;
--===== Against a table using APPLY
SELECT f.substringCount
FROM dbo.someTable AS t
CROSS APPLY samd.substringCount8K_lazy(t.col, #searchString) AS f;
Parameters:
#string = VARCHAR(8000); input string to analyze
#searchString = VARCHAR(1000); substring to search for
[Returns]:
Inline table valued function returns -
substringCount = int; Number of times that #searchChar appears in #string
[Developer Notes]:
1. substringCount8K_lazy does NOT take overlapping values into consideration. For
example, this query will return a 1 but the correct result is 2:
SELECT substringCount FROM samd.substringCount8K_lazy('xxx','xx')
When overlapping values are a possibility or concern then use substringCountAdvanced8k
2. substringCount8K_lazy is what is referred to as an "inline" scalar UDF." Technically
it's aninline table valued function (iTVF) but performs the same task as a scalar
valued user defined function (UDF); the difference is that it requires the APPLY table
operator to accept column values as a parameter. For more about "inline" scalar UDFs
see thisarticle by SQL MVP Jeff Moden:
http://www.sqlservercentral.com/articles/T-SQL/91724/
and for more about how to use APPLY see the this article by SQL MVP Paul White:
http://www.sqlservercentral.com/articles/APPLY/69953/.
Note the above syntax example and usage examples below to better understand how to
use the function. Although the function is slightly more complicated to use than a
scalar UDF it will yield notably better performance for many reasons. For example,
unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
not restrict the query optimizer's ability generate a parallel query execution plan.
3. substringCount8K_lazy returns NULL when either input parameter is NULL and returns 0
when either input parameter is blank.
4. substringCount8K_lazy does not treat parameters as cases senstitive
5. substringCount8K_lazy is deterministic. For more deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. How many times does the substring "abc" appear?
SELECT f.* FROM samd.substringCount8k_lazy('abc123xxxabc','abc') AS f;
--===== 2. Return records from a table where the substring "ab" appears more than once
DECLARE #table TABLE (string varchar(8000));
DECLARE #searchString varchar(1000) = 'ab';
INSERT #table VALUES ('abcabc'),('abcd'),('bababab'),('baba'),(NULL);
SELECT searchString = #searchString, t.string, f.substringCount
FROM #table AS t
CROSS APPLY samd.substringCount8k_lazy(string,'ab') AS f
WHERE f.substringCount > 1;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20180625 - Initial Development - Alan Burstein
Rev 01 - 20190102 - Added logic to better handle #searchstring = char(32) - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT substringCount = (LEN(v.s)-LEN(REPLACE(v.s,v.st,'')))/d.l
FROM (VALUES(DATALENGTH(#searchstring))) AS d(l)
CROSS APPLY (VALUES(#string,CASE WHEN d.l>0 THEN #searchstring END)) AS v(s,st);
GO
I have 2 varchar(64) values that are decimals in this case (say COLUMN1 and COLUMN2, both varchars, both decimal numbers(money)). I need to create a where clause where I say this:
COLUMN1 < COLUMN2
I believe I have to convert these 2 varchar columns to a different data types to compare them like that, but I'm not sure how to go about that. I tried a straight forward CAST:
CAST(COLUMN1 AS DECIMAL(9,2)) < CAST(COLUMN2 AS DECIMAL(9,2))
But I had to know that would be too easy. Any help is appreciated. Thanks!
You can create a UDF like this to check which values can't be cast to DECIMAL
CREATE OR REPLACE FUNCTION IS_DECIMAL(i VARCHAR(64)) RETURNS INTEGER
CONTAINS SQL
--ALLOW PARALLEL -- can use this on Db2 11.5 or above
NO EXTERNAL ACTION
DETERMINISTIC
BEGIN
DECLARE NOT_VALID CONDITION FOR SQLSTATE '22018';
DECLARE EXIT HANDLER FOR NOT_VALID RETURN 0;
RETURN CASE WHEN CAST(i AS DECIMAL(31,8)) IS NOT NULL THEN 1 END;
END
For example
CREATE TABLE S ( C VARCHAR(32) );
INSERT INTO S VALUES ( ' 123.45 '),('-00.12'),('£546'),('12,456.88');
SELECT C FROM S WHERE IS_DECIMAL(c) = 0;
would return
C
---------
£546
12,456.88
It really is that easy...this works fine...
select cast('10.15' as decimal(9,2)) - 1
from sysibm.sysdummy1;
You've got something besides a valid numerical character in your data..
And it's something besides leading or trailing whitespace...
Try the following...
select *
from table
where translate(column1, ' ','0123456789.')
<> ' '
or translate(column2, ' ','0123456789.')
<> ' '
That will show you the rows with alpha characters...
If the above does't return anything, then you've probably got a string with double decimal points or something...
You could use a regex to find those.
There is a built-in ability to do this without UDFs.
The xmlcast function below does "safe" casting between (var)char and decfloat (you may use as double or as decimal(X, Y) instead, if you want). It returns NULL if it's impossible to cast.
You may use such an expression twice in the WHERE clause.
SELECT
S
, xmlcast(xmlquery('if ($v castable as xs:decimal) then xs:decimal($v) else ()' passing S as "v") as decfloat) D
FROM (VALUES ( ' 123.45 '),('-00.12'),('£546'),('12,456.88')) T (S);
|S |D |
|---------|------------------------------------------|
| 123.45 |123.45 |
|-00.12 |-0.12 |
|£546 | |
|12,456.88| |
I'm trying to write a query to produce a list of all nodes in a tree given a root, and also the paths (using names the parents give their children) taken to get there. The recursive CTE I have working is a textbook CTE straight from the docs here, however, it's proven difficult to get the paths working in this case.
Following the git model, names are given to children by their parents as a result of paths created by traversing the tree. This implies a map to children ids like git's tree structure.
I've been looking online for a solution for a recursive query but they all seem to contain solutions that use parent ids, or materialized paths, which all would break structural sharing concepts that Rich Hickey's database as value talk is all about.
current implementation
Imagine the objects table is dead simple (for simplicity sake, let's assume integer ids):
drop table if exists objects;
create table objects (
id INT,
data jsonb
);
-- A
-- / \
-- B C
-- / \ \
-- D E F
INSERT INTO objects (id, data) VALUES
(1, '{"content": "data for f"}'), -- F
(2, '{"content": "data for e"}'), -- E
(3, '{"content": "data for d"}'), -- D
(4, '{"nodes":{"f":{"id":1}}}'), -- C
(5, '{"nodes":{"d":{"id":2}, "e":{"id":3}}}'), -- B
(6, '{"nodes":{"b":{"id":5}, "c":{"id":4}}}') -- A
;
drop table if exists work_tree;
create table work_tree (
id INT NOT NULL,
path text,
ref text,
data jsonb,
primary key (ref, id) -- TODO change to ref, path
);
create or replace function get_nested_ids_array(data jsonb) returns int[] as $$
select array_agg((value->>'id')::int) as nested_id
from jsonb_each(data->'nodes')
$$ LANGUAGE sql STABLE;
create or replace function checkout(root_id int, ref text) returns void as $$
with recursive nodes(id, nested_ids, data) AS (
select id, get_nested_ids_array(data), data
from objects
where id = root_id
union
select child.id, get_nested_ids_array(child.data), child.data
from objects child, nodes parent
where child.id = ANY(parent.nested_ids)
)
INSERT INTO work_tree (id, data, ref)
select id, data, ref from nodes
$$ language sql VOLATILE;
SELECT * FROM checkout(6, 'master');
SELECT * FROM work_tree;
If you are familiar, these objects' data property look similar to git blobs/trees, mapping names to ids or storing content. So imagine you want to create an index, so, after a "checkout", you need to query for the list of nodes, and potentially the paths to produce a working tree or index:
Current Output:
id path ref data
6 NULL master {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4 NULL master {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5 NULL master {"nodes":{"f":{"id":1}}}
1 NULL master {"content": "data for d"}
2 NULL master {"content": "data for e"}
3 NULL master {"content": "data for f"}
Desired Output:
id path ref data
6 / master {"nodes":{"b":{"id":5}, "c":{"id":4}}}
4 /b master {"nodes":{"d":{"id":2}, "e":{"id":3}}}
5 /c master {"nodes":{"f":{"id":1}}}
1 /b/d master {"content": "data for d"}
2 /b/e master {"content": "data for e"}
3 /c/f master {"content": "data for f"}
What's the best way to aggregate path in this case? I'm aware that I'm compressing the information when I call get_nested_ids_array when I do the recursive query, so not sure with this top-down approach how to properly aggregate with the CTE.
EDIT use case for children ids
to explain more about why I need to use children ids instead of parent:
Imagine a data structure like so:
A
/ \
B C
/ \ \
D E F
If you make a modification to F, you only add a new root A', and children nodes C', and F', leaving the old tree in tact:
A' A
/ \ / \
C' B C
/ / \ \
F' D E F
If you make a deletion, you only add a new root A" that only points to B and you still have A if you ever need to time travel (and they share the same objects, just like git!):
A" A
\ / \
B C
/ \ \
D E F
So it seems that the best way to achieve this is with children ids so children can have multiple parents - across time and space! If you think there's another way to achieve this, by all means, let me know!
Edit #2 case for not using parent_ids
Using parent_ids has cascading effects that requires editing the entire tree. For example,
A
/ \
B C
/ \ \
D E F
If you make a modification to F, you still need a new root A' to maintain immutability. And if we use parent_ids, then that means both B and C now have a new parent. Hence, you can see how it ripples through the entire tree immediately requiring every node is touched:
A A'
/ \ / \
B C B' C'
/ \ \ / \ \
D E F D' E' F'
EDIT #3 use case for parents giving names to children
We can make a recursive query where objects store their own name, but the question I'm asking is specifically about constructing a path where the names are given to children from their parents. This is modeling a data structure similar to the git tree, for example if you see this git graph pictured below, in the 3rd commit there is a tree (a folder) bak that points to the original root which represents a folder of all the files at the 1st commit. If that root object had it's own name, it wouldn't be possible to achieve this so simply as adding a ref. That's the beauty of git, it's as simple as making a reference to a hash and giving it a name.
This is the relationship I'm setting up which is why the jsonb data structure exists, it's to provide a mapping from a name to an id (hash in git's case). I know it's not ideal, but it does the job of providing the hash map. If there's another way to create this mapping of names to ids, and thus, a way for parents to give names to children in a top-down tree, I'm all ears!
Any help is appreciated!
Store the parent of a node instead of its children. It is a simpler and cleaner solution, in which you do not need structured data types.
This is an exemplary model with the data equivalent to that in the question:
create table objects (
id int primary key,
parent_id int,
label text,
content text);
insert into objects values
(1, 4, 'f', 'data for f'),
(2, 5, 'e', 'data for e'),
(3, 5, 'd', 'data for d'),
(4, 6, 'c', ''),
(5, 6, 'b', ''),
(6, 0, 'a', '');
And a recursive query:
with recursive nodes(id, path, content) as (
select id, label, content
from objects
where parent_id = 0
union all
select o.id, concat(path, '->', label), o.content
from objects o
join nodes n on n.id = o.parent_id
)
select *
from nodes
order by id desc;
id | path | content
----+---------+------------
6 | a |
5 | a->b |
4 | a->c |
3 | a->b->d | data for d
2 | a->b->e | data for e
1 | a->c->f | data for f
(6 rows)
The variant with children_ids.
drop table if exists objects;
create table objects (
id int primary key,
children_ids int[],
label text,
content text);
insert into objects values
(1, null, 'f', 'data for f'),
(2, null, 'e', 'data for e'),
(3, null, 'd', 'data for d'),
(4, array[1], 'c', ''),
(5, array[2,3], 'b', ''),
(6, array[4,5], 'a', '');
with recursive nodes(id, children, path, content) as (
select id, children_ids, label, content
from objects
where id = 6
union all
select o.id, o.children_ids, concat(path, '->', label), o.content
from objects o
join nodes n on o.id = any(n.children)
)
select *
from nodes
order by id desc;
id | children | path | content
----+----------+---------+------------
6 | {4,5} | a |
5 | {2,3} | a->b |
4 | {1} | a->c |
3 | | a->b->d | data for d
2 | | a->b->e | data for e
1 | | a->c->f | data for f
(6 rows)
#klin's excellent answer inspired me to experiment with PostgreSQL, trees (paths), and recursive CTE! :-D
Preamble: my motivation is storing data in PostgreSQL, but visualizing those data in a graph. While the approach here has limitations (e.g. undirected edges; ...), it may otherwise be useful in other contexts.
Here, I adapted #klins code to enable CTE without a dependence on the table id, though I do use those to deal with the issue of loops in the data, e.g.
a,b
b,a
that throw the CTE into a nonterminating loop.
To solve that, I employed the rather brilliant approach suggested by #a-horse-with-no-name in SO 31739150 -- see my comments in the script, below.
PSQL script ("tree with paths.sql"):
-- File: /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
-- Adapted from: https://stackoverflow.com/questions/44620695/recursive-path-aggregation-and-cte-query-for-top-down-tree-postgres
-- See also: /mnt/Vancouver/FC/RDB - PostgreSQL/Recursive CTE - Graph Algorithms in a Database Recursive CTEs and Topological Sort with Postgres.pdf
-- https://www.fusionbox.com/blog/detail/graph-algorithms-in-a-database-recursive-ctes-and-topological-sort-with-postgres/620/
-- Run this script in psql, at the psql# prompt:
-- \! cd /mnt/Vancouver/Programming/data/metabolism/practice/sql/
-- \i /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
\c practice
DROP TABLE tree;
CREATE TABLE tree (
-- id int primary key
id SERIAL PRIMARY KEY
,s TEXT -- s: source node
,t TEXT -- t: target node
,UNIQUE(s, t)
);
INSERT INTO tree(s, t) VALUES
('a','b')
,('b','a') -- << adding this 'back relation' breaks CTE_1 below, as it enters a loop and cannot terminate
,('b','c')
,('b','d')
,('c','e')
,('d','e')
,('e','f')
,('f','g')
,('g','h')
,('c','h');
SELECT * FROM tree;
-- SELECT s,t FROM tree WHERE s='b';
-- RECURSIVE QUERY 1 (CTE_1):
-- WITH RECURSIVE nodes(src, path, tgt) AS (
-- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'a'
-- -- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'c'
-- UNION ALL
-- SELECT t.s, concat(path, '->', t), t.t FROM tree t
-- JOIN nodes n ON n.tgt = t.s
-- )
-- -- SELECT * FROM nodes;
-- SELECT path FROM nodes;
-- RECURSIVE QUERY 2 (CTE_2):
-- Deals with "loops" in Postgres data, per
-- https://stackoverflow.com/questions/31739150/to-find-infinite-recursive-loop-in-cte
-- "With Postgres it's quite easy to prevent this by collecting all visited nodes in an array."
WITH RECURSIVE nodes(id, src, path, tgt) AS (
SELECT id, s, concat(s, '->', t), t
,array[id] as all_parent_ids
FROM tree WHERE s = 'a'
UNION ALL
SELECT t.id, t.s, concat(path, '->', t), t.t, all_parent_ids||t.id FROM tree t
JOIN nodes n ON n.tgt = t.s
AND t.id <> ALL (all_parent_ids) -- this is the trick to exclude the endless loops
)
-- SELECT * FROM nodes;
SELECT path FROM nodes;
Script execution / output (PSQL):
# \i tree_with_paths.sql
You are now connected to database "practice" as user "victoria".
DROP TABLE
CREATE TABLE
INSERT 0 10
id | s | t
----+---+---
1 | a | b
2 | b | a
3 | b | c
4 | b | d
5 | c | e
6 | d | e
7 | e | f
8 | f | g
9 | g | h
10 | c | h
path
---------------------
a->b
a->b->a
a->b->c
a->b->d
a->b->c->e
a->b->d->e
a->b->c->h
a->b->d->e->f
a->b->c->e->f
a->b->c->e->f->g
a->b->d->e->f->g
a->b->d->e->f->g->h
a->b->c->e->f->g->h
You can change the starting node (e.g. start at node "d") in the SQL script -- giving, e.g.:
# \i tree_with_paths.sql
...
path
---------------
d->e
d->e->f
d->e->f->g
d->e->f->g->h
Network visualization:
I exported those data (at the PSQL prompt) to a CSV,
# \copy (SELECT s, t FROM tree) TO '/tmp/tree.csv' WITH CSV
COPY 9
# \! cat /tmp/tree.csv
a,b
b,c
b,d
c,e
d,e
e,f
f,g
g,h
c,h
... which I visualized (image above) in a Python 3.5 venv:
>>> import networkx as nx
>>> import pylab as plt
>>> G = nx.read_edgelist("/tmp/tree.csv", delimiter=",")
>>> G.nodes()
['b', 'a', 'd', 'f', 'c', 'h', 'g', 'e']
>>> G.edges()
[('b', 'a'), ('b', 'd'), ('b', 'c'), ('d', 'e'), ('f', 'g'), ('f', 'e'), ('c', 'e'), ('c', 'h'), ('h', 'g')]
>>> G.number_of_nodes()
8
>>> G.number_of_edges()
9
>>> from networkx.drawing.nx_agraph import graphviz_layout
## There is a bug in Python or NetworkX: you may need to run this
## command 2x, as you may get an error the first time:
>>> nx.draw(G, pos=graphviz_layout(G), node_size=1200, node_color='lightblue', linewidths=0.25, font_size=10, font_weight='bold', with_labels=True)
>>> plt.show()
>>> nx.dijkstra_path(G, 'a', 'h')
['a', 'b', 'c', 'h']
>>> nx.dijkstra_path(G, 'a', 'f')
['a', 'b', 'd', 'e', 'f']
Note that the dijkstra_path returned from NetworkX is one of several possible, whereas all paths are returned by the Postgres CTE in a visually-appealing manner.
I have built SQL Spatial triggers so that if users of our GIS move a pit (point feature) the pipes (lines) that end or start at the pit have their endpoint/startpoint altered. I basically replace the entire geometry.
The problem is, some pipes have more than one point. In this case I only want to alter the endpoint or startpoint and leave the other points as is, otherwise the other vertices are lost. Can you set the STEndPoint/STStartPoint or is there another way to only alter these points of a line?
Below code is my attempt so far, after trying the recursive CTE which I gave up on as I couldn't reference a table variable inside this. Instead I have built a temp table which will store the x and y values. It would be good to not have to repeat the script 4 times but it works (pipes will have 2 to 4 points). I now need to alter the geometry x/y for just the pipe endpoints/startpoints after the user moves the pit. I Will look at this tomorrow.
--Create temp table to store X and Y values of pipe vertices
CREATE table #PipeGeom
(
ID int IDENTITY(1,1) NOT NULL,
N int,
X float,
Y float,
PipeGMSCKey int
)
--insert vertices. Most pipes have two points. A few have 4.
DECLARE #VertexCount INT
SET #VertexCount = 1
INSERT INTO #PipeGeom (N, X, Y, PipeGMSCKey)
SELECT N=#VertexCount, pipe.Geometry_SPA.STPointN(1).STX AS X, pipe.Geometry_SPA.STPointN(1).STY AS Y, pipe.GMSC_Key FROM [Assets_GMSC_Dev].[dbo].[vw_Pipes] pipe
INNER JOIN [Assets_GMSC_Dev].[dbo].[vw_Pits] pit ON pipe.Geometry_SPA.STDistance(pit.Geometry_SPA) <0.1
WHERE pit.GMSC_Key = '1481532'
--UNION ALL
INSERT INTO #PipeGeom (N, X, Y, PipeGMSCKey)
SELECT N=2, pipe.Geometry_SPA.STPointN(2).STX AS X, pipe.Geometry_SPA.STPointN(2).STY AS Y, pipe.GMSC_Key FROM [Assets_GMSC_Dev].[dbo].[vw_Pipes] pipe
INNER JOIN [Assets_GMSC_Dev].[dbo].[vw_Pits] pit ON pipe.Geometry_SPA.STDistance(pit.Geometry_SPA) <0.1
WHERE pit.GMSC_Key = '1481532'
--UNION ALL
INSERT INTO #PipeGeom (N, X, Y, PipeGMSCKey)
SELECT N=3, pipe.Geometry_SPA.STPointN(3).STX AS X, pipe.Geometry_SPA.STPointN(3).STY AS Y, pipe.GMSC_Key FROM [Assets_GMSC_Dev].[dbo].[vw_Pipes] pipe
INNER JOIN [Assets_GMSC_Dev].[dbo].[vw_Pits] pit ON pipe.Geometry_SPA.STDistance(pit.Geometry_SPA) <0.1
WHERE pit.GMSC_Key = '1481532'
--UNION ALL
INSERT INTO #PipeGeom (N, X, Y, PipeGMSCKey)
SELECT N=4, pipe.Geometry_SPA.STPointN(4).STX AS X, pipe.Geometry_SPA.STPointN(4).STY AS Y, pipe.GMSC_Key FROM [Assets_GMSC_Dev].[dbo].[vw_Pipes] pipe
INNER JOIN [Assets_GMSC_Dev].[dbo].[vw_Pits] pit ON pipe.Geometry_SPA.STDistance(pit.Geometry_SPA) <0.1
WHERE pit.GMSC_Key = '1481532'
--User moves pit, then get new pit location, which pipe start/endpoint will need to move to.
--New Start and Finish points
DECLARE #PitX FLOAT = (SELECT Geometry_SPA.STX FROM vw_Pits WHERE GMSC_Key = '1481532');
DECLARE #PitY FLOAT = (SELECT Geometry_SPA.STY FROM vw_Pits WHERE GMSC_Key = '1481532');
--need to grab just the endpoint/startpoint of line here and build Geometry String
drop table #PipeGeom
Thanks
This might not be an optimum solution but hope it's of use. Also, I'm on 2014 /2016 so it's untested on 2008R2.
--Initial (Existing) Line
DECLARE #g GEOMETRY = GEOMETRY::STGeomFromText('LINESTRING(5 5, 20 10, 30 20, 0 50)', 0);
--New Start and Finish points
DECLARE #X1 VARCHAR(2) = '10'
DECLARE #Y1 VARCHAR(2) = '10'
DECLARE #X2 VARCHAR(2) = '20'
DECLARE #Y2 VARCHAR(2) = '20'
DECLARE #Coords NVARCHAR(MAX)
;WITH Points(N, X, Y) AS
(
SELECT 2, #g.STPointN(2).STX, #g.STPointN(2).STY
UNION ALL
SELECT N + 1, #g.STPointN(N + 1).STX, #g.STPointN(N + 1).STY
FROM Points GP
WHERE N < #g.STNumPoints() - 1
)
SELECT #Coords = COALESCE(#Coords + ', ', '') + CAST(X AS NVARCHAR(50)) + ' ' + CAST(Y AS NVARCHAR(50)) FROM Points
SELECT GEOMETRY::STGeomFromText('LINESTRING('+#X1+' '+#Y1+', '+#Coords+', '+#X2+' '+#Y2+' '+')', 0) NewGeom
This uses a recursive CTE to parse out the co-ords except the first and last for the Geometry #G into a Nvarchar. The new first and last points are concatenated and a new Geometry returned. You could probably wrap this up in a SP or function.
Our solution needs us to work in hierarchies of regions which are as follows.
STATE
|
DISTRICT
|
TALUK
/ \
/ \
HOBLI PANCHAYAT
\ /
\ /
\ /
VILLAGE
There are 2 ways to navigate to a village from a Taluk. Either through HOBLI OR through PANCHAYAT.
We need a PK(non-business KEY) and a SERIAL_NUMBER/ID for each STATE, DISTRICT, TALUK, HOBLI, PANCHAYAT, VILLAGE; However, each village has 8 additional attributes.
How do I design this structure in PostgreSQL 8.4 ?
My previous experience was on Oracle so I'm wondering how to navigate hierarchical structures in PostgreSQL 8.4 ? If at all, the solution should be friendly for READ/navigation speed.
================================================================
Quassnoi : Here is a sample hierarchy
KARNATAKA
|
|
TUMKUR (District)
|
|
|
KUNIGAL (Taluk)
/ \
/ \
/ \
HULIYUR DURGA(Hobli) CHOWDANAKUPPE(Panchayat)
\ /
\ /
\ /
\ /
\ /
Voddarakempapura(Village)
Ankanahalli(Village)
Chowdanakuppe(Village)
Yedehalli(Village)
NAVIGATE : For now, I will be presenting 2 separate UI screens each having separate navigable hierarchies
#1 using HOBLI and
So, for #1, I will need the entire tree starting from STATE, DISTRICT(s), TALUK(s), HOBLI(s), VILLAGE(s). Using the above tree, I will need
KARNATAKA (State)
|
|
|---TUMKUR (District)
|
|
|-----KUNIGAL(Taluk)
|
|
**|----HULIYUR DURGA(Hobli)**
|
|
|---VODDARAKEMPAPURA(Village)
|
|---Yedehalli(Village)
|
|---Ankanahalli(Village)
#2 using PANCHAYAT.
So, for #2, I will need the entire tree starting from STATE, DISTRICT(s), TALUK(s), PANCHAYAT(s), VILLAGE(s)
KARNATAKA (state)
|
|
|---TUMKUR (District)
|
|
|-----KUNIGAL(Taluk)
|
|
**|----CHOWDANAKUPPE (Panchayat)**
|
|
|---VODDARAKEMPAPURA(Village)
|
|---Ankanahalli(Village)
|
|---Chowdanakuppe(Village)
ResultSet
Should be able to create above Trees with the following details.
We need a PK(non-business KEY) and a SERIAL_NUMBER/ID for each STATE, DISTRICT, TALUK, HOBLI, PANCHAYAT, VILLAGE along with a Name and LEVEL of the relationship(similar to ORACLE'S LEVEL).
For now, getting the above ResultSet is OK. But in the future, we will need an ability to do reporting(some aggregation) at a HOBLI/PANCHAYAT/TALUK level.
=====================================
#Quassnoi #2,
Thank you very much,
"If you are planning to add some more hierarchy axes, it may be worth creating a separate table to store the hierarchies (with the axis field added) rather than adding the fields to the table."
Actually, I simplified the existing requirement so as NOT to confuse anyone. The actual hierarchy is like this
STATE
|
DISTRICT
|
TALUK
/ \
/ \
HOBLI PANCHAYAT
\ /
\ /
\ /
REVENUE VILLAGE
|
|
HABITATION
Sample data for such a hierarchy is like below
KARNATAKA
|
TUMKUR (District)
|
KUNIGAL (Taluk)
/ \
/ \
HULIYUR DURGA(Hobli) CHOWDANAKUPPE(Panchayat)
\ /
\ /
Thavarekere(Revenue Village)
/ \
Bommanahalli(habitation) Tavarekere(Habitation)
Will anything in your solution below change by the above modification ?
Also, would you recommend that I create another Table like below to store the 7 properties of the Habitats ? Is there a better way to store such info ?
CREATE TABLE habitatDetails
(
id BIGINT NOT NULL PRIMARY KEY,
serialNumber BIGINT NOT NULL,
habitatid BIGINT NOT NULL, -- we will add these details only for habitats
CONSTRAINT "habitatdetails_fk" FOREIGN KEY ("habitatid")
REFERENCES "public"."t_hierarchy"("id")
prop1 VARCHAR(128) ,
prop2 VARCHAR(128) ,
prop3 VARCHAR(128) ,
prop4 VARCHAR(128) ,
prop5 VARCHAR(128) ,
prop6 VARCHAR(128) ,
prop7 VARCHAR(128) ,
);
Thank you,
CREATE TABLE t_hierarchy
(
id BIGINT NOT NULL PRIMARY KEY,
type VARCHAR(128) NOT NULL,
name VARCHAR(128) NOT NULL,
tax_parent BIGINT,
gov_parent BIGINT,
CHECK (NOT (tax_parent IS NULL AND gov_parent IS NULL))
);
CREATE INDEX ix_hierarchy_taxparent ON t_hierarchy (tax_parent);
CREATE INDEX ix_hierarchy_govparent ON t_hierarchy (gov_parent);
INSERT
INTO t_hierarchy
VALUES (1, 'State', 'Karnataka', 0, 0),
(2, 'District', 'Tumkur', 1, 1),
(3, 'Taluk', 'Kunigal', 2, 2),
(4, 'Hobli', 'Huliyur Durga', 3, NULL),
(5, 'Panchayat', 'Chowdanakuppe', NULL, 3),
(6, 'Village', 'Voddarakempapura', 4, 5),
(7, 'Village', 'Ankanahalli', 4, 5),
(8, 'Village', 'Chowdanakuppe', 4, 5),
(9, 'Village', 'Yedehalli', 4, 5)
CREATE OR REPLACE FUNCTION fn_hierarchy_tax(level INT, start BIGINT)
RETURNS TABLE (level INT, h t_hierarchy)
AS
$$
SELECT $1, h
FROM t_hierarchy h
WHERE h.id = $2
UNION ALL
SELECT (f).*
FROM (
SELECT fn_hierarchy_tax($1 + 1, h.id) f
FROM t_hierarchy h
WHERE h.tax_parent = $2
) q;
$$
LANGUAGE 'sql';
CREATE OR REPLACE FUNCTION fn_hierarchy_tax(start BIGINT)
RETURNS TABLE (level INT, h t_hierarchy)
AS
$$
SELECT fn_hierarchy_tax(1, $1);
$$
LANGUAGE 'sql';
CREATE OR REPLACE FUNCTION fn_hierarchy_gov(level INT, start BIGINT)
RETURNS TABLE (level INT, h t_hierarchy)
AS
$$
SELECT $1, h
FROM t_hierarchy h
WHERE h.id = $2
UNION ALL
SELECT (f).*
FROM (
SELECT fn_hierarchy_gov($1 + 1, h.id) f
FROM t_hierarchy h
WHERE h.gov_parent = $2
) q;
$$
LANGUAGE 'sql';
CREATE OR REPLACE FUNCTION fn_hierarchy_gov(start BIGINT)
RETURNS TABLE (level INT, h t_hierarchy)
AS
$$
SELECT fn_hierarchy_gov(1, $1);
$$
LANGUAGE 'sql';
SELECT ht.level, (ht.h).*
FROM fn_hierarchy_tax(1) ht;
SELECT ht.level, (ht.h).*
FROM fn_hierarchy_gov(1) ht;
The main idea is to keep two parents in two different fields, and use CONNECT BY emulation (rather than recursive CTE) functionality to preserve the order.
If you are planning to add some more hierarchy axes, it may be worth creating a separate table to store the hierarchies (with the axis field added) rather than adding the fields to the table.
Update:
Will anything in your solution below change by the above modification?
No, it will work alright.
By "axes" I mean hierarchy chains. Currently, you have two axes: political hierarchy (though hablis) and tax hierarchy (through panchayats). If you are planning to add some more axes (which is of course improbable), you may consider storing the hierarchies in another table and adding "axis" field to that table. Again, it's very improbable that you want to do this, I just mentioned this possibility for the other readers who may have a similar problem.
Also, would you recommend that I create another Table like below to store the 7 properties of the Habitats ? Is there a better way to store such info ?
Yes, keeping them in a separate table is a good idea.