Matching characters between two strings - TSQL - tsql

I have two pieces of code which splits a string into single characters and returns it row by row. Does anyone know of any in-built functions which can essentially take the split strings in order to determine whether they are similar to each other ?
SELECT SUBSTRING(Aux.Name, X.number+1, 1) AS Split
FROM
(SELECT 'Wes Davids' as Name) AS Aux
INNER JOIN master..spt_values X ON X.number < LEN(Aux.Name)
WHERE X.type = 'P'
1 W
2 e
3 s
4
5 D
6 a
7 v
8 i
9 d
10 s
SELECT SUBSTRING(Aux.Name, X.number+1, 1) AS Split
FROM
(SELECT 'W Davids' as Name) AS Aux
INNER JOIN master..spt_values X ON X.number < LEN(Aux.Name)
WHERE X.type = 'P'
1 W
2
3 D
4 a
5 v
6 i
7 d
8 s

For splitting a string into N-Grams, specifically unigrams in your case, you should use ngrams8k. For example:
SELECT ng.* FROM dbo.ngrams8k('Wes Davids',1) AS ng;
Returns:
position token
----------- ---------
1 W
2 e
3 s
4
5 D
6 a
7 v
8 i
9 d
10 s
You can use it, for example, to quickly get the longest common substring between two strings as shown below. You can create a similarity score by dividing the length of the longest common substring (LCSS) by the length of the longest of the two strings (L2):
DECLARE
#string1 VARCHAR(100) = 'Joe Cook',
#string2 VARCHAR(100) = 'J Cook';
SELECT TOP (1) *, LCSS = LEN(TRIM(ng.token)), similarity = 1.*LEN(TRIM(ng.token))/b.L2
FROM (VALUES(
CASE WHEN LEN(#string1)<= LEN(#string2) THEN #string1 ELSE #string2 END,
CASE WHEN LEN(#string1)<= LEN(#string2) THEN #string2 ELSE #string1 END,
CASE WHEN LEN(#string1)<= LEN(#string2) THEN LEN(#string1) ELSE LEN(#string2)END,
CASE WHEN LEN(#string1)<= LEN(#string2) THEN LEN(#string2) ELSE LEN(#string1)END
)) AS b(S1,S2,L1,L2)
CROSS JOIN master..spt_values AS x
CROSS APPLY dbo.ngrams8k(b.S1,x.number+1) AS ng
WHERE x.[type] = 'P'
AND x.number < b.L1
AND CHARINDEX(ng.token,b.S2) > 0
ORDER BY LEN(TRIM(ng.token)) DESC
GO
Returns:
S1 S2 position token LCSS Similarity
-------------- ------------ -------------------- ---------- ----- ---------------------------------------
J Cook Joe Cook 3 Cook 4 0.50000000000
You can get a better similarity score by subtracting the Levenshtein (lev) distance from the length of the shorter of the two strings (L1-Lev), then dividing that value by L2:
(L1-Lev)/L2. You can use Phil Factor's Levenshtein Function for this.
DECLARE
#string1 VARCHAR(100) = 'James Cook',
#string2 VARCHAR(100) = 'Jamess Cook';
SELECT
Lev = dbo.LEVENSHTEIN(#string1,#string2),
Similarity = (1.*b.L1-dbo.LEVENSHTEIN(#string1,#string2))/b.L2
FROM (VALUES(
CASE WHEN LEN(#string1)<= LEN(#string2) THEN #string1 ELSE #string2 END,
CASE WHEN LEN(#string1)<= LEN(#string2) THEN #string2 ELSE #string1 END,
CASE WHEN LEN(#string1)<= LEN(#string2) THEN LEN(#string1) ELSE LEN(#string2)END,
CASE WHEN LEN(#string1)<= LEN(#string2) THEN LEN(#string2) ELSE LEN(#string1)END
)) AS b(S1,S2,L1,L2)
GO
Returns:
Lev Similarity
----------- ---------------------------------------
1 0.81818181818
This is an example of how to use the Levenshtein Distance for measuring similarity. There are other algorithms such as the Damerau–Levenshtein distance and The Longest Common Subsequence. Damerau–Levenshtein is more precise but slower (Phil Factor has a Damerau–Levenshtein function in the aforementioned link as well as a [Longest Common Subsequence function] in a different post7. The formula for similarity is the same (L1-DLev)/L2. The Longest Common Subsequence (LCSSq) is more accurate (but slower) than the longest common substring, but uses the same formula for calculating a similarity score: (LCSSq/L2)
Hopefully this will get you started.

Related

Include zero count in groupby

I have this table
person outcome
Peter positive
Peter positive
Peter positive
Eric positive
Eric positive
Eric negative
and want to count the number of rows each person has a positive/negative outcome.
select person, outcome, count(*)
from public.test123
group by person, outcome
person outcome count
Peter positive 3
Eric positive 2
Eric negative 1
But I also want a zero count for Peter negative. I've seen answers like this but I have nothing to join the table to?
How can I groupby, count and include zeros?
person outcome count
Peter positive 3
Peter negative 0
Eric positive 2
Eric negative 1
zxc
create table public.test123 (
person VARCHAR(20),
outcome VARCHAR(20));
insert into public.test123(person, outcome)
VALUES
('Peter', 'positive'),
('Peter', 'positive'),
('Peter', 'positive'),
('Eric', 'positive'),
('Eric', 'positive'),
('Eric', 'negative');
step-by-step demo:db<>fiddle
SELECT
s.person,
s.outcome,
SUM((t.outcome IS NOT NULL)::int) as cnt -- 4
FROM (
SELECT
*
FROM unnest(ARRAY['positive', 'negative']) as x(outcome), -- 1
(
SELECT DISTINCT -- 2
person
FROM test123
) s
) s
LEFT JOIN test123 t ON t.person = s.person AND t.outcome = s.outcome -- 3
GROUP BY s.person, s.outcome
Create a list of all possible outcome values.
Join it with all possible person values. Now you have a cartesian table with all possible combinations.
This can be used to join your original table.
Count all non-NULL values for each combination (in that case I uses SUM() with all non-NULL values == 1, 0 else)

Find and Replace numbers in a string

If I input a string as given below, I should be able to convert as mentioned below.
Ex 1: String - 5AB89C should be converted as 0000000005AB0000000089C
Ex 2: String GH1HJ should be converted as GH0000000001HJ
Ex 3: String N99K7H45 should be B0000000099K0000000007H0000000045
Each number should be complimented with 10 leading zeros including the number. In Ex:1, number 5 is complemented with 9 leading zeros making 10 digits, same way 89 is complimented with 8 leading zeros making total of 10 digits. Alphabets and any special characters should be untouched.
Once you get a copy of PatternSplitCM This is easy as pie.
Here's how we do it with one value:
DECLARE #string VARCHAR(8000) = '5AB89C'
SELECT CASE f.[matched] WHEN 1 THEN '00000000'+'' ELSE '' END + f.item
FROM dbo.patternsplitCM(#String,'[0-9]') AS f
ORDER BY f.ItemNumber
FOR XML PATH('');
Returns: 000000005AB0000000089C
Now against a table:
-- sample data
DECLARE #table TABLE (StringId INT IDENTITY, String VARCHAR(8000));
INSERT #table(String)
VALUES('5AB89C'),('GH1HJ'),('N99K7H45');
SELECT t.StringId, oldstring = t.String, newstring = f.padded
FROM #table AS t
CROSS APPLY
(
SELECT CASE f.[matched] WHEN 1 THEN '00000000'+'' ELSE '' END + f.item
FROM dbo.patternsplitCM(t.String,'[0-9]') AS f
ORDER BY f.ItemNumber
FOR XML PATH('')
) AS f(padded);
Returns:
StringId oldstring newstring
----------- ----------------- --------------------------------------
1 5AB89C 000000005AB0000000089C
2 GH1HJ GH000000001HJ
3 N99K7H45 N0000000099K000000007H0000000045
... and that's it. The code to create PatternSplitCM is below.
PatternSplitCM Code:
CREATE FUNCTION dbo.PatternSplitCM
(
#List VARCHAR(8000) = NULL
,#Pattern VARCHAR(50)
) RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
WITH numbers AS (
SELECT TOP(ISNULL(DATALENGTH(#List), 0))
n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) g (n))
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY MIN(n)),
Item = SUBSTRING(#List,MIN(n),1+MAX(n)-MIN(n)),
Matched
FROM (
SELECT n, y.Matched, Grouper = n - ROW_NUMBER() OVER(ORDER BY y.Matched,n)
FROM numbers
CROSS APPLY (
SELECT Matched = CASE WHEN SUBSTRING(#List,n,1) LIKE #Pattern THEN 1 ELSE 0 END
) y
) d
GROUP BY Matched, Grouper

Check for equal amounts of negative numbers as positive numbers

I have a table with two columns: intGroupID, decAmount
I want to have a query that can basically return the intGroupID as a result if for every positive(+) decAmount, there is an equal and opposite negative(-) decAmount.
So a table of (id=1,amount=1.0),(1,2.0),(1,-1.0),(1,-2.0) would return back the intGroupID of 1, because for each positive number there exists a negative number to match.
What I know so far is that there must be an equal number of decAmounts (so I enforce a count(*) % 2 = 0) and the sum of all rows must = 0.0. However, some cases that get by that logic are:
ID | Amount
1 | 1.0
1 | -1.0
1 | 2.0
1 | -2.0
1 | 3.0
1 | 2.0
1 | -4.0
1 | -1.0
This has a sum of 0.0 and has an even number of rows, but there is not a 1-for-1 relationship of positives to negatives. I need a query that can basically tell me if there is a negative amount for each positive amount, without reusing any of the rows.
I tried counting the distinct absolute values of the numbers and enforcing that it is less than the count of all rows, but it's not catching everything.
The code I have so far:
DECLARE #tblTest TABLE(
intGroupID INT
,decAmount DECIMAL(19,2)
);
INSERT INTO #tblTest (intGroupID ,decAmount)
VALUES (1,-1.0),(1,1.0),(1,2.0),(1,-2.0),(1,3.0),(1,2.0),(1,-4.0),(1,-1.0);
DECLARE #intABSCount INT = 0
,#intFullCount INT = 0;
SELECT #intFullCount = COUNT(*) FROM #tblTest;
SELECT #intABSCount = COUNT(*) FROM (
SELECT DISTINCT ABS(decAmount) AS absCount FROM #tblTest GROUP BY ABS(decAmount)
) AS absCount
SELECT t1.intGroupID
FROM #tblTest AS t1
/* Make Sure Even Number Of Rows */
INNER JOIN
(SELECT COUNT(*) AS intCount FROM #tblTest
)
AS t2 ON t2.intCount % 2 = 0
/* Make Sure Sum = 0.0 */
INNER JOIN
(SELECT SUM(decAmount) AS decSum FROM #tblTest)
AS t3 ON decSum = 0.0
/* Make Sure Count of Absolute Values < Count of Values */
WHERE
#intABSCount < #intFullCount
GROUP BY t1.intGroupID
I think there is probably a better way to check this table, possibly by finding pairs and removing them from the table and seeing if there's anything left in the table once there are no more positive/negative matches, but I'd rather not have to use recursion/cursors.
Create TABLE #tblTest (
intA INT
,decA DECIMAL(19,2)
);
INSERT INTO #tblTest (intA,decA)
VALUES (1,-1.0),(1,1.0),(1,2.0),(1,-2.0),(1,3.0),(1,2.0),(1,-4.0),(1,-1.0), (5,-5.0),(5,5.0) ;
SELECT * FROM #tblTest;
SELECT
intA
, MIN(Result) as IsBalanced
FROM
(
SELECT intA, X,Result =
CASE
WHEN count(*)%2 = 0 THEN 1
ELSE 0
END
FROM
(
---- Start thinking here --- inside-out
SELECT
intA
, x =
CASE
WHEN decA < 0 THEN
-1 * decA
ELSE
decA
END
FROM #tblTest
) t1
Group by intA, X
)t2
GROUP BY intA
Not tested but I think you can get the idea
This returns the id that do not conform
The not is easier to test / debug
select pos.*, neg.*
from
( select id, amount, count(*) as ccount
from tbl
where amount > 0
group by id, amount ) pos
full outer join
( select id, amount, count(*) as ccount
from tbl
where amount < 0
group by id, amount ) neg
on pos.id = neg.id
and pos.amount = -neg.amount
and pos.ccount = neg.ccount
where pos.id is null
or neg.id is null
I think this will return a list of id that do conform
select distinct(id) from tbl
except
select distinct(isnull(pos.id, neg.id))
from
( select id, amount, count(*) as ccount
from tbl
where amount > 0
group by id, amount ) pos
full outer join
( select id, amount, count(*) as ccount
from tbl
where amount < 0
group by id, amount ) neg
on pos.id = neg.id
and pos.amount = -neg.amount
and pos.ccount = neg.ccount
where pos.id is null
or neg.id is null
Boy, I found a simpler way to do this than my previous answers. I hope all my crazy edits are saved for posterity.
This works by grouping all numbers for an id by their absolute value (1, -1 grouped by 1).
The sum of the group determines if there are an equal number of pairs. If it is 0 then it is equal, any other value for the sum means there is an imbalance.
The detection of evenness by the COUNT aggregate is only necessary to detect an even number of zeros. I assumed that 0's could exist and they should occur an even number of times. Remove it if this isn't a concern, as 0 will always pass the first test.
I rewrote the query a bunch of different ways to get the best execution plan. The final result below only has one big heap sort which was unavoidable given the lack of an index.
Query
WITH tt AS (
SELECT intGroupID,
CASE WHEN SUM(decAmount) > 0 OR COUNT(*) % 2 = 1 THEN 1 ELSE 0 END unequal
FROM #tblTest
GROUP BY intGroupID, ABS(decAmount)
)
SELECT tt.intGroupID,
CASE WHEN SUM(unequal) != 0 THEN 'not equal' ELSE 'equals' END [pair]
FROM tt
GROUP BY intGroupID;
Tested Values
(1,-1.0),(1,1.0),(1,2),(1,-2), -- should work
(2,-1.0),(2,1.0),(2,2),(2,2), -- fail, two positive twos
(3,1.0),(3,1.0),(3,-1.0), -- fail two 1's , one -1
(4,1),(4,2),(4,-.5),(4,-2.5), -- fail: adds up the same sum, but different values
(5,1),(5,-1),(5,0),(5,0), -- work, test zeros
(6,1),(6,-1),(6,0), -- fail, test zeros
(7,1),(7,-1),(7,-1),(7,1),(7,1) -- fail, 3 x 1
Results
A pairs
_ _____
1 equal
2 not equal
3 not equal
4 not equal
5 equal
6 not equal
7 not equal
The following should return "disbalanced" groups:
;with pos as (
select intGroupID, ABS(decAmount) m
from TableName
where decAmount > 0
), neg as (
select intGroupID, ABS(decAmount) m
from TableName
where decAmount < 0
)
select distinct IsNull(p.intGroupID, n.intGroupID) as intGroupID
from pos p
full join neg n on n.id = p.id and abs(n.m - p.m) < 1e-8
where p.m is NULL or n.m is NULL
to get unpaired elements, select satement can be changed to following:
select IsNull(p.intGroupID, n.intGroupID) as intGroupID, IsNull(p.m, -n.m) as decAmount
from pos p
full join neg n on n.id = p.id and abs(n.m - p.m) < 1e-8
where p.m is NULL or n.m is NULL
Does this help?
-- Expected result - group 1 and 3
declare #matches table (groupid int, value decimal(5,2))
insert into #matches select 1, 1.0
insert into #matches select 1, -1.0
insert into #matches select 2, 2.0
insert into #matches select 2, -2.0
insert into #matches select 2, -2.0
insert into #matches select 3, 3.0
insert into #matches select 3, 3.5
insert into #matches select 3, -3.0
insert into #matches select 3, -3.5
insert into #matches select 4, 4.0
insert into #matches select 4, 4.0
insert into #matches select 4, -4.0
-- Get groups where we have matching positive/negatives, with the same number of each
select mat.groupid, min(case when pos.PositiveCount = neg.NegativeCount then 1 else 0 end) as 'Match'
from #matches mat
LEFT JOIN (select groupid, SUM(1) as 'PositiveCount', Value
from #matches where value > 0 group by groupid, value) pos
on pos.groupid = mat.groupid and pos.value = ABS(mat.value)
LEFT JOIN (select groupid, SUM(1) as 'NegativeCount', Value
from #matches where value < 0 group by groupid, value) neg
on neg.groupid = mat.groupid and neg.value = case when mat.value < 0 then mat.value else mat.value * -1 end
group by mat.groupid
-- If at least one pair within a group don't match, reject
having min(case when pos.PositiveCount = neg.NegativeCount then 1 else 0 end) = 1
You can compare your values this way:
declare #t table(id int, amount decimal(4,1))
insert #t values(1,1.0),(1,-1.0),(1,2.0),(1,-2.0),(1,3.0),(1,2.0),(1,-4.0),(1,-1.0),(2,-1.0),(2,1.0)
;with a as
(
select count(*) cnt, id, amount
from #t
group by id, amount
)
select id from #t
except
select b.id from a
full join a b
on a.cnt = b.cnt and a.amount = -b.amount
where a.id is null
For some reason i can't write comments, however Daniels comment is not correct, and my solution does accept (6,1),(6,-1),(6,0) which can be correct. 0 is not specified in the question and since it is a 0 value it can be handled eather way. My answer does NOT accept (3,1.0),(3,1.0),(3,-1.0)
To Blam: No I am not missing
or b.id is null
My solution is like yours, but not exactly identical

Getting a label associated with a maximum value over a partition (SQL)

I know there must be a better way to do this and I'm brain dead today.
I have two tables :
Reference
Id Label
1 Apple
2 Banana
3 Cherry
Elements
Id ReferenceId P1 P2 Qty
1 1 1 2 8
2 2 2 3 14
3 1 3 2 1
4 3 2 1 6
5 3 1 2 3
I want to group these up primarily by (P1, P2) but independent of the order of P1 and P2 - so that (1,2) and (2,1) map to the same group. That's fine.
The other part is I want to get the label that has the large sum(qty) for a given P1, P2 pair - in other words, I want the result set to be:
P1 P2 TotalQty MostRepresentativeLabel
1 2 17 Cherry
2 3 15 Banana
All I can come up with is this awful mess:
select endpoint1, endpoint2, totalTotal, mostRepresentativeLabelByQty from
(
select SUM(qty)as total
,case when (p1<p2) then p1 else p2 end as endpoint1
,case when (p1<p2) then p2 else p1 end as endpoint2
,reference.label as mostRepresentativeLabelByQty
from elements inner join reference on elements.fkId = reference.id
group by case when (p1<p2) then p1 else p2 end
,case when (p1<p2) then p2 else p1 end
,label
) a inner join
(
select MAX(total) as highestTotal, SUM(total) as totalTotal from
(
select SUM(qty)as total
,case when (p1<p2) then p1 else p2 end as endpoint1
,case when (p1<p2) then p2 else p1 end as endpoint2
,reference.label as mostRepresentativeLabelByQty
from elements inner join reference on elements.fkId = reference.id
group by case when (p1<p2) then p1 else p2 end
,case when (p1<p2) then p2 else p1 end
,label
) byLabel
group by endpoint1, endpoint2
) b
on a.total = b.highestTotal
Which .. works ... but I'm not convinced. This ultimately is going to be running on much larger datasets (200,000 rows or so) so I'm not liking this approach - is there a simpler way to express "use the value from this column where some other column is maximized" that I'm totally blanking on?
(SQL Server 2008 R2 by the way)
I use the sum of the BINARY_CHECKSUM's of P1 and P2 to uniquely identify each group. This SUM is identified
by the BC alias, and permits the grouping needed to find the largest group labels.
DECLARE #Reference TABLE(ID INT, Label VARCHAR(10));
DECLARE #Elements TABLE(ID INT, ReferenceID INT, P1 INT, P2 INT, Qty INT);
INSERT INTO #Reference VALUES
(1,'Apple')
, (2,'Banana')
, (3,'Cherry');
INSERT INTO #Elements VALUES
(1,1,1,2,8)
, (2,2,2,3,14)
, (3,1,3,2,1)
, (4,3,2,1,6)
, (5,3,1,2,3);
; WITH a AS (
SELECT
P1, P2=P2, Qty, BC=ABS(BINARY_CHECKSUM(CAST(P1 AS VARCHAR(10))))+ABS(BINARY_CHECKSUM(CAST(P2 AS VARCHAR(10))))
, Label
, LabelSum=SUM(Qty)OVER(PARTITION BY ABS(BINARY_CHECKSUM(CAST(P1 AS VARCHAR(10))))+ABS(BINARY_CHECKSUM(CAST(P2 AS VARCHAR(10)))),Label)
, GroupSum=SUM(Qty)OVER(PARTITION BY ABS(BINARY_CHECKSUM(CAST(P1 AS VARCHAR(10))))+ABS(BINARY_CHECKSUM(CAST(P2 AS VARCHAR(10)))))
FROM #Elements e
INNER JOIN #Reference r on r.ID=e.ReferenceID
)
, r AS (
SELECT *, rnk=RANK()OVER(PARTITION BY BC ORDER BY LabelSum DESC)
FROM a
)
SELECT P1=MIN(P1)
, P2=MAX(P2)
, TotalQty=GroupSum
, MostRepresentativeLabel=Label
FROM r
WHERE rnk=1
GROUP BY GroupSum,Label
ORDER BY GroupSum DESC;
GO
Result:
EDIT Wrap each BINARY_CHECKSUM in ABS to maximize the entropy of the sums of each group's BINARY_CHECKSUM. Because BINARY_CHECKSUM is a signed BIGINT, this will decrease
the chances of a collision between two different groups where a positive BINARY_CHECKSUM is summed with
a negative BINARY_CHECKSUM.

Switch Case in T-SQL In where Clause

Why this doesn't work? Please help.
SELECT X
FROM Y
WHERE Z >= 5
AND A IN (CASE #someParameter when 1 THEN (5) ELSE (4,5) END)
Where as below works
SELECT X
FROM Y
WHERE Z >= 5
AND A = (CASE #someParameter when 1 THEN 5 ELSE 4 END)
You could accomplish that without case, like:
WHERE Z >= 5
AND (
#SomeParameter = 1 AND A = 5
OR
#SomeParameter <> 1 AND A IN (4,5)
)
You can have select statement instead of mentioning numbers directly. I ahven't tried executing it. But the idea is to get the required number set by using select query inside your brackets.
SELECT X
FROM Y
WHERE Z >= 5
AND A IN (CASE #someParameter when 1 THEN (SELECT 4) ELSE (SELECT 4 UNION SELECT 5) END)
You can't return a set, range, or table (or anything else you want to call multiple values) from a CASE statement. That (4,5) expression is just not allowed. My advice instead is to build these values as a small lookup table you can select from.
SELECT X
FROM Y
WHERE Z >= 5
AND
CASE #someParameter
when 1 THEN A = 5
ELSE (A = 4 OR A = 5)
END
Note: I haven't tried this syntax. My guess is that this should work.