Simple query that returns top 10 results by similarity.
SELECT name, similarity(name, 'some text') as sm
FROM table
WHERE name % 'some text'
ORDER BY sm DESC
LIMIT 10
But there is one moment where I need to expand limit of returned data.
For example lets say that I have 11 rows texts in DB some text and 20 rows texts some text 2
Theese texts are similar and after query execution results will be only 10 rows with some text
How to return all rows which are dublicates and after that some LIMIT data?
Expected result would be
11 rows with 'some text'
and after that 10 rows with other similarity in this case 'some text 2'
All returned results 21
How to achieve this?
if I understand you correctly, you need to use UNION:
-- Unlimited output of 'some text'
select
name
, similarity(name, 'some text') as sm
from table
where name % 'some text'
-----
union
-----
select
sub.name
, sub.sm
from ( -- Limited output of 'some text 2'
select
name
, similarity(name, 'some text 2') as sm
from table
where name % 'some text 2'
limit 10
) sub
order by
sm desc
If you need "some text" to be limited as well, wrap it in a subquery similar to that for "some text 2" with the desired limit:
select
sub1.name
, sub1.sm
from ( -- Limited output of 'some text'
select
name
, similarity(name, 'some text') as sm
from table
where name % 'some text'
limit 11
) sub1
-----
union
-----
select
sub2.name
, sub2.sm
from ( -- Limited output of 'some text 2'
select
name
, similarity(name, 'some text 2') as sm
from table
where name % 'some text 2'
limit 10
) sub2
order by
sm desc
Related
I have a data to RedShift:
id: 210396
created: 2021-09-01 05:42:15.80726
inputs_super: [{"desc":" Please check the pledge box, Pledge content","name":"pledge","type":"dropdown","values":["Agree","Disagree"]}]
desc: " Please check the pledge box, Pledge content"
name: "pledge"
values: ["Agree","Disagree"]
I need to parse list of "values" at RedShift and create row for each of list's items.
Example:
id: 210396
created: 2021-09-01 05:42:15.80726
inputs_super: [{"desc":" Please check the pledge box, Pledge content","name":"pledge","type":"dropdown","values":["Agree","Disagree"]}]
desc: " Please check the pledge box, Pledge content"
name: "pledge"
values: ["Agree"]
id: 210396
created: 2021-09-01 05:42:15.80726
inputs_super: [{"desc":" Please check the pledge box, Pledge content","name":"pledge","type":"dropdown","values":["Agree","Disagree"]}]
desc: " Please check the pledge box, Pledge content"
name: "pledge"
values: ["Disagree"]
I create this query to do this operation:
CREATE TEMP TABLE seq_0_to_100 AS (
SELECT 0 AS i UNION ALL
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5
-- I am stopping here, you could easily generate this as a VIEW with 100+ real rows...
);
WITH all_values AS (
SELECT c.*, d.desc, d.name, d.values
FROM (
SELECT id, created, JSON_PARSE(inputs) AS inputs_super
FROM course.table
WHERE prompttype = 'input'
) AS c,
c.inputs_super AS d
ORDER BY created DESC
LIMIT 10
), split_values AS (
SELECT id, json_extract_array_element_text(values, seq.i, True) AS size
FROM all_values, seq_0_to_100 AS seq
WHERE seq.i < JSON_ARRAY_LENGTH(values)
)
SELECT * FROM split_values;
But I got an error on the last step when try to split list (on "split_values" step):
ERROR: function json_extract_array_element_text(super, integer, boolean) does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts.
May be you know how I can fix it?
The issue is that values is still of type super and needs to be a string. The function to convert super to string is json_serialize().
I was curious so I built a test case from your question. Here's my working version:
drop table super;
create table super as select 210396 as id, '2021-09-01 05:42:15.80726'::timestamp as created,
'[{"desc":" Please check the pledge box, Pledge content","name":"pledge","type":"dropdown","values":["Agree","Disagree"]}]'::text as inputs,
'Please check the pledge box, Pledge content' as desc, 'pledge' as name;
drop table seq_0_to_100;
CREATE TEMP TABLE seq_0_to_100 AS (
SELECT 0 AS i UNION ALL
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5
-- I am stopping here, you could easily generate this as a VIEW with 100+ real rows...
);
WITH all_values AS (
SELECT c.*, d.desc, d.name, d.values
FROM (
SELECT id, created, JSON_PARSE(inputs) AS inputs_super
FROM super
--WHERE prompttype = 'input'
) AS c,
c.inputs_super AS d
ORDER BY created DESC
--LIMIT 10
), split_values AS (
SELECT id, i, json_serialize(values), json_extract_array_element_text(json_serialize(values), seq.i) AS size
FROM all_values, seq_0_to_100 AS seq
WHERE seq.i < JSON_ARRAY_LENGTH(json_serialize(values))
)
SELECT * FROM split_values;
SELECT XMLSERIALIZE(
XMLELEMENT(
NAME "row",
XMLFOREST(
A.TITLE AS "title",
A.TAG as "tag" )))
FROM ARTICLES A;
Expected Result:
Instead of Mentioned Column name(TITLE ,TAG) ,shall we able to keep '*' (means - select * from Articles)
because my table contains 150 columns.
As #data_henrik said, your question it not clear ...
But If I got you right, you want to create a xml doc for all columns from your base table, without having to code all 150 columns in the query.
I don't know a direct way of doing it, but you can do it in a 2 step way.
1st step will be a SELECT statement, that will BUILD your final SELECT based on the columns from your base table..
Let's take DEPARTMENT table from sample db as an example:
db2 => select * from department
DEPTNO DEPTNAME MGRNO ADMRDEPT LOCATION
------ ------------------------------------ ------ -------- ----------------
A00 SPIFFY COMPUTER SERVICE DIV. 000010 A00 -
B01 PLANNING 000020 A00 -
C01 INFORMATION CENTER 000030 A00 -
...
it has 5 columns. The following select will build your final SELECT ...
select
'SELECT XMLROW( ' ||
listagg(rtrim(colname) || ' AS "' || lcase(rtrim(colname)) || '"' , ', ')
|| ' ) FROM DEPARTMENT'
from syscat.columns
where tabname = 'DEPARTMENT'
------------------------------------------------------------------------------------------------------------
SELECT XMLROW( ADMRDEPT AS "admrdept", DEPTNAME AS "deptname", DEPTNO AS "deptno", LOCATION AS "location", MGRNO AS "mgrno" ) FROM DEPARTMENT
If you execute the resulting SELECT. it will produce the xml, similar to what you want.
----------------------------------------------------------------------------------------------------------
<row><admrdept>A00</admrdept><deptname>SPIFFY COMPUTER SERVICE DIV.</deptname><deptno>A00</deptno><mgrno>000010</mgrno></row>
<row><admrdept>A00</admrdept><deptname>PLANNING</deptname><deptno>B01</deptno><mgrno>000020</mgrno></row>
<row><admrdept>A00</admrdept><deptname>INFORMATION CENTER</deptname><deptno>C01</deptno><mgrno>000030</mgrno></row>
Note:
I have used XMLROW, for simplicity.. instead of yours XMLSERIALIZE( XMLELEMEENT( XMLFOREST ... just as an example.. so you get the idea..
I have a t-sql query that looks like this:
select * from (
SELECT [Id], replace(ca.[AKey], '-', '') as [AKey1], rtrim(replace(replace(replace(lower([Name]), '#', ''), '(1.0)', ''), '(2.5)', '')) as [Name], [Key], dw.[AKey], replace(lower(trim([wName])), '#', '') as [wName]
FROM [dbo].[wTable] ca
FULL JOIN (select * from [dw].[wTable]) dw on
rtrim(left( replace(replace(replace(lower(dw.[wName]), '(1.0)', ''), '(2.5)', ''), '#', ''), 5))+'%'
like
rtrim(left( replace(replace(replace(lower(ca.[Name] ), '(1.0)', ''), '(2.5)', ''), '#', ''), 5))+'%'
and
right(rtrim(replace(replace(replace(lower(dw.[wName]), '(1.0)', ''), '(2.5)', ''), '#', '')), 2)
like
right(rtrim(replace(replace(replace(lower(ca.[Name] ), '(1.0)', ''), '(2.5)', ''), '#', '')), 2)
) tp
As you can see, during the JOIN, it's removing some fuzzy characters that may or may not exist, and it's checking to see if the first 5 characters in the wName column match with the first 5 characters in the Name column, then doing the same for the last 2 characters in the columns.
So essentially, it's matching on the first 5 characters AND last 2 characters.
What I'm trying to add is an additional column that will tell me if the resulting columns are an exact match or if they are fuzzy. In other words, if they are an exact match it should say 'True' or something like that, and if they are a fuzzy match I would ideally like it to tell me how far off they are. For example, how many characters do not match.
As JNevil mentioned you could use Levenshtein. You can also use Damarau-Levenshtein or the Longest Common Substring depending on how accurate you want to get and what your performance expectations are.
Below are two solutions. The first is a Levenshtein solution using a copy I grabbed from Phil Factor here. The Longest Common Substring solution uses my version of the Longest Common Substring which is fastest available for SQL Server (by far).
-- sample data
declare #t1 table (string1 varchar(100));
declare #t2 table (string2 varchar(100));
insert #t1 values ('abc'),('xxyz'),('1234'),('9923');
insert #t2 values ('abcd'),('xyz'),('2345'),('zzz');
-- Levenshtein
select string1, string2, Ld
from
(
select *, Ld = dbo.LEVENSHTEIN(t1.string1, t2.string2)
from #t1 t1
cross join #t2 t2
) compare
where ld <= 2;
-- Longest Common Substring
select string1, string2, lcss = item, lcssLen = itemlen, diff = mx.L-itemLen
from #t1 t1
cross join #t2 t2
cross apply dbo.lcssWindowAB(t1.string1, t2.string2, 20)
cross apply (values (IIF(len(string1) > len(string2), len(string1),len(string2)))) mx(L)
where mx.L-itemLen <= 2;
RESULTS
string1 string2 Ld
-------- -------- -----
abc abcd 1
xxyz xyz 1
1234 2345 2
string1 string2 lcss lcssLen diff
-------- -------- ----- ----------- -----------
abc abcd abc 3 1
xxyz xyz xyz 3 1
1234 2345 234 3 1
9923 2345 23 2 2
This does not answer your question but should get you started.
P.S. The Levenshtein function I posted does have a small bug, it says the distance between "9923" and "2345" is 4, the correct answer would be two. There's other Levenshtein functions out there though.
I have records like this in a table called "Entry":
TABLE: Entry
ID Tags
--- ------------------------------------------------------
1 Coffee, Tea, Cake, BBQ
2 Soda, Lemonade
...etc.
TABLE: Tags
ID TagName
---- -----------
1 Coffee
2 Tea
3 Soda
...
TABLE: TagEntry
ID TAGID ENTRYID
--- ----- -------
1 1 1
2 2 1
3 3 2
....
I need to loop through each record in the entire table for Entry, then for each row loop the comma delimited tags because I need to split each tag then do a Tag lookup based on tag name to grab the TagID, and then ultimately insert TagID, EntryID in a bridge table called TagEntry for each comma delimited tag
Not sure how to go about this.
Try this
;with entry as
(
select 1 id, 'Coffee, Tea, Cake, BBQ' tags
Union all
select 2, 'Soda, Lemonade'
), tags as
(
select 1 id,'Coffee' TagName union all
select 2,'Tea' union all
select 3,'Soda'
), entryxml as
(
SELECT id, ltrim(rtrim(r.value('.','VARCHAR(MAX)'))) as Item from (
select id, CONVERT(XML, N'<root><r>' + REPLACE(tags,',','</r><r>') + '</r></root>') as XmlString
from entry ) x
CROSS APPLY x.XmlString.nodes('//root/r') AS RECORDS(r)
)
select e.id EntryId, t.id TagId from entryxml e
inner join tags t on e.Item = t.TagName
This SQL will split your Entry table, for joining to the others:
with raw as (
select * from ( values
(1, 'Coffee, Tea, Cake, BBQ'),
(2, 'Soda, Lemonade')
) Entry(ID,Tags)
)
, data as (
select ID, Tag = convert(varchar(255),' '), Tags, [Length] = len(Tags) from raw
union all
select
ID = ID,
Tag = case when charindex(',',Tags) = 0 then Tags else convert(varchar(255), substring(Tags, 1, charindex(',',Tags)-1) ) end,
Tags = substring(Tags, charindex(',',Tags)+1, 255),
[Length] = [Length] - case when charindex(',',Tags) = 0 then len(Tags) else charindex(',',Tags) end
from data
where [Length] > 0
)
select ID, Tag = ltrim(Tag)
from data
where Tag <> ''
and returns this for the given input:
ID Tag
----------- ------------
2 Soda
2 Lemonade
1 Coffee
1 Tea
1 Cake
1 BBQ
I've found a small annoyance that I was wondering how to get around...
In a simplified example, say I need to return "TEST B-19" and "TEST B-20"
I have a where clause that looks like:
where [Name] LIKE 'TEST B-[12][90]'
and it works... unless there's a "TEST B-10" or "TEST-B29" value that I don't want.
I'd rather not resort to doing both cases, because in more complex situations that would become prohibitive.
I tried:
where [Name] LIKE 'TEST B-[19-20]'
but of course that doesn't work because it is looking for single characters...
Thoughts? Again, this is a very simple example, I'd be looking for ways to grab ranges from 16 to 32 or 234 to 459 without grabbing all the extra values that could be created.
EDITED to include test examples...
You might see "TEXAS 22" or "THX 99-20-110-B6" or "E-19" or "SOUTHERN B" or "122 FLOWERS" in that field. The presense of digits is common, but not a steadfast rule, and there are absolutely no general patterns for hypens, digits, characters, order, etc.
I would divide the Name column into the text parts and the number parts, and convert the number parts into an integer, and then check if that one was between the values. Something like:
where cast(substring([Name], 7, 2) as integer) between 19 and 20
And, of course, if the possible structure of [Name] is much more complex, you'd have to calculate the values for 7 and 2, not hardcode them....
EDIT: If you want to filter out the ones not conforming to the pattern first, do the following:
where [Name] LIKE '%TEST B-__%'
and cast(substring([Name], CHARINDEX('TEST B-', [Name]) + LEN('TEST B-'), 2) as integer) between 19 and 20
Maybe it's faster using CHARINDEX in place of the LIKE in the topmost line two, especially if you put an index on the computed value, but... That is only optimization... :)
EDIT: Tested the procedure. Given the following data:
jajajajajajajTEST B-100
jajajajajajajTEST B-85
jajajajjTEST B-100
jajjajajTEST B-100
jajajajajajajTEST B-00
jajajajaTEST B-100
jajajajajajajEST B-99
jajajajajajajTEST B-100
jajajajajajajTEST B-19
jajajajjTEST B-100
jajjajajTEST B-120
jajajajajajajTEST B-00
jajajajaTEST B-150
jajajajajajajEST B-20
TEST B-20asdfh asdfkh
The query returns the following rows:
jajajajajajajTEST B-19
TEST B-20asdfh asdfkh
Wildcards or no, you still have to edit the query every time you want to change the range definition. If you're always dealing with a range (and it's not always the same range), you might use parameters. For example:
note: for some reason (this has happened in many other posts as well), when I try to post code beginning with 'declare', SO hangs and times-out. I reported it on meta already, but nobody could reproduce it (including me). Here it's happening again, so I took the 'D' off, and now it works. I'll come back tomorrow, and it will let me put the 'D' back on.
DECLARE #min varchar(5)
DECLARE #max varchar(5)
SET #min = 'B-19'
SET #max = 'B-20'
SELECT
...
WHERE NAME BETWEEN #min AND #max
You should avoid formatting [NAME] as others have suggested (using function on it) -- this way, your search can benefit from an index on it.
In any case -- you might re-consider your table structure. It sounds like 'TEST B-19' is a composite (non-normalized) value of category ('TEST') + sub-category ('B') + instance ('19'). Put it in a lookup table with 4 columns (id being the first), and then join it by id in whatever query needs to output the composite value. This will make searching and indexing much easier and faster.
In the absence of test data, I generated my own. I just removed the Test B- prefix, converted to int and did a Between
With Numerals As
(
Select top 100 row_number() over (order by name) TestNumeral
from sys.columns
),
TestNumbers AS
(
Select 'TEST B-' + Convert (VarChar, TestNumeral) TestNumber
From Numerals
)
Select *
From TestNumbers
Where Cast (Replace (TestNumber, 'TEST B-', '') as Integer) between 1 and 16
This gave me
TestNumber
-------------------------------------
TEST B-1
TEST B-2
TEST B-3
TEST B-4
TEST B-5
TEST B-6
TEST B-7
TEST B-8
TEST B-9
TEST B-10
TEST B-11
TEST B-12
TEST B-13
TEST B-14
TEST B-15
TEST B-16
This means, however, that if you have different strategies for naming tests, you would have to remove all different kinds of prefixes.
Now, on the other hand, if your Test numbers are in the TEST-Space-TestType-Hyphen-TestNumber format, you could use PatIndex and SubString
With Numerals As
(
Select top 100 row_number() over (order by name) TestNumeral
from sys.columns
),
TestNumbers AS
(
Select 'TEST B-' + Convert (VarChar, TestNumeral) TestNumber
From Numerals
Where TestNumeral Between 10 and 19
UNION
Select 'TEST A-' + Convert (VarChar, TestNumeral) TestNumber
From Numerals
Where TestNumeral Between 20 and 29
)
Select *
From TestNumbers
Where Cast (SubString (TestNumber, PATINDEX ('%-%', TestNumber)+1, Len (TestNumber) - PATINDEX ('%-%', TestNumber)) as Integer) between 16 and 26
That should yield the following
TestNumber
-------------------------------------
TEST A-20
TEST A-21
TEST A-22
TEST A-23
TEST A-24
TEST A-25
TEST A-26
TEST B-16
TEST B-17
TEST B-18
TEST B-19
All of your examples seem to have the test numbers at the end. So if you can create a table of patterns and then JOIN using a LIKE statement, you may be able make it work. Here is an example:
;
With TestNumbers As
(
select 'E-1' TestNumber
union select 'E-2'
union select 'E-3'
union select 'E-4'
union select 'E-5'
union select 'E-6'
union select 'E-7'
union select 'SOUTHERN B1'
union select 'SOUTHERN B2'
union select 'SOUTHERN B3'
union select 'SOUTHERN B4'
union select 'SOUTHERN B5'
union select 'SOUTHERN B6'
union select 'SOUTHERN B7'
union select 'Southern CC'
union select 'Southern DD'
union select 'Southern EE'
union select 'TEST B-1'
union select 'TEST B-2'
union select 'TEST B-3'
union select 'TEST B-4'
union select 'TEST B-5'
union select 'TEST B-6'
union select 'TEST B-7'
union select 'TEXAS 1'
union select 'TEXAS 2'
union select 'TEXAS 3'
union select 'TEXAS 4'
union select 'TEXAS 5'
union select 'TEXAS 6'
union select 'TEXAS 7'
union select 'THX 99-20-110-B1'
union select 'THX 99-20-110-B2'
union select 'THX 99-20-110-B3'
union select 'THX 99-20-110-B4'
union select 'THX 99-20-110-B5'
union select 'THX 99-20-110-B6'
union select 'THX 99-20-110-B7'
union select 'Southern AA'
union select 'Southern CC'
union select 'Southern DD'
union select 'Southern EE'
),
Prefixes as
(
Select 'TEXAS ' TestPrefix
Union Select 'THX 99-20-110-B'
Union Select 'E-'
Union Select 'SOUTHERN B'
Union Select 'TEST B-'
)
Select TN.TestNumber
From TestNumbers TN, Prefixes P
Where 1=1
And TN.TestNumber Like '%' + P.TestPrefix + '%'
And Cast (REPLACE (Tn.TestNumber, p.TestPrefix, '') AS INTEGER) between 4 and 6
This will give you
TestNumber
----------------
E-4
E-5
E-6
SOUTHERN B4
SOUTHERN B5
SOUTHERN B6
TEST B-4
TEST B-5
TEST B-6
TEXAS 4
TEXAS 5
TEXAS 6
THX 99-20-110-B4
THX 99-20-110-B5
THX 99-20-110-B6
(15 row(s) affected)
Is this acceptable:
WHERE [Name] IN ( 'TEST B-19', 'TEST B-20' )
The list of values can come from a subquery, e.g.:
WHERE [Name] IN ( SELECT [Name] FROM Elsewhere WHERE ... )