Check SQL Server table values against themselves - tsql

Imagine I had this table:
declare #tmpResults table ( intItemId int, strTitle nvarchar(100), intWeight float )
insert into #tmpResults values (1, 'Item One', 7)
insert into #tmpResults values (2, 'Item One v1', 6)
insert into #tmpResults values (3, 'Item Two', 6)
insert into #tmpResults values (4, 'Item Two v1', 7)
And a function, which we'll call fn_Lev that takes two strings, compares them to one another and returns the number of differences between them as an integer (i.e. the Levenshtein distance).
What's the most efficient way to query that table, check the fn_Lev value of each strTitle against all the other strTitles in the table and delete rows are similar to one another by a Levenshtein distance of 3, preferring to keeping higher intWeights?
So the after the delete, #tmpResults should contain
1 Item One 7
4 Item Two v1 7
I can think of ways to do this, but nothing that isn't horribly slow (i.e iterative). I'm sure there's a faster way?
Cheers,
Matt

SELECT strvalue= CASE
WHEN t1.intweight >= t2.intweight THEN t1.strtitle
ELSE t2.strtitle
END,
dist = Fn_lev(t1.strtitle, t2.strtitle)
FROM #tmpResults AS t1
INNER JOIN #tmpResults AS t2
ON t1.intitemid < t2.intitemid
WHERE Fn_lev(t1.strtitle, t2.strtitle) = 3
This will perform a self join that will match each row only once. It will excluding matching a row on itself or reverse of a previous match ie if A<->B is a match then B<->A isn't.
The case statement selects the highest weighted result

If I've understood you correctly, you can use a cross join
SELECT t1.intItemId AS Id1, t2.intItemId AS Id2, fn_Lev(t1.strTitle, t2.strTitle) AS Lev
FROM #tmpResults AS t1
CROSS JOIN #tmpResults AS t2
The cross join will give you the results of every combination of rows between the left and right side of the join (hence it doesn't need any ON clause, as it is matching everything to everything else). You can then use the result of the SELECT to choose which to delete.

Related

PostgreSQL - Can I get the inverse of distinct rows?

I have a table of contacts. Each contact has an associating website. Each website can have multiple contacts.
I ran a query to get one contact with Select distinct on (website). This works fine.
But I want to do something the the rest of the data not selected but Select distinct on (website). Is there an inverse command where I can find all records from websites that have NOT been processed?
Use except. Here is an illustration. order by is for clarity.
create temporary table the_table (i integer, tx text);
insert into the_table values
(1, 'one'),
(1, 'one more one'),
(1, 'yet another one'),
(2, 'two'),
(2, 'one more two'),
(2, 'yet another two'),
(3, 'three'),
(3, 'three alternative');
select * from the_table
EXCEPT
select distinct on (i) * from the_table
order by i;
i
tx
1
one more one
1
yet another one
2
yet another two
2
one more two
3
three alternative

Set repeating IDs till first record repeats (bulk load csv file)

I have a file that I imported via bulk-insert and I want to assign group IDs/sequences.
I would like to assign the IDs till the first record with the first character is repeated. In this example its "A".
The challenge I have is how to achieve this example and set the IDs like this example:
ID
data
1
A000abcefd
1
E00asoaskdaok
1
C000dasdasok
2
A100abcasds
2
E100aandas
2
C100adsokdas
Here is one way to do it, but given the limited info you provided I will make the following assumptions:
**The data in your table has some order to it. This obviously will not work if that is not the case. I used an ID, you use what you have.
**The first row in the table has the character you are looking for.
CREATE TABLE #tmp(ID int, [data] varchar(20))
INSERT INTO #tmp
VALUES
(1, 'A000abcefd'),
(2, 'E00asoaskdaok'),
(3, 'C000dasdasok'),
(4, 'A100abcasds'),
(5, 'E100aandas'),
(6, 'C100adsokdas')
DECLARE #CHAR varchar(1)
SELECT #CHAR = (SELECT TOP 1 SUBSTRING([data],1,1) FROM #tmp ORDER BY ID)
SELECT SUM(CASE WHEN SUBSTRING([data],1,1) = #CHAR THEN 1 ELSE 0 END)
OVER(ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) SeqNum
,[data]
FROM #tmp

Convert comma separated id to comma separated string in postgresql

I have comma separated column which represents the ids of emergency type like:
ID | Name
1 | 1,2,3
2 | 1,2
3 | 1
I want to make query to get name of the this value field.
1 - Ambulance
2 - Fire
3 - Police
EXPECTED OUTPUT
1 - Ambulance, Fire, Police
2 - Ambulance, Fire
3 - Ambulance
I just need to write select statement in postgresql to display string values instead of integer values in comma separated.
Comma separated values is bad database design practice, though postgre is so feature rich, that you can handle this task easily.
-- just simulate tables
with t1(ID, Name) as(
select 1 ,'1,2,3' union all
select 2 ,'1,2' union all
select 3 ,'1'
),
t2(id, name) as(
select 1, 'Ambulance' union all
select 2, 'Fire' union all
select 3, 'Police'
)
-- here is actual query
select s1.id, string_agg(t2.name, ',') from
( select id, unnest(string_to_array(Name, ','))::INT as name_id from t1 ) s1
join t2
on s1.name_id = t2.id
group by s1.id
demo
Though, if you can, change your approach. Right database design means easy queries and better performance.
To get the values for each of the ids is a simple query: select * from ;. Once you have the values you will have to parse the strings with a delimiter of ','. then you would have to assign the parsed sting values to the appropriate job titles, and remake the list. Are you writing this in a specific language?
or you could just assign the sorted value to something like 1,2,3 is equal to "some string", 1,2 is equal to "some other string", etc.
Assuming you have a table with the ids and values for Ambulance, Police and Fire, then you can use something like the following.
CREATE TABLE public.test1
(
id integer NOT NULL,
commastring character varying,
CONSTRAINT pk_test1 PRIMARY KEY (id)
);
INSERT INTO public.test1
VALUES (1, '1,2,3'), (2, '1,2'), (3, '1');
CREATE TABLE public.test2
(
id integer NOT NULL,
description character varying,
CONSTRAINT pk_test2 PRIMARY KEY (id)
);
INSERT INTO public.test2
VALUES (1, 'Ambulance'), (2, 'Fire'), (3, 'Police');
with descs as
(with splits as
(SELECT id, split_part(commastring, ',', 1) as col2,
split_part(commastring, ',', 2) as col3, split_part(commastring, ',', 3) as col4 from test1)
select splits.id, t21.description as d1, t22.description as d2, t23.description as d3
from splits
inner join test2 t21 on t21.id::character varying = splits.col2
left join test2 t22 on t22.id::character varying = splits.col3
left join test2 t23 on t23.id::character varying = splits.col4)
SELECT descs.id, CASE WHEN d2 IS NOT NULL AND d3 IS NOT NULL
THEN CONCAT_WS(',', d1,d2,d3) ELSE CASE WHEN d2 IS NOT NULL
THEN CONCAT_WS(',', d1,d2) ELSE d1 END END FROM descs
ORDER BY id;
By way of explanation, I give the create table and insert commands, so that you (and others) can follow the logic. It would help enormously, if you were to do this in your questions, as it saves everyone time and avoids misunderstandings.
My inner CTE then splits the string using split_part. The syntax here is quite simple, field, separator and desired column within the field to be split (so in this case we need one, two and three). I then join the split columns to test2. Note two things here: the first join is an inner join, as there will always be at least one column in the split (I am assuming!!!), whereas the other two are left joins; secondly, the split of a character varying field in turn produces character varying splits, so I have to cast the int id to character varying for the join to work. Doing the cast this way round (i.e. id to character varying rather than character varying to id) means I don't have to bother about nulls. Finally depending on the number of nulls present, I concatenate the results with the given separator. Again I am assuming d1 will always have a value.
HTH

Concatenated columns should not match in 2 tables

I'll just put this in layman's terms since I'm a complete noobie:
I have 2 tables A and B, both having 2 columns of interest namely: employee_number and salary.
What I am looking to do is to extract rows of 'combination' of employee_number and salary from A that are NOT present in B, but each of employee_number and salary should be present in both.
I am looking to doing it with the 2 following conditions(please forgive the wrong function
names.. this is just to present the problem 'eloquently'):
1.) A.unique(employee_number) exists in B.unique(employee_number) AND A.unique(salary)
exists in B.unique(salary)
2.) A.concat(employee_number,salary) <> B.concat(employee_number,salary)
Note: A and B are in different databases, so I'm looking to use dblink to do this.
This is what I tried doing:
SELECT distinct * FROM dblink('dbname=test1 port=5432
host=test01 user=user password=password','SELECT employee_number,salary, employee_number||salary AS ENS FROM empsal.A')
AS A(employee_number int8, salary integer, ENS numeric)
LEFT JOIN empsalfull.B B on B.employee_number = A.employee_number AND B.salary = A.salary
WHERE A.ENS not in (select distinct employee_number || salary from empsalfull.B)
but it turned out to be wrong as I had it cross-checked by using spreadsheets and I don't get the same result.
Any help would be greatly appreciated. Thanks.
For easier understanding I left out the dblink.
Because, the first one selects lines in B that equal the employeenumber in A as well as the salery in A, so their concatenated values will equal as well (if you expect this to not be true, please provide some test data).
SELECT * from firsttable A
LEFT JOIN secondtable B where
(A.employee_number = B.employee_number AND a.salery != b.salery) OR
(A.salery = B.salery AND A.employee_number != B.employee_number)
If you have troubles with lines containing nulls, you might also try somthing like this:
AND (a.salery != b.salery OR (a.salery IS NULL AND b.salery IS NOT NULL) or (a.salery IS NOT
NULL and b.salery IS NULL))
I think you're looking for something along these lines.
(Sample data)
create table A (
employee_number integer primary key,
salary integer not null
);
create table B (
employee_number integer primary key,
salary integer not null
);
insert into A values
(1, 20000),
(2, 30000),
(3, 20000); -- This row isn't in B
insert into B values
(1, 20000), -- Combination in A
(2, 20000), -- Individual values in A
(3, 50000); -- Only emp number in A
select A.employee_number, A.salary
from A
where (A.employee_number, A.salary) NOT IN (select employee_number, salary from B)
and A.employee_number IN (select employee_number from B)
and A.salary IN (select salary from B)
output: 3, 20000

T-SQL grouping question

Every once and a while I have a scenario like this, and can never come up with the most efficient query to pull in the information:
Let's say we have a table with three columns (A int, B int, C int). My query needs to answer a question like this: "Tell me what the value of column C is for the largest value of column B where A = 5." A real world scenario for something like this would be 'A' is your users, 'B' is the date something happened, and 'C' is the value, where you want the most recent entry for a specific user.
I always end up with a query like this:
SELECT
C
FROM
MyTable
WHERE
A = 5
AND B = (SELECT MAX(B) FROM MyTable WHERE A = 5)
What am I missing to do this in a single query (opposed to nesting them)? Some sort of 'Having' clause?
BoSchatzberg's answer works when you only care about the 1 result where A=5. But I suspect this question is the result of a more general case. What if you want to list the top record for each distinct value of A?
SELECT t1.*
FROM MyTable t1
INNER JOIN
(
SELECT A, MAX(B)
FROM MyTable
GROUP BY A
) t2 ON t1.A = t2.A AND t1.B = t2.B
--
SELECT C
FROM MyTable
INNER JOIN (SELECT A, MAX(B) AS MAX_B FROM MyTable GROUP BY A) AS X
ON MyTable.A = X.A
AND MyTable.B = MAX_B
--
WHERE MyTable.A = 5
In this case the first section (between the comments) can also easily be moved into a view for modularity or reuse.
You can do this:
SELECT TOP 1 C
FROM MyTable
WHERE A = 5
ORDER BY b DESC
I think you are close (and what you have would work). You could use something like the following:
select C
, max(B)
from MyTable
where A = 5
group by C
After a little bit of testing, I don't think that this can be done without doing it the way you're already doing it (i.e. a subquery). Since you need the max of B and you can't get the value of C without also including that in a GROUP BY or HAVING clause, a subquery seems to be the best way.
create table #tempints (
a int,
b int,
c int
)
insert into #tempints values (1, 8, 10)
insert into #tempints values (1, 8, 10)
insert into #tempints values (2, 4, 10)
insert into #tempints values (5, 8, 10)
insert into #tempints values (5, 3, 10)
insert into #tempints values (5, 7, 10)
insert into #tempints values (5, 8, 15)
/* this errors out with "Column '#tempints.c' is invalid in the select list because it is not contained in either an
aggregate function or the GROUP BY clause." */
select t1.c, max(t1.b)
from #tempints t1
where t1.a=5
/* this errors with "An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING
clause or a select list, and the column being aggregated is an outer reference." */
select t1.c, max(t1.b)
from #tempints t1, #tempints t2
where t1.a=5 and t2.b=max(t1.b)
/* errors with "Column '#tempints.a' is invalid in the HAVING clause because it is not contained in either an aggregate
function or the GROUP BY clause." */
select c
from #tempints
group by b, c
having a=5 and b=max(b)
drop table #tempints