Merge SQL to Exclude Duplicate Records So Merge 2nd time Doesn't Fail - tsql

I have three tables and only one that I directly control and am doing a MERGE between them. See my abbreviated but working example here (sqlfiddle example).
I am doing a MERGE between table 1 and Table 2 to Table 3. Table 1 has duplicate data which the MERGE (erroneously) can handle on the first run (insert) but fails with this message on the second run (update).
The MERGE statement attempted to UPDATE or DELETE the same row more
than once.
My question is, can the MERGE be written to either use an EXCEPT such as
SELECT AdFull FROM [dbo].[Users] WHERE AdFull IS NOT NULL
EXCEPT
SELECT AdFull FROM [dbo].[Users]
WHERE AdFull IS NOT NULL
GROUP BY AdFull
HAVING COUNT(*) = 1
or a different Join to only show users that are not duplicated? Or even a way to select a specific one of the duplicates?
Answered Questions
MERGE is a working Insert due to the nature of Fiddle. But due (AFAIK) to the stateless nature of fiddle one never sees the error in Fiddle on a second run, because a merge never happens with the data, only inserts.
Ignore Rows: Actually I would eventually like to use an individual duplicate row via divining of one based on a condition. The actual data table I am dealing with away from the fiddle example has more columns and it would be nice to maybe select a specific row in a duplicate set due to a specific condition.
The example doesn't bare it out, but yes the duplicates are due to the computed AdFull column. Think of a system adding a temp employee, that user gets a row. Then the temp employee gets hired on as fulltime, keeps the ad account but then gets another row in the user table. Yes I know it shouldn't happen. So that is how a duplicate comes about.
(Duplicate values Table 3) Table three is a result table that can be cleaned out for any duplicates to start this process afresh.

In your MERGE statement can you do something similar this?
MERGE INTO [dbo].Table3 AS T3
USING
(
SELECT
AdFull,
MAX(StartedOn)
FROM [dbo].Table2 AS [ad]
GROUP BY AdFull
) AS T2
ON (T2.AdFull = T3.AdFull)
WHEN MATCHED THEN UPDATE blah
WHEN NOT MATCHED THEN INSERT blah
Using the MAX aggregate with a GROUP BY should give you only the information from when the temp was hired on. Then if the AdFull matches you can simply UPDATE Table3 with the most recent information and if there is no match then INSERT a new row.
UPDATE: If I fail to mention that MERGE should be used with caution I will take flak from #AaronBertrand.

Related

Unexpected sort order on postgres left outer join

Background
I'm using Postgres 11 and pgAdmin4 v5.2. The problem I describe below is on my dev machine which has both the postgres server and pgAdmin client.
Questions I've looked at on SO that deal with incorrect ordering seem related to collation-related issues with ordering of text fields, whereas my problem is on an integer field.
Setup
I have a table norm_plans that contains ~5k records.
Column | Type
---------------------------------
canon_id | integer
name | character varying(200)
...
other fields
canon_id is autopopulated using a sequence.
I've created a new table norm_plans_cmp as a copy of norm_plans (CREATE TABLE norm_plans_cmp AS TABLE norm_plans WITH DATA;)
I next insert some new records into norm_plans and update some existing records (fields other than canon_id.
The new records increment the sequence and are assigned canon_id values as expected.
I now want to compare norm_plans against norm_plans_cmp so I perform a left outer join:
select a.*, b.*
from norm_plans a
left outer join norm_plans_cmp b
on a.canon_id = b.canon_id
order by a.canon_id
Problem
I would expect records to be sorted by canon_id. This holds true from 1-2000, but after 2,000 I get canon_ids from 5,001 to 5,111 (which is the last canon_id) and then it again picks up from 2,001. I'm viewing this data from pgAdmin, see screenshot 1 below showing the shift from 2,000 to 5,001, and screenshot 2 showing the transition again from 5,111 back to 2,001.
Additional observations
While incorrect, the ordering seems consistent. Running the query multiple times results in the same (incorrect) ordering.
Despite my question title, I'm not totally sure the left join has anything to do with this.
Running SELECT * ... ORDER BY canon_id on norm_plans or norm_plans_cmp alone also result in incorrect ordering, albeit at different points in the order.
Answers to this SO question suggest index corruption may be a contributing problem, but I have no indexes on either norm_plans or norm_plans_cmp (canon_id is not defined as a PK).
At this point, I'm stumped!

INSERT INTO .. SELECT causing possible race condition?

INSERT INTO A
SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A);
or, written differently:
WITH selection AS
(SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A))
INSERT INTO A SELECT * FROM selection;
If these queries run multiple times simultaneously, is it possible that I will end up with duplicated rows in A?
How does Postgres process these queries? Is it one or multiple?
If it is multiple queries (find max(timestamp)[1], select[2] then insert[3]) I can imagine this will cause duplicated rows.
If that is correct, would wrapping it in BEGIN/END (a transaction) help?
Yes, that might result in duplicate values.
A single statement sees a consistent view of the data in all tables as of the point in time when the statement started.
Wrapping that single statement into a transaction won't change that (a single statement is always executed as an atomic statement regardless of the number of sub-query involved).
The statement will never see uncommitted data from other transactions (which is the root cause why you can wind up with duplicate values).
The only safe way to avoid duplicate values, is to create a unique constraint (or index) on that column. In that case the INSERT would result in an error if such a value already exists.
If you want to avoid the error, use insert ... on conflict
This depends on the isolation level set in your database.
This is from the postgres documentation
By default, this is set to Repeatable read, which means that each query will get the output based on when the transaction first attempted to read the data. If 2 queries read before any one writes, then you will get duplicate data in these tables.
If you want to avoid having duplicate entries, you have a few options.
Try using the isolation level Serializable
Apply a unique index on a field of A in table B. Timestamp is not a great contender as you might legitimately have 2 rows with the same timestamp. Probably id of the table A is a good option.
Take a lock at the application level before performing such a query.

ERROR: syntax error at or near "group"

Hello I'm writing an sql query But i am getting a syntax error on the line with the GROUP BY. What can possibly be the problem, help if you can please.
UPDATE intersection_points i
SET nbr_victimes = sum(tue+bl+bg)
FROM accident_ma a ,intersection_points i
WHERE (ST_DWithin(i.st_intersection,a.geom_acc, 10000) group by st_intersection)) ;
GROUP BY is its own clause, it's not part of a WHERE clause.
This is what you have:
WHERE (
ST_DWithin(i.st_intersection,a.geom_acc, 10000)
group by st_intersection
)
This is what you need:
WHERE ST_DWithin(i.st_intersection,a.geom_acc, 10000)
group by st_intersection
Edit: In response to comments, it sounds like your JOIN is a bit more complex than the UPDATE ... FROM syntax would need. Take a look at the "Notes" section on this page:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
Because of this indeterminacy, referencing other tables only within sub-selects is safer, though often harder to read and slower than using a join.
Normally this would involve changing the syntax to something like:
UDPATE SomeTable
SET SomeColumn = 'Some Value'
WHERE AnotherColumn =
(SELECT AnotherColumn
FROM AnotherTable
-- etc.)
However, the use of ST_DWithin() in this query may complicate that quite a bit. Without much deeper knowledge of the table structures, relationships, and overall intent of this update there probably isn't much more help I can give. Essentially you're going to need to clarify for the database exactly what records need to be updated and how to update them, which may involve changing your query to this latter sub-select syntax in some way.
I don' t understand your data structure. I create the following tables from your query. Please check table structure.
if table's structure is this
your query must be
UPDATE intersection_points SET nbr_victimes = (SELECT SUM(a.tue+a.bl+a.bg) FROM accident_ma a WHERE st_dwithin(st_intersection, a.geom_acc, 1000));

returning rows from an implicit JOIN where results either exist in both tables OR only one table

I've got a PostgreSQL-9.0.x database that manages an automated testing environment. There are a bunch of tables that contain assorted static data (OS versions, test names, etc) named 'buildlist' & 'osversmap'. However, there are also two tables which contain data which changes often. The first is a 'pending' table which is effectively a test queue where pending tests are self-selected by the test systems, and then deleted when the test run has completed. The second is a 'results' table which contains the test results as they are produced (in progress and completed).
The records in the pending table have a one to many relationship with the records in the results table (each row in pending can have 0 or more rows in results). For example, if no test systems have self-assigned a pending row, then there will be zero associated rows in results, and then once a pending row is assigned, the number of rows in results will increase for each pending row. An added catch is that I always want only the newest results table row associated with each pending table row. What I need to do is query the 'pending' table for pending tests, and then also get a 'logurl' from the results table that corresponds to each pending table row.
All of this is rather similar to this problem, except that I have the added burden of the two additional tables with the static data (buildlist & osversmap):
PHP/SQL: Using only one query, SELECT rows from two tables if data is in both tables, or just SELECT from one table if not
I'm stumbling over how to integrate those two tables with static data into the query. The following query works fine as long as there's at least one row in the 'results' table that corresponds to each row in the pending table (however, it doesn't return anything for rows that only exist in 'pending' yet not yet in 'results'):
SELECT
pending.cl,
pending.id,
pending.buildid,
pending.build_type,
pending.active,
pending.submittracker,
pending.os,pending.arch,
pending.osversion,
pending.branch,
pending.comment,
osversmap.osname,
buildlist.buildname,
results.logurl
FROM pending ,osversmap ,buildlist ,results
WHERE
pending.buildid=buildlist.id
AND pending.os=osversmap.os
AND pending.osversion=osversmap.osversion
AND pending.owner='$owner'
AND pending.completed='f'
AND results.hostname=pending.active
AND results.submittracker=pending.submittracker
AND pending.cl=results.cl
AND results.current_status!='PASSED'
AND results.current_status NOT LIKE '%FAILED'
ORDER BY pending.submittracker,pending.branch,pending.os,pending.arch
thanks in advance!
Does the following work for you?
SELECT pending.cl,
pending.id,
pending.buildid,
pending.build_type,
pending.active,
pending.submittracker,
pending.os,
pending.arch,
pending.osversion,
pending.branch,
pending.comment,
osversmap.osname,
buildlist.buildname,
results.logurl
FROM pending
JOIN osversmap
ON ( pending.os = osversmap.os
AND pending.osversion = osversmap.osversion )
JOIN buildlist
ON ( pending.buildid = buildlist.id )
LEFT OUTER JOIN results
ON ( pending.active = results.hostname
AND pending.submittracker = results.submittracker
AND pending.cl = results.cl
AND results.current_status != 'PASSED'
AND results.current_status NOT LIKE '%FAILED'
)
WHERE pending.owner = '$owner'
AND pending.completed = 'f'
ORDER BY pending.submittracker,
pending.branch,
pending.os,
pending.arch
All of this is rather similar to this problem, except that I have the added burden of the two additional tables with the static data (buildlist & osversmap):
Simplest approach would be to build a view that returns the right rows without referencing buildlist and osversmap. Then join those two tables to the view.
CREATE VIEW your-meaningful-view-name AS
SELECT
pending.cl,
pending.id,
pending.buildid,
pending.build_type,
pending.active,
pending.submittracker,
pending.os,pending.arch,
pending.osversion,
pending.branch,
pending.comment,
results.logurl
FROM pending
-- No DDL or sample INSERT statements. You might need an outer join.
INNER JOIN results
ON (results.hostname=pending.active AND
results.submittracker=pending.submittracker AND
results.cl=pending.cl)
WHERE pending.owner='$owner' AND
pending.completed='f' AND
-- Are *both* these really necessary?
results.current_status!='PASSED' AND
results.current_status NOT LIKE '%FAILED'
And then
SELECT osversmap.osname, buildlist.buildname, ymvn.*,
FROM your-meaningful-view-name ymvn
INNER JOIN osversmap ON ymvn.os=osversmap.os
AND ymvn.osversion=osversmap.osversion
INNER JOIN buildlist ON ymvn.buildid=buildlist.id
ORDER BY ymvn.submittracker, ymvn.branch, ymvn.os, ymvn.arch

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum