Query to remove all redundant entries from a table - postgresql

I have a Postgres table that describes relationships between entities, this table is populated by a process which I cannot modify. This is an example of that table:
+-----+-----+
| e1 | e2 |
|-----+-----|
| A | B |
| C | D |
| D | C |
| ... | ... |
+-----+-----+
I want to write a SQL query that will remove all unecessary relationships from the table, for example the relationship [D, C] is redundant as it's already defined by [C, D].
I have a query that deletes using a self join but this removes everything to do with the relationship, e.g.:
DELETE FROM foo USING foo b WHERE foo.e2 = b.e1 AND foo.e1 = b.e2;
Results in:
+-----+-----+
| e1 | e2 |
|-----+-----|
| A | B |
| ... | ... |
+-----+-----+
However, I need a query that will leave me with one of the relationships, it doesn't matter which relationship remains, either [C, D] or [D, C] but not both.
I feel like there is a simple solution here but it's escaping me.

A general solution is to use the always unique pseudo-column ctid:
DELETE FROM foo USING foo b WHERE foo.e2 = b.e1 AND foo.e1 = b.e2
AND foo.ctid > b.ctid;
Incidentally it keeps the tuple whose physical location is nearest to the first data page of the table.

Assuming that an exact duplicate row is constrained against, there will always be at most two rows for a given relationship: (C,D) and (D,C) in your example. The same constraint also means the two columns have a distinct values: the pair (C,C) might be legal, but cannot be duplicated.
Assuming that the datatype involved has a sane definition of >, you can add a condition that the row to be deleted is the one where the first column > the second column, and leave the other untouched.
In your sample query, this would mean adding AND foo.e1 > foo.e2.

Related

Postgres Unique JSON Array Aggregate Values

I have a table that stores values like this:
| id | thing | values |
|----|-------|--------|
| 1 | a |[1, 2] |
| 2 | b |[2, 3] |
| 3 | a |[2, 3] |
And would like to use an aggregate function to group by thing but store only the unique values of the array such that the result would be:
| thing | values |
|-------|---------|
| a |[1, 2, 3]|
| b |[2, 3] |
Is there a simple and performant way of doing this in Postgres?
First you take the JSON array apart with json_array_elements() - this is a set-returning function with a JOIN LATERAL you get a row with id, thing and a JSON array element for each element.
Then you select DISTINCT records for thing and value, ordered by value.
Finally you aggregate records back together with json_agg().
In SQL that looks like:
SELECT thing, json_agg(value) AS values
FROM (
SELECT DISTINCT thing, value
FROM t
JOIN LATERAL json_array_elements(t.values) AS v(value) ON true
ORDER BY value) x
GROUP BY thing
In general you would want to use the jsonb type as that is more efficient than json. Then you'd have to use the corresponding jsonb_...() functions.

joining with a DISTINCT ON on an ordered subquery in sqlalchemy

Here is (an extremely simplified version of) my problem.
I'm using Postgresql as the backend and trying to build a sqlalchemy query
from another query.
Table setup
Here are the tables with some random data for the example.
You can assume that each table was declared in sqlalchemy declaratively, with
the name of the mappers being respectively Item and ItemVersion.
At the end of the question you can find a link where I put the code for
everything in this question, including the table definitions.
Some items.
item
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+
A table containing versions of each item. Each has at least one.
item_version
+----+---------+---------+-----------+
| id | item_id | version | text |
+----+---------+---------+-----------+
| 1 | 1 | 0 | item_1_v0 |
| 2 | 1 | 1 | item_1_v1 |
| 3 | 2 | 0 | item_2_v0 |
| 4 | 3 | 0 | item_3_v0 |
+----+---------+---------+-----------+
The query
Now, for a given sqlalchemy query over Item, I want a function that returns
another query, but this time over (Item, ItemVersion), where the Items are
the same as in the original query (and in the same order!), and where the
ItemVersion are the corresponding latest versions for each Item.
Here is an example in SQL, which is pretty straightforward:
First a random query over the item table
SELECT item.id as item_id
FROM item
WHERE item.id != 2
ORDER BY item.id DESC
which corresponds to
+---------+
| item_id |
+---------+
| 3 |
| 1 |
+---------+
Then from that query, if I want to join the right versions, I can do
SELECT sq2.item_id AS item_id,
sq2.item_version_id AS item_version_id,
sq2.item_version_text AS item_version_text
FROM (
SELECT DISTINCT ON (sq.item_id)
sq.item_id AS item_id,
iv.id AS item_version_id,
iv.text AS item_version_text
FROM (
SELECT item.id AS item_id
FROM item
WHERE id != 2
ORDER BY id DESC) AS sq
JOIN item_version AS iv
ON iv.item_id = sq.item_id
ORDER BY sq.item_id, iv.version DESC) AS sq2
ORDER BY sq2.item_id DESC
Note that it has to be wrapped in a subquery a second time because the
DISTINCT ON discards the ordering.
Now the challenge is to write a function that does that in sqlalchemy.
Here is what I have so far.
First the initial sqlalchemy query over the items:
session.query(Item).filter(Item.id != 2).order_by(desc(Item.id))
Then I'm able to build my second query but without the original ordering. In
other words I don't know how to do the second subquery wrapping that I did in
SQL to get back the ordering that was discarded by the DISTINCT ON.
def join_version(session, query):
sq = aliased(Item, query.subquery('sq'))
sq2 = session.query(sq, ItemVersion) \
.distinct(sq.id) \
.join(ItemVersion) \
.order_by(sq.id, desc(ItemVersion.version))
return sq2
I think this SO question could be part of the answer but I'm not quite
sure how.
The code to run everything in this question (database creation, population and
a failing unit test with what I have so far) can be found here. Normally
if you can fix the join_version function, it should make the test pass!
Ok so I found a way. It's a bit of a hack but still only queries the database twice so I guess I will survive! Basically I'm querying the database for the Items first, and then I do another query for the ItemVersions, filtering on item_id, and then reordering with a trick I found here (this is also relevant).
Here is the code:
def join_version(session, query):
items = query.all()
item_ids = [i.id for i in items]
items_v_sq = session.query(ItemVersion) \
.distinct(ItemVersion.item_id) \
.filter(ItemVersion.item_id.in_(item_ids)) \
.order_by(ItemVersion.item_id, desc(ItemVersion.version)) \
.subquery('sq')
sq = aliased(ItemVersion, items_v_sq)
items_v = session.query(sq) \
.order_by('idx(array{}, sq.item_id)'.format(item_ids))
return zip(items, items_v)

How to merge data in stata

I'm learning stata, and trying to understand merging. Can someone explain the difference between different kinds of merging to me? (1:1, 1:m, m:1, m:m)?
In case the Stata manual is unclear, here's a quick overview.
First, it's important to clear up the terminology.
A merge basically connects rows in two datasets (Stata calls them observations) based on a specified variable or list of variables, called key variables. You have to start with one dataset already in memory (Stata calls this the master dataset), and you merge another dataset to it (the other dataset is called the using dataset). What you're left with is a single dataset containing all of the variables from the master, and any variable from the using that didn't already exist in master. It also generates a new variable called _merge indicating whether there were rows in master that weren't in using or vice versa. The merged dataset (unless otherwise specified) will contain all rows from master and using, regardless of whether the key variables matched between the two.
The concept of a "unique identifier" is important. If a variable (or combination of variables) has a different value in every row, it uniquely identifies rows. This is important for the details about 1:1, 1:m etc.
1:1 means the key variable provides unique identifiers in both datasets. You will be left with all of the rows from both datasets in memory.
1:m means the key variable in the master dataset uniquely identifies rows, but the key variable from the using dataset doesn't. You will still be left with all of the rows from both datasets, but if a key variable has duplicate observations in the using dataset, the master dataset will gain duplicates to match them.
m:1 is the opposite of 1:m. The key variable in the master dataset doesn't uniquely identify rows, but the key variable in the using dataset does.
m:m is kind of weird. The key variable doesn't uniquely identify rows in either dataset, so you will be left with duplicated rows from both sides.
Example:
** make a dataset and save as a tempfile called `b'. Note that k uniquely identifies rows
set obs 3
gen k = _n
gen b = "b"
list
+-------+
| k b |
|-------|
1. | 1 b |
2. | 2 b |
3. | 3 b |
+-------+
tempfile b
save `b'
** make another dataset and merge `b' to it. Note that k uniquely identifies rows
set obs 3
gen k = _n
gen a = "a"
list
+-------+
| k a |
|-------|
1. | 1 a |
2. | 2 a |
3. | 3 a |
+-------+
merge 1:1 k using `b'
list
+-------------------------+
| k a b _merge |
|-------------------------|
1. | 1 a b matched (3) |
2. | 2 a b matched (3) |
3. | 3 a b matched (3) |
+-------------------------+
** make another dataset and merge `b' to it. Note that k does not uniquely identify rows and that k=2 and k=3 do not exist in the master dataset
clear
set obs 3
gen k = 1
gen a = "a"
list
+-------+
| k a |
|-------|
1. | 1 a |
2. | 1 a |
3. | 1 a |
+-------+
merge m:1 k using `b'
list
+----------------------------+
| k a b _merge |
|----------------------------|
1. | 1 a b matched (3) |
2. | 1 a b matched (3) |
3. | 1 a b matched (3) |
4. | 2 b using only (2) |
5. | 3 b using only (2) |
+----------------------------+

Query join result appears to be incorrect

I have no idea what's going on here. Maybe I've been staring at this code for too long.
The query I have is as follows:
CREATE VIEW v_sku_best_before AS
SELECT
sw.sku_id,
sw.sku_warehouse_id "A",
sbb.sku_warehouse_id "B",
sbb.best_before,
sbb.quantity
FROM SKU_WAREHOUSE sw
LEFT OUTER JOIN SKU_BEST_BEFORE sbb
ON sbb.sku_warehouse_id = sw.warehouse_id
ORDER BY sbb.best_before
I can post the table definitions if that helps, but I'm not sure it will. Suffice to say that SKU_WAREHOUSE.sku_warehouse_id is an identity column, and SKU_BEST_BEFORE.sku_warehouse_id is a child that uses that identity as a foreign key.
Here's the result when I run the query:
+--------+-----+----+-------------+----------+
| sku_id | A | B | best_before | quantity |
+--------+-----+----+-------------+----------+
| 20251 | 643 | 11 | <<null>> | 140 |
+--------+-----+----+-------------+----------+
(1 row)
The join specifies that the sku_warehouse_id columns have to be equal, but when I pull the ID from each table (labelled as A and B) they're different.
What am I doing wrong?
Perhaps just sw.sku_warehouse_id instead of sw.warehouse_id?

sql query to break down count of every combination

I need a Postgresql Query that returns the count of every type of combination of record.
For example, I have a table T with columns A, B, C, D, E and other columns that are not of importance:
Table T
--------------
A | B | C | D | E
The query should return a table R with the values from columns A, B, C, D, and a count for how many times each configuration occurs with the specified E value.
Table R
---------------
A | B | C | D | count
When all of the counts for each record are added together, it should equal the total number of records in the original table.
It seems like a very simple problem, but due to my lack of SQL knowledge, I cannot figure out how to do this.
The only solution I can think of is this:
select a, b, c, d, count(*)
from T
where e = 'abc'
group by a, b, c, d
But when adding the counts up from this query, it is way more than the count of the original table. It seems like count(*) shouldn't be used, or i'm just totally going about this the wrong way. I'd really appreciate any advice as to how I should go about this. Thank you all.
NULL values couldn't possibly fool you. Consider this demo:
WITH t(a,b,c,d) AS (
VALUES
(1,2,3,4)
,(1,2,3,NULL)
,(2,2,3,NULL)
,(2,2,3,NULL)
,(2,2,3,4)
,(2,NULL,NULL,NULL)
,(NULL,NULL,NULL,NULL)
)
SELECT a, b, c, d, count(*)
FROM t
GROUP BY a, b, c, d
ORDER BY a, b, c, d;
a | b | c | d | count
---+---+---+---+-------
1 | 2 | 3 | 4 | 1
1 | 2 | 3 | | 1
2 | 2 | 3 | 4 | 1
2 | 2 | 3 | | 2
2 | | | | 1
| | | | 1
There must be some other misunderstanding here.
I figured it out, it was something really stupid. I forgot to specify the where 'E' = 'ABC' clause in the select count(*) when comparing the count. Thanks anyway for your help guys!