Postgres Union not union'ing everything - postgresql

This is using Postgres 12
I have a table that stages an import from a spreadsheet. It has a structure like
code | yes | no
XX1000 | yes | null
ZX1001 | null | no
I am trying to get all the results in a query so I can do other stuff with it.
When I run
select substring(code, 3), 'Y' from table1 where yes = 'yes' and no is null
I get the correct number of results (lets say 100).
When I run
select substring(code, 3), 'N' from table1 where yes is null and no is not null
I get the proper number of results (lets say 6)
If I run
select substring(code, 3), 'Y' from table1 where yes = 'yes' and no is null
UNION
select substring(code, 3), 'N' from table1 where yes is null and no is not null
I get 102 results, with 4 results missing from the yes query. Extracting all the results and comparing in Excel, I can see that there are not any values in the substring results that duplicate for both queries (e.g. each id is in the result set of the yes query once). I can also guarantee there are no overlapping substring(code, 3) values since the spreadsheet is populated from a different system where the values after the two characters are the id column in the table (I also verified the second query run separately returns distinct values compared to the first query).
Running a UNION ALL gives me 106 results that are all unique.
What is going on here? I am so confused why the UNION is dropping unique results.

According documentation UNION effectively appends the result of query2 to the result of query1 (although there is no guarantee that this is the order in which the rows are actually returned). Furthermore, it eliminates duplicate rows from its result, in the same way as DISTINCT, unless UNION ALL is used.

Related

UNION producing duplicates in output (MySQL Workbench)

I am trying to compare UNION vs UNION ALL in MySQL Workbench. I am combining two tables that are not identical (hence the NULL AS column names) with UNION. However, I am getting a duplicate, and therefore UNION ALL and UNION are producing the same outputs. I know that the purpose of UNION vs UNION ALL is that UNION is supposed to only report unique values and not include duplicates in the output, but it is not doing that for me. Thank you for the help.
SELECT
e.emp_no,
e.first_name,
e.last_name,
NULL AS dept_no,
NULL AS from_date
FROM
employees_dup AS e
WHERE
e.emp_no = 10001
UNION SELECT
NULL AS emp_no,
NULL AS first_name,
NULL AS last_name,
m.dept_no,
m.from_date
FROM
dept_manager AS m;
link to screenshot of code and output
[1]: https://i.stack.imgur.com/0Vpii.png
I don't see any duplicates in the sample you gave. Yes, there are a lot of rows with the same value for department numbers but they all have different from dates. A union filters out duplicate rows, not values. You can have a list of twenty names all with the last name Smith. But Brain Smith and Carl Smith are clearly different names.

ERROR: array must not contain nulls PostgreSQL

My Query is
SELECT
id,
ARRAY_AGG(session_os)::integer[]
FROM
t
GROUP BY id
HAVING ARRAY_AGG(session_os)::integer[] && ARRAY[1,NULL]
It's giving ERROR: array must not contain nulls
Actually I want to get rows like
id | Session_OS
-------|-------------
641 | {1, 2}
642 | {NULL, 2}
643 | {NULL}
Kindly check the sample data here
https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7793fa763a360bf7334787e4249d6107
The && operator does not support NULL values. So, you need another approach. For example you could join the data to the table first. This gives you the ids which are linked to your required data. At the second step you are able to arregate all values using these ids.
step-by-step demo:db<>fiddle
SELECT
id,
ARRAY_AGG(session_os) -- 4
FROM t
WHERE id IN ( -- 3
SELECT
id
FROM
t
JOIN (
SELECT unnest(ARRAY[1, null]) as a -- 1
)s ON s.a IS NOT DISTINCT FROM t.session_os -- 2
)
GROUP BY id
Create a table or query result which contains your relevant data, incl. the NULL value.
You can join the data, incl. the NULL value, using the operator IS NOT DISTINCT FROM, which considers the NULL.
Now you have fetched the relevant id values which can be used in the WHERE filter
Finally your can do your aggregation
The extension intarray installs its own && operator for int[], and this doesn't allow NULLs and it takes precedence over the built-in && operator.
If you are not using intarray, you can just uninstall it (except for in dbfiddle, where you can't). If you are using it occasionally, I think it is best to install it in its own schema which is not in your search path. Then you need to schema qualify its operators when you do need them.
Alternatively, you can leave intarray in place and schema qualify the normal built-in operators when you need those ones specifically, as shown here.

Why does using DISTINCT ON () at different points in a query return different (unintuitive) results?

I’m querying from a table that has repeated uuids, and I want to remove duplicates. I also want to exclude some irrelevant data which requires joining on another table. I can remove duplicates and then exclude irrelevant data, or I can switch the order and exclude then remove duplicates. Intuitively, I feel like if anything, removing duplicates then joining should produce more rows than joining and then removing duplicates, but that is the opposite of what I’m seeing. What am I missing here?
In this one, I remove duplicates in the first subquery and filter in the second, and I get 500k rows:
with tbl1 as (
select distinct on (uuid) uuid, foreign_key
from original_data
where date > some_date
),
tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
)
select * from tbl2
If I filter then remove duplicates, I get 550k rows:
with tbl1 as (
select uuid, foreign_key
from original_data
where date > some_date
),
tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
),
tbl3 as (
select distinct on (uuid) uuid
from tbl2
)
select * from tbl3
Is there an explanation here?
Does original_data.foreign_key have a foreign key constraint referencing other_data.id allowing for foreign_keys that don't link to any id in other_data?
Isn't other_data.category or original_data.foreign_key column missing a NOT NULL constraint?
In either of these cases postgres would filter out all records with
a missing link (foreign_key=null)
a broken link (foregin_key doesn't match any id in other_data)
linking to an other_data record with a category set o null
in both of your approaches - regardless of whether they're a duplicate or not - as other_data.category <> something evaluates to null for them which does not satisfy the WHERE clause. That, combined with missing ORDER BY causing DISTINCT ON to drop different duplicates randomly each time, could result in dropping the duplicates that then get filtered out in tbl2 in the first approach, but not in the second.
Example:
pgsql122=# select * from original_data;
uuid | foreign_key | comment
------+-------------+---------------------------------------------------
1 | 1 | correct, non-duplicate record with a correct link
3 | 2 | duplicate record with a broken link
3 | 1 | duplicate record with a correct link
4 | null | duplicate record with a missing link
4 | 1 | duplicate record with a correct link
5 | 3 | duplicate record with a correct link, but a null category behind it
5 | 1 | duplicate record with a correct link
6 | null | correct, non-duplicate record with a missing link
7 | 2 | correct, non-duplicate record with a broken link
8 | 3 | correct, non-duplicate record with a correct link, but a null category behind it
pgsql122=# select * from other_data;
id | category
----+----------
1 | a
3 | null
Both of your approaches keep uuid 1 and eliminate uuid 6, 7 and 8 even though they're unique.
Your first approach randomly keeps between 0 and 3 out of the 3 pairs of duplicates (uuid 3, 4 and 5), depending on which one in each pair gets discarded by DISTINCT ON.
Your second approach always keeps one record for each uuid 3, 4 and 5. Each clone with missing link, a broken link or a link with a null category behind it is already gone by the time you discard duplicates.
As #a_horse_with_no_name suggested, ORDER BY should make DISTINCT ON consistent and predictable but only as long as records vary on the columns used for ordering. It also won't help if you have other issues, like the one I suggest.

SqlAlchemy: count of distinct over multiple columns

I can't do:
>>> session.query(
func.count(distinct(Hit.ip_address, Hit.user_agent)).first()
TypeError: distinct() takes exactly 1 argument (2 given)
I can do:
session.query(
func.count(distinct(func.concat(Hit.ip_address, Hit.user_agent))).first()
Which is fine (count of unique users in a 'pageload' db table).
This isn't correct in the general case, e.g. will give a count of 1 instead of 2 for the following table:
col_a | col_b
----------------
xx | yy
xxy | y
Is there any way to generate the following SQL (which is valid in postgresql at least)?
SELECT count(distinct (col_a, col_b)) FROM my_table;
distinct() accepts more than one argument when appended to the query object:
session.query(Hit).distinct(Hit.ip_address, Hit.user_agent).count()
It should generate something like:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT ON (hit.ip_address, hit.user_agent)
hit.ip_address AS hit_ip_address, hit.user_agent AS hit_user_agent
FROM hit) AS anon_1
which is even a bit closer to what you wanted.
The exact query can be produced using the tuple_() construct:
session.query(
func.count(distinct(tuple_(Hit.ip_address, Hit.user_agent)))).scalar()
Looks like sqlalchemy distinct() accepts only one column or expression.
Another way around is to use group_by and count. This should be more efficient than using concat of two columns - with group by database would be able to use indexes if they do exist:
session.query(Hit.ip_address, Hit.user_agent).\
group_by(Hit.ip_address, Hit.user_agent).count()
Generated query would still look different from what you asked about:
SELECT count(*) AS count_1
FROM (SELECT hittable.user_agent AS hittableuser_agent, hittable.ip_address AS sometable_column2
FROM hittable GROUP BY hittable.user_agent, hittable.ip_address) AS anon_1
You can add some variables or characters in concat function in order to make it distinct. Taking your example as reference it should be:
session.query(
func.count(distinct(func.concat(Hit.ip_address, "-", Hit.user_agent))).first()

Duplicate values returned with joins

I was wondering if there is a way using TSQL join statement (or any other available option) to only display certain values. I will try and explain exactly what I mean.
My database has tables called Job, consign, dechead, decitem. Job, consign, and dechead will only ever have one line per record but decitem can have multiple records all tied to the dechead with a foreign key. I am writing a query that pulls various values from each table. This is fine with all the tables except decitem. From dechead I need to pull an invoice value and from decitem I need to grab the net wieghts. When the results are returned if dechead has multiple child decitem tables it displays all values from both tables. What I need it to do is only display the dechad values once and then all the decitems values.
e.g.
1 ¦123¦£2000¦15.00¦1
2 ¦--¦------¦20.00¦2
3 ¦--¦------¦25.00¦3
Line 1 displays values from dechead and the first line/Join from decitems. Lines 2 and 3 just display values from decitem. If I then export the query to say excel I do not have duplicate values in the first two fileds of lines 2 and 3
e.g.
1 ¦123¦£2000¦15.00¦1
2 ¦123¦£2000¦20.00¦2
3 ¦123¦£2000¦25.00¦3
Thanks in advance.
Check out 'group by' for your RDBMS http://msdn.microsoft.com/en-US/library/ms177673%28v=SQL.90%29.aspx
this is a task best left for the application, but if you must do it in sql, try this:
SELECT
CASE
WHEN RowVal=1 THEN dt.col1
ELSE NULL
END as Col1
,CASE
WHEN RowVal=1 THEN dt.col2
ELSE NULL
END as Col2
,dt.Col3
,dt.Col4
FROM (SELECT
col1, col2, col3
,ROW_NUMBER OVER(PARTITION BY Col1 ORDER BY Col1,Col4) AS RowVal
FROM ...rest of your big query here...
) dt
ORDER BY dt.col1,dt.Col4