Why does Postgres choose different data solely based on columns selected? - postgresql

I'm running two different queries with two unions each inside a subquery:
So the structure is:
SELECT *
FROM (subquery_1
UNION SELECT subquery_2)
Now, if I perform the query on the left, I get this result:
However, the query on the right returns this result:
How are the results differing even though the conditions have not changed in either query, and the only difference was one of the selected columns in a subquery?
This is very counter-intuitive.

The operator UNION removes duplicate rows from the returned resultset.
Removing a column from the SELECT statement may produce duplicate rows that would not exist if the removed column was there.
Try UNION ALL instead, which will return in any case all the rows of the unioned queries.
See a simplified demo.

Related

Why does adding FROM clause resolve the effect of WHERE condition in subquery on main query?

In my research, I want to see diseases correlated to diabetes by listing all diseases which co-occurs with diabetes, i.e. when there is at least one patient who has medical record of both diabetes and that disease. At first I try this query:
use [ng_data]
select distinct [disease]
from [dbo].[Final_View_2]
where [encode_id] in
(
select [encode_id]
where [disease] like '%diabetes%'
)
Where encode_id is the id of patient. But this query only returns diseases whose name contain 'diabetes'. It looks like the condition in subquery affects results in main query.
Then when I try this query:
use [ng_data]
select distinct [disease]
from [dbo].[Final_View_2]
where [encode_id] in
(
select [encode_id]
from [dbo].[Final_View_2]
where [disease] like '%diabetes%'
)
it works correctly. It seems that adding from clause in the subquery can resolve the effect of where clause in subquery on the main query. Could someone explain how the queries are carried out and why it produces such result? I'm confused by the dependence of main query on subquery.
Your first query is acting like a correlated sub-query, i.e. the rows it is referencing come from the result set in the outer query. It is in effect just saying "is the column value in this row like '%diabetes%'"; it is therefore no different to putting that WHERE clause in the outer query.
By add the FROM in your sub-query you are creating a secondary resultset of encodeids that have diabetes in the disease column and then selecting all rows that have that encodeid in the first resultset without reference to the disease column.
Take a look at the execution plan for each query to see what it is doing

Unexpected behavior in a postgres group by query

I am used to writing group by queries in t-sql. In a t-sql group by, this would generate a list where items with the same categorytext were grouped together, then items within a category text group that had the same type text would be grouped together. But that does not seem to be what is happening here:
Select "CategoryText", "TypeText"
from "NewOrleans911Categories"
group by "CategoryText", "TypeText";
Here is some output from postgres. Why are the NAs not getting grouped together?
CategoryText; TypeText
"BrokenWindows";"DRUG VIOLATIONS"
"NA";"BOMB SCARE"
"Weapon";"DISCHARGING FIREARMS"
"NA";"NEGLIGENT INJURY"
In a t-sql group by, this would generate a list where items with the same categorytext were grouped together, then items within a category text group that had the same type text would be grouped together.
In SQL, the order in which rows are returned by a query is unspecified, unless you toss in an order by clause. Typically, you'll get the rows in the order they got returned by the query, and that would entirely depend on the query plan. (Best I'm aware, t-sql does that too.)
At any rate, you'd want to add the missing order by clause to get the expected result:
Select "CategoryText", "TypeText"
from "NewOrleans911Categories"
group by "CategoryText", "TypeText"
order by "CategoryText", "TypeText";
Or (and I suspect this is what you're actually looking for) replace the group by with an order by clause:
Select "CategoryText", "TypeText"
from "NewOrleans911Categories"
order by "CategoryText", "TypeText";
You are "grouping" by two columns. The rows are only "Grouped " when the records match both columns.
In that case you have different TypeText for both NA, so they will not group by. Much like using a distinct, which in that case will accomplish the same thing.
May be you need query like this:
select distinct on ("CategoryText") "CategoryText", "TypeText"
from "NewOrleans911Categories"
because with group by you cannot select columns which aren't in group by statement.

Tsql, union changes result order, union all doesn't

I know UNION removes duplicates but it changes result order even when there are no duplicates.
I have two select statements, no order by statement anywhere
I want union them with or without (all)
i.e.
SELECT A
UNION (all)
SELECT B
"Select B" actually contains nothing, no entry will be returned
if I use "Select A union Select B", the order of the result is different from just "Select A"
if I use:
SELECT A
UNION ALL
SELECT B
the order of the result is the same as "Select A" itself and there are no duplicates in "Select A" at all.
Why is this? it is unpredictable.
The only way to get a particular order of results from an SQL query is to use an ORDER BY clause. Anything else is just relying on coincidence and the particular (transitory) state of the server at the time you issue your query.
So if you want/need a particular order, use an ORDER BY.
As to why it changes the ordering of results - first, UNION (without ALL) guarantees to remove all duplicates from the result - not just duplicates arising from the different queries - so if the first query returns duplicate rows and the second query returns no rows, UNION still has to eliminate them.
One common, easy way to determine whether you have duplicates in a bag of results is to sort those results (in whatever sort order is most convenient to the system) - in this way, duplicates end up next to each other and so you can then just iterate over these sorted results and if(results[index] == results[index-1]) skip;.
So, you'll commonly find that the results of a UNION (without ALL) query have been sorted - in some arbitrary order. But, to re-emphasise the original point, what ordering was applied is not defined, and certainly shouldn't be relied upon - any patches to the software, changes in indexes or statistics may result in the system choosing a different sort order the next time the query is executed - unless there's an ORDER BY clause.
One of the most important points to understand about SQL is that a table has no guaranteed order, because a table is supposed to represent a set (or multiset if it has duplicates), and a set has no order. This means that when you query a table without specifying an ORDER BY clause, the query returns a table result, and SQL Server is free to return the rows in the output in any order. If the results happen to be ordered, it may be due to optimization reasons. The point I'm trying to make is that any order of the rows in the output is considered valid, and no specific order is guaranteed. The only way for you to guarantee that the rows in the result are sorted is to explicitly specify an ORDER BY clause.

sql query to retrieve DISTINCT rows on left join

I am developing a t-sql query to return left join of two tables, but when I just select records from Table A, it gives me only 2 records. The problem though is when I left join it Table B, it gives me 4 records. How can I reduce this to just 2 records?
One problem though is that I am only aware of one PK/FK to link these two tables.
The field you are using for the join must exist more than once in table B - this is why multiple rows are being returned in the join. In order to reduce the row count you will have to either add further fields to the join, or add a where clause to filter out rows not required.
Alternatively you could use a GROUP BY statement to group the rows up, but this may not be what you need.
Remember that the left join brings you null fields from joined table.
Also you can use select(distinct), but i can't see well you issue. Can you give us more details?

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000