Postgresql Faulty Syntax on select/join/group - postgresql

What about the following is not proper syntax for Postgresql?
select p.*, SUM(vote) as votes_count
FROM votes v, posts p
where p.id = v.`voteable_id`
AND v.`voteable_type` = 'Post'
group by v.voteable_id
order by votes_count DESC limit 20
I am in the process of installing postgresql locally but wanted to get this out sooner :)
Thank you

MySQL is a lot looser in its interpretation of standard SQL than PostgreSQL is. There are two issues with your query:
Backtick quoting is a MySQL thing.
Your GROUP BY is invalid.
The first one can be fixed by simply removing the offending quotes. The second one requires more work; from the fine manual:
When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions, since there would be more than one possible value to return for an ungrouped column.
This means that every column mentioned in your SELECT either has to appear in an aggregate function or in the GROUP BY clause. So, you have to expand your p.* and make sure that all those columns are in the GROUP BY, you should end up with something like this but with real columns in place of p.column...:
select p.id, p.column..., sum(v.vote) as votes_count
from votes v, posts p
where p.id = v.voteable_id
and v.voteable_type = 'Post'
group by p.id, p.column...
order by votes_count desc
limit 20
This is a pretty common problem when moving from MySQL to anything else.

Related

What is the execution order of a query with sub queries?

Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id

How to specify two expressions in the select list when the subquery is not introduced with EXISTS

I have a query that uses a subquery and I am having a problem returning the expected results. The error I receive is..."Only one expression can be specified in the select list when the subquery is not introduced with EXISTS." How can I rewrite this to work?
SELECT
a.Part,
b.Location,
b.LeadTime
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
AND
Date IN (SELECT Location, MAX(Date) FROM dbo.Vendor GROUP BY Location)
GROUP BY
a.Part,
b.Location,
b.LeadTime
ORDER BY
a.Part
I think something like this may be what you're looking for. You didn't say what version of SQL Server--this works in SQL 2005 and up:
SELECT
p.Part,
p.Location, -- from *p*, otherwise if no match we'll get a NULL
v.LeadTime
FROM
dbo.Parts p
OUTER APPLY (
SELECT TOP (1) * -- * here is okay because we specify columns outside
FROM dbo.Vendor v
WHERE p.Location = v.Location -- the correlation part
ORDER BY v.Date DESC
) v
WHERE
p.Location IN ('A','B','C')
ORDER BY
p.Part
;
Now, your query can be repaired as is by adding the "correlation" part to change your query into a correlated subquery as demonstrated in Kory's answer (you'd also remove the GROUP BY clause). However, that method still requires an additional and unnecessary join, hurting performance, plus it can only pull one column at a time. This method allows you to pull all the columns from the other table, and has no extra join.
Note: this gives logically the same results as Lamak's answer, however I prefer it for a few reasons:
When there is an index on the correlation columns (Location, here) this can be satisfied with seeks, but the Row_Number solution has to scan (I believe).
I prefer the way this expresses the intent of the query more directly and succinctly. In the Row_Number method, one must get out to the outer condition to see that we are only grabbing the rn = 1 values, then bop back into the CTE to see what that is.
Using CROSS APPLY or OUTER APPLY, all the other tables not involved in the single-inner-row-per-outer-row selection are outside where (to me) they belong. We aren't squishing concerns together. Using Row_Number feels a bit like throwing a DISTINCT on a query to fix duplication rather than dealing with the underlying issue. I guess this is basically the same issue as the previous point worded in a different way.
The moment you have TWO tables from which you wish to pull the most recent value, the Row_Number() solution blows up completely. With this syntax, you just easily add another APPLY clause, and it's crystal clear what you're doing. There is a way to use Row_Number for the multiple tables scenario by moving the other tables outside, but I still don't prefer that syntax.
Using this syntax allows you to perform additional joins based on whether the selected row exists or not (in the case that no matching row was found). In the Row_Number solution, you can only reasonably do that NOT NULL checking in the outer query--so you are forced to split up the query into multiple, separated parts (you don't want to be joining to values you will be discarding!).
P.S. I strongly encourage you to use aliases that hint at the table they represent. Please don't use a and b. I used p for Parts and v for Vendor--this helps you and others make sense of the query more quickly in the future.
If I understood you corrrectly, you want the rows with the max date for locations A, B and C. Now, assuming SQL Server 2005+, you can do this:
;WITH CTE AS
(
SELECT
a.Part,
b.Location,
b.LeadTime,
RN = ROW_NUMBER() OVER(PARTITION BY a.Part ORDER BY [Date] DESC)
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
)
SELECT Part,
Location,
LeadTime
FROM CTE
WHERE RN = 1
ORDER BY Part
In your subquery you need to correlate the Location and Part to the outer query.
Example:
Date = (SELECT MAX(Date)
FROM dbo.Vender v
WHERE v.Location = b.Location
AND v.Part = b.Part
)
So this will bring back one date for each location and part

DB2 Query Structure Using User-Defined Function as a Table

I'm a little new to DB2, and am having trouble developing a query. I have created a user-defined function that returns a table of data which I want to then join and select from in larger select statement. I'm working on a sensitive db, so the query below isn't what I'm literally running, but it's almost exactly like it (without the other 10 joins I have to do lol).
select
A.customerId,
A.firstname,
A.lastname,
B.orderId,
B.orderDate,
F.currentLocationDate,
F.currentLocation
from
customer A
INNER JOIN order B
on A.customerId = B.customerId
INNER JOIN table(getShippingHistory(B.customerId)) as F
on B.orderId = F.orderId
where B.orderId = 35
This works great if I run this query without the where clause (or some other where clause that doesn't check for an ID). When I include the where clause, I get the following error:
Error during Prepare 58004(-901)[IBM][CLI Driver][DB2/LINUXX8664]
SQL0901N The SQL statement failed because of a non-severe system
error. Subsequent SQL statements can be processed. (Reason "Bad Plan;
Unresolved QNC found".) SQLSTATE=58004
I have tracked the issue down to fact that I'm using one of join criteria for the parameters (B.customerId). I have validated this fact by replacing B.customerId with a valid customerId, and the query works great. Problem is, I don't know the customerId when calling this query. I know only the orderId (in this example).
Any thoughts on how to restructure this so I can make only 1 call to get all the info? I know the plan is the problem b/c the customerId isn't getting resolved before the function is called.
So if I understand correctly, the function getShippingHistory(customerId) returns a table.
And if you call it with a single customer Id that table gets joined in your query above no problem at all.
But the way you have the query written above, you are asking db2 to call the function for every row returned by your query (i.e. every b.customerId that matches your join and where conditions).
So I'm not sure what behaviour you are expecting, because what you're asking for is a table back for every row in your query, and db2 (nor I) can figure out what the result is supposed to look like.
So in terms of restructuring your query, think about how you can change the getShippingHistory logic when multiple customer Ids are involved.
i found the best solution (given the current query structure) is to use a LEFT join instead of an INNER join in order force the LEFT part of the join to happen which will resolve the customerId to a value by the time it gets to the function call.
select
A.customerId,
A.firstname,
A.lastname,
B.orderId,
B.orderDate,
F.currentLocationDate,
F.currentLocation
from
customer A
INNER JOIN order B
on A.customerId = B.customerId
LEFT JOIN table(getShippingHistory(B.customerId)) as F
on B.orderId = F.orderId
where B.orderId = 35

HAVING clause in PostgreSQL

I'm rewriting the MySQL queries to PostgreSQL. I have table with articles and another table with categories. I need to select all categories, which has at least 1 article:
SELECT c.*,(
SELECT COUNT(*)
FROM articles a
WHERE a."active"=TRUE AND a."category_id"=c."id") "count_articles"
FROM articles_categories c
HAVING (
SELECT COUNT(*)
FROM articles a
WHERE a."active"=TRUE AND a."category_id"=c."id" ) > 0
I don't know why, but this query is causing an error:
ERROR: column "c.id" must appear in the GROUP BY clause or be used in an aggregate function at character 8
The HAVING clause is a bit tricky to understand. I'm not sure about how MySQL interprets it. But the Postgres documentation can be found here:
http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-HAVING
It essentially says:
The presence of HAVING turns a query
into a grouped query even if there is
no GROUP BY clause. This is the same
as what happens when the query
contains aggregate functions but no
GROUP BY clause. All the selected rows
are considered to form a single group,
and the SELECT list and HAVING clause
can only reference table columns from
within aggregate functions. Such a
query will emit a single row if the
HAVING condition is true, zero rows if
it is not true.
The same is also explained in this blog post, which shows how HAVING without GROUP BY implicitly implies a SQL:1999 standard "grand total", i.e. a GROUP BY ( ) clause (which isn't supported in PostgreSQL)
Since you don't seem to want a single row, the HAVING clause might not be the best choice.
Considering your actual query and your requirement, just rewrite the whole thing and JOIN articles_categories to articles:
SELECT DISTINCT c.*
FROM articles_categories c
JOIN articles a
ON a.active = TRUE
AND a.category_id = c.id
alternative:
SELECT *
FROM articles_categories c
WHERE EXISTS (SELECT 1
FROM articles a
WHERE a.active = TRUE
AND a.category_id = c.id)
SELECT * FROM categories c
WHERE
EXISTS (SELECT 1 FROM article a WHERE c.id = a.category_id);
should be fine... perhaps simpler ;)

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000