Unexpected behavior in a postgres group by query - postgresql

I am used to writing group by queries in t-sql. In a t-sql group by, this would generate a list where items with the same categorytext were grouped together, then items within a category text group that had the same type text would be grouped together. But that does not seem to be what is happening here:
Select "CategoryText", "TypeText"
from "NewOrleans911Categories"
group by "CategoryText", "TypeText";
Here is some output from postgres. Why are the NAs not getting grouped together?
CategoryText; TypeText
"BrokenWindows";"DRUG VIOLATIONS"
"NA";"BOMB SCARE"
"Weapon";"DISCHARGING FIREARMS"
"NA";"NEGLIGENT INJURY"

In a t-sql group by, this would generate a list where items with the same categorytext were grouped together, then items within a category text group that had the same type text would be grouped together.
In SQL, the order in which rows are returned by a query is unspecified, unless you toss in an order by clause. Typically, you'll get the rows in the order they got returned by the query, and that would entirely depend on the query plan. (Best I'm aware, t-sql does that too.)
At any rate, you'd want to add the missing order by clause to get the expected result:
Select "CategoryText", "TypeText"
from "NewOrleans911Categories"
group by "CategoryText", "TypeText"
order by "CategoryText", "TypeText";
Or (and I suspect this is what you're actually looking for) replace the group by with an order by clause:
Select "CategoryText", "TypeText"
from "NewOrleans911Categories"
order by "CategoryText", "TypeText";

You are "grouping" by two columns. The rows are only "Grouped " when the records match both columns.
In that case you have different TypeText for both NA, so they will not group by. Much like using a distinct, which in that case will accomplish the same thing.

May be you need query like this:
select distinct on ("CategoryText") "CategoryText", "TypeText"
from "NewOrleans911Categories"
because with group by you cannot select columns which aren't in group by statement.

Related

Why does Postgres choose different data solely based on columns selected?

I'm running two different queries with two unions each inside a subquery:
So the structure is:
SELECT *
FROM (subquery_1
UNION SELECT subquery_2)
Now, if I perform the query on the left, I get this result:
However, the query on the right returns this result:
How are the results differing even though the conditions have not changed in either query, and the only difference was one of the selected columns in a subquery?
This is very counter-intuitive.
The operator UNION removes duplicate rows from the returned resultset.
Removing a column from the SELECT statement may produce duplicate rows that would not exist if the removed column was there.
Try UNION ALL instead, which will return in any case all the rows of the unioned queries.
See a simplified demo.

Why do I need to group by columns that I don't need to group by?

Say I have a query like this:
SELECT
car.id,
car.make,
car.model,
car.vin,
car.year,
car.color
FROM car GROUP BY car.make
I want to group the result by make so I can eliminate any duplicate makes. I'm essentially trying to do a SELECT DISTINCT. But I get this error:
ERROR column must appear in the GROUP BY clause or be used in an aggregate function
It seems silly to group by each column when I dont want to see any of them in a group. How do I get around this?
Instead of GROUP BY, use DISTINCT ON:
SELECT DISTINCT ON (c.make) c.*
FROM car c
ORDER BY c.make;
This will return an arbitrary row for each make. Which row? An arbitrary one. You can include a second key in the ORDER BY to determine the particular row you want (cheapest, oldest, etc.).
All column names in SELECT list must appear in GROUP BY clause unless name is used only in an aggregate function. PostgreSQL only let you omit from the GROUP BY clause columns that are functionally dependent on columns that are in the GROUP BY.

Postgres: Distinct but only for one column

I have a table on pgsql with names (having more than 1 mio. rows), but I have also many duplicates. I select 3 fields: id, name, metadata.
I want to select them randomly with ORDER BY RANDOM() and LIMIT 1000, so I do this is many steps to save some memory in my PHP script.
But how can I do that so it only gives me a list having no duplicates in names.
For example [1,"Michael Fox","2003-03-03,34,M,4545"] will be returned but not [2,"Michael Fox","1989-02-23,M,5633"]. The name field is the most important and must be unique in the list everytime I do the select and it must be random.
I tried with GROUP BY name, bu then it expects me to have id and metadata in the GROUP BY as well or in a aggragate function, but I dont want to have them somehow filtered.
Anyone knows how to fetch many columns but do only a distinct on one column?
To do a distinct on only one (or n) column(s):
select distinct on (name)
name, col1, col2
from names
This will return any of the rows containing the name. If you want to control which of the rows will be returned you need to order:
select distinct on (name)
name, col1, col2
from names
order by name, col1
Will return the first row when ordered by col1.
distinct on:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
Anyone knows how to fetch many columns but do only a distinct on one column?
You want the DISTINCT ON clause.
You didn't provide sample data or a complete query so I don't have anything to show you. You want to write something like:
SELECT DISTINCT ON (name) fields, id, name, metadata FROM the_table;
This will return an unpredictable (but not "random") set of rows. If you want to make it predictable add an ORDER BY per Clodaldo's answer. If you want to make it truly random, you'll want to ORDER BY random().
To do a distinct on n columns:
select distinct on (col1, col2) col1, col2, col3, col4 from names
SELECT NAME,MAX(ID) as ID,MAX(METADATA) as METADATA
from SOMETABLE
GROUP BY NAME

Create a query to select two columns; (Company, No. of Films) from the database

I have created a database as part of university assignment and I have hit a snag with the question in the title.
More likely I am being asked to find out how many films each company has made. Which suggests to me a group by query. But I have no idea where to begin. It is only a two mark question but the syntax is not clicking in my head.
My schema is:
CREATE TABLE Movie
(movieID CHAR(3) ,
title CHAR(36),
year NUMBER,
company CHAR(50),
totalNoms NUMBER,
awardsWon NUMBER,
DVDPrice NUMBER(5,2),
discountPrice NUMBER(5,2))
There are other tables but at first glance I don't think they are relevant to this question.
I am using sqlplus10
The answer you need comes from three basic SQL concepts, I'll step through them with you. If you need more assistance to create an answer from these hints, let me know and I can try to keep guiding you.
Group By
As you mentioned, SQL offers a GROUP BY function that can help you.
A SQL Query utilizing GROUP BY would look like the following.
SELECT list, fields, aggregate(value)
FROM tablename
--WHERE goes here, if you need to restrict your result set
GROUP BY list, fields
a GROUP BY query can only return fields listed in the group by statement, or aggregate functions acting on each group.
Aggregate Functions
Your homework question also needs an Aggregate function called Count. This is used to count the results returned. A simple query like the following returns the count of all records returned.
SELECT Count(*)
FROM tablename
The two can be combined, allowing you to get the Count of each group in the following way.
SELECT list, fields, count(*)
FROM tablename
GROUP BY list, fields
Column Aliases
Another answer also tried to introduce you to SQL column aliases, but they did not use SQLPLUS syntax.
SELECT Count(*) as count
...
SQLPLUS column alias syntax is shown below.
SELECT Count(*) "count"
...
I'm not going to provide you the SQL, but instead a way to think about it.
What you want to do is select where the company matches and count the total rows returned. That count is the number of films made by the specified company.
Hope that points you in the right direction.
Select company, count(*) AS count
from Movie
group by company
select * group by company won't work in Oracle.

sql query to retrieve DISTINCT rows on left join

I am developing a t-sql query to return left join of two tables, but when I just select records from Table A, it gives me only 2 records. The problem though is when I left join it Table B, it gives me 4 records. How can I reduce this to just 2 records?
One problem though is that I am only aware of one PK/FK to link these two tables.
The field you are using for the join must exist more than once in table B - this is why multiple rows are being returned in the join. In order to reduce the row count you will have to either add further fields to the join, or add a where clause to filter out rows not required.
Alternatively you could use a GROUP BY statement to group the rows up, but this may not be what you need.
Remember that the left join brings you null fields from joined table.
Also you can use select(distinct), but i can't see well you issue. Can you give us more details?