How to find duplicate rows in JPA - jpa

Is there a way to find duplicate entries in a data set using JPA?
| id | text |
-------------
| 1 | foo |
| 2 | bar |
| 3 | foo |
I want to have only entries 1 & 3 in my set.
I can't make it unique on this field.
—
DISTINCT would give me rows 1 & 2.
If it’s a query, a join with the same table? I’m not sure how that would work. I couldn’t get group by to function.

Edited
I believe you can use the following syntax without inner query:
SELECT id, text, COUNT(*) FROM entity GROUP BY text HAVING COUNT(*) > 1

You can apply common practice from SQL to JPQL with the following query:
SELECT e FROM Entity e WHERE e.text IN (SELECT text FROM Entity d GROUP BY text HAVING COUNT(*)>1.
A sub-query is required so you'd need an index on text column for it to be efficient.

Related

Inserting data from one table based on some Id in postgres

I have two tables with almost 13,000 records and looks something like this
TableA:
ID Status Option
----------------------
1 | Approved |
2 | Reject |
3 | Approved |
4
.
.
13,000
TableB
Name Option Status
-----------------------------------------------
First | {'data':'Add into box','ID':'1'} | Approved
Second | {'data':'Don't Add','ID':'2'} | Reject
Third | {'data':'Add into box','ID':'3'} | Approved
.
.
.
13,000
I want to fill the Option column (data type varchar)in table A with similar data to that of Table B Option column (data type B) based on same ID which is also in option json object. How do i fill them in one go rather than going one by one.
An update query where we set the "option" in TableA using a subquery, where we filter the result based on "id" of TableA matching with "id" inside varchar column "option" of TableB.
update tablea
set option = (select option from tableb
where tablea.id::text = tableb.option::json ->> 'id'
limit 1);
-- assuming id has a 1:1 relation in both tables

selecting multiple columns but group by only one in postgres

I have a simple table in postgres:
remoteaddr count
142.4.218.156 592
158.69.26.144 613
167.114.209.28 618
Which I pulled using the following:
select remoteaddr,
count (remoteaddr)
from domain_visitors
group by remoteaddr
having count (remoteaddr) > 500
How do I select additional columns and still only group by remoteaddr?
Option 1: You could use the array_agg() function to concatenate the additional column values into a grouped list:
SELECT
remoteaddr,
array_agg(DISTINCT username) AS unique_users,
array_agg(username) AS repeated_users,
count(remoteaddr) as remote_count
FROM domain_visitors
GROUP BY remoteaddr;
See this SQL Fiddle. This query would return something like the below:
+----------------+---------------------------------+-----------------------------------------------------------------------------------------------------+--------------+
| remoteaddr | unique_users | repeated_users | remote_count |
+----------------+---------------------------------+-----------------------------------------------------------------------------------------------------+--------------+
| 142.4.218.156 | anotheruser,user9688766,vistor1 | user9688766,anotheruser,vistor1,vistor1,vistor1,vistor1,vistor1,anotheruser,anotheruser,anotheruser | 10 |
| 158.69.26.144 | anotheruser,user9688766 | anotheruser,user9688766,user9688766,user9688766,user9688766 | 5 |
| 167.114.209.28 | vistor1 | vistor1 | 1 |
+----------------+---------------------------------+-----------------------------------------------------------------------------------------------------+--------------+
Option 2: You could put your first query in a common table expression (aka a "WITH" clause), and join it against the original table, like this:
WITH grouped_addr AS (
SELECT remoteaddr, count(remoteaddr) AS remote_count
FROM domain_visitors
GROUP BY remoteaddr
)
SELECT ga.remoteaddr, dv.username, ga.remote_count
FROM grouped_addr ga
INNER JOIN domain_visitors dv
ON ga.remoteaddr = dv.remoteaddr
WHERE remote_count > 500;
Here is a SQL Fiddle.
Bear in mind that this will return repeated results for any additional columns (in this example, username). This is not usually what you want. Note each of the SELECT examples in the Fiddles and see which best suits your purpose.

PostgreSQL Group By not working as expected - wants too many inclusions

I have a simple postgresql table that I'm tying to query. Imaging a table like this...
| ID | Account_ID | Iteration |
|----|------------|-----------|
| 1 | 100 | 1 |
| 2 | 101 | 1 |
| 3 | 100 | 2 |
I need to get the ID column for each Account_ID where Iteration is at its maximum value. So, you'd think something like this would work
SELECT "ID", "Account_ID", MAX("Iteration")
FROM "Table_Name"
GROUP BY "Account_ID"
And I expect to get:
| ID | Account_ID | MAX(Iteration) |
|----|------------|----------------|
| 2 | 101 | 1 |
| 3 | 100 | 2 |
But when I do this, Postgres complains:
ERROR: column "ID" must appear in the GROUP BY clause or be used in an aggregate function
Which, when I do that it just destroys the grouping altogether and gives me the whole table!
Is the best way to approach this using the following?
SELECT DISTINCT ON ("Account_ID") "ID", "Account_ID", "Iteration"
FROM "Marketing_Sparks"
ORDER BY "Account_ID" ASC, "Iteration" DESC;
The GROUP BY statement aggregates rows with the same values in the columns included in the group by into a single row. Because this row isn't the same as the original row, you can't have a column that is not in the group by or in an aggregate function. To get what you want, you will probably have to select without the ID column, then join the result to the original table. I don't know PostgreSQL syntax, but I assume it would be something like the following.
SELECT Table_Name.ID, aggregate.Account_ID, aggregate.MIteration
(SELECT Account_ID, MAX(Iteration) AS MIteration
FROM Table_Name
GROUP BY Account_ID) aggregate
LEFT JOIN Table_Name ON aggregate.Account_ID = Table_Name.Account_ID AND
aggregate.MIteration = Tabel_Name.Iteration

How to use COUNT() in more that one column?

Let's say I have this 3 tables
Countries ProvOrStates MajorCities
-----+------------- -----+----------- -----+-------------
Id | CountryName Id | CId | Name Id | POSId | Name
-----+------------- -----+----------- -----+-------------
1 | USA 1 | 1 | NY 1 | 1 | NYC
How do you get something like
---------------------------------------------
CountryName | ProvinceOrState | MajorCities
| (Count) | (Count)
---------------------------------------------
USA | 50 | 200
---------------------------------------------
Canada | 10 | 57
So far, the way I see it:
Run the first SELECT COUNT (GROUP BY Countries.Id) on Countries JOIN ProvOrStates,
store the result in a table variable,
Run the second SELECT COUNT (GROUP BY Countries.Id) on ProvOrStates JOIN MajorCities,
Update the table variable based on the Countries.Id
Join the table variable with Countries table ON Countries.Id = Id of the table variable.
Is there a possibility to run just one query instead of multiple intermediary queries? I don't know if it's even feasible as I've tried with no luck.
Thanks for helping
Use sub query or derived tables and views
Basically If You You Have 3 Tables
select * from [TableOne] as T1
join
(
select T2.Column, T3.Column
from [TableTwo] as T2
join [TableThree] as T3
on T2.CondtionColumn = T3.CondtionColumn
) AS DerivedTable
on T1.DepName = DerivedTable.DepName
And when you are 100% percent sure it's working you can create a view that contains your three tables join and call it when ever you want
PS: in case of any identical column names or when you get this message
"The column 'ColumnName' was specified multiple times for 'Table'. "
You can use alias to solve this problem
This answer comes from #lotzInSpace.
SELECT ct.[CountryName], COUNT(DISTINCT p.[Id]), COUNT(DISTINCT c.[Id])
FROM dbo.[Countries] ct
LEFT JOIN dbo.[Provinces] p
ON ct.[Id] = p.[CountryId]
LEFT JOIN dbo.[Cities] c
ON p.[Id] = c.[ProvinceId]
GROUP BY ct.[CountryName]
It's working. I'm using LEFT JOIN instead of INNER JOIN because, if a country doesn't have provinces, or a province doesn't have cities, then that country or province doesn't display.
Thanks again #lotzInSpace.

Why do I have to provide an items.id column to the group by clause?

I want to return unique items based on condition, sorted by price asc. My query fails because Postgres wants items.id to be present in the group by clause. If it's included the query returns everything matching the where clause, which is not what I want. Why do I need to include the column?
select items.*
from items
where product_id = 1 and items.status = 'in_stock'
group by condition /* , items.id returns everything */
order by items.price asc
| id | condition | price |
--------------------------
| 1 | new | 9 |
| 2 | good | 5 |
| 3 | good | 3 |
I only want items with ids 1 and 3.
Update: Here's a fiddle using the answer below, which still produces the error:
http://sqlfiddle.com/#!1/33786/2
The problem is that PostgreSQL has no way of knowing which items records you want to take values from; that is, it can't tell that you want this:
| id | condition | price |
--------------------------
| 1 | new | 9 |
| 3 | good | 3 |
and not this:
| id | condition | price |
--------------------------
| 1 | new | 9 |
| 2 | good | 5 |
To fix this, you need to use some sort of aggregation function, such as MAX:
SELECT MAX(id) AS id,
condition,
MAX(price) AS price
FROM items
WHERE product_id = 1
AND status = 'in_stock'
GROUP BY condition
ORDER BY price ASC
which gives:
| id | condition | price |
--------------------------
| 1 | new | 9 |
| 3 | good | 5 |
(This restriction is part of the SQL standard, and most DBMSes enforce it. One exception is MySQL, which allows your query, but with the caveat that "The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate" [link].)
SQL Fiddle
select *
from (
select distinct on (cond)
id, cond, price
from items
where product_id = 1 and items.status = 'in_stock'
order by cond, price
) s
order by price
The SQL standard requires this behaviour, though some databases like MySQL ignore it and instead return unpredictable results.
If there's more than one row for "cond = good" and you ask for the "id" of the row where "cond = good", which row should the database give you? The row with id = 3, or id = 2? How should it know which to pick? MySQL picks an arbitrary row if there are multiple candidates, but this isn't allowed by the standard.
In your case you seem to want to pick the lowest-price row for each condition.
PostgreSQL provides an extension, DISTINCT ON ..., to help with this. Clodaldo has demonstrated this in his answer, so I won't repeat that here. Using DISTINCT ON will be much more efficient than the example below.
The SQL-standard way would be to use a window to rank the results, then filter on the ranked data. Unfortunately this is pretty inefficient as it requires all rows that match the inner where clause to be collected and sorted.
SELECT *
FROM (
SELECT *, dense_rank() OVER w AS itemrank
FROM items
WHERE product_id = 1 AND items.status = 'in_stock'
WINDOW w AS (PARTITION BY cond ORDER BY price ASC)
) ranked_items
WHERE itemrank = 1;
(http://sqlfiddle.com/#!1/33786/19)
Another SQL-standard way is to use an aggregation subquery to find the min prices for each category then display all rows with the min price:
SELECT *
FROM items INNER JOIN (
SELECT cond, min(price) AS minprice
FROM items
WHERE product_id = 1 AND items.status = 'in_stock'
GROUP BY cond
) minprices(cond, price)
ON (items.price = minprices.price AND items.cond = minprices.cond)
ORDER BY items.price;
Unlike the DISTINCT ON version, though, this will display multiple entries if the lowest priced item has more than one entry with the same cond and price.
So.. you should really use the DISTINCT ON approach, but you need to understand it. Start with the PostgreSQL documentation here.
On a side note, newer PostgreSQL versions allow you to refer to any column of a table whose primary key you've listed in GROUP BY; they identify the functional dependency of the other columns on the primary key. So you don't have to aggregate other cols if you've mentioned the PK in newer versions. That's what the standard requires, but older versions weren't smart enough to figure it out and required all columns to be listed explicitly.
That's what people who ask this question usually want to know, but doesn't apply strictly to your question since it turns out you're trying to use GROUP BY to filter rows.