PostgreSQL: Count Number of Occurrences in Columns - postgresql

BACKGROUND
I have three large tables (employee_info, driver_info, school_info) that I have joined together on common attributes using a series of LEFT OUTER JOIN operations. After each join, the resulting number of records increased slightly, indicating that there are duplicate IDs in the data. To try and find all of the duplicates in the IDs, I dumped the ID columns into a temp table like so:
Original Dump of ID Columns
first_name
last_name
employee_id
driver_id
school_id
Mickey
Mouse
1234
abcd
wxyz
Donald
Duck
2423
heca
qwer
Mary
Poppins
1111
acbe
aaaa
Wiley
Cayote
1234
strf
aaaa
Daffy
Duck
1256
acbe
pqrs
Bugs
Bunny
9999
strf
yxwv
Pink
Panther
2222
zzzz
zzaa
Michael
Archangel
0000
rstu
aaaa
In this overly simplified example, you will see that IDs 1234 (employee_id), strf (driver_id), and aaaa (school_id) are each duplicated at least once. I would like to add a count column for each of the ID columns, and populate them with the count for each ID used, like so:
ID Columns with Counts
first_name
last_name
employee_id
employee_id_count
driver_id
driver_id_count
school_id
school_id_count
Mickey
Mouse
1234
2
abcd
1
wxyz
1
Donald
Duck
2423
1
heca
1
qwer
1
Mary
Poppins
1111
1
acbe
1
aaaa
3
Wiley
Cayote
1234
2
strf
2
aaaa
3
Daffy
Duck
1256
1
acbe
1
pqrs
1
Bugs
Bunny
9999
1
strf
2
yxwv
1
Pink
Panther
2222
1
zzzz
1
zzaa
1
Michael
Archangel
0000
1
rstu
1
aaaa
3
You can see that IDs 1234 and strf each have 2 in the count, and aaaa has 3. After generating this table, my goal is to pull out all records where any of the counts are greater than 1, like so:
All Records with One or More Duplicate IDs
first_name
last_name
employee_id
employee_id_count
driver_id
driver_id_count
school_id
school_id_count
Mickey
Mouse
1234
2
abcd
1
wxyz
1
Mary
Poppins
1111
1
acbe
1
aaaa
3
Wiley
Cayote
1234
2
strf
2
aaaa
3
Bugs
Bunny
9999
1
strf
2
yxwv
1
Michael
Archangel
0000
1
rstu
1
aaaa
3
Real World Perspective
In my real-world work, the JOIN'd table contains 100 columns, 15 different ID fields and over 30,000 records, and the final table came out to be 28 more than the original. This may seem like a small amount, but each of the 28 represent a broken link that we must fix.
Is there a simple way to get the counts populated like in the second table above? I have been wrestling with this for hours already, and have not been able to make this work. I tried some aggregate functions, but they cannot be used in table UPDATE operations.

The COUNT function, when used as an analytic function, can do what you want here, e.g.
WITH cte AS (
SELECT *,
COUNT(employee_id) OVER (PARTITION BY employee_id) employee_id_count,
COUNT(driver_id) OVER (PARTITION BY driver_id) driver_id_count,
COUNT(school_id) OVER (PARTITION BY school_id) school_id_count
FROM yourTable
)
SELECT *
FROM cte
WHERE
employee_id_count > 1
driver_id_count > 1
school_id_count > 1;

Related

Replace empty strings with NULL instead of empty strings when using JOIN

I have two tables:
table_a
id name
1 john
2 dave
3 tim
4 marta
5 jim
table_b
id sum random_metric
1 10.50 abc
3 11.5 efg
5 5.76 ghj
I have joined them on id
SELECT ...
FROM table_a
LEFT JOIN table_b ON table_a.id = table_b.id
and I get:
id name sum random_metric
1 john 10.5 abc
2 dave
3 tim 11.5 efg
4 marta
5 jim 5.76 ghj
Then I want to convert the sum column to double precision but since it has empty strings in rows 2, 4 it does not work.
How could I join tables so that I would have this:
id name sum random_metric
1 john 10.5 abc
2 dave NULL NULL
3 tim 11.5 efg
4 marta NULL NULL
5 jim 5.76 ghj

Autoincrement in query

I need to create a query which increment value of current row by 8% to previous row.
Table (let's name it money) contains one row (and two columns), and it looks like
AMOUNT ID
100.00 AAA
I just need to populate a data from this table like this way (one select from this table, eg. 6 iterations):
100.00 AAA
108.00 AAA
116.64 AAA
125.97 AAA
136.04 AAA
146.93 AAA
You can do that with a common table expression.
E.g. if your source looks like this:
db2 "create table money(amount decimal(31,2), id varchar(10))"
db2 "insert into money values (100,'AAA')"
You can create the input data with the following query (I will include counter column for clarity):
db2 "with
cte(c1,c2,counter)
as
(select
amount, id, 1
from
money
union all
select
c1*1.08, c2, counter+1
from
cte
where counter < 10)
select * from cte"
C1 C2 COUNTER
--------------------------------- ---------- -----------
100.00 AAA 1
108.00 AAA 2
116.64 AAA 3
125.97 AAA 4
136.04 AAA 5
146.92 AAA 6
158.67 AAA 7
171.36 AAA 8
185.06 AAA 9
199.86 AAA 10
To populate the existing table without repeating the existing row you use e.g. an insert like this:
$ db2 "insert into money
with
cte(c1,c2,counter)
as
(select
amount*1.08, id, 1
from
money
union all
select
c1*1.08, c2, counter+1
from
cte
where counter < 10) select c1,c2 from cte"
$ db2 "select * from money"
AMOUNT ID
--------------------------------- ----------
100.00 AAA
108.00 AAA
116.64 AAA
125.97 AAA
136.04 AAA
146.93 AAA
158.68 AAA
171.38 AAA
185.09 AAA
199.90 AAA
215.89 AAA
11 record(s) selected.

query multiple attribute in a table with single attribute in another table

I can't explain my problem in English well. So I write my problem in a personal way.
user_id name surname
1 john great
2 mary white
3 joseph alann
event_id official_id assistant_id date
1 1 2 2017-12-19
2 1 3 2017-12-20
3 2 3 2017-12-21
I want to get names at the same time when I query an event. I tried:
SELECT * FROM event a, user b WHERE a.official_id=b.user_id AND a.assistant_id=b.user_id
When I use "OR" instead of "AND" gives me cartesian result. I want the result like:
event_id off_id off_name asst_id asst_name date
1 1 john 2 mary 2017-12-19
2 1 john 3 joseph 2017-12-20
3 2 mary 3 joseph 2017-12-21

TSQL advanced ranking, grouping to find date spans

I need to do some advanced grouping in TSQL with data that looks like this:
PK YEARMO DATA
1 201201 AAA
1 201202 AAA
1 201203 AAA
1 201204 AAA
1 201205 (null)
1 201206 BBB
1 201207 AAA
2 201301 CCC
2 201302 CCC
2 201303 CCC
2 201304 DDD
2 201305 DDD
And then, every time DATA changes per primary key, pull up the date range for said item so that it looks something like this:
PK START_DT STOP_DT DATA
1 201201 201204 AAA
1 201205 201205 (null)
1 201206 201206 BBB
1 201207 201207 AAA
2 201301 201303 CCC
2 201304 201305 DDD
I've been playing around with ranking functions but haven't had much success. Any pointers in the right direction would be supremely awesome and appreciated.
You can use the row_number()function to partition your data into ranges:
SELECT
PK,
START_DT = MIN(YEARMO),
STOP_DT = MAX(YEARMO),
DATA
FROM (
SELECT
PK, DATA, YEARMO,
ROW_NUMBER() OVER (ORDER BY YEARMO) -
ROW_NUMBER() OVER (PARTITION BY PK, DATA ORDER BY YEARMO) grp
FROM your_table
) A
GROUP BY PK, DATA, grp
ORDER BY MIN(YEARMO)
Sample SQL Fiddle

Return all records regardless if there is a match

In my Table 1, It may have AND have a null entry in the address column to corresponding record OR not have a matching entry in Table 2.
I want to present all the records in Table 1 but also present corresponding entries from Table 2. My RESULT is what I am trying to achieve.
Table 1
ID First Last
1 John Smith
2 Bob Long
3 Bill Davis
4 Sam Bird
5 Tom Fenton
6 Mary Willis
Table 2
RefID ID Address
1 1 123 Main
2 2 555 Center
3 3 626 Smith
4 4 412 Walnut
5 1
6 2 555 Center
7 3
8 4 412 Walnut
Result
Id First Last Address
1 John Smith 123 Main
2 Bob Long 555 Center
3 Bill Davis 626 Smith
4 Sam Bird 412 Walnut
5 Tom Fenton
6 Mary Willis
You need an outer join for this:
SELECT * FROM Table1 t1 LEFT OUTER JOIN Table2 t2 ON t1.ID = t2.RefID
How do you join those two tables? If table 2 have more than 1 matched address, how do you want display them? Please clarify in your question.
Here is a query based on my assumptions.
SELECT
ID, First, Last,
Address = (SELECT MAX(Address) FROM Table2 t2 WHERE t1.ID = t2.ID)
FROM Table1 t1