'View' (NOT DELETE) Duplicate Rows from a Postgresql table obtained from joins - postgresql

So I have temp table I created by joining three tables :
Trips
Stops
Stop_times
The Stop_times table has a list of trip_ids, the corresponding stops and the scheduled arrival and departure times of buses at those stops.
I searched online and everywhere I seem to find answers for how to delete duplicates (using ctid, nested queries) but not view them.
My query looks something like this :
CREATE TEMP TABLE temp as
SELECT
(CASE st.arrival_time < current_timestamp::time
WHEN true THEN (current_timestamp::date + interval '1 day') + st.arrival_time
ELSE (current_timestamp::date) + st.arrival_time
END) as arrival,
CASE st.departure_time < current_timestamp::time
WHEN true THEN (current_timestamp::date + interval '1 day') + st.departure_time
ELSE (current_timestamp::date) + st.departure_time
END as departure, st.trip_id, st.stop_id, st.stop_headsign,route_id, t.trip_headsign, s.stop_code, s.stop_name, s.stop_lat, s.stop_lon
FROM schema.stop_times st
JOIN schema.trips t ON t.trip_id=st.trip_id
JOIN schema.stops s ON s.stop_id=st.stop_id
order by arrival, departure;
I know that there are duplicates (by running the select * and select DISTINCT on temp), I just need to identify the duplicates...any help will be appreciated!
PS : I know I can use DISTINCT and get rid of duplicates, but it is slowing down the query a lot so I need to rework the query for which I need to identify the duplicates, the resultant records are greater than 200,000 so exporting them to excel and filtering duplicates is not an option either (I tried but excel can't handle it)

I believe this will give you what you want:
SELECT arrival, departure, trip_id, stop_id, stop_headsign, route_id,
headsign, stop_code, stop_name, stop_lat, stop_lon, count(*)
FROM temp
GROUP BY arrival, departure, trip_id, stop_id, stop_headsign, route_id,
headsign, stop_code, stop_name, stop_lat, stop_lon
HAVING count(*) > 1;

Related

How to make postgres (cursor?) start at particular row

I have created the following query:
select t.id, t.row_id, t.content, t.location, t.retweet_count, t.favorite_count, t.happened_at,
a.id, a.screen_name, a.name, a.description, a.followers_count, a.friends_count, a.statuses_count,
c.id, c.code, c.name,
t.parent_id
from tweets t
join accounts a on a.id = t.author_id
left outer join countries c on c.id = t.country_id
where t.row_id > %s
-- order by t.row_id
limit 100
Where %s is a number that starts at 0 and is incremented by 100 after each such query is conducted. I want to fetch all records from the database using this method, where I just increase the %s in the where condition. I found this approach on https://ivopereira.net/efficient-pagination-dont-use-offset-limit. I also included a column in my table which is corresponding to row number (I named it row_id). Now the problem is when I run this query the first time, it returns rows which have an row_id of 3 million. I would like the cursor (not sure if my terminology is correct) to start from rows with row_id 1 through 100 and so on. The table contains 7 million rows. Am I missing something obvious with which I could achieve my goal?

Looping SQL query - PostgreSQL

I'm trying to get a query to loop through a set of pre-defined integers:
I've made the query very simple for this question.. This is pseudo code as well obviously!
my_id = 0
WHILE my_id < 10
SELECT * from table where id = :my_id`
my_id += 1
END
I know that for this query I could just do something like where id < 10.. But the actual query I'm performing is about 60 lines long, with quite a few window statements all referring to the variable in question.
It works, and gets me the results I want when I have the variable set to a single figure.. I just need to be able to re-run the query 10 times with different variables hopefully ending up with one single set of results.
So far I have this:
CREATE OR REPLACE FUNCTION stay_prices ( a_product_id int ) RETURNS TABLE (
pid int,
pp_price int
) AS $$
DECLARE
nights int;
nights_arr INT[] := ARRAY[1,2,3,4];
j int;
BEGIN
j := 1;
FOREACH nights IN ARRAY nights_arr LOOP
-- query here..
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql;
But I'm getting this back:
ERROR: query has no destination for result data
HINT: If you want to discard the results of a SELECT, use PERFORM instead.
So do I need to get my query to SELECT ... INTO the returning table somehow? Or is there something else I can do?
EDIT: this is an example of the actual query I'm running:
\x auto
\set nights 7
WITH x AS (
SELECT
product_id, night,
LAG(night, (:nights - 1)) OVER (
PARTITION BY product_id
ORDER BY night
) AS night_start,
SUM(price_pp_gbp) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS pp_price,
MIN(spaces_available) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_spaces_available,
MIN(period_date_from) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_period_date_from,
MAX(period_date_to) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS max_period_date_to
FROM products_nightlypriceperiod pnpp
WHERE
spaces_available >= 1
AND min_group_size <= 1
AND night >= '2016-01-01'::date
AND night <= '2017-01-01'::date
)
SELECT
product_id as pid,
CASE WHEN x.pp_price > 0 THEN x.pp_price::int ELSE null END as pp_price,
night_start as from_date,
night as to_date,
(night-night_start)+1 as duration,
min_spaces_available as spaces
FROM x
WHERE
night_start = night - (:nights - 1)
AND min_period_date_from = night_start
AND max_period_date_to = night;
That will get me all the nights night periods available for all my products in 2016 along with the price for the period and the max number of spaces I could fill in that period.
I'd like to be able to run this query to get all the periods available between 2 and 30 days for all my products.
This is likely to produce a table with millions of rows. The plan is to re-create this table periodically to enable a very quick look up of what's available for a particular date. The products_nightlypriceperiod represents a night of availability of a product - e.g. Product X has 3 spaces left for Jan 1st 2016, and costs £100 for the night.
Why use a loop? You can do something like this (using your first query):
with params as (
select generate_series(1, 10) as id
)
select t.*
from params cross join
table t
where t.id = params.id;
You can modify params to have the values you really want. Then just use cross join and let the database "do the looping."

Creating 'Empty' Records for Days of the Month Without Records

I have a very simpl postgres (9.3) query that looks like this:
SELECT a.date, b.status
FROM sis.table_a a
JOIN sis.table_b b ON a.thing_id = b.thing_id
WHERE EXTRACT(MONTH FROM a.date) = 06
AND EXTRACT(YEAR FROM a.date) = 2015
Some days of the month of June do not exist in table_a and thus are obviously not joined to table_b. What is the best way to create records for these not represented days and assign a placeholder (e.g. 'EMPTY') to their 'status' column? Is this even possible to do using pure SQL?
Basically, you need LEFT JOIN and it looks like you also need generate_series() to provide the full set of days:
SELECT d.date
, a.date IS NOT NULL AS a_exists
, COALESCE(b.status, 'status_missing') AS status
FROM (
SELECT date::date
FROM generate_series('2015-06-01'::date
, '2015-06-30'::date
, interval '1 day') date
) d
LEFT JOIN sis.table_a a USING (date)
LEFT JOIN sis.table_b b USING (thing_id)
ORDER BY 1;
Use sargable WHERE conditions. What you had cannot use a plain index on date and has to default to a much more expensive sequential scan. (There are no more WHERE conditions in my final query.)
Aside: don't use the basic type name (and reserved word in standard SQL) date as identifier.
Related (2nd chapter):
PostgreSQL: running count of rows for a query 'by minute'

multiple extract() with WHERE clause possible?

So far I have come up with the below:
WHERE (extract(month FROM orders)) =
(SELECT min(extract(month from orderdate))
FROM orders)
However, that will consequently return zero to many rows, and in my case, many, because many orders exist within that same earliest (minimum) month, i.e. 4th February, 9th February, 15th Feb, ...
I know that a WHERE clause can contain multiple columns, so why wouldn't the below work?
WHERE (extract(day FROM orderdate)), (extract(month FROM orderdate)) =
(SELECT min(extract(day from orderdate)), min(extract(month FROM orderdate))
FROM orders)
I simply get: SQL Error: ORA-00920: invalid relational operator
Any help would be great, thank you!
Sample data:
02-Feb-2012
14-Feb-2012
22-Dec-2012
09-Feb-2013
18-Jul-2013
01-Jan-2014
Output:
02-Feb-2012
14-Feb-2012
Desired output:
02-Feb-2012
I recreated your table and found out you just messed up the brackets a bit. The following works for me:
where
(extract(day from OrderDate),extract(month from OrderDate))
=
(select
min(extract(day from OrderDate)),
min(extract(month from OrderDate))
from orders
)
Use something like this:
with cte1 as (
select
extract(month from OrderDate) date_month,
extract(day from OrderDate) date_day,
OrderNo
from tablename
), cte2 as (
select min(date_month) min_date_month, min(date_day) min_date_day
from cte1
)
select cte1.*
from cte1
where (date_month, date_day) = (select min_date_month, min_date_day from cte2)
A common table expression enables you to restructure your data and then use this data to do your select. The first cte-block (cte1) selects the month and the day for each of your table rows. Cte2 then selects min(month) and min(date). The last select then combines both ctes to select all rows from cte1 that have the desired month and day.
There is probably a shorter solution to that, however I like common table expressions as they are almost all the time better to understand than the "optimal, shortest" query.
If that is really what you want, as bizarre as it seems, then as a different approach you could forget the extracts and the subquery against the table to get the minimums, and use an analytic approach instead:
select orderdate
from (
select o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
from orders o
)
where rn = 1;
ORDERDATE
---------
01-JAN-14
The row_number() effectively adds a pseudo-column to every row in your original table, based on the month and day in the order date. The rn values are unique, so there will be one row marked as 1, which will be from the earliest day in the earliest month. If you have multiple orders with the same day/month, say 01-Jan-2013 and 01-Jan-2014, then you'll still only get exactly one with rn = 1, but which is picked is indeterminate. You'd need to add further order by conditions to make it deterministic, but I have no idea what you might want.
That is done in the inner query; the outer query then filters so that only the records marked with rn = 1 is returned; so you get exactly one row back from the overall query.
This also avoids the situation where the earliest day number is not in the earliest month number - say if you only had 01-Jan-2014 and 02-Feb-2014; comparing the day and month separately would look for 01-Feb-2014, which doesn't exist.
SQL Fiddle (with Thomas Tschernich's anwer thrown in too, giving the same result for this data).
To join the result against your invoice table, you don't need to join to the orders table again - especially not with a cross join, which is skewing your results. You can do the join (at least) two ways:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
) o, invoices i
WHERE i.invno = o.invno
AND rn = 1;
Or:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT orderno, orderdate, invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
)
WHERE rn = 1
) o, invoices i
WHERE i.invno = o.invno;
The first looks like it does more work but the execution plans are the same.
SQL Fiddle with your pastebin-supplied query that gets two rows back, and these two that get one.

Simple SELECT, but adding JOIN returns too many rows

The query below returns 9,817 records. Now, I want to SELECT one more field from another table. See the 2 lines that are commented out, where I've simply selected this additional field and added a JOIN statement to bind this new columns. With these lines added, the query now returns 649,200 records and I can't figure out why! I guess something is wrong with my WHERE criteria in conjunction with the JOIN statement. Please help, thanks.
SELECT DISTINCT dbo.IMPORT_DOCUMENTS.ITEMID, BEGDOC, BATCHID
--, dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.CATEGORY_ID
FROM IMPORT_DOCUMENTS
--JOIN dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS ON
dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID = dbo.IMPORT_DOCUMENTS.ITEMID
WHERE (BATCHID LIKE 'IC0%' OR BATCHID LIKE 'LP0%')
AND dbo.IMPORT_DOCUMENTS.ITEMID IN
(SELECT dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID FROM
CATEGORY_COLLECTION_CATEGORY_RESULTS
WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN(
SELECT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16))
AND Sample_Id > 0)
AND dbo.IMPORT_DOCUMENTS.ITEMID NOT IN
(SELECT ASSIGNMENT_FOLDER_DOCUMENTS.Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)
One possible reason is because one of your tables contains data at lower level, lower than your join key. For example, there may be multiple records per item id. The same item id is repeated X number of times. I would fix the query like the below. Without data knowledge, Try running the below modified query.... If output is not what you're looking for, convert it into SELECT Within a Select...
Hope this helps....
Try this SQL: SELECT DISTINCT a.ITEMID, a.BEGDOC, a.BATCHID, b.CATEGORY_ID FROM IMPORT_DOCUMENTS a JOIN (SELECT DISTINCT ITEMID FROM CATEGORY_COLLECTION_CATEGORY_RESULTS WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN (SELECT DISTINCT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16)) AND Sample_Id > 0) B ON a.ITEMID =b.ITEMID WHERE a.(a.BATCHID LIKE 'IC0%' OR a.BATCHID LIKE 'LP0%') AND a.ITEMID NOT IN (SELECT DIDTINCT Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)