I have to delete some records that are considered to be useless.
There is a address file and an order history file. In the company which has a consumer products, they get many product inquiries or start of sale that never becomes a sale.
Each inquiry gets a record in the address file, Customer number. In the order history file is same customer numn and a Suffix field. which is started at 000 and increments when there is new order. the bulk of the business is in fact a recurring model.
A customer who has only '000' record (there could be multiple 000's), means they never bought anything we wish to purge them from these files.
I am thinking a simple RPG program but am also interested in just using SQL if that is possible or other methods.
At this stage, we would not actually be deleting but copying the proposed records for purging to an output file which will be reviewed and also would be stored in case a need to revert.
F Addressfile IF E
F OrderHistory IF E
**** create 2 output file clones but with altered name for now.
F Zaddressfile O E
F ZorderHistory O E
*inlr doweq *off
Read Addressfile lr
*inlr ifeq *off
move *off Flg000
exsr Chk000
Flg000 ifeq *on
iter
else exsr purge
endif
endif
enddo
Chk000 begsr
**basically setll to a different logical on the orderhistory
and reade for a long as we have the matching customer number
and if there is a suffix not = 000 then we turn on the flag
and get out.
the purge subr will have to read thru again to get the records needed to purge from the orderhistory file by using the same customer number that would still be in the read of the address file. because i would not be sure what value has the subr, for customer and i dont want to store that.
then it would write to the new file incl the address file then we can iter read the next
customer in address file.
also we cannot assume that if someone did buy, they have a 001 maybe it got deleted over the years.
if we did, i could simply chain on that.
All sorts of steps you have to do in RPG. This can be done in SQL a variety of simpler ways. SQL is adept at processing and analyzing groups of records in an entire file all at once.
CREATE TABLE zaddresses AS
( SELECT *
FROM addressFile
WHERE cust IN (SELECT cust
FROM orderHistory
GROUP BY cust
HAVING max(sufix)='000'
)
)
WITH DATA
NOT LOGGED INITIALLY;
CREATE TABLE zorderHst AS
( SELECT *
FROM orderHistory
WHERE cust IN (SELECT cust
FROM zaddresses
)
)
WITH DATA
NOT LOGGED INITIALLY;
There, you've defined your holding table and populated it in one single statement each. It does have some nested logic, but nonetheless only two statements.
To purge them
DELETE FROM addressfile
WHERE cust IN (SELECT cust FROM zaddresses);
DELETE FROM orderHistory
WHERE cust IN (SELECT cust FROM zaddresses);
A grand total of four SQL statements. (I wont even ask how many you'd have in your RPG program)
Once you understand SQL, you can think about processing entire files, not just record by record instructions. It's much simpler to get things done, and it's almost always faster when done well.
(You may hear arguments about performance under particular circumstances, but most often they simply aren't using SQL as well as they should. If you write poor RPG, it performs badly too. ;-)
I would use SQL.
-- Save only the rows to be deleted
CREATE TABLE ZADDRESSFILE AS
(SELECT *
FROM ADDRESSFILE af
WHERE NOT EXISTS
(SELECT 1
FROM ADDRESSFILE sub
WHERE sub.CUSTNO = af.CUSTNO
AND sub.SUFFIX <> '000' -- (or <> 0 if numeric)
)
)
-- If ZADDRESSFILE exists and you want to add the rows
-- to ZADDRESSFILE instead....
INSERT INTO ZADDRESSFILE
(SELECT *
FROM ADDRESSFILE af
WHERE NOT EXISTS
(SELECT 1
FROM ADDRESSFILE sub
WHERE sub.CUSTNO = af.CUSTNO
AND sub.SUFFIX <> '000' -- (or <> 0 if numeric)
)
AND OT EXISTS
(SELECT 1
FROM ZADDRESSFILE sub
WHERE sub.CUSTNO = af.CUSTNO
)
)
-- Get number of rows to be deleted
SELECT COUNT(*)
FROM ADDRESSFILE af
WHERE NOT EXISTS
(SELECT 1
FROM ADDRESSFILE sub
WHERE sub.CUSTNO = af.CUSTNO
AND sub.SUFIX <> '000'
)
-- Delete 'em
DELETE
FROM ADDRESSFILE af
WHERE NOT EXISTS
(SELECT 1
FROM ADDRESSFILE sub
WHERE sub.CUSTNO = af.CUSTNO
AND sub.SUFIX <> '000'
)
Related
I have a database of courses. I need to get a name, a a topic, a teacher, a duration and a number of students registered. I get the first four successfully, but not the last one.
Here is what my tables look like:
courses table
journal table
all tables
That's the successful part for the first four:
SELECT c.name, t.topic_name AS topic, u.name || ' '|| u.surname AS TEACHER, ((c.end_date - c.start_date) / 7)::int AS duration
FROM public.course c
RIGHT JOIN public.topic t ON c.topic_id = t.topic_id
RIGHT JOIN public.teacher_course tc ON c.course_id = tc.course_id
RIGHT JOIN public.user u ON tc.teacher_id = u.user_id
WHERE u.role_id = 2;
Basically, to know the number of registered students per course, I only need to count records in the journal table for each course, but when I add
count(j.id_record) AS students_registered
it just breaks and asks me to group everything by and blah blah.
I'm confused about that. How to get this number correctly for each course?
We have 2 tables renewal_bkp and adhoc_bkp and 1 MV as test_mv1.
I basically want to create a script that will update one row of renewal_bkp and adhoc_bkp and then select the data from the above MV.
This needs to be done in a loop fashion. Below is an
example:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
update renewal_bkp set network_status='provisioned' where msisdn='3234561010240';
update adhoc_bkp set status='provisioned' where msisdn='3234561010240';
select * from test_mv1 where msisdn='3234561010240';
...
...
and so on
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
This same statement needs to get generated 1000 times with different msisdn numbers.
Can you help me to create a script to do so, then manually writing down each statements.
Thanks,
Sandeep
Although it is totally unclear how you are accessing the msisdns, here is a compact version that does all three things in one batch, atomic, with the help of Data-Modifying CTE's:
WITH
ids (id) AS (
VALUES ('3234561010240'), ('...'), ...
),
renewals AS (
UPDATE renewal_bkp SET network_status = 'provisioned'
WHERE msisdn IN (SELECT id FROM ids)
),
adhoc AS (
UPDATE adhoc_bkp SET status = 'provisioned'
WHERE msisdn IN (SELECT id FROM ids)
)
SELECT *
FROM test_mv1
WHERE msisdn IN (SELECT id FROM ids)
Instead of the VALUES clause, you can also put a regular SELECT from a dedicated table, which would make sense if you're going to execute this more often than once or twice a year.
I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.
I can list the partitions with
SELECT
child.relname AS child_schema
FROM pg_inherits
JOIN pg_class child ON pg_inherits.inhrelid = child.oid ;
Is it guaranteed that they are listed in creation order? Because then only an additional LIMIT 1 is required. Else this will print the oldest, the one with the lowest number in its name: (my partitions are named name_1 name_2 name_3 ...)
SELECT
MIN ( trim(leading 'name_' from child.relname)::int ) AS child_schema
FROM pg_inherits
JOIN pg_class child ON pg_inherits.inhrelid = child.oid ;
Then I need to create a script which uses the result to execute DROP TABLE? Is there no easier way?
Is it guaranteed that they are listed in creation order?
No. This is likely as long as sequential scans and no dropped tables, but if you change the query and the plan changes, you could get rather unexpected results ordering-wise. Also I would expect that once free space is re-used, the ordering may change as well.
Your current trim query is the best way. Stick with it.
I know this might be redundant but I have had the same query running for almost 3 days and before I kill it, I would like to get a community sanity check.
DELETE
FROM mytble
WHERE ogc_fid NOT IN
(SELECT MAX(dup.ogc_fid)
FROM mytble As dup
GROUP BY dup.id)
mytble is the name of the table, ogc_fid is the name of the unique id field and id is the name of the field that I want to be the unique id. There are 41 million records in the table and indexes are built and everything so I am still a bit concerned about why its taking so long to complete. Any thoughts on this?
If I understood correctly, you want to delete all the records for which a record with the same dup_id
(but with a higher ogc_fid) exists. And keep only those with the highest ogc_fid.
-- DELETE -- uncomment this line and comment the next line if proven innocent.
SELECT COUNT(*)
FROM mytble mt
WHERE EXISTS (
SELECT *
FROM mytble nx
WHERE nx.dup_id = mt.dup_id -- there exists a row with the same dup_id
AND nx.ogc_fid > mt.ogc_fid -- , ... but with a higher ogc_fid
);
With an index on dup_id (and maybe on ogc_id) this should run maybe a few minutes for 41M records.
UPDATE: if no indexes exist, you could speed up the above queries by first creating an index:
CREATE UNIQUE INDEX sinterklaas ON mytble (dup_id, ogc_id);
Would be nice if you provided explain output, but what you're doing might be faster when done like this (again, I'd look up explain):
DELETE FROM mytable d
USING mytable m
LEFT JOIN (SELECT max(ogc_fid) AS f FROM mytble GROUP BY id) AS q ON m.ogc_fid=q.f
WHERE d.ogc_fid=m.ogc_fid AND q.f IS NULL;