I have thousands of lines of duplicate data in PostgreSQL database. To find out which row are duplicated, I am using this code:
SELECT "Date" FROM stockdata
group by "Date"
having count("Date")>1
This has produced again thousands of lines of date column which have more then 1 entry. How can I remove the row with the date so that just 1 entry of the duplicated item remains.
P.S I cannot use a Primary Key when entering data.
Update
As per the comment. There is no primary key. Also the Date is unique thus there cannot be 2 or more of it.
df look like this:
Date High Low Open Close Volume Adj Close
0 2017-04-03 893.489990 885.419983 888.000000 891.510010 3422300 891.510010
1 2017-04-04 908.539978 890.280029 891.500000 906.830017 4984700 906.830017
2 2017-04-05 923.719971 905.619995 910.820007 909.280029 7508400 909.280029
3 2017-04-06 917.190002 894.489990 913.799988 898.280029 6344100 898.280029
4 2017-04-07 900.090027 889.309998 899.650024 894.880005 3710900 894.880005
... ... ... ... ... ... ... ...
12595 2022-03-28 1097.880005 1053.599976 1065.099976 1091.839966 34168700 1091.839966
12596 2022-03-29 1114.770020 1073.109985 1107.989990 1099.569946 24538300 1099.569946
12597 2022-03-30 1113.949951 1084.000000 1091.170044 1093.989990 19955000 1093.989990
12598 2022-03-31 1103.140015 1076.640015 1094.569946 1077.599976 16265600 1077.599976
12599 2022-04-01 1094.750000 1066.640015 1081.150024 1076.352783 11449987 1076.352783
12600 rows × 7 columns
The data is repeated a few times at places.
However the rows with the same date with have the same data.
This data is not a stock data (i am using it as a troubleshoot example) but from yokogawa datalogger. https://www.yokogawa.com/in/solutions/products-platforms/data-acquisition/data-logger/#Overview
There are redundancies in the system and the earlier integrator had just dumped all the data on 1 database and thus if redundant logger comes online, the database has multiple entries. I need to remove it so we can actually use the data. I don't have access to their software.
Further Update:
Using this code as suggested in the comments:
delete from stockdata s
using
(SELECT "Date" , max(ctid) as max_ctid from stockdata group by "Date") t
where s.ctid<>t.max_ctid
and s."Date"=t."Date";
It was able to do the job but going forward, is this dangerous solution for production?
This should do the trick:
DELETE FROM
stockdata a
USING stockdata b
WHERE
a.id < b.id
AND a.Date = b.Date;
But be careful, this will immediately delete all duplicates. There is no way to restore them.
Related
I have streaming data using Kafka to Druid. It's an eCommerce de-normalized order event data where status and few fields get updated in every event.
I need to do aggregate query based on timestamp with the most updated entry only.
For example: If data sample is:
{"orderId":"123","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:02:33Z"}
{"orderId":"abc","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:03:33Z"}
{"orderId":"123","status":"Shipped","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-07T02:03:33Z"}
Now if I want to query on all orders stuck on "Initiated" status for more than 2 days then for above data it should only show orderId "abc".
But if I query something like
Select orderId,qty,paymentId from order where status = Initiated and WHERE "timestamp" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
This query will return both orders "123" and "abc", but 123 has another event received after 2 days so the previous events should not be included in result.
Is their any good and optimized way to handle this kind of scenarios in Apache druid?
One way I was thinking to use a separate lookup table to store orderId and latest status and perform a join with this lookup and above aggregation query on orderId and status
EDIT 1:
This query works but it joins on whole table, which can give resource limit exception for big datasets:
WITH maxOrderTime (orderId, "__time") AS
(
SELECT orderId, max("__time") FROM inline_data
GROUP BY orderId
)
SELECT inline_data.orderId FROM inline_data
JOIN maxOrderTime
ON inline_data.orderId = maxOrderTime.orderId
AND inline_data."__time" = maxOrderTime."__time"
WHERE inline_data.status='Initiated' and inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
EDIT 2:
Tried with:
SELECT
inline_data.orderID,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderID
HAVING last_status = 1
But gives this error:
Error: Unknown exception
Error while applying rule DruidQueryRule(AGGREGATE), args
[rel#1853:LogicalAggregate.NONE.,
rel#1863:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"inline_data"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/2021-03-14T09:57:05.000Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"lookup("status",'status_as_number')","outputType":"STRING"}],"resultFormat":"compactedList","batchSize":20480,"order":"none","filter":null,"columns":["orderID","v0"],"legacy":false,"context":{"sqlOuterLimit":100,"sqlQueryId":"fbc167be-48fc-4863-b3a8-b8a7c45fb60f"},"descending":false,"granularity":{"type":"all"}},signature={orderID:LONG,
v0:STRING})]
java.lang.RuntimeException
I think this can be done easier. If you replace the status to a numeric representation, you can use it more easy.
First use an inline lookup to replace the status. See this page how to define a lookup: https://druid.apache.org/docs/0.20.1/querying/lookups.html
Now, we have for example these values in a lookup named status_as_number:
Initiated = 1
Shipped = 2
Since we now have a numeric representation, you can simply do a group by query and see the max status number. A query like this would be sufficient:
SELECT
inline_data.orderId,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderId
HAVING last_status = 1
Note: this query is not tested. The HAVING part makes sure that you only see orders which are Initiated.
I hope this solves your problem.
I have table having below records
Sno A
- --
1 spoo74399p
2 spoo75399p
I want to update the above records by replacing oo (alphabet 'o') by empty after sp
afte
Required OUTPUT
----------------
Sno A
1 sp74399p
2 sp75399p
UPDATE my_table
SET A = REPLACE(A, 'spoo', 'sp');
This would update all the records.
To clarify a bit, this would find any instances of "spoo" and replace them with "sp". The end result is that any "oo" right after "sp" would be deleted.
I'm trying to solve this problem:
I have a query/view that will join ~10 tables to extract some fields for a report (if any). The query doesn't use any grouping function, only joins and cut off some unuseful data.
I have to take this one big view, get the group for the first index, take the max of a date in the second column and take all the information from other fields referring the record of the max value.
I cannot be able to to this in postgres.
As a pseudo code I can give this:
select 1
, max(2)
, 3 referred to the record from max(2)
, 4 referred to the record from max(2)
, ...
, 20 referred to the record from max(2)
from (ViewWithAllJoins) a
group by 1
For privacy and business problem I had to obfuscate some informations, 1/2/3/4... are the name of the column from the view "ViewWithAllJoins", I hope that the problem is still understandable and resolvable!
I've tryied with WINDOW command as reported in Convert keep dense_rank from Oracle query into postgres but I cannot be able to use the group by that I need. Other tryes that I've done was about the dense_rank like shown in Dense_rank first Oracle to Postgresql convert but I can't do any assumption on the order of the data in any of the other fields in exception of 1 and 2, so I can't use any of the aggregate function on them.
Any ideas? Possibly without adding too much subqueryes.
Thank you!
EDIT:
As suggested I'll add some synthetic data to better understand the problem and what I want.
Start:
ID DATE COLUMN1 COLUMN2 COLUMN3
=====================================================================
88888888;"2016-04-02 09:00:00";"aaaaaaaaaaa";"TEXT89" ; 999999999
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
88888888;"2017-11-09 09:00:00";"zzzz" ;"TEXT80000" ; 850580582
75858585;"2017-01-31 09:00:00";"~~~~~~~~~~~";"TEXT10" ; 101010101
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
What I want:
ID DATE COLUMN1 COLUMN2 COLUMN3
===================================================================
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
So the group by id, the max of the date and the other fields related to the row of the max date.
-- So you have duplicate records per ID, and for every ID you want to select the record with the most recent date ?
Use NOT EXISTS:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM queryview t
WHERE NOT EXISTS (
SELECT *
FROM queryview x
WHERE x.id=t.id
AND x.zdate > t.zdate
);
Or, use row_number() over a window, and pick only the row with the final date:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM ( SELECT *
, row_number() OVER(PARTITION BY id, ORDER BY zdate DESC) AS rn
FROM queryview
) q
WHERE q.rn = 1
;
I have a table that needs to pull the latest date from different categories and the date might not always be filled out. I have tried to use MAX, MIN etc. it has not worked.
e.g. ID 1st Game Date 2nd Game Date 3rd Game Date
Joe 6/1/16 missing missing
Anna missing 7/2/16 7/6/16
Rita missing 7/31/16 missing
Needs to Return:
ID Date
Joe 6/1/16
Anna 7/6/16
Rita 7/31/16
I do have this sql that works well but it requires that all the dates get filled in other wise it doesn't return the latest date:
ApptDate: Switch([Pt1stApptDate]>=[2ndApptDate] And [Pt1stApptDate]>=
[3rdApptDate],[Pt1stApptDate],[2ndApptDate]>=[Pt1stApptDate] And [2ndApptDate]>=
[3rdApptDate],[2ndApptDate],[3rdApptDate]>=[Pt1stApptDate] And [3rdApptDate]>=
[2ndApptDate],[3rdApptDate])
Much appreciation in advance for all your help
Use the Nz function:
ApptDate: Switch(Nz([Pt1stApptDate],0)>=Nz([2ndApptDate],0) And
Nz([Pt1stApptDate],0)>= Nz([3rdApptDate],0), Nz([Pt1stApptDate],0),
Nz([2ndApptDate],0)>=Nz([Pt1stApptDate],0) And Nz([2ndApptDate],0)>=
Nz([3rdApptDate],0),Nz([2ndApptDate],0),
Nz([3rdApptDate],0)>=Nz([Pt1stApptDate],0) And Nz([3rdApptDate],0)>=
Nz([2ndApptDate],0),Nz([3rdApptDate],0))
Having said that, your table design is incorrect.
You should be storing each ApptDate per ID in a separate row:
ApptID ID ApptDate ApptNr
1 Joe 6/1/2016 1
2 Anna 7/2/2016 2
3 Anna 7/6/2016 3
4 Rita 7/31/2016 2
whereas ApptID is an autonumber and ApptNr is a sequence per ID (what you seem to call a category).
When you are having problems writing what should be simple queries (SQL DML) then you should consider you may have design flaws (in your SQL DDL).
The missing values are causing you to avoid the MAX set function and compels you to handle nulls in queries (note the NZ() function will cause errors outside of the Access UI). Better to model missing data by simply not adding a row to a table. Think about it: you want the smallest amount of data possible in your database, you can infer the remainder e.g. if Joe was not gaming on 1 Jan and 2 Jan and 3 Jan and 4 Jan etc then simply don't add anything to your database for all these dates.
The following SQL DDL requires ANSI-92 Query Mode (but you can create the same tables/views using the Access GUI tools):
CREATE TABLE Attendance
( gamer_name VARCHAR( 35 ) NOT NULL REFERENCES Gamers ( gamer_name ),
game_sequence NOT NULL CHECK ( game_sequence BETWEEN 1 AND 3 )
game_date DATETIME NOT NULL,
UNIQUE ( game_date, game_sequence ) );
INSERT INTO Attendance VALUES ( 'Joe', 1, '2016-06-01' );
INSERT INTO Attendance VALUES ( 'Anna', 2, '2016-07-02' );
INSERT INTO Attendance VALUES ( 'Anna', 3, '2016-07-06' );
INSERT INTO Attendance VALUES ( 'Rita', 1, '2016-07-31' );
CREATE VIEW MostRecentAttendance
AS
SELECT gamer_name, MAX ( game_date ) AS game_date
FROM Attendance
GROUP
BY gamer_name;
SELECT *
FROM Attendance a
WHERE EXISTS ( SELECT *
FROM MostRecentAttendance r
WHERE r.gamer_name = a.gamer_name
AND r.game_date = a.game_date );
To find the missing sequence values for players, create a table of all possible sequence numbers { 1, 2, 3 } to which you can 'anti-join' (e.g. NOT EXISTS).
Given a table of the following format in MATLAB:
userid | itemid | keywords
A = [ 3 10 'book'
3 10 'briefcase'
3 10 'boat'
12 20 'windows'
12 20 'picture'
12 35 'love'
4 10 'day'
12 10 'working day'
... ... ... ];
where A is a table of size (58000*3), I want to write the data in a csv file with the following format:
csv.file
itemid keywords
10 book, briefcase, boat, day, working day, ...
20 windows, picture, ...
35 love, ...
where we the list of itemids is stored in Iids = [10,20,35,...]
I would like to avoide using loops for this as you can imagine the matrix is big-sized. Any idea is appreciated.
I wasn't able to think of a solution without loops. But you can optimize your loop by:
using logical indexing
running such loop only M times (if M is the number of unique itemid elements) instead of N times (if N is the number of elements in your table).
The solution I come up with is this.
First of all, create your table
A=table([3;3;3;12;12;12;4;12], [10;10;10;20;20;35;10;10],{'book','briefcase','boat','windows','picture','love','day','working day'}','VariableNames',{'userid','itemid','keywords'});
which looks like
Select the unique values for column itemid (your Iids):
Iids=unique(A.itemid);
which looks like
Create a new, empty, table which will contain the results:
NewTable=table();
And now the minimal loop I've come up with:
for id=Iids'
% select rows with given itemid value
RowsWithGivenId=A(A.itemid==id,:);
% create new row in NewTable with the id and the (joined together) keywords from the selected rows
NewTable=[NewTable; table(id,{strjoin(RowsWithGivenId.keywords,', ')})];
end
Also, append the new column names in NewTable
NewTable.Properties.VariableNames = {'itemid','keywords'};
And now NewTable looks like:
Please note: due to the fact that the keywords in the new table are separated by comma, a csv file is not the format I recommend. By using writetable() as writetable(NewTable,'myfile.csv');
what you'll get is
As instead, by replacing ; instead of a separating comma (in strjoin()), you'll get a nicer format: