I have a table in which a file is imported daily and, unfortunately, it was created without a constraint so I have to find issues where the same records may have been imported two days in a row.
So, I want to write a query that will tell me when records with a particular "header date" were imported more than once (the header date should be unique each day). The field I'm using for import date is a datetime "dataDate" field. My header date field is called "headerDate" and is a datetime field as well, and my table is tblCases. Any help is appreciated. Thanks!
This will give you the dates with more than one row:
SELECT headerDate
FROM tblCases
GROUP BY headerDate
HAVING COUNT(*) > 1
Of course this is extendable to give you the rows that are duplicated in the table for whatever columns you specify:
SELECT headerDate, col1, col2, col3, ...
FROM tblCases
GROUP BY headerDate, col1, col2, col3, ...
HAVING COUNT(*) > 1
If you want the detail on the dataDate then
select *
FROM tblCases
where headerDate in
(
SELECT headerDate
FROM tblCases
GROUP BY headerDate
HAVING COUNT(*) > 1
)
order by headerDate, dataDate
Related
I need to upload multiple excel files to a postgresql table but they can olverlap each other in several registers, therefore I need to be aware of IntegrityErrors. I'm following two approaches:
cursor.copy_from: The fastest approach but I don't know how to catch and control all Integrityerrors due to duplicate registers
streamCSV = StringIO()
streamCSV.write(invoicing_info.to_csv(index=None, header=None, sep=';'))
streamCSV.seek(0)
with conn.cursor() as c:
c.copy_from(streamCSV, "staging.table_name", columns=dataframe.columns, sep=';')
conn.commit()
cursor.execute: I can count and handle each exception but it is very
slow.
data = invoicing_info.to_dict(orient='records')
with cursor as c:
for entry in data:
try:
c.execute(DLL_INSERT, entry)
successful_inserts += 1
connection.commit()
print('Successful insert. Operation number {}'.format(successful_inserts))
except psycopg2.IntegrityError as duplicate:
duplicate_registers += 1
connection.rollback()
print('Duplicate entry. Operation number {}'.format(duplicate_registers))
At the end of the routine, I need to determine the following info:
print("Initial shape: {}".format(invoicing_info.shape))
print("Successful inserts: {}".format(successful_inserts))
print("Duplicate entries: {}".format(duplicate_registers))
How can I modify the first approach to control all exceptions? How can I optimize the second approach?
while you have duplicate IDs in different excel sheets you have to answer for yourself how you make a decision to data from which excel sheet to trust?
while you are using multiple tables, and will use approach to have at least one row from conflicting pair you can always do following:
create temporary tables for each excel sheet
upload data to each table for excel sheet (like you do now in a bulk)
make an insert from select picking distinct on(id), in a manner:
INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, col1, col2
FROM
(
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data
with such insert postgreSQL will take the random row out of non-unique id sets.
In case you would like to trust the first record you can add some order:
INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, ordering_column col1, col2
FROM
(
SELECT id, 1 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, 2 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, 3 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data
ORDER BY ordering_column
for initial count of objects:
SELECT sum(count)
FROM
(
SELECT count(*) as count FROM temp_table_for_excel_sheet1
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet2
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet3
) as data
after finishing this bulk inserts you can run select count(*) FROM staging.table_name to get a result for total number of inserted records
for duplicate count you can run:
SELECT sum(count)
FROM
(
SELECT count(*) as count
FROM temp_table_for_excel_sheet2 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
UNION
SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
)
UNION
SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet2 )
) as data
If the excel sheets contain duplicate records, Pandas seems a likely choice for identifying and eliminated dupes: https://33sticks.com/python-for-business-identifying-duplicate-data/. Or is the issue that different records in different sheets have the same id/index? If so, a similar approach could work where you use Pandas to isolate the ids used multiple times and then correct them with unique identifiers before attempting to upload to the SQL db.
For a bulk upload, I'd use an ORM. SQLAlchemy has some great info on bulk uploads: http://docs.sqlalchemy.org/en/rel_1_0/orm/persistence_techniques.html#bulk-operations, and there's a related discussion here: Bulk insert with SQLAlchemy ORM
I have date in yyyymmdd format in an int column and I would like to group by month i.e. yyyymm.I've tried the below two versions
select to_char(to_timestamp(create_dt),'YYYYMM'),count(*) from table_name
group by to_char(to_timestamp(create_dt),'YYYYMM')
order by to_char(to_timestamp(create_dt),'YYYYMM') desc
AND
select to_char(create_dt,'YYYYMM'),count(*) from table_name
group by to_char(create_dt,'YYYYMM')
order by to_char(create_dt,'YYYYMM') desc
select create_dt / 100, count(*)
from t
group by 1
order by 1 desc
limit 6
Figured it out,any alternate ways would be helpful.
select substring(create_dt::int8,1,6),count(*) from table
group by substring(create_dt::int8,1,6)
order by substring(create_dt::int8,1,6) desc
limit 6;
I have o table name table_1 with 4 columns id, text, fromDate, toDate. The table represents the working experience.I want to create a function which will return the row with columns id, text where the employee worked more recently. This means I need column toDate to be closest to today.
Here is a demonstration of my code:
Select (abs("toDate"-now())) as date_diff
from table_1
Select id,text
from table_1
where (abs("toDate"-now()))=select min(date_diff)
Is this correct or is there something better I can do?
I wil try something like this:
Select id,text
from table_1
where "toDate" = ( select max ("toDate") from table_1 )
It will provide you the latest "toDate" value.
Try this:
select * from table_1
order by to_date desc
limit 1
I prepared the following report. It displays three columns (AccruedInterest, Tip & CardRevenue) by month for a given year. Now I want to "Rotate" the result so that StartDate value turn into 12 columns.
I have this:
I need this:
I have tried pivoting the table but I need multiple columns to be aggregated as you see.
You have to unpivot your data and the pivot.
Note: I put in my own values in your table as to make each unique so you know the data is correct.
--Create YourTable
SELECT * INTO YourTable
FROM
(
SELECT CAST('2015-01-01' AS DATE) StartDate,
607.834 AS AccruedInterest,
1 AS Tip,
3 AS CardRevenue
UNION ALL
SELECT CAST('2015-02-01' AS DATE) StartDate,
643.298 AS AccruedInterest,
16.8325 AS Tip,
5 AS CardRevenue
) A;
GO
--This pivots your data
SELECT *
FROM
(
--This unpivots your data using cross apply
SELECT col,val,StartDate
FROM YourTable
CROSS APPLY
(
SELECT 'AccruedInterest', CAST(AccruedInterest AS VARCHAR(100))
UNION ALL
SELECT 'Tip', CAST(Tip AS VARCHAR(100))
UNION ALL
SELECT 'CardRevenue', CAST(CardRevenue AS VARCHAR(100))
) A(col,val)
) B
PIVOT
(
MAX(val) FOR startdate IN([2015-01-01],[2015-02-01])
) pvt
Results:
col 2015-01-01 2015-02-01
AccruedInterest 607.834 643.298
CardRevenue 3 5
Tip 1.0000 16.8325
First time posting here, a newbie to SQl, and I'm not exactly sure how to word this but I'll try my best.
I have a query:
select report_month, employee_id, split_bonus,sum(salary) FROM empsal
where report_month IN('2010-12-01','2010-11-01','2010-07-01','2010-04-01','2010-09-01','2010-10-01','2010-08-01')
AND employee_id IN('100','101','102','103','104','105','106','107')
group by report_month, employee_id, split_bonus;
Now, to the result of this query, I want to add a new column split_bonus_cumulative that is essentially equivalent to adding a sum(split_bonus) in the select clause but for this case, the group buy should only have report_month and employee_id.
Can anyone show me how to do this with a single query? Thanks in advance.
Try:
SELECT
report_month,
employee_id,
SUM(split_bonus),
SUM(salary)
FROM
empsal
WHERE
report_month IN('2010-12-01','2010-11-01','2010-07-01','2010-04-01','2010-09-01','2010-10-01','2010-08-01')
AND
employee_id IN('100','101','102','103','104','105','106','107')
GROUP BY
report_month,
employee_id;
Assuming you're using Postgres, you might also find window functions useful:
http://www.postgresql.org/docs/9.0/static/tutorial-window.html
Unless I'm mistaking, you want something that resembles the following:
select report_month, employee_id, salary, split_bonus,
sum(salary) over w as sum_salary,
sum(split_bonus) over w as sum_bonus
from empsal
where ...
window w as (partition by employee_id);
CTEs are also convenient:
http://www.postgresql.org/docs/9.0/static/queries-with.html
WITH
rows as (
SELECT foo.*
FROM foo
WHERE ...
),
report1 as (
SELECT aggregates
FROM rows
WHERE ...
),
report2 as (
SELECT aggregates
FROM rows
WHERE ...
)
SELECT *
FROM report1, report2, ...