SQL-Server calculating how many instances on a day - sql-server-2008-r2
I have a table that has ID, start date, and end date
Start_date End_Date ID
2016-03-01 06:30:00.000 2016-03-07 17:30:00.000 782772
2016-03-01 09:09:00.000 2016-03-07 10:16:00.000 782789
2016-03-01 11:17:00.000 2016-03-08 20:10:00.000 782882
2016-03-01 12:22:00.000 2016-03-21 19:40:00.000 782885
2016-03-01 13:15:00.000 2016-03-24 13:37:00.000 783000
2016-03-01 13:23:00.000 2016-03-07 19:15:00.000 782964
2016-03-01 13:55:00.000 2016-03-14 15:45:00.000 782972
2016-03-01 14:05:00.000 2016-03-07 20:32:00.000 783065
2016-03-01 18:06:00.000 2016-03-09 12:42:00.000 782988
2016-03-01 19:05:00.000 2016-04-01 20:00:00.000 782942
2016-03-01 19:15:00.000 2016-03-10 13:30:00.000 782940
2016-03-01 19:15:00.000 2016-03-07 18:00:00.000 783111
2016-03-01 20:10:00.000 2016-03-08 14:05:00.000 783019
2016-03-01 22:15:00.000 2016-03-24 12:46:00.000 782979
2016-03-02 08:00:00.000 2016-03-08 09:02:00.000 783222
2016-03-02 09:31:00.000 2016-03-15 09:16:00.000 783216
2016-03-02 11:04:00.000 2016-03-19 18:49:00.000 783301
2016-03-02 11:23:00.000 2016-03-14 19:49:00.000 783388
2016-03-02 11:46:00.000 2016-03-08 18:10:00.000 783368
2016-03-02 12:03:00.000 2016-03-23 08:44:00.000 783246
2016-03-02 12:23:00.000 2016-03-11 14:45:00.000 783302
2016-03-02 12:24:00.000 2016-03-12 15:30:00.000 783381
2016-03-02 12:30:00.000 2016-03-09 13:58:00.000 783268
2016-03-02 13:00:00.000 2016-03-10 11:30:00.000 783391
2016-03-02 13:35:00.000 2016-03-17 04:40:00.000 783309
2016-03-02 15:05:00.000 2016-04-04 11:52:00.000 783295
2016-03-02 15:08:00.000 2016-03-15 16:15:00.000 783305
2016-03-02 15:32:00.000 2016-03-08 14:20:00.000 783384
2016-03-02 16:49:00.000 2016-03-08 11:40:00.000 783367
2016-03-02 16:51:00.000 2016-03-11 16:00:00.000 783387
2016-03-02 18:00:00.000 2016-03-10 17:00:00.000 783242
2016-03-02 18:37:00.000 2016-03-25 13:30:00.000 783471
2016-03-02 18:45:00.000 2016-03-11 20:15:00.000 783498
2016-03-02 19:41:00.000 2016-03-17 12:34:00.000 783522
2016-03-02 20:08:00.000 2016-03-22 15:30:00.000 783405
2016-03-02 20:16:00.000 2016-03-31 12:30:00.000 783512
2016-03-02 21:45:00.000 2016-03-15 12:25:00.000 783407
2016-03-03 09:59:00.000 2016-03-09 15:00:00.000 783575
2016-03-03 11:18:00.000 2016-03-16 10:30:00.000 783570
2016-03-03 11:27:00.000 2016-03-15 17:28:00.000 783610
2016-03-03 11:36:00.000 2016-03-11 16:05:00.000 783572
2016-03-03 11:55:00.000 2016-03-10 20:15:00.000 783691
2016-03-03 12:10:00.000 2016-03-09 19:50:00.000 783702
2016-03-03 12:11:00.000 2016-03-15 14:08:00.000 783611
2016-03-03 12:55:00.000 2016-03-10 11:50:00.000 783571
2016-03-03 13:20:00.000 2016-04-20 20:37:00.000 783856
2016-03-03 14:08:00.000 2016-03-10 16:00:00.000 783728
2016-03-03 15:10:00.000 2016-03-10 17:00:00.000 783727
2016-03-03 15:20:00.000 2016-03-17 15:14:00.000 783768
2016-03-03 16:55:00.000 2016-03-09 14:09:00.000 783812
2016-03-03 17:00:00.000 2016-03-12 12:33:00.000 783978
2016-03-03 17:17:00.000 2016-03-10 16:00:00.000 783729
2016-03-03 17:42:00.000 2016-03-10 12:13:00.000 783975
2016-03-03 18:23:00.000 2016-03-09 17:00:00.000 783820
2016-03-03 18:31:00.000 2016-03-11 14:00:00.000 783891
2016-03-03 18:59:00.000 2016-03-10 17:00:00.000 783772
2016-03-03 19:48:00.000 2016-03-11 17:30:00.000 783724
2016-03-03 19:50:00.000 2016-03-09 18:00:00.000 783829
2016-03-03 20:48:00.000 2016-03-11 11:04:00.000 783745
2016-03-03 23:00:00.000 2016-03-13 10:59:00.000 783983
2016-03-04 02:50:00.000 2016-03-10 10:45:00.000 783991
2016-03-04 11:25:00.000 2016-03-14 14:50:00.000 784102
2016-03-04 11:28:00.000 2016-03-18 16:21:00.000 784011
2016-03-04 12:01:00.000 2016-03-11 13:20:00.000 784014
2016-03-04 12:15:00.000 2016-03-11 08:00:00.000 784004
2016-03-04 13:06:00.000 2016-03-11 15:00:00.000 784012
2016-03-04 13:37:00.000 2016-03-10 18:00:00.000 784200
2016-03-04 13:52:00.000 2016-04-22 21:30:00.000 784132
2016-03-04 14:11:00.000 2016-03-14 19:00:00.000 784136
2016-03-04 14:17:00.000 2016-03-11 16:52:00.000 784176
2016-03-04 14:42:00.000 2016-03-13 15:25:00.000 784070
2016-03-04 16:00:00.000 2016-03-11 17:30:00.000 784655
2016-03-04 16:30:00.000 2016-03-10 23:30:00.000 784652
2016-03-04 17:25:00.000 2016-03-22 14:00:00.000 784028
2016-03-04 19:50:00.000 2016-03-10 12:42:00.000 784303
2016-03-04 20:00:00.000 2016-03-10 16:13:00.000 784006
2016-03-04 21:30:00.000 2016-03-10 18:00:00.000 784042
2016-03-04 22:25:00.000 2016-04-02 19:40:00.000 784044
2016-03-04 22:40:00.000 2016-03-15 17:30:00.000 784276
2016-03-04 22:55:00.000 2016-03-13 13:50:00.000 784257
2016-03-04 23:10:00.000 2016-03-15 13:19:00.000 784266
2016-03-05 10:30:00.000 2016-03-11 07:45:00.000 784295
2016-03-05 10:30:00.000 2016-03-16 19:00:00.000 784305
2016-03-05 11:05:00.000 2016-03-17 15:26:00.000 784320
2016-03-05 12:30:00.000 2016-03-14 11:25:00.000 784368
2016-03-05 12:50:00.000 2016-03-17 13:27:00.000 784419
2016-03-05 13:01:00.000 2016-03-11 17:00:00.000 784298
2016-03-05 14:34:00.000 2016-03-11 19:00:00.000 784286
2016-03-05 14:45:00.000 2016-04-07 12:01:00.000 784316
2016-03-05 16:00:00.000 2016-03-24 17:00:00.000 784334
2016-03-05 19:22:00.000 2016-04-12 15:56:00.000 784335
2016-03-05 19:25:00.000 2016-03-14 11:59:00.000 784346
2016-03-05 19:25:00.000 2016-03-11 16:10:00.000 784399
2016-03-05 20:15:00.000 2016-03-15 16:20:00.000 784362
2016-03-05 20:26:00.000 2016-03-12 15:03:00.000 784347
2016-03-05 23:30:00.000 2016-03-17 16:45:00.000 784476
2016-03-06 11:57:00.000 2016-03-15 21:00:00.000 784524
2016-03-06 13:17:00.000 2016-03-29 08:09:00.000 784472
2016-03-06 14:07:00.000 2016-03-15 13:55:00.000 784497
2016-03-06 15:00:00.000 2016-03-16 12:24:00.000 784474
What i am looking to do is for ever day I get a count of how many entries occur
Example Output
date Instances
01/03/2016 113
02/03/2016 100
03/03/2016 106
04/03/2016 127
05/03/2016 81
06/03/2016 59
07/03/2016 115
08/03/2016 104
09/03/2016 92
10/03/2016 105
11/03/2016 128
12/03/2016 71
13/03/2016 64
14/03/2016 99
15/03/2016 106
16/03/2016 101
17/03/2016 96
18/03/2016 127
19/03/2016 75
20/03/2016 62
21/03/2016 93
22/03/2016 109
23/03/2016 102
24/03/2016 104
25/03/2016 85
26/03/2016 87
27/03/2016 72
28/03/2016 61
29/03/2016 86
30/03/2016 90
31/03/2016 122
This is the query i am using is
with [dates] as (
select convert(datetime, '2016-01-01') as [date] --start
union all
select dateadd(day, 1, [date])
from [dates]
where [date] < GETDATE() --end
)
select [date]
,Sum (Case when [date] between ws._start_dttm and Case when Cast(ws.End_DTTM as date) is null then [date]
else Cast(ws._End_DTTM as date) end then 1 else 0 end)
from [dates]
Join [STAYS] ws on Case when Cast(ws.End_DTTM as date) is null then GETDATE()-1
else Cast(ws.End_DTTM as date) end = dates.date
where END_DTTM between '2016-01-01' and GETDATE()
Group BY date
Order by [date]
option (maxrecursion 0)
however am not getting the right answer as this currently done in Excel:
Date Instances
01/03/2016 343
02/03/2016 326
03/03/2016 327
04/03/2016 332
05/03/2016 318
06/03/2016 317
07/03/2016 337
08/03/2016 332
09/03/2016 345
10/03/2016 349
11/03/2016 341
12/03/2016 323
13/03/2016 333
14/03/2016 349
15/03/2016 344
16/03/2016 358
17/03/2016 349
18/03/2016 350
19/03/2016 347
20/03/2016 351
21/03/2016 371
22/03/2016 369
23/03/2016 340
24/03/2016 335
25/03/2016 319
26/03/2016 341
27/03/2016 355
28/03/2016 351
29/03/2016 367
30/03/2016 379
31/03/2016 385
Updated as Per Op comment:
In summary for below row
Start_date End_Date ID
2016-03-01 06:30:00.000 2016-03-07 17:30:00.000 782772
Expected output would be:
01/03/2016 1
02/03/2016 1
03/03/2016 1
04/03/2016 1
05/03/2016 1
06/03/2016 1
07/03/2016 1
Like this i want to calculate for all rows per date
select convert(varchar(10),startdate,103) as datee,count(*) as occurences
from table
group by convert(varchar(10),startdate,103)
Update:
Try this
;with cte
as
(
select
startdate,enddate
datediff(day,enddate,startdate) as cnt
from
table
)
select
convert(varchar(10),startdate,103)as date,
sum(cnt)
from
cte
group by
convert(varchar(10),startdate,103)
Related
How to extract the whole hours from a time range in Postgresql and get the duration of each extracted hour
I'm new to database (even more to postgres), so if you can help me. I have a table something like this: id_interaction start_time end_time 0001 2022-06-03 12:40:10 2022-06-03 12:45:16 0002 2022-06-04 10:50:40 2022-06-04 11:10:12 0003 2022-06-04 16:30:00 2022-06-04 18:20:00 0004 2022-06-05 23:00:00 2022-06-06 10:30:12 Basically I need to create a query to get the duration doing a separation by hours, for example: id_interaction start_time end_time hour duration 0001 2022-06-03 12:40:10 2022-06-03 12:45:16 12:00:00 00:05:06 0002 2022-06-04 10:50:40 2022-06-04 11:10:12 10:00:00 00:09:20 0002 2022-06-04 10:50:40 2022-06-04 11:10:12 11:00:00 00:10:12 0003 2022-06-04 16:30:00 2022-06-04 18:20:00 16:00:00 00:30:00 0003 2022-06-04 16:30:00 2022-06-04 18:20:00 17:00:00 01:00:00 0003 2022-06-04 16:30:00 2022-06-04 18:20:00 18:00:00 00:20:00 0004 2022-06-05 23:00:00 2022-06-06 03:30:12 23:00:00 01:00:00 0004 2022-06-05 23:00:00 2022-06-06 03:30:12 24:00:00 01:00:00 0004 2022-06-05 23:00:00 2022-06-06 03:30:12 01:00:00 01:00:00 0004 2022-06-05 23:00:00 2022-06-06 03:30:12 02:00:00 01:00:00 0004 2022-06-05 23:00:00 2022-06-06 03:30:12 03:00:00 00:30:12 I need all the hours from start to finish. For example: if an id starts at 17:10 and ends at 19:00, I need the duration of 17:00, 18:00 and 19:00
If you're trying to get the duration in each whole hour interval overlapped by your data, this can be achieved by rounding timestamps using date_trunc(), using generate_series() to move around the intervals and casting between time, interval and timestamp: create or replace function hours_crossed(starts timestamp,ends timestamp) returns integer language sql as ' select case when date_trunc(''hour'',starts)=date_trunc(''hour'',ends) then 0 when date_trunc(''hour'',starts)=starts then floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0) else floor(extract(''epoch'' from ends-starts)::numeric/60.0/60.0) +1 end'; select * from ( select id_interacao, tempo_inicial, tempo_final, to_char(hora, 'HH24:00')::time as hora, least(tempo_final, hora + '1 hour'::interval) - greatest(tempo_inicial, hora) as duracao from (select *, date_trunc('hour',tempo_inicial) + (generate_series(0, hours_crossed(tempo_inicial,tempo_final))::text||' hours')::interval as hora from test_times ) a ) a where duracao<>'0'::interval; This also fixes your first entry that lasts 5 minutes but shows as 45. You'll need to decide how you want to handle zero-length intervals and ones that end on an exact hour - I added a condition to skip them. Here's a working example.
Filter excel data based on column value powershell
I am having a excel with around 34k data like below SYMM_ID DATE INSTANCE Total Response Time 297900076 2022-10-23 11:25:00 PM GS_GTS_ORACLUL_L_PRDPRF 0.21 297900076 2022-10-24 02:15:00 AM GS_GTS_ORACLUL_L_PRDPRF 0.36 297900076 2022-10-24 04:20:00 AM GS_GTS_ORACLUL_L_PRDPRF 0.96 297900076 2022-10-24 04:25:00 AM GS_GTS_ORACLUL_L_PRDPRF 0.3 297900076 2022-10-24 04:30:00 AM GS_GTS_ORACLUL_L_PRDPRF 1.21 297900076 2022-10-24 04:35:00 AM GS_GTS_ORACLUL_L_PRDPRF 0.48 297900076 2022-10-24 04:40:00 AM GS_GTS_ORACLUL_L_PRDPRF 1.17 297900076 2022-10-24 04:45:00 AM GS_GTS_ORACLUL_L_PRDPRF 0.33 297900076 2022-10-24 04:50:00 AM GS_GTS_ORACLUL_L_PRDPRF 0.57 297900076 2022-10-24 04:55:00 AM GS_GTS_ORACLUL_L_PRDPRF 0.34 297900076 2022-10-23 05:00:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.35 297900076 2022-10-23 05:05:00 AM GS_GTS_ORACLUL_L_PRDSTD 1.02 297900076 2022-10-23 05:10:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.68 297900076 2022-10-23 05:15:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.72 297900076 2022-10-23 05:20:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.67 297900076 2022-10-23 05:25:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.57 297900076 2022-10-23 05:30:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.64 297900076 2022-10-23 05:35:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.61 297900076 2022-10-23 05:40:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.56 297900076 2022-10-23 05:45:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.57 297900076 2022-10-23 05:50:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.5 297900076 2022-10-23 05:55:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.66 297900076 2022-10-23 06:00:00 AM GS_GTS_ORACLUL_L_PRDSTD 0.63 297900076 2022-10-23 06:05:00 AM GS_GTS_ORACLUL_L_PRDSTD 1.1 297900076 2022-10-23 11:00:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.36 297900076 2022-10-23 11:05:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.84 297900076 2022-10-23 11:10:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.93 297900076 2022-10-23 11:15:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.55 297900076 2022-10-23 11:20:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.53 297900076 2022-10-23 11:25:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.48 297900076 2022-10-23 11:30:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.58 297900076 2022-10-23 11:35:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.67 297900076 2022-10-23 11:40:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.38 297900076 2022-10-23 11:45:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.74 297900076 2022-10-23 11:50:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.51 297900076 2022-10-23 11:55:00 PM GS_GTS_LOCCLUL_L_PRDPRF 0.56 297900076 2022-10-24 12:00:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.45 297900076 2022-10-24 12:05:00 AM GS_GTS_LOCCLUL_L_PRDPRF 1.88 297900076 2022-10-24 12:10:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.75 297900076 2022-10-24 12:15:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.6 297900076 2022-10-24 12:20:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.63 297900076 2022-10-24 12:25:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.96 297900076 2022-10-24 12:30:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.58 297900076 2022-10-24 12:35:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.64 297900076 2022-10-24 12:40:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.73 297900076 2022-10-24 12:45:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.54 297900076 2022-10-24 12:50:00 AM GS_GTS_LOCCLUL_L_PRDPRF 0.57 I am using below code to get the total number of rows from the sheet $workbook = $excel.workbooks.open('C:\SLAFile.xlsx') $worksheet = $workbook.Worksheets.Item(1) $rows = $worksheet.range("D2").currentregion.rows.count But I have to filter the data and get the row count of rows having "PRF" and "STD" in INSTANCE. I have to execute different formulas based on INSTANCE(for PRF different and for STD different formula) Please let me know how to filter the column data and get the row counts
I think you might be better of using the ImportExcel module. That would make an object in PowerShell we can easily work with. There are a few ways you can do this given the amount of rows, but the simplest one would be with Where-Object import-module ImportExcel $eFile=Import-Excel 'C:\Working\temp\test.xlsx' $PRF=$eFile.'INSTANCE ' | Where-Object {$_ -like '*PRF'} $STD=$eFile.'INSTANCE ' | Where-Object {$_ -like '*STD'} Write-Host "*PRF - $($PRF.count)" Write-Host "*STD - $($STD.count)"
Extracting all rows containing a specific datetime value (MATLAB)
I have a table which looks like this: Entry number Timestamp Value1 Value2 Value3 Value4 5758 28-06-2018 16:30 34 63 34.2 60.9 5759 28-06-2018 17:00 33.5 58 34.9 58.4 5758 28-06-2018 16:30 34 63 34.2 60.9 5759 28-06-2018 17:00 33.5 58 34.9 58.4 5760 28-06-2018 17:30 33 53 35.2 58.5 5761 28-06-2018 18:00 33 63 35 57.9 5762 28-06-2018 18:30 33 61 34.6 58.9 5763 28-06-2018 19:00 33 59 34.1 59.4 5764 28-06-2018 19:30 28 89 33.5 64.2 5765 28-06-2018 20:00 28 89 33 66.1 5766 28-06-2018 20:30 28 83 32.5 67 5767 28-06-2018 21:00 29 89 32.2 68.4 Where '28-06-2018 16:30' is under one column. So I have 6 columns: Entry number, Timestamp, Value1, Value2, Value3, Value4 I want to extract all rows that belong to '28-06-2018', i.e all data pertaining to that day. Since my table is too large I couldn't fit more data, however, the entries under the timestamp range for a couple of months.
t=table([5758;5759],["28-06-2018 16:30";"29-06-2018 16:30"],[34;33.5],'VariableNames',{'Entry number','Timestamp','Value1'}) t = 2×3 table Entry number Timestamp Value1 ____________ __________________ ______ 5758 "28-06-2018 16:30" 34 5759 "29-06-2018 16:30" 33.5 t(contains(t.('Timestamp'),"28-06"),:) ans = 1×3 table Entry number Timestamp Value1 ____________ __________________ ______ 5758 "28-06-2018 16:30" 34
SUM postgres not like expected
after the query I would like to obtain the SUM of account_move_line.balance AS ammounteur when account_id, partner_id, invoice_id and account_account.code were = SELECT account_move_line.name, account_move_line.account_id, account_move_line.partner_id, account_move_line.invoice_id, account_move_line.journal_id, CASE WHEN account_account.code LIKE '40%%' THEN '400000' WHEN account_account.code LIKE '44%%' THEN '440000' ELSE account_account.code END AS ACCOUNTGL, CASE WHEN account_account.code = '702000' THEN SUM(account_move_line.balance) ELSE (round(account_move_line.balance, 2)) END AS AMOUNTEUR FROM public.account_move_line JOIN account_account ON (account_account.id = account_move_line.account_id) WHERE (account_move_line.date BETWEEN '2020-03-01' AND '2020-03-31') GROUP BY account_move_line.account_id, account_move_line.partner_id, account_move_line.invoice_id, account_move_line.journal_id, account_account.code, account_move_line.balance, account_move_line.name ORDER BY account_move_line.account_id, account_move_line.invoice_id; The result I get: NAME account_id Partner_id Invoice_id J_id accountgl amounteur "Taxe led" 186 2476 1883 1 "702000" -0.83 "Taxe eclairage" 186 2476 1883 1 "702000" -0.11 "Taxe gros et petit blanc" 186 3090 1884 1 "702000" -0.83 "Taxe eclairage" 186 2077 1885 1 "702000" 0.25 "Taxe eclairage" 186 2077 1887 1 "702000" -0.25 "Taxe eclairage" 186 2077 1888 1 "702000" -0.02 "Taxe led" 186 2481 1916 1 "702000" -0.83 "Taxe eclairage" 186 2481 1916 1 "702000" -0.52 I expected NAME account_id Partner_id Invoice_id J_id accountgl amounteur 186 2476 1883 1 "702000" -0.94 "Taxe gros et petit blanc" 186 3090 1884 1 "702000" -0.83 "Taxe eclairage" 186 2077 1885 1 "702000" 0.25 "Taxe eclairage" 186 2077 1887 1 "702000" -0.25 "Taxe eclairage" 186 2077 1888 1 "702000" -0.02 186 2481 1916 1 "702000" -1.35 Thanks
I'm guessing, but it seems you expect the results to be grouped by account_id, partner_id, invoice_id, and perhaps journal_id. But you've told it to group by so many more columns. account_move_line.account_id, account_move_line.partner_id, account_move_line.invoice_id, account_move_line.journal_id, account_account.code, account_move_line.balance, account_move_line.name To be grouped, a row would have to have the same account, partner, invoice, and journal IDs. Plus the same code, balance, and name. Cut your group by back to just the four IDs. This will mean you cannot select some columns because the group has several values for that column. For example, the name. Each group will contain several names, no single name can be selected.
Dataframe merge creates duplicate records in pandas (0.7.3)
When I merge two CSV files, of the format (date, someValue), I see some duplicate records. If I reduce the records to half the problem goes away. However, if I double the size of both the files it worsens. Appreciate any help! My code: i = pd.DataFrame.from_csv('i.csv') i = i.reset_index() e = pd.DataFrame.from_csv('e.csv') e = e.reset_index() total_df = pd.merge(i, e, right_index=False, left_index=False, right_on=['date'], left_on=['date'], how='left') total_df = total_df.sort(column='date') (Note: the dupulicate records for 11/15, 11/16, 12/17, 12/18.) In [7]: total_df Out[7]: date Cost netCost 25 2012-11-15 00:00:00 1 2 26 2012-11-15 00:00:00 1 2 31 2012-11-16 00:00:00 1 2 32 2012-11-16 00:00:00 1 2 37 2012-11-17 00:00:00 1 2 2 2012-11-18 00:00:00 1 2 5 2012-11-19 00:00:00 1 2 8 2012-11-20 00:00:00 1 2 11 2012-11-21 00:00:00 1 2 14 2012-11-22 00:00:00 1 2 17 2012-11-23 00:00:00 1 2 20 2012-11-24 00:00:00 1 2 23 2012-11-25 00:00:00 1 2 29 2012-11-26 00:00:00 1 2 35 2012-11-27 00:00:00 1 2 0 2012-11-28 00:00:00 1 2 3 2012-11-29 00:00:00 1 2 6 2012-11-30 00:00:00 1 2 9 2012-12-01 00:00:00 1 2 12 2012-12-02 00:00:00 1 2 15 2012-12-03 00:00:00 1 2 18 2012-12-04 00:00:00 1 2 21 2012-12-05 00:00:00 1 2 24 2012-12-06 00:00:00 1 2 30 2012-12-07 00:00:00 1 2 36 2012-12-08 00:00:00 1 2 1 2012-12-09 00:00:00 2 2 4 2012-12-10 00:00:00 2 2 7 2012-12-11 00:00:00 2 2 10 2012-12-12 00:00:00 2 2 13 2012-12-13 00:00:00 1 2 16 2012-12-14 00:00:00 2 2 19 2012-12-15 00:00:00 2 2 22 2012-12-16 00:00:00 2 2 27 2012-12-17 00:00:00 1 2 28 2012-12-17 00:00:00 1 2 33 2012-12-18 00:00:00 1 2 34 2012-12-18 00:00:00 1 2 i.csv date,Cost 2012-11-15 00:00:00,1 2012-11-16 00:00:00,1 2012-11-17 00:00:00,1 2012-11-18 00:00:00,1 2012-11-19 00:00:00,1 2012-11-20 00:00:00,1 2012-11-21 00:00:00,1 2012-11-22 00:00:00,1 2012-11-23 00:00:00,1 2012-11-24 00:00:00,1 2012-11-25 00:00:00,1 2012-11-26 00:00:00,1 2012-11-27 00:00:00,1 2012-11-28 00:00:00,1 2012-11-29 00:00:00,1 2012-11-30 00:00:00,1 2012-12-01 00:00:00,1 2012-12-02 00:00:00,1 2012-12-03 00:00:00,1 2012-12-04 00:00:00,1 2012-12-05 00:00:00,1 2012-12-06 00:00:00,1 2012-12-07 00:00:00,1 2012-12-08 00:00:00,1 2012-12-09 00:00:00,2 2012-12-10 00:00:00,2 2012-12-11 00:00:00,2 2012-12-12 00:00:00,2 2012-12-13 00:00:00,1 2012-12-14 00:00:00,2 2012-12-15 00:00:00,2 2012-12-16 00:00:00,2 2012-12-17 00:00:00,1 2012-12-18 00:00:00,1 e.csv date,netCost 2012-11-15 00:00:00,2 2012-11-16 00:00:00,2 2012-11-17 00:00:00,2 2012-11-18 00:00:00,2 2012-11-19 00:00:00,2 2012-11-20 00:00:00,2 2012-11-21 00:00:00,2 2012-11-22 00:00:00,2 2012-11-23 00:00:00,2 2012-11-24 00:00:00,2 2012-11-25 00:00:00,2 2012-11-26 00:00:00,2 2012-11-27 00:00:00,2 2012-11-28 00:00:00,2 2012-11-29 00:00:00,2 2012-11-30 00:00:00,2 2012-12-01 00:00:00,2 2012-12-02 00:00:00,2 2012-12-03 00:00:00,2 2012-12-04 00:00:00,2 2012-12-05 00:00:00,2 2012-12-06 00:00:00,2 2012-12-07 00:00:00,2 2012-12-08 00:00:00,2 2012-12-09 00:00:00,2 2012-12-10 00:00:00,2 2012-12-11 00:00:00,2 2012-12-12 00:00:00,2 2012-12-13 00:00:00,2 2012-12-14 00:00:00,2 2012-12-15 00:00:00,2 2012-12-16 00:00:00,2 2012-12-17 00:00:00,2 2012-12-18 00:00:00,2
This does seem like a bug with pandas 0.7.3 or numpy 1.6. This only happens if the column being merged on is a date (internally converted to numpy.datetime64). My solution was to convert date into a string- def _DatetimeToString(datetime64): timestamp = datetime64.astype(long)/1000000000 return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d') i = pd.DataFrame.from_csv('i.csv') i = i.reset_index() i['date'] = i['date'].map(_DatetimeToString) e = pd.DataFrame.from_csv('e.csv') e = e.reset_index() i['date'] = i['date'].map(_DatetimeToString) total_df = pd.merge(i, e, right_index=False, left_index=False, right_on=['date'], left_on=['date'], how='left') total_df = total_df.sort(column='date')
This issue/bug came up for me as well. I was not merging on a datetime series, however, I did have a datetime series in the left dataframe. My solution was to de-dupe: len(pophist) 2347 pop_merged = pd.merge(left=pophist, right=df_labels, how='left', left_on ='candidate', right_on ='Slug', indicator = True) pop_merged.shape 3303 pop_merged2 = pop_merged.drop_duplicates() #note dedupping is required due to issue in how pandas handles datetime dtypes on merge. len(pop_merged2) 2347