Removing Duplicated Values

Removing Duplicated Values - date

After data cleaning and aggregation I was left with a data table like this:
df
id d1 v1 d2 v2 d3 v3 d4 v4
1 1-1-2018 1 1-1-2018 1 1-1-2018 1 1-1-2018 1
2 1-1-2018 1 1-2-2018 2 1-2-2018 2 1-2-2018 2
3 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-3-2018 3
4 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-4-2018 4
I am trying to remove any values from a column in the above data frame that are duplicates of earlier columns.
I have already tried:
df$v2[df$v1 == df$v2] <- NA
this removed all of the values from the v2 column
I want my data frame to look like this at the end:
df
id d1 v1 d2 v2 d3 v3 d4 v4
1 1-1-2018 1 NA NA NA NA NA NA
2 1-1-2018 1 1-2-2018 2 NA NA NA NA
3 1-1-2018 1 1-2-2018 2 1-3-2018 3 NA NA
4 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-4-2018 4

Try df[...condition here...]$column <- NA
Or with data.table:
library(data.table)
dt <- data.table(df)
dt[d1 == d2, v1 := NA]

Related

R intersecting a date between dates in another row and assigning end dates by group

Hi I need a solution to a problem in the R programming language
library(gtools)
end_date <- "2021-12-31"
ddf1 <- data.frame(pnr=c("1","1","2","2","3","3","3","4"),
in_date=as.POSIXct(c("2010-08-18","2010-09-01","2019-04-02","2018-03-27",
"2019-07-12","2013-10-20","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04",NA,"2019-05-17",NA,
NA,"2017-08-19",NA,NA)),
treat1=c(1,1,1,1,1,1,1,1)
)
ddf2 <- data.frame(pnr=c("4","4","3","3","2","2","2","1"),
in_date=as.POSIXct(c("2010-08-18","2010-09-01","2019-04-02","2018-03-27",
"2019-07-12","2013-10-20","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04",NA,"2019-05-17",NA,
NA,"2017-08-19",NA,NA)),
treat2=c(1,1,1,1,1,1,1,1)
)
expected_output1 <- data.frame(pnr=c("1","2","3","3","4"),
in_date=as.POSIXct(c("2010-08-18","2018-03-27",
"2019-07-12","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04","2019-05-17",end_date,
"2017-08-19",end_date)),
treat1=c(1,1,1,1,1)
)
expected_output2 <- data.frame(pnr=c("4","3","2","2","1"),
in_date=as.POSIXct(c("2010-08-18","2018-03-27",
"2019-07-12","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04","2019-05-17",end_date,
"2017-08-19",end_date)),
treat2=c(1,1,1,1,1)
)
ddf <- smartbind(ddf1,ddf2)
expected_output <- smartbind(expected_output1,expected_output2)
> ddf
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 1 2010-09-01 <NA> 1 NA
1:3 2 2019-04-02 2019-05-17 1 NA
1:4 2 2018-03-27 <NA> 1 NA
1:5 3 2019-07-12 <NA> 1 NA
1:6 3 2013-10-20 2017-08-19 1 NA
1:7 3 2012-07-01 <NA> 1 NA
1:8 4 2015-05-02 <NA> 1 NA
2:1 4 2010-08-18 2010-12-04 NA 1
2:2 4 2010-09-01 <NA> NA 1
2:3 3 2019-04-02 2019-05-17 NA 1
2:4 3 2018-03-27 <NA> NA 1
2:5 2 2019-07-12 <NA> NA 1
2:6 2 2013-10-20 2017-08-19 NA 1
2:7 2 2012-07-01 <NA> NA 1
2:8 1 2015-05-02 <NA> NA 1
> expected_output
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 2 2018-03-27 2019-05-17 1 NA
1:3 3 2019-07-12 2021-12-31 1 NA
1:4 3 2012-07-01 2017-08-19 1 NA
1:5 4 2015-05-02 2021-12-31 1 NA
2:1 4 2010-08-18 2010-12-04 NA 1
2:2 3 2018-03-27 2019-05-17 NA 1
2:3 2 2019-07-12 2021-12-31 NA 1
2:4 2 2012-07-01 2017-08-19 NA 1
2:5 1 2015-05-02 2021-12-31 NA 1
I have some individuals who have gone through different treatments, treat1 and treat2.
I need to deal with the fact that some treatment courses were started but lack an out_date
In the case when there lacks an out_date it should be replaced with an end_date for the study:
end_date <- "2021-12-31"
However, if an observation like the one
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 1 2010-09-01 <NA> 1 NA
where the in_date, meaning that start of the treatment, is within the period of another treatment for that same person or "pnr" then the correct output is:
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
because 2010-08-18 is the earliest in_date.
However another case where the there is an earlier date in the row without an out_date, then this date should be used, which is the case for pnr 2
pnr in_date out_date treat1 treat2
1:3 2 2019-04-02 2019-05-17 1 NA
1:4 2 2018-03-27 <NA> 1 NA
becomes:
pnr in_date out_date treat1 treat2
1:2 2 2018-03-27 2019-05-17 1 NA
So the whole period of treatment is covered, with the earliest in_date and latest out_date.
In cases where there is no out_date the end_date should be set instead;
so that:
pnr in_date out_date treat1 treat2
1:8 4 2015-05-02 <NA> 1 NA
becomes:
1:5 4 2015-05-02 2021-12-31 1 NA
In the special case where both an earlier date, or intersecting date, and a later in_date with a missing out_date the function should be able to handle it, like with pnr 3
1:5 3 2019-07-12 <NA> 1 NA
1:6 3 2013-10-20 2017-08-19 1 NA
1:7 3 2012-07-01 <NA> 1 NA
should become:
pnr in_date out_date treat1 treat2
1:3 3 2019-07-12 2021-12-31 1 NA
1:4 3 2012-07-01 2017-08-19 1 NA
OPTIONAL: If it is at all possible, it would be great if the function could handle this differently according to different treatments, so each pnr is handled differently within each treat1 and treat2 also shown in the expected_out
I have attempted to write some code for comparing whether an out_date is NA, and the differences between dates, but I still cant fathom how to continue:
ddf$end_replaced <- as.integer(ifelse(is.na(ddf$out_date),1,0))
ddf <- data.table(ddf)
ddf <- ddf[order(ddf$treat1,ddf$pnr,ddf$in_date,ddf$out_date),]
ddf[, diffx := difftime(in_date, shift(in_date, fill=in_date[1L]),
units="days"), by=pnr]
Thanks for reading
UPDATE
I ended up solving the issue, its not pretty so if anyone has a better solution than this then please let me know

Power BI - Counting number of projects in execution per month

I want to make a line graph with the number of running projects by fortnight (could be monthly, whatever is easier).
Given table (ETC used when the project is not finished yet):
project ID
Start date
Finish date
ETC date
Category
1
04/12/2022
08/23/2022
Type A
2
04/14/2022
09/21/2022
Type B
3
05/18/2022
12/17/2022
Type A
4
06/21/2022
09/25/2022
Type C
5
06/28/2022
10/02/2022
Type A
6
07/08/2022
12/23/2022
Type C
7
07/20/2022
12/08/2022
Type C
8
07/29/2022
10/12/2022
Type B
In Excel, I am using the COUNTIFS function to determine how many projects were in progress at the same time. For example: There is 1 project running (ID 1) on the first fortnight of April (1-14) and 2 projects running (IDs 1 and 2) on the second fortnight (14-30)
My table in excel looks like this:
Fortnight
Type A
Type B
Type C
Total
04/22 F1
1
1
04/22 F2
1
1
2
05/22 F1
1
1
2
05/22 F2
2
1
3
06/22 F1
2
1
3
06/22 F2
3
1
1
5
07/22 F1
3
1
2
6
07/22 F2
3
2
3
8
08/22 F1
3
2
3
8
08/22 F2
3
2
3
8
09/22 F1
2
2
3
7
09/22 F2
2
2
3
7
10/22 F1
2
1
2
5
10/22 F2
1
2
3
11/22 F1
1
2
3
11/22 F2
1
2
3
12/22 F1
1
2
3
12/22 F2
1
1
2

How to merge multiple cases into one in SPSS?

I want to fill in missing values for a case with values from cases in a different file. The corresponding cases have the same refrence number, variable REF. In the end, there should only be be one case per reference number, with no missing values in any variable. I already tried: Data-> Merge files-> Add variable-> many to one, but I still end up with multiple cases per reference number or no change at all in the table. I can't figure out how this works.
My two data sets:
REF p1 p2 p3
1 5 NA NA
2 3 NA NA
3 4 NA NA
REF p1 p2 p3
1 NA 3 NA
1 NA NA 1
2 NA 2 NA
2 NA NA 4
3 NA 1 NA
3 NA NA 1
Desired output:
REF p1 p2 p3
1 5 3 1
2 3 2 4
3 4 1 1
What I tried, but did not work:

I suggest you first stack the two files, so that all the data is in one table, then use aggregation to get all the data for each case into one line. I suggest aggregation using the max function under the assumption that for every REF only one value exists in each column, so the aggregation will select this value and leave out the other "competing" missing values.
EDITED to leave only one line per "REF":
add files /file = dataset1 /file = dataset2.
exe.
dataset name gen.
aggregate /outfile=* /break=REF /P1 P2 P3=max(P1 P2 P3).

The simple recursive function doesn't work well

It's just an easy recursive function test.
It should stop at n = 3, but not.
Could you please tell me where is wrong in my code?
Thank you!
>> recursiveFunction(0)
101
1
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
2
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
3
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
function recursiveFunction(callHierarchie)
callHierarchie = callHierarchie + 1;
disp(callHierarchie + 100);
for n = 1:3
disp(n);
if callHierarchie <= 2
disp('g');
recursiveFunction(callHierarchie);
end
end
end

The problem is both how you're generating your output and how you're interpreting your output. Here's a Python equivalent function that generates the same output:
def recursiveFunction1(callHierarchie):
callHierarchie = callHierarchie + 1
print("{:>6}".format(callHierarchie + 100))
for n in range(1, 4):
print("{:>6}".format(n))
if callHierarchie <= 2:
print('g')
recursiveFunction(callHierarchie)
recursiveFunction(0)
Folks can verify it produces the same output. Let's modify the code to indent based on the recursion level:
def recursiveFunction(callHierarchie):
callHierarchie = callHierarchie + 1
print(" " * callHierarchie, "{:>6}".format(callHierarchie + 100))
for n in range(1, 4):
print(" " * callHierarchie, "{:>6}".format(n))
if callHierarchie <= 2:
print(" " * callHierarchie, 'g')
recursiveFunction(callHierarchie)
Now the output displays slightly differently:
% python3 test.py
101
1
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
2
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
3
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
%
You can see that n does stop at 3, but the extra numbers you were seeing were n at a different level of recursion!

output for different changing variables

I have 4 variables a,b,c,d. a can vary 1,2 i.e. a=1,2, b=1,2,3, c=1,2,3,4, d=1,2,3,4,5 so by varying each element I want to make output value i.e.
a b c d output
1 1 1 1 1
1 1 1 2 2
1 1 1 3 3
1 1 1 4 4
1 1 1 5 5
now varying c with 1 value and d with all values i.e.
a b c d output
1 1 2 1 6
1 1 2 2 7
1 1 2 3 8
1 1 2 4 9
1 1 2 5 10
now change c to 3 and doing above so getting output as 11,12,13,14,15. when c reaches max varying limit then change b i.e.
a b c d output
1 2 1 1 16
1 2 1 2 17
1 2 1 3 18
1 2 1 4 19
1 2 1 5 20
then
a b c d output
1 2 2 1 21
1 2 2 2 22
1 2 2 3 23
1 2 2 4 24
1 2 2 5 25
so like this I want to proceed and want output for all conditions of a,b,c,d. so how to do it or any equation to do this in matlab. in above a,b,c,d vary 2,3,4,5 i.e in increasing order but in general case they can vary without increasing order e.g. a,b,c,d can vary 7,4,9,13.

A possible algorithm could be to buil the combinations column by column considering the number of times eache value has to be repeted starting form the array d
Defined:
len_a the length of the arraya
len_b the length of the arrayb
len_c the length of the arrayc
len_d the length of the arrayd
you need to replicate the d array len_a * len_b * len_c times.
The array c needs to be replicated len_c * len_d times to cover the "right side" combination, the this set of data have to be replicated len_a * len_b times to account for the "left side" to come.
Similar approach applies for the definiton of the array a and b.
To have the set of combinations in a "random" sequence, is sufficient to
use the randperm function.
% Define the input arrays
a=1:2;
len_a=length(a);
b=1:3;
len_b=length(b);
c=1:4;
len_c=length(c);
d=1:5;
len_d=length(d);
% Generate the fourth column of the table
%
d1=repmat(d',len_a*len_b*len_c,1)
%
% Generate the third column of the table
c1=repmat(reshape(bsxfun(#plus,zeros(len_d,1),[1:len_c]),len_c*len_d,1),len_a*len_b,1)
%
% Generate the second column of the table
b1=repmat(reshape(bsxfun(#plus,zeros(len_c*len_d,1),[1:len_b]),len_b*len_c*len_d,1),len_a,1)
%
% Generate the first column of the table
a1=reshape(bsxfun(#plus,zeros(len_b*len_c*len_d,1),[1:len_a]),len_a*len_b*len_c*len_d,1)
%
% Merge the colums and add the counter in the fifth column
combination_set_1=[a1 b1 c1 d1 (1:len_a*len_b*len_c*len_d)']
% Randomize the rows
combination_set_2=combination_set_1(randperm(len_a*len_b*len_c*len_d),:)
Output:
1 1 1 1 1
1 1 1 2 2
1 1 1 3 3
1 1 1 4 4
1 1 1 5 5
1 1 2 1 6
1 1 2 2 7
1 1 2 3 8
1 1 2 4 9
1 1 2 5 10
1 1 3 1 11
1 1 3 2 12
1 1 3 3 13
1 1 3 4 14
1 1 3 5 15
1 1 4 1 16
1 1 4 2 17
1 1 4 3 18
1 1 4 4 19
1 1 4 5 20
1 2 1 1 21
1 2 1 2 22
1 2 1 3 23
1 2 1 4 24
1 2 1 5 25
1 2 2 1 26
1 2 2 2 27
1 2 2 3 28
1 2 2 4 29
1 2 2 5 30
1 2 3 1 31
1 2 3 2 32
1 2 3 3 33
1 2 3 4 34
1 2 3 5 35
1 2 4 1 36
1 2 4 2 37
1 2 4 3 38
1 2 4 4 39
1 2 4 5 40
1 3 1 1 41
1 3 1 2 42
1 3 1 3 43
1 3 1 4 44
1 3 1 5 45
1 3 2 1 46
1 3 2 2 47
1 3 2 3 48
1 3 2 4 49
1 3 2 5 50
1 3 3 1 51
1 3 3 2 52
1 3 3 3 53
1 3 3 4 54
1 3 3 5 55
1 3 4 1 56
1 3 4 2 57
1 3 4 3 58
1 3 4 4 59
1 3 4 5 60
2 1 1 1 61
2 1 1 2 62
2 1 1 3 63
2 1 1 4 64
2 1 1 5 65
2 1 2 1 66
2 1 2 2 67
2 1 2 3 68
2 1 2 4 69
2 1 2 5 70
2 1 3 1 71
2 1 3 2 72
2 1 3 3 73
2 1 3 4 74
2 1 3 5 75
2 1 4 1 76
2 1 4 2 77
2 1 4 3 78
2 1 4 4 79
2 1 4 5 80
2 2 1 1 81
2 2 1 2 82
2 2 1 3 83
2 2 1 4 84
2 2 1 5 85
2 2 2 1 86
2 2 2 2 87
2 2 2 3 88
2 2 2 4 89
2 2 2 5 90
2 2 3 1 91
2 2 3 2 92
2 2 3 3 93
2 2 3 4 94
2 2 3 5 95
2 2 4 1 96
2 2 4 2 97
2 2 4 3 98
2 2 4 4 99
2 2 4 5 100
2 3 1 1 101
2 3 1 2 102
2 3 1 3 103
2 3 1 4 104
2 3 1 5 105
2 3 2 1 106
2 3 2 2 107
2 3 2 3 108
2 3 2 4 109
2 3 2 5 110
2 3 3 1 111
2 3 3 2 112
2 3 3 3 113
2 3 3 4 114
2 3 3 5 115
2 3 4 1 116
2 3 4 2 117
2 3 4 3 118
2 3 4 4 119
2 3 4 5 120
Hope this helps.
Qapla'

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Removing Duplicated Values - date

Try df[...condition here...]$column <- NA Or with data.table: library(data.table) dt <- data.table(df) dt[d1 == d2, v1 := NA]

Related

R intersecting a date between dates in another row and assigning end dates by group

Power BI - Counting number of projects in execution per month

How to merge multiple cases into one in SPSS?

The simple recursive function doesn't work well

output for different changing variables

Categories

Resources