R intersecting a date between dates in another row and assigning end dates by group - date

Hi I need a solution to a problem in the R programming language
library(gtools)
end_date <- "2021-12-31"
ddf1 <- data.frame(pnr=c("1","1","2","2","3","3","3","4"),
in_date=as.POSIXct(c("2010-08-18","2010-09-01","2019-04-02","2018-03-27",
"2019-07-12","2013-10-20","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04",NA,"2019-05-17",NA,
NA,"2017-08-19",NA,NA)),
treat1=c(1,1,1,1,1,1,1,1)
)
ddf2 <- data.frame(pnr=c("4","4","3","3","2","2","2","1"),
in_date=as.POSIXct(c("2010-08-18","2010-09-01","2019-04-02","2018-03-27",
"2019-07-12","2013-10-20","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04",NA,"2019-05-17",NA,
NA,"2017-08-19",NA,NA)),
treat2=c(1,1,1,1,1,1,1,1)
)
expected_output1 <- data.frame(pnr=c("1","2","3","3","4"),
in_date=as.POSIXct(c("2010-08-18","2018-03-27",
"2019-07-12","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04","2019-05-17",end_date,
"2017-08-19",end_date)),
treat1=c(1,1,1,1,1)
)
expected_output2 <- data.frame(pnr=c("4","3","2","2","1"),
in_date=as.POSIXct(c("2010-08-18","2018-03-27",
"2019-07-12","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04","2019-05-17",end_date,
"2017-08-19",end_date)),
treat2=c(1,1,1,1,1)
)
ddf <- smartbind(ddf1,ddf2)
expected_output <- smartbind(expected_output1,expected_output2)
> ddf
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 1 2010-09-01 <NA> 1 NA
1:3 2 2019-04-02 2019-05-17 1 NA
1:4 2 2018-03-27 <NA> 1 NA
1:5 3 2019-07-12 <NA> 1 NA
1:6 3 2013-10-20 2017-08-19 1 NA
1:7 3 2012-07-01 <NA> 1 NA
1:8 4 2015-05-02 <NA> 1 NA
2:1 4 2010-08-18 2010-12-04 NA 1
2:2 4 2010-09-01 <NA> NA 1
2:3 3 2019-04-02 2019-05-17 NA 1
2:4 3 2018-03-27 <NA> NA 1
2:5 2 2019-07-12 <NA> NA 1
2:6 2 2013-10-20 2017-08-19 NA 1
2:7 2 2012-07-01 <NA> NA 1
2:8 1 2015-05-02 <NA> NA 1
> expected_output
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 2 2018-03-27 2019-05-17 1 NA
1:3 3 2019-07-12 2021-12-31 1 NA
1:4 3 2012-07-01 2017-08-19 1 NA
1:5 4 2015-05-02 2021-12-31 1 NA
2:1 4 2010-08-18 2010-12-04 NA 1
2:2 3 2018-03-27 2019-05-17 NA 1
2:3 2 2019-07-12 2021-12-31 NA 1
2:4 2 2012-07-01 2017-08-19 NA 1
2:5 1 2015-05-02 2021-12-31 NA 1
I have some individuals who have gone through different treatments, treat1 and treat2.
I need to deal with the fact that some treatment courses were started but lack an out_date
In the case when there lacks an out_date it should be replaced with an end_date for the study:
end_date <- "2021-12-31"
However, if an observation like the one
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 1 2010-09-01 <NA> 1 NA
where the in_date, meaning that start of the treatment, is within the period of another treatment for that same person or "pnr" then the correct output is:
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
because 2010-08-18 is the earliest in_date.
However another case where the there is an earlier date in the row without an out_date, then this date should be used, which is the case for pnr 2
pnr in_date out_date treat1 treat2
1:3 2 2019-04-02 2019-05-17 1 NA
1:4 2 2018-03-27 <NA> 1 NA
becomes:
pnr in_date out_date treat1 treat2
1:2 2 2018-03-27 2019-05-17 1 NA
So the whole period of treatment is covered, with the earliest in_date and latest out_date.
In cases where there is no out_date the end_date should be set instead;
so that:
pnr in_date out_date treat1 treat2
1:8 4 2015-05-02 <NA> 1 NA
becomes:
1:5 4 2015-05-02 2021-12-31 1 NA
In the special case where both an earlier date, or intersecting date, and a later in_date with a missing out_date the function should be able to handle it, like with pnr 3
1:5 3 2019-07-12 <NA> 1 NA
1:6 3 2013-10-20 2017-08-19 1 NA
1:7 3 2012-07-01 <NA> 1 NA
should become:
pnr in_date out_date treat1 treat2
1:3 3 2019-07-12 2021-12-31 1 NA
1:4 3 2012-07-01 2017-08-19 1 NA
OPTIONAL: If it is at all possible, it would be great if the function could handle this differently according to different treatments, so each pnr is handled differently within each treat1 and treat2 also shown in the expected_out
I have attempted to write some code for comparing whether an out_date is NA, and the differences between dates, but I still cant fathom how to continue:
ddf$end_replaced <- as.integer(ifelse(is.na(ddf$out_date),1,0))
ddf <- data.table(ddf)
ddf <- ddf[order(ddf$treat1,ddf$pnr,ddf$in_date,ddf$out_date),]
ddf[, diffx := difftime(in_date, shift(in_date, fill=in_date[1L]),
units="days"), by=pnr]
Thanks for reading
UPDATE
I ended up solving the issue, its not pretty so if anyone has a better solution than this then please let me know

Related

The simple recursive function doesn't work well

It's just an easy recursive function test.
It should stop at n = 3, but not.
Could you please tell me where is wrong in my code?
Thank you!
>> recursiveFunction(0)
101
1
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
2
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
3
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
function recursiveFunction(callHierarchie)
callHierarchie = callHierarchie + 1;
disp(callHierarchie + 100);
for n = 1:3
disp(n);
if callHierarchie <= 2
disp('g');
recursiveFunction(callHierarchie);
end
end
end
The problem is both how you're generating your output and how you're interpreting your output. Here's a Python equivalent function that generates the same output:
def recursiveFunction1(callHierarchie):
callHierarchie = callHierarchie + 1
print("{:>6}".format(callHierarchie + 100))
for n in range(1, 4):
print("{:>6}".format(n))
if callHierarchie <= 2:
print('g')
recursiveFunction(callHierarchie)
recursiveFunction(0)
Folks can verify it produces the same output. Let's modify the code to indent based on the recursion level:
def recursiveFunction(callHierarchie):
callHierarchie = callHierarchie + 1
print(" " * callHierarchie, "{:>6}".format(callHierarchie + 100))
for n in range(1, 4):
print(" " * callHierarchie, "{:>6}".format(n))
if callHierarchie <= 2:
print(" " * callHierarchie, 'g')
recursiveFunction(callHierarchie)
Now the output displays slightly differently:
% python3 test.py
101
1
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
2
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
3
g
102
1
g
103
1
2
3
2
g
103
1
2
3
3
g
103
1
2
3
%
You can see that n does stop at 3, but the extra numbers you were seeing were n at a different level of recursion!

Removing Duplicated Values

After data cleaning and aggregation I was left with a data table like this:
df
id d1 v1 d2 v2 d3 v3 d4 v4
1 1-1-2018 1 1-1-2018 1 1-1-2018 1 1-1-2018 1
2 1-1-2018 1 1-2-2018 2 1-2-2018 2 1-2-2018 2
3 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-3-2018 3
4 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-4-2018 4
I am trying to remove any values from a column in the above data frame that are duplicates of earlier columns.
I have already tried:
df$v2[df$v1 == df$v2] <- NA
this removed all of the values from the v2 column
I want my data frame to look like this at the end:
df
id d1 v1 d2 v2 d3 v3 d4 v4
1 1-1-2018 1 NA NA NA NA NA NA
2 1-1-2018 1 1-2-2018 2 NA NA NA NA
3 1-1-2018 1 1-2-2018 2 1-3-2018 3 NA NA
4 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-4-2018 4
Try df[...condition here...]$column <- NA
Or with data.table:
library(data.table)
dt <- data.table(df)
dt[d1 == d2, v1 := NA]

How to join tables in Matlab (2018) by matching time intervals?

I have two tables A and B. I want to join them based on their validity time intervals.
A has product quality (irregular times) and B has hourly settings during the production period. I need to create a table like C that includes the parameters p1 and p2 for all A's RefDates that fall in the time range of B's ValidFrom ValidTo.
A
RefDate result
'11-Oct-2017 00:14:00' 17
'11-Oct-2017 00:14:00' 19
'11-Oct-2017 00:20:00' 5
'11-Oct-2017 01:30:00' 25
'11-Oct-2017 01:30:00' 18
'11-Oct-2017 03:03:00' 28
B
ValidFrom ValidTo p1 p2
'11-Oct-2017 00:13:00' '11-Oct-2017 01:12:59' 2 1
'11-Oct-2017 01:13:00' '11-Oct-2017 02:12:59' 3 1
'11-Oct-2017 02:13:00' '11-Oct-2017 03:12:59' 4 5
'11-Oct-2017 03:13:00' '11-Oct-2017 04:12:59' 6 1
'11-Oct-2017 04:13:00' '11-Oct-2017 05:12:59' 7 9
I need to get something like this.
C
RefDate res p1 p2
'11-Oct-2017 00:14:00' 17 2 1
'11-Oct-2017 00:14:00' 19 2 1
'11-Oct-2017 00:20:00' 5 2 1
'11-Oct-2017 01:30:00' 25 3 1
'11-Oct-2017 01:30:00' 18 3 1
'11-Oct-2017 03:03:00' 28 4 5
I know how to do this in SQL and I think I have figured out how to do this row by row in MatLab but this is horribly slow. The data set is rather large. I just assume there must be a more elegant way that I just couldn't find.
Something that caused many of my approaches to fail is that the RefDate column is not unique.
edit:
the real tables have thousands of rows and hundreds of variables.
C (in reality)
RefDate res res2 ... res200 p1 p2 ... p1000
11-Oct-2017 00:14:00 17 2 1
11-Oct-2017 00:14:00 19 2 1
11-Oct-2017 00:20:00 5 2 1
11-Oct-2017 01:30:00 25 3 1
11-Oct-2017 01:30:00 18 3 1
11-Oct-2017 03:03:00 28 4 5
This can actually be done in a single line of code. Assuming your ValidTo value always ends immediately before the ValidFrom in the next row (which it does in your example), you only need to use your ValidFrom values. First, convert those and your RefDate values to serial date numbers using datenum. Then use the discretize function to bin the RefDate values using the ValidFrom values as the edges, which will give you the row index in B that contains each time in A. Then use that index to extract the p1 and p2 values and append them to A:
>> C = [A B(discretize(datenum(A.RefDate), datenum(B.ValidFrom)), 3:end)]
C =
RefDate result p1 p2
______________________ ______ __ __
'11-Oct-2017 00:14:00' 17 2 1
'11-Oct-2017 00:14:00' 19 2 1
'11-Oct-2017 00:20:00' 5 2 1
'11-Oct-2017 01:30:00' 25 3 1
'11-Oct-2017 01:30:00' 18 3 1
'11-Oct-2017 03:03:00' 28 4 5
The above solution should work for any number of columns pN in B.
If there are any times in A that don't fall in any of the ranges in B, you will have to break the solution into multiple lines so you can check whether or not the index returned from discretize contains NaN values. Assuming you want to exclude those rows from C, this would be the new solution:
index = discretize(datenum(A.RefDate), datenum(B.ValidFrom));
C = [A(~isnan(index), :) B(index(~isnan(index)), 3:end)];
The following code does exactly what you are asking for:
% convert to datetime
A.RefDate = datetime(A.RefDate);
B.ValidFrom = datetime(B.ValidFrom);
B.ValidTo = datetime(B.ValidTo);
% for each row in A, find the matching row in B
i = cellfun(#find, arrayfun(#(x) (x >= B.ValidFrom) & (x <= B.ValidTo), A.RefDate, 'UniformOutput', false), 'UniformOutput', false);
% find rows in A that where not matched
j = cellfun(#isempty, i, 'UniformOutput', false);
% build the result
C = [B(cell2mat(i),:) A(~cell2mat(j),:)];
% display output
C

How to combine numbers in different columns using matlab

excel has an equation or function to combine number in different column, as you can see in the picture below. having the same data in matlab , how can i combine numbers in a different columns.
having a d data:
a b c d
1 1 1 3
2 1 0 5
1 2 5 30
3 4 1 26
-1 1 1 3
since 111 and -111 have the same values of d, so i combine it so that 1st cell in 1st column became 111,-111 and their d become 6 because i add it up, so can matlab do that? thanks
a=[1 1 1 3;2 1 0 5; 1 2 5 30; 3 4 1 26; -1 1 1 3]
len=size(a);
x2=[]
for i=1:len(1)
s=num2str(a(i,1:len(2)-1));
s=s(s~=' ');
x2(i,:)=[str2num(s) (a(i,len(2)))];
end
Result:
x2 =
111 3
210 5
125 30
341 26
-111 3
now to find the repeated indices:
u=unique(x2(:,2));
n=histc(x2(:,2),u);
ind=find(x2(:,2)==u(n>1))
Result:
ind =
1
5
Ok now to sum and combine:
xx=x2(ind,:)
ss=sum(xx(:,2));
s=num2str(xx(:,1)');
s=strrep(s, ' ', ',')
x2(min(ind),2) = ss;
x2(ind(ind~=min(ind)),:) = []
C = num2cell(x2);
C(min(ind),1) = cellstr(s)
The final result is:
C =
'111,-111' [ 6]
[ 210] [ 5]
[ 125] [30]
[ 341] [26]

How to sum values among matrices of structures under certain conditions in Matlab?

I am new in matlab and I am not familiar with array of matrices. I have a number of matrices nx6:
<26x6 double>
<21x6 double>
<27x6 double>
<36x6 double>
<29x6 double>
<30x6 double>
....
Each matrix is of this type:
>> Matrix{1,1}
A B C D E F
1 2 6 223 735064.287500000 F11
2 3 6 223 735064.288194445 F12
3 4 6 223 735064.288888889 F13
4 5 6 223 735064.290277778 F14
>> Matrix{2,1}
A B C D E F
1 2 6 223 735064.700694445 F21
2 3 6 223 735064.701388889 F22
3 4 6 223 735064.702083333 F23
4 5 6 223 735064.702777778 F24
>> Matrix{3,1}
A B C D E F
1 2 7 86 735064.3541666666 F31
2 3 7 86 735064.3548611112 F32
3 4 7 86 735064.3555555555 F33
4 5 7 86 735064.3562499999 F34
5 6 7 86 735064.702777778 F35
>> Matrix{4,1}
A B C D E F
1 2 7 86 735064.3569444444 F41
2 3 7 86 735064.3576388888 F42
3 4 7 86 735064.3583333333 F43
4 5 7 86 735064.3590277778 F44
5 6 6 86 735064.702777778 F45
Where E and F are dates in datenum format. Specifically F is the time difference.
Considering all matrices at once, I would like to sum the values of column F among all the matrices that have equal values in columns A, B, D.
For each value of the column D (the number of bus), I would like to obtain a new matrix like the following one:
A B C D H
1 2 6 223 F11+F21
2 3 6 223 F12+F22
3 4 6 223 F13+F23
4 5 6 223 F14+F24
A B C D H
1 2 7 86 F31+F41
2 3 7 86 F32+F42
3 4 7 86 F33+F43
4 5 7 86 F34+F44
5 6 7 86 F35+F45
Thank you in advance for you help!
This approach should get you started. I suggested setting up a matrix that stores the comparison between the columns 1,2 and 4. Based on that matrix you can then generate your output matrix. This saves you nested if statements and checks in your loop.
Here's an example (please note that I changed row 3 of Matrix{1,1}):
Matrix{1,1} = [ ...
1 2 6 223 735064.287500000 1;
2 3 6 223 735064.288194445 2;
3 4 6 223 735064.288888889 3;
4 5 6 223 735064.290277778 4];
Matrix{2,1} = [ ...
1 2 6 223 735064.700694445 10;
2 3 6 223 735064.701388889 10;
2 4 6 223 735064.702083333 10;
4 5 6 223 735064.702777778 10];
COMP = Matrix{1,1}(:,[1:2 4])==Matrix{2,1}(:,[1:2 4]);
a = 1;
for i=1:size(Matrix{1,1},1)
if sum(COMP(i,:)) == 3
SUM{1,1}(a,1:5) = Matrix{1,1}(i,1:5);
SUM{1,1}(a,6) = Matrix{1,1}(i,6) + Matrix{2,1}(i,6);
a = a + 1;
end
end
The matrix COMP stores a 1 for each element that is the same in Matrix{1,1} and Matrix{2,1} when comparing columns 1, 2 and 4.
This reduces the if-statement to a check if all elements in a row agree (hence sum == 3). If that condition is satisfied, a new matrix is generated (SUM{1,1}) which sums the entries in column 6, in this case:
SUM{1,1}(:,6) =
11
12
14